Enhance Crawl4AI with new features and documentation
- Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies.
This commit is contained in:
@@ -1,124 +1,109 @@
|
||||
# Download Handling in Crawl4AI
|
||||
|
||||
This guide explains how to use Crawl4AI to handle file downloads during crawling. You'll learn how to trigger downloads, specify download locations, and access downloaded files.
|
||||
This guide explains how to use Crawl4AI to handle file downloads during crawling. You'll learn how to trigger downloads, specify download locations, and access downloaded files.
|
||||
|
||||
## Enabling Downloads
|
||||
|
||||
By default, Crawl4AI does not download files. To enable downloads, set the `accept_downloads` parameter to `True` in either the `AsyncWebCrawler` constructor or the `arun` method.
|
||||
To enable downloads, set the `accept_downloads` parameter in the `BrowserConfig` object and pass it to the crawler.
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import BrowserConfig, AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(accept_downloads=True) as crawler: # Globally enable downloads
|
||||
config = BrowserConfig(accept_downloads=True) # Enable downloads globally
|
||||
async with AsyncWebCrawler(config=config) as crawler:
|
||||
# ... your crawling logic ...
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
Or, enable it for a specific crawl:
|
||||
Or, enable it for a specific crawl by using `CrawlerRunConfig`:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="...", accept_downloads=True)
|
||||
config = CrawlerRunConfig(accept_downloads=True)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
# ...
|
||||
```
|
||||
|
||||
## Specifying Download Location
|
||||
|
||||
You can specify the download directory using the `downloads_path` parameter. If not provided, Crawl4AI creates a "downloads" directory inside the `.crawl4ai` folder in your home directory.
|
||||
Specify the download directory using the `downloads_path` attribute in the `BrowserConfig` object. If not provided, Crawl4AI defaults to creating a "downloads" directory inside the `.crawl4ai` folder in your home directory.
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
# ... inside your crawl function:
|
||||
|
||||
downloads_path = os.path.join(os.getcwd(), "my_downloads") # Custom download path
|
||||
os.makedirs(downloads_path, exist_ok=True)
|
||||
|
||||
result = await crawler.arun(url="...", downloads_path=downloads_path, accept_downloads=True)
|
||||
config = BrowserConfig(accept_downloads=True, downloads_path=downloads_path)
|
||||
|
||||
# ...
|
||||
```
|
||||
|
||||
If you are setting it globally, provide the path to the AsyncWebCrawler:
|
||||
```python
|
||||
async def crawl_with_downloads(url: str, download_path: str):
|
||||
async with AsyncWebCrawler(
|
||||
accept_downloads=True,
|
||||
downloads_path=download_path, # or set it on arun
|
||||
verbose=True
|
||||
) as crawler:
|
||||
result = await crawler.arun(url=url) # you still need to enable downloads per call.
|
||||
async def main():
|
||||
async with AsyncWebCrawler(config=config) as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
# ...
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Triggering Downloads
|
||||
|
||||
Downloads are typically triggered by user interactions on a web page (e.g., clicking a download button). You can simulate these actions with the `js_code` parameter, injecting JavaScript code to be executed within the browser context. The `wait_for` parameter might also be crucial to allowing sufficient time for downloads to initiate before the crawler proceeds.
|
||||
Downloads are typically triggered by user interactions on a web page, such as clicking a download button. Use `js_code` in `CrawlerRunConfig` to simulate these actions and `wait_for` to allow sufficient time for downloads to start.
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://www.python.org/downloads/",
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
js_code="""
|
||||
// Find and click the first Windows installer link
|
||||
const downloadLink = document.querySelector('a[href$=".exe"]');
|
||||
if (downloadLink) {
|
||||
downloadLink.click();
|
||||
}
|
||||
""",
|
||||
wait_for=5 # Wait for 5 seconds for the download to start
|
||||
wait_for=5 # Wait 5 seconds for the download to start
|
||||
)
|
||||
|
||||
result = await crawler.arun(url="https://www.python.org/downloads/", config=config)
|
||||
```
|
||||
|
||||
## Accessing Downloaded Files
|
||||
|
||||
Downloaded file paths are stored in the `downloaded_files` attribute of the returned `CrawlResult` object. This is a list of strings, with each string representing the absolute path to a downloaded file.
|
||||
The `downloaded_files` attribute of the `CrawlResult` object contains paths to downloaded files.
|
||||
|
||||
```python
|
||||
if result.downloaded_files:
|
||||
print("Downloaded files:")
|
||||
for file_path in result.downloaded_files:
|
||||
print(f"- {file_path}")
|
||||
# Perform operations with downloaded files, e.g., check file size
|
||||
file_size = os.path.getsize(file_path)
|
||||
print(f"- File size: {file_size} bytes")
|
||||
else:
|
||||
print("No files downloaded.")
|
||||
```
|
||||
|
||||
|
||||
## Example: Downloading Multiple Files
|
||||
## Example: Downloading Multiple Files
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
import os
|
||||
from pathlib import Path
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def download_multiple_files(url: str, download_path: str):
|
||||
|
||||
async with AsyncWebCrawler(
|
||||
accept_downloads=True,
|
||||
downloads_path=download_path,
|
||||
verbose=True
|
||||
) as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
config = BrowserConfig(accept_downloads=True, downloads_path=download_path)
|
||||
async with AsyncWebCrawler(config=config) as crawler:
|
||||
run_config = CrawlerRunConfig(
|
||||
js_code="""
|
||||
// Trigger multiple downloads (example)
|
||||
const downloadLinks = document.querySelectorAll('a[download]'); // Or a more specific selector
|
||||
for (const link of downloadLinks) {
|
||||
link.click();
|
||||
await new Promise(r => setTimeout(r, 2000)); // Add a small delay between clicks if needed
|
||||
}
|
||||
const downloadLinks = document.querySelectorAll('a[download]');
|
||||
for (const link of downloadLinks) {
|
||||
link.click();
|
||||
await new Promise(r => setTimeout(r, 2000)); // Delay between clicks
|
||||
}
|
||||
""",
|
||||
wait_for=10 # Adjust the timeout to match the expected time for all downloads to start
|
||||
wait_for=10 # Wait for all downloads to start
|
||||
)
|
||||
result = await crawler.arun(url=url, config=run_config)
|
||||
|
||||
if result.downloaded_files:
|
||||
print("Downloaded files:")
|
||||
@@ -126,23 +111,19 @@ async def download_multiple_files(url: str, download_path: str):
|
||||
print(f"- {file}")
|
||||
else:
|
||||
print("No files downloaded.")
|
||||
|
||||
|
||||
# Example usage
|
||||
# Usage
|
||||
download_path = os.path.join(Path.home(), ".crawl4ai", "downloads")
|
||||
os.makedirs(download_path, exist_ok=True) # Create directory if it doesn't exist
|
||||
|
||||
os.makedirs(download_path, exist_ok=True)
|
||||
|
||||
asyncio.run(download_multiple_files("https://www.python.org/downloads/windows/", download_path))
|
||||
```
|
||||
|
||||
## Important Considerations
|
||||
|
||||
- **Browser Context:** Downloads are managed within the browser context. Ensure your `js_code` correctly targets the download triggers on the specific web page.
|
||||
- **Waiting:** Use `wait_for` to manage the timing of the crawl process if immediate download might not occur.
|
||||
- **Error Handling:** Implement proper error handling to gracefully manage failed downloads or incorrect file paths.
|
||||
- **Security:** Downloaded files should be scanned for potential security threats before use.
|
||||
- **Browser Context:** Downloads are managed within the browser context. Ensure `js_code` correctly targets the download triggers on the webpage.
|
||||
- **Timing:** Use `wait_for` in `CrawlerRunConfig` to manage download timing.
|
||||
- **Error Handling:** Handle errors to manage failed downloads or incorrect paths gracefully.
|
||||
- **Security:** Scan downloaded files for potential security threats before use.
|
||||
|
||||
|
||||
|
||||
This guide provides a foundation for handling downloads with Crawl4AI. You can adapt these techniques to manage downloads in various scenarios and integrate them into more complex crawling workflows.
|
||||
This revised guide ensures consistency with the `Crawl4AI` codebase by using `BrowserConfig` and `CrawlerRunConfig` for all download-related configurations. Let me know if further adjustments are needed!
|
||||
Reference in New Issue
Block a user