Enhance crawler capabilities and documentation

- Add llm.txt generator
  - Added SSL certificate extraction in AsyncWebCrawler.
  - Introduced new content filters and chunking strategies for more robust data extraction.
  - Updated documentation.
This commit is contained in:
UncleCode
2024-12-25 21:34:31 +08:00
parent 84b311760f
commit d5ed451299
59 changed files with 2208 additions and 1763 deletions

View File

@@ -1,63 +1,10 @@
### Hypothetical Questions
1. **Enabling Downloads**
- *"How do I configure Crawl4AI to allow file downloads during a crawl?"*
- *"Where in my code should I set `accept_downloads=True` to enable downloads?"*
2. **Specifying the Download Location**
- *"How can I choose a custom directory for storing downloaded files?"*
- *"What is the default download directory if I dont specify one?"*
3. **Triggering Downloads from Pages**
- *"How do I simulate a click on a download link or button to initiate file downloads?"*
- *"Can I use JavaScript injection (`js_code`) to trigger downloads from the webpage elements?"*
- *"What does `wait_for` do, and how do I use it to ensure the download starts before proceeding?"*
4. **Accessing Downloaded Files**
- *"Where can I find the paths to the files that Ive downloaded?"*
- *"How do I check if any files were downloaded after my crawl completes?"*
5. **Multiple Downloads**
- *"How do I handle scenarios where multiple files need to be downloaded sequentially?"*
- *"Can I introduce delays between file downloads to prevent server overload?"*
6. **Error Handling and Reliability**
- *"What if the files I expect to download dont appear or the links are broken?"*
- *"How can I handle incorrect paths, nonexistent directories, or failed downloads gracefully?"*
7. **Timing and Performance**
- *"When should I use `wait_for` and how do I choose an appropriate delay?"*
- *"Can I start the download and continue processing other tasks concurrently?"*
8. **Security Considerations**
- *"What precautions should I take with downloaded files?"*
- *"How can I ensure that downloaded files are safe before processing them further?"*
9. **Integration with Other Crawl4AI Features**
- *"Can I combine file downloading with other extraction strategies or LLM-based processes?"*
- *"How do I manage downloads when running multiple parallel crawls?"*
### Topics Discussed in the File
- **Enabling Downloads in Crawl4AI**:
Configure the crawler through `BrowserConfig` or `CrawlerRunConfig` to allow file downloads.
- **Download Locations**:
Specify a custom `downloads_path` or rely on the default directory (`~/.crawl4ai/downloads`).
- **Triggering File Downloads**:
Use JavaScript code injection (`js_code`) to simulate user interactions (e.g., clicking a download link). Employ `wait_for` to allow time for downloads to initiate.
- **Accessing Downloaded Files**:
After the crawl, `result.downloaded_files` provides a list of paths to the downloaded files. Use these paths to verify file sizes or further process the files.
- **Handling Multiple Files**:
Loop through downloadable elements on the page, introduce delays, and wait for downloads to complete before proceeding.
- **Error and Timing Considerations**:
Manage potential errors when downloads fail or timing issues arise. Adjust `wait_for` and error handling logic to ensure stable and reliable file retrievals.
- **Security Precautions**:
Always verify the integrity and safety of downloaded files before using them in your application.
In summary, the file explains how to set up, initiate, and manage file downloads within the Crawl4AI framework, including specifying directories, triggering downloads programmatically, handling multiple files, and accessing downloaded results. It also covers timing, error handling, and security best practices.
enable_downloads: Downloads must be enabled using accept_downloads parameter in BrowserConfig or CrawlerRunConfig | download settings, enable downloads | BrowserConfig(accept_downloads=True)
download_location: Set custom download directory using downloads_path in BrowserConfig, defaults to .crawl4ai/downloads | download folder, save location | BrowserConfig(downloads_path="/path/to/downloads")
download_trigger: Trigger downloads using js_code in CrawlerRunConfig to simulate click actions | download button, click download | CrawlerRunConfig(js_code="document.querySelector('a[download]').click()")
download_timing: Control download timing using wait_for parameter in CrawlerRunConfig | download wait, timeout | CrawlerRunConfig(wait_for=5)
access_downloads: Access downloaded files through downloaded_files attribute in CrawlResult | download results, file paths | result.downloaded_files
multiple_downloads: Download multiple files by clicking multiple download links with delay | batch download, multiple files | js_code="const links = document.querySelectorAll('a[download]'); for(const link of links) { link.click(); }"
download_verification: Check download success by examining downloaded_files list and file sizes | verify downloads, file check | if result.downloaded_files: print(os.path.getsize(file_path))
browser_context: Downloads are managed within browser context and require proper js_code targeting | download management, browser scope | CrawlerRunConfig(js_code="...")
error_handling: Handle failed downloads and incorrect paths for robust download management | download errors, error handling | try-except around download operations
security_consideration: Scan downloaded files for security threats before use | security check, virus scan | No direct code reference