Commit Message:
Enhance Crawl4AI with CLI and documentation updates - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py` - Added chunking strategies and their documentation in `llm.txt`
This commit is contained in:
81
docs/llm.txt/3_async_webcrawler.q.md
Normal file
81
docs/llm.txt/3_async_webcrawler.q.md
Normal file
@@ -0,0 +1,81 @@
|
||||
### Questions
|
||||
|
||||
1. **Asynchronous Crawling Basics**
|
||||
- *"How do I perform asynchronous web crawling using `AsyncWebCrawler`?"*
|
||||
- *"What are the performance benefits of asynchronous I/O in `crawl4ai`?"*
|
||||
|
||||
2. **Browser Configuration**
|
||||
- *"How can I configure `BrowserConfig` for headless Chromium or Firefox?"*
|
||||
- *"How do I set viewport dimensions and proxies in the `BrowserConfig`?"*
|
||||
- *"How can I enable verbose logging for browser interactions?"*
|
||||
|
||||
3. **Docker and Containerization**
|
||||
- *"How do I run `AsyncWebCrawler` inside a Docker container for scalability?"*
|
||||
- *"Which dependencies are needed in the Dockerfile to run asynchronous crawls?"*
|
||||
|
||||
4. **Crawling Strategies**
|
||||
- *"What is `AsyncPlaywrightCrawlerStrategy` and when should I use it?"*
|
||||
- *"How do I switch between different crawler strategies if multiple are available?"*
|
||||
|
||||
5. **Handling Dynamic Content**
|
||||
- *"How can I inject custom JavaScript to load more content or simulate user actions?"*
|
||||
- *"What is the best way to wait for specific DOM elements before extracting content?"*
|
||||
|
||||
6. **Extraction Strategies**
|
||||
- *"How do I use `JsonCssExtractionStrategy` to extract structured JSON data?"*
|
||||
- *"What are the differences between regex-based chunking and NLP-based chunking?"*
|
||||
- *"How can I integrate `LLMExtractionStrategy` for more intelligent data extraction?"*
|
||||
|
||||
7. **Caching and Performance**
|
||||
- *"How does caching improve the performance of asynchronous crawling?"*
|
||||
- *"How do I clear or bypass the cache in `AsyncWebCrawler`?"*
|
||||
- *"What are the available `CacheMode` options and when should I use each?"*
|
||||
|
||||
8. **Batch Crawling and Concurrency**
|
||||
- *"How do I crawl multiple URLs concurrently using `arun_many`?"*
|
||||
- *"How can I limit concurrency with `semaphore_count` for resource management?"*
|
||||
|
||||
9. **Scaling Crawls**
|
||||
- *"What strategies can I use to scale asynchronous crawls across multiple machines?"*
|
||||
- *"How do I integrate job queues or distribute tasks for larger crawl projects?"*
|
||||
|
||||
10. **Screenshots and PDFs**
|
||||
- *"How do I enable screenshot or PDF capture during a crawl?"*
|
||||
- *"How can I save visual outputs for troubleshooting rendering issues?"*
|
||||
|
||||
11. **Troubleshooting**
|
||||
- *"What should I do if the browser fails to launch or times out?"*
|
||||
- *"How do I debug JavaScript code injections that don’t work as expected?"*
|
||||
- *"How can I handle partial loads or missing content due to timeouts?"*
|
||||
|
||||
12. **Best Practices**
|
||||
- *"How do I handle authentication or session management in `AsyncWebCrawler`?"*
|
||||
- *"How can I avoid getting blocked by target sites, e.g., by using proxies?"*
|
||||
- *"What error handling approaches are recommended for production crawls?"*
|
||||
- *"How can I adhere to legal and ethical guidelines when crawling?"*
|
||||
|
||||
13. **Configuration Options**
|
||||
- *"How do I customize `CrawlerRunConfig` parameters like `mean_delay` and `max_range`?"*
|
||||
- *"How can I run the crawler non-headless for debugging dynamic interactions?"*
|
||||
|
||||
14. **Integration and Reference**
|
||||
- *"Where can I find the GitHub repository or additional documentation?"*
|
||||
- *"How do I incorporate Playwright’s advanced features with `AsyncWebCrawler`?"*
|
||||
|
||||
### Topics Discussed in the File
|
||||
|
||||
- **Asynchronous Crawling and Performance**
|
||||
- **`AsyncWebCrawler` Initialization and Usage**
|
||||
- **`BrowserConfig` for Browser Choice, Headless Mode, Viewport, Proxy, and Verbosity**
|
||||
- **Running Crawlers in Docker and Containerized Environments**
|
||||
- **`AsyncPlaywrightCrawlerStrategy` and DOM Interactions**
|
||||
- **Dynamic Content Handling via JavaScript Injection**
|
||||
- **Extraction Strategies (e.g., `JsonCssExtractionStrategy`, `LLMExtractionStrategy`)**
|
||||
- **Content Chunking Approaches (Regex and NLP-based)**
|
||||
- **Caching Mechanisms and Cache Modes**
|
||||
- **Parallel Crawling with `arun_many` and Concurrency Controls**
|
||||
- **Scaling Crawls Across Multiple Workers or Containers**
|
||||
- **Screenshot and PDF Generation for Debugging**
|
||||
- **Common Troubleshooting Techniques and Error Handling**
|
||||
- **Authentication, Session Management, and Ethical Guidelines**
|
||||
- **Adjusting `CrawlerRunConfig` for Delays, Concurrency, Extraction, and JavaScript Injection**
|
||||
Reference in New Issue
Block a user