Enhance crawler capabilities and documentation

- Add llm.txt generator
  - Added SSL certificate extraction in AsyncWebCrawler.
  - Introduced new content filters and chunking strategies for more robust data extraction.
  - Updated documentation.
This commit is contained in:
UncleCode
2024-12-25 21:34:31 +08:00
parent 84b311760f
commit d5ed451299
59 changed files with 2208 additions and 1763 deletions

View File

@@ -1,81 +1,15 @@
### Questions
1. **Asynchronous Crawling Basics**
- *"How do I perform asynchronous web crawling using `AsyncWebCrawler`?"*
- *"What are the performance benefits of asynchronous I/O in `crawl4ai`?"*
2. **Browser Configuration**
- *"How can I configure `BrowserConfig` for headless Chromium or Firefox?"*
- *"How do I set viewport dimensions and proxies in the `BrowserConfig`?"*
- *"How can I enable verbose logging for browser interactions?"*
3. **Docker and Containerization**
- *"How do I run `AsyncWebCrawler` inside a Docker container for scalability?"*
- *"Which dependencies are needed in the Dockerfile to run asynchronous crawls?"*
4. **Crawling Strategies**
- *"What is `AsyncPlaywrightCrawlerStrategy` and when should I use it?"*
- *"How do I switch between different crawler strategies if multiple are available?"*
5. **Handling Dynamic Content**
- *"How can I inject custom JavaScript to load more content or simulate user actions?"*
- *"What is the best way to wait for specific DOM elements before extracting content?"*
6. **Extraction Strategies**
- *"How do I use `JsonCssExtractionStrategy` to extract structured JSON data?"*
- *"What are the differences between regex-based chunking and NLP-based chunking?"*
- *"How can I integrate `LLMExtractionStrategy` for more intelligent data extraction?"*
7. **Caching and Performance**
- *"How does caching improve the performance of asynchronous crawling?"*
- *"How do I clear or bypass the cache in `AsyncWebCrawler`?"*
- *"What are the available `CacheMode` options and when should I use each?"*
8. **Batch Crawling and Concurrency**
- *"How do I crawl multiple URLs concurrently using `arun_many`?"*
- *"How can I limit concurrency with `semaphore_count` for resource management?"*
9. **Scaling Crawls**
- *"What strategies can I use to scale asynchronous crawls across multiple machines?"*
- *"How do I integrate job queues or distribute tasks for larger crawl projects?"*
10. **Screenshots and PDFs**
- *"How do I enable screenshot or PDF capture during a crawl?"*
- *"How can I save visual outputs for troubleshooting rendering issues?"*
11. **Troubleshooting**
- *"What should I do if the browser fails to launch or times out?"*
- *"How do I debug JavaScript code injections that dont work as expected?"*
- *"How can I handle partial loads or missing content due to timeouts?"*
12. **Best Practices**
- *"How do I handle authentication or session management in `AsyncWebCrawler`?"*
- *"How can I avoid getting blocked by target sites, e.g., by using proxies?"*
- *"What error handling approaches are recommended for production crawls?"*
- *"How can I adhere to legal and ethical guidelines when crawling?"*
13. **Configuration Options**
- *"How do I customize `CrawlerRunConfig` parameters like `mean_delay` and `max_range`?"*
- *"How can I run the crawler non-headless for debugging dynamic interactions?"*
14. **Integration and Reference**
- *"Where can I find the GitHub repository or additional documentation?"*
- *"How do I incorporate Playwrights advanced features with `AsyncWebCrawler`?"*
### Topics Discussed in the File
- **Asynchronous Crawling and Performance**
- **`AsyncWebCrawler` Initialization and Usage**
- **`BrowserConfig` for Browser Choice, Headless Mode, Viewport, Proxy, and Verbosity**
- **Running Crawlers in Docker and Containerized Environments**
- **`AsyncPlaywrightCrawlerStrategy` and DOM Interactions**
- **Dynamic Content Handling via JavaScript Injection**
- **Extraction Strategies (e.g., `JsonCssExtractionStrategy`, `LLMExtractionStrategy`)**
- **Content Chunking Approaches (Regex and NLP-based)**
- **Caching Mechanisms and Cache Modes**
- **Parallel Crawling with `arun_many` and Concurrency Controls**
- **Scaling Crawls Across Multiple Workers or Containers**
- **Screenshot and PDF Generation for Debugging**
- **Common Troubleshooting Techniques and Error Handling**
- **Authentication, Session Management, and Ethical Guidelines**
- **Adjusting `CrawlerRunConfig` for Delays, Concurrency, Extraction, and JavaScript Injection**
quick_start: Basic async crawl setup requires BrowserConfig and AsyncWebCrawler initialization | getting started, basic usage, initialization | asyncio.run(AsyncWebCrawler(browser_config=BrowserConfig(browser_type="chromium", headless=True)))
browser_types: AsyncWebCrawler supports multiple browser types including Chromium and Firefox | supported browsers, browser options | BrowserConfig(browser_type="chromium")
headless_mode: Browser can run in headless mode without UI for better performance | invisible browser, no GUI | BrowserConfig(headless=True)
viewport_settings: Configure browser viewport dimensions for proper page rendering | screen size, window size | BrowserConfig(viewport_width=1920, viewport_height=1080)
docker_deployment: AsyncWebCrawler can run in Docker containers for scalability | containerization, deployment | FROM python:3.10-slim; RUN pip install crawl4ai playwright
dynamic_content: Handle JavaScript-loaded content using custom JS injection | javascript handling, dynamic loading | CrawlerRunConfig(js_code=["document.querySelector('.load-more').click()"])
extraction_strategies: Multiple strategies available for content extraction including JsonCssExtractionStrategy and LLMExtractionStrategy | content extraction, data parsing | JsonCssExtractionStrategy(selectors={"title": "h1"})
caching_modes: Control cache behavior with different modes: ENABLED, BYPASS, DISABLED | cache control, caching options | CrawlerRunConfig(cache_mode=CacheMode.ENABLED)
batch_crawling: Process multiple URLs concurrently using arun_many method | parallel crawling, multiple urls | crawler.arun_many(urls, config=CrawlerRunConfig(semaphore_count=10))
rate_limiting: Control crawl rate using mean_delay and max_range parameters | throttling, delay control | CrawlerRunConfig(mean_delay=1.0, max_range=0.5)
visual_capture: Generate screenshots and PDFs of crawled pages | page capture, visual output | CrawlerRunConfig(screenshot=True, pdf=True)
error_handling: Common issues include browser launch failures, timeouts, and JS execution problems | troubleshooting, debugging | try/except blocks with crawler.logger
authentication: Handle login requirements through js_code or Playwright selectors | login handling, sessions | CrawlerRunConfig with login steps via js_code
proxy_configuration: Configure proxy settings to bypass IP restrictions | proxy setup, IP rotation | BrowserConfig(proxy="http://proxy-server:port")
chunking_strategies: Split content using regex or NLP-based chunking | content splitting, text processing | CrawlerRunConfig(chunking_strategy=RegexChunking())