Enhance crawler capabilities and documentation

- Add llm.txt generator - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation.
2024-12-25 21:34:31 +08:00
parent 84b311760f
commit d5ed451299
59 changed files with 2208 additions and 1763 deletions
--- a/docs/llm.txt/2_configuration.q.md
+++ b/docs/llm.txt/2_configuration.q.md
@@ -1,97 +1,20 @@
-### Hypothetical Questions
-
-**BrowserConfig:**
-
-1. **Browser Types and Headless Mode**
-   - *"How do I choose between `chromium`, `firefox`, or `webkit` for `browser_type`?"*
-   - *"What are the benefits of running the browser in `headless=True` mode versus a visible UI?"*
-
-2. **Managed Browser and Persistent Context**
-   - *"When should I enable `use_managed_browser` for advanced session control?"*
-   - *"How do I use `use_persistent_context` and `user_data_dir` to maintain login sessions and persistent storage?"*
-
-3. **Debugging and Remote Access**
-   - *"How do I use the `debugging_port` to remotely inspect the browser with DevTools?"*
-
-4. **Proxy and Network Configurations**
-   - *"How can I configure a `proxy` or `proxy_config` for region-specific crawling or authentication?"*
-
-5. **Viewports and Layout Testing**
-   - *"How do I adjust `viewport_width` and `viewport_height` for responsive layout testing?"*
-
-6. **Downloads and Storage States**
-   - *"What steps do I need to take to enable `accept_downloads` and specify a `downloads_path`?"*
-   - *"How can I use `storage_state` to preload cookies or session data?"*
-
-7. **HTTPS and JavaScript Settings**
-   - *"What happens if I set `ignore_https_errors=True` on sites with invalid SSL certificates?"*
-   - *"When should I disable `java_script_enabled` to improve speed and stability?"*
-
-8. **Cookies, Headers, and User Agents**
-   - *"How do I add custom `cookies` or `headers` to every browser request?"*
-   - *"How can I set a custom `user_agent` or use a `user_agent_mode` like `random` to avoid detection?"*
-
-9. **Performance Tuning**
-   - *"What is the difference between `text_mode`, `light_mode`, and adding `extra_args` for performance tuning?"*
-
---
-
-**CrawlerRunConfig:**
-
-10. **Content Extraction and Filtering**
-    - *"How does the `word_count_threshold` affect which pages or sections get processed?"*
-    - *"What `extraction_strategy` should I use for structured data extraction and how does `chunking_strategy` help organize the content?"*
-    - *"How do I apply a `css_selector` or `excluded_tags` to refine my extracted content?"*
-
-11. **Markdown and Text-Only Modes**
-    - *"Can I generate Markdown output directly and what `markdown_generator` should I use?"*
-    - *"When should I set `only_text=True` to strip out non-textual content?"*
-
-12. **Caching and Session Handling**
-    - *"How does `cache_mode=ENABLED` improve performance, and when should I consider `WRITE_ONLY` or disabling the cache?"*
-    - *"What is the role of `session_id` in maintaining state across requests?"*
-
-13. **Page Loading and Timing**
-    - *"How do `wait_until`, `page_timeout`, and `wait_for` elements help control page load timing before extraction?"*
-    - *"When should I disable `wait_for_images` to speed up the crawl?"*
-
-14. **Delays and Concurrency**
-    - *"How do `mean_delay` and `max_range` randomize request intervals to avoid detection?"*
-    - *"What is `semaphore_count` and how does it manage concurrency for multiple crawling tasks?"*
-
-15. **JavaScript Execution and Dynamic Content**
-    - *"How can I inject custom `js_code` to load additional data or simulate user interactions?"*
-    - *"When should I use `scan_full_page` or `adjust_viewport_to_content` to handle infinite scrolling?"*
-
-16. **Screenshots, PDFs, and Media**
-    - *"How do I enable `screenshot` or `pdf` generation to capture page states?"*
-    - *"What are `image_description_min_word_threshold` and `image_score_threshold` for, and how do they enhance image-related extraction?"*
-
-17. **Logging and Debugging**
-    - *"How do `verbose` and `log_console` help me troubleshoot issues with crawling or page scripts?"*
-
---
-
-### Topics Discussed in the File
-
- **BrowserConfig Essentials:**  
-  - Browser types (`chromium`, `firefox`, `webkit`)  
-  - Headless vs. non-headless mode  
-  - Persistent context and managed browser sessions  
-  - Proxy configurations and network settings  
-  - Viewport dimensions and responsive testing  
-  - Download handling and storage states  
-  - HTTPS errors and JavaScript enablement  
-  - Cookies, headers, and user agents  
-  - Performance tuning via `text_mode`, `light_mode`, and `extra_args`
-
- **CrawlerRunConfig Core Settings:**  
-  - Content extraction parameters (`word_count_threshold`, `extraction_strategy`, `chunking_strategy`)  
-  - Markdown generation and text-only extraction  
-  - Content filtering (`css_selector`, `excluded_tags`)  
-  - Caching strategies and `cache_mode` options  
-  - Page load conditions (`wait_until`, `wait_for`) and timeouts (`page_timeout`)  
-  - Delays, concurrency, and scaling (`mean_delay`, `max_range`, `semaphore_count`)  
-  - JavaScript injections (`js_code`) and handling dynamic/infinite scroll content  
-  - Screenshots, PDFs, and image thresholds for enhanced outputs  
-  - Logging and debugging modes (`verbose`, `log_console`)
+browser_config: Configure browser type with chromium, firefox, or webkit support | browser selection, browser engine, web engine | BrowserConfig(browser_type="chromium")
+headless_mode: Toggle headless browser mode for GUI-less operation | headless browser, no GUI, background mode | BrowserConfig(headless=True)
+managed_browser: Enable advanced browser manipulation and control | browser management, session control | BrowserConfig(use_managed_browser=True)
+debugging_setup: Configure remote debugging port for browser inspection | debug port, devtools connection | BrowserConfig(debugging_port=9222)
+persistent_context: Enable persistent browser sessions for maintaining state | session persistence, profile saving | BrowserConfig(use_persistent_context=True)
+browser_profile: Specify directory for storing browser profile data | user data, profile storage | BrowserConfig(user_data_dir="/path/to/profile")
+proxy_configuration: Set up proxy settings for browser connections | proxy server, network routing | BrowserConfig(proxy="http://proxy.example.com:8080")
+viewport_settings: Configure browser window dimensions | screen size, window dimensions | BrowserConfig(viewport_width=1920, viewport_height=1080)
+download_handling: Configure browser download behavior and location | file downloads, download directory | BrowserConfig(accept_downloads=True, downloads_path="/downloads")
+content_threshold: Set minimum word count for processing page content | word limit, content filter | CrawlerRunConfig(word_count_threshold=200)
+extraction_strategy: Configure method for extracting structured data | data extraction, parsing strategy | CrawlerRunConfig(extraction_strategy=CustomStrategy())
+content_chunking: Define strategy for breaking content into chunks | text chunking, content splitting | CrawlerRunConfig(chunking_strategy=RegexChunking())
+cache_behavior: Control caching mode for crawler operations | cache control, data caching | CrawlerRunConfig(cache_mode=CacheMode.ENABLED)
+page_navigation: Configure page load and navigation timing | page timeout, navigation wait | CrawlerRunConfig(wait_until="domcontentloaded", page_timeout=60000)
+javascript_execution: Enable or disable JavaScript processing | JS handling, script execution | CrawlerRunConfig(java_script_enabled=True)
+content_filtering: Configure HTML tag exclusion and content cleanup | tag filtering, content cleanup | CrawlerRunConfig(excluded_tags=["script", "style"])
+concurrent_operations: Set limit for simultaneous crawler operations | concurrency control, parallel crawling | CrawlerRunConfig(semaphore_count=5)
+page_interaction: Configure JavaScript execution and page scanning | page automation, interaction control | CrawlerRunConfig(js_code="custom_script()", scan_full_page=True)
+media_capture: Enable screenshot and PDF generation capabilities | visual capture, page export | CrawlerRunConfig(screenshot=True, pdf=True)
+debugging_options: Configure logging and console message capture | debug logging, error tracking | CrawlerRunConfig(verbose=True, log_console=True)