Commit Message:
Enhance Crawl4AI with CLI and documentation updates - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py` - Added chunking strategies and their documentation in `llm.txt`
This commit is contained in:
97
docs/llm.txt/2_configuration.q.md
Normal file
97
docs/llm.txt/2_configuration.q.md
Normal file
@@ -0,0 +1,97 @@
|
||||
### Hypothetical Questions
|
||||
|
||||
**BrowserConfig:**
|
||||
|
||||
1. **Browser Types and Headless Mode**
|
||||
- *"How do I choose between `chromium`, `firefox`, or `webkit` for `browser_type`?"*
|
||||
- *"What are the benefits of running the browser in `headless=True` mode versus a visible UI?"*
|
||||
|
||||
2. **Managed Browser and Persistent Context**
|
||||
- *"When should I enable `use_managed_browser` for advanced session control?"*
|
||||
- *"How do I use `use_persistent_context` and `user_data_dir` to maintain login sessions and persistent storage?"*
|
||||
|
||||
3. **Debugging and Remote Access**
|
||||
- *"How do I use the `debugging_port` to remotely inspect the browser with DevTools?"*
|
||||
|
||||
4. **Proxy and Network Configurations**
|
||||
- *"How can I configure a `proxy` or `proxy_config` for region-specific crawling or authentication?"*
|
||||
|
||||
5. **Viewports and Layout Testing**
|
||||
- *"How do I adjust `viewport_width` and `viewport_height` for responsive layout testing?"*
|
||||
|
||||
6. **Downloads and Storage States**
|
||||
- *"What steps do I need to take to enable `accept_downloads` and specify a `downloads_path`?"*
|
||||
- *"How can I use `storage_state` to preload cookies or session data?"*
|
||||
|
||||
7. **HTTPS and JavaScript Settings**
|
||||
- *"What happens if I set `ignore_https_errors=True` on sites with invalid SSL certificates?"*
|
||||
- *"When should I disable `java_script_enabled` to improve speed and stability?"*
|
||||
|
||||
8. **Cookies, Headers, and User Agents**
|
||||
- *"How do I add custom `cookies` or `headers` to every browser request?"*
|
||||
- *"How can I set a custom `user_agent` or use a `user_agent_mode` like `random` to avoid detection?"*
|
||||
|
||||
9. **Performance Tuning**
|
||||
- *"What is the difference between `text_mode`, `light_mode`, and adding `extra_args` for performance tuning?"*
|
||||
|
||||
---
|
||||
|
||||
**CrawlerRunConfig:**
|
||||
|
||||
10. **Content Extraction and Filtering**
|
||||
- *"How does the `word_count_threshold` affect which pages or sections get processed?"*
|
||||
- *"What `extraction_strategy` should I use for structured data extraction and how does `chunking_strategy` help organize the content?"*
|
||||
- *"How do I apply a `css_selector` or `excluded_tags` to refine my extracted content?"*
|
||||
|
||||
11. **Markdown and Text-Only Modes**
|
||||
- *"Can I generate Markdown output directly and what `markdown_generator` should I use?"*
|
||||
- *"When should I set `only_text=True` to strip out non-textual content?"*
|
||||
|
||||
12. **Caching and Session Handling**
|
||||
- *"How does `cache_mode=ENABLED` improve performance, and when should I consider `WRITE_ONLY` or disabling the cache?"*
|
||||
- *"What is the role of `session_id` in maintaining state across requests?"*
|
||||
|
||||
13. **Page Loading and Timing**
|
||||
- *"How do `wait_until`, `page_timeout`, and `wait_for` elements help control page load timing before extraction?"*
|
||||
- *"When should I disable `wait_for_images` to speed up the crawl?"*
|
||||
|
||||
14. **Delays and Concurrency**
|
||||
- *"How do `mean_delay` and `max_range` randomize request intervals to avoid detection?"*
|
||||
- *"What is `semaphore_count` and how does it manage concurrency for multiple crawling tasks?"*
|
||||
|
||||
15. **JavaScript Execution and Dynamic Content**
|
||||
- *"How can I inject custom `js_code` to load additional data or simulate user interactions?"*
|
||||
- *"When should I use `scan_full_page` or `adjust_viewport_to_content` to handle infinite scrolling?"*
|
||||
|
||||
16. **Screenshots, PDFs, and Media**
|
||||
- *"How do I enable `screenshot` or `pdf` generation to capture page states?"*
|
||||
- *"What are `image_description_min_word_threshold` and `image_score_threshold` for, and how do they enhance image-related extraction?"*
|
||||
|
||||
17. **Logging and Debugging**
|
||||
- *"How do `verbose` and `log_console` help me troubleshoot issues with crawling or page scripts?"*
|
||||
|
||||
---
|
||||
|
||||
### Topics Discussed in the File
|
||||
|
||||
- **BrowserConfig Essentials:**
|
||||
- Browser types (`chromium`, `firefox`, `webkit`)
|
||||
- Headless vs. non-headless mode
|
||||
- Persistent context and managed browser sessions
|
||||
- Proxy configurations and network settings
|
||||
- Viewport dimensions and responsive testing
|
||||
- Download handling and storage states
|
||||
- HTTPS errors and JavaScript enablement
|
||||
- Cookies, headers, and user agents
|
||||
- Performance tuning via `text_mode`, `light_mode`, and `extra_args`
|
||||
|
||||
- **CrawlerRunConfig Core Settings:**
|
||||
- Content extraction parameters (`word_count_threshold`, `extraction_strategy`, `chunking_strategy`)
|
||||
- Markdown generation and text-only extraction
|
||||
- Content filtering (`css_selector`, `excluded_tags`)
|
||||
- Caching strategies and `cache_mode` options
|
||||
- Page load conditions (`wait_until`, `wait_for`) and timeouts (`page_timeout`)
|
||||
- Delays, concurrency, and scaling (`mean_delay`, `max_range`, `semaphore_count`)
|
||||
- JavaScript injections (`js_code`) and handling dynamic/infinite scroll content
|
||||
- Screenshots, PDFs, and image thresholds for enhanced outputs
|
||||
- Logging and debugging modes (`verbose`, `log_console`)
|
||||
Reference in New Issue
Block a user