Enhance Crawl4AI with CLI and documentation updates - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py` - Added chunking strategies and their documentation in `llm.txt`
4.8 KiB
4.8 KiB
Hypothetical Questions
BrowserConfig:
-
Browser Types and Headless Mode
- "How do I choose between
chromium,firefox, orwebkitforbrowser_type?" - "What are the benefits of running the browser in
headless=Truemode versus a visible UI?"
- "How do I choose between
-
Managed Browser and Persistent Context
- "When should I enable
use_managed_browserfor advanced session control?" - "How do I use
use_persistent_contextanduser_data_dirto maintain login sessions and persistent storage?"
- "When should I enable
-
Debugging and Remote Access
- "How do I use the
debugging_portto remotely inspect the browser with DevTools?"
- "How do I use the
-
Proxy and Network Configurations
- "How can I configure a
proxyorproxy_configfor region-specific crawling or authentication?"
- "How can I configure a
-
Viewports and Layout Testing
- "How do I adjust
viewport_widthandviewport_heightfor responsive layout testing?"
- "How do I adjust
-
Downloads and Storage States
- "What steps do I need to take to enable
accept_downloadsand specify adownloads_path?" - "How can I use
storage_stateto preload cookies or session data?"
- "What steps do I need to take to enable
-
HTTPS and JavaScript Settings
- "What happens if I set
ignore_https_errors=Trueon sites with invalid SSL certificates?" - "When should I disable
java_script_enabledto improve speed and stability?"
- "What happens if I set
-
Cookies, Headers, and User Agents
- "How do I add custom
cookiesorheadersto every browser request?" - "How can I set a custom
user_agentor use auser_agent_modelikerandomto avoid detection?"
- "How do I add custom
-
Performance Tuning
- "What is the difference between
text_mode,light_mode, and addingextra_argsfor performance tuning?"
- "What is the difference between
CrawlerRunConfig:
-
Content Extraction and Filtering
- "How does the
word_count_thresholdaffect which pages or sections get processed?" - "What
extraction_strategyshould I use for structured data extraction and how doeschunking_strategyhelp organize the content?" - "How do I apply a
css_selectororexcluded_tagsto refine my extracted content?"
- "How does the
-
Markdown and Text-Only Modes
- "Can I generate Markdown output directly and what
markdown_generatorshould I use?" - "When should I set
only_text=Trueto strip out non-textual content?"
- "Can I generate Markdown output directly and what
-
Caching and Session Handling
- "How does
cache_mode=ENABLEDimprove performance, and when should I considerWRITE_ONLYor disabling the cache?" - "What is the role of
session_idin maintaining state across requests?"
- "How does
-
Page Loading and Timing
- "How do
wait_until,page_timeout, andwait_forelements help control page load timing before extraction?" - "When should I disable
wait_for_imagesto speed up the crawl?"
- "How do
-
Delays and Concurrency
- "How do
mean_delayandmax_rangerandomize request intervals to avoid detection?" - "What is
semaphore_countand how does it manage concurrency for multiple crawling tasks?"
- "How do
-
JavaScript Execution and Dynamic Content
- "How can I inject custom
js_codeto load additional data or simulate user interactions?" - "When should I use
scan_full_pageoradjust_viewport_to_contentto handle infinite scrolling?"
- "How can I inject custom
-
Screenshots, PDFs, and Media
- "How do I enable
screenshotorpdfgeneration to capture page states?" - "What are
image_description_min_word_thresholdandimage_score_thresholdfor, and how do they enhance image-related extraction?"
- "How do I enable
-
Logging and Debugging
- "How do
verboseandlog_consolehelp me troubleshoot issues with crawling or page scripts?"
- "How do
Topics Discussed in the File
-
BrowserConfig Essentials:
- Browser types (
chromium,firefox,webkit) - Headless vs. non-headless mode
- Persistent context and managed browser sessions
- Proxy configurations and network settings
- Viewport dimensions and responsive testing
- Download handling and storage states
- HTTPS errors and JavaScript enablement
- Cookies, headers, and user agents
- Performance tuning via
text_mode,light_mode, andextra_args
- Browser types (
-
CrawlerRunConfig Core Settings:
- Content extraction parameters (
word_count_threshold,extraction_strategy,chunking_strategy) - Markdown generation and text-only extraction
- Content filtering (
css_selector,excluded_tags) - Caching strategies and
cache_modeoptions - Page load conditions (
wait_until,wait_for) and timeouts (page_timeout) - Delays, concurrency, and scaling (
mean_delay,max_range,semaphore_count) - JavaScript injections (
js_code) and handling dynamic/infinite scroll content - Screenshots, PDFs, and image thresholds for enhanced outputs
- Logging and debugging modes (
verbose,log_console)
- Content extraction parameters (