Files
crawl4ai/docs/llm.txt/2_configuration.q.md
UncleCode 84b311760f Commit Message:
Enhance Crawl4AI with CLI and documentation updates
  - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py`
  - Added chunking strategies and their documentation in `llm.txt`
2024-12-21 14:26:56 +08:00

4.8 KiB

Hypothetical Questions

BrowserConfig:

  1. Browser Types and Headless Mode

    • "How do I choose between chromium, firefox, or webkit for browser_type?"
    • "What are the benefits of running the browser in headless=True mode versus a visible UI?"
  2. Managed Browser and Persistent Context

    • "When should I enable use_managed_browser for advanced session control?"
    • "How do I use use_persistent_context and user_data_dir to maintain login sessions and persistent storage?"
  3. Debugging and Remote Access

    • "How do I use the debugging_port to remotely inspect the browser with DevTools?"
  4. Proxy and Network Configurations

    • "How can I configure a proxy or proxy_config for region-specific crawling or authentication?"
  5. Viewports and Layout Testing

    • "How do I adjust viewport_width and viewport_height for responsive layout testing?"
  6. Downloads and Storage States

    • "What steps do I need to take to enable accept_downloads and specify a downloads_path?"
    • "How can I use storage_state to preload cookies or session data?"
  7. HTTPS and JavaScript Settings

    • "What happens if I set ignore_https_errors=True on sites with invalid SSL certificates?"
    • "When should I disable java_script_enabled to improve speed and stability?"
  8. Cookies, Headers, and User Agents

    • "How do I add custom cookies or headers to every browser request?"
    • "How can I set a custom user_agent or use a user_agent_mode like random to avoid detection?"
  9. Performance Tuning

    • "What is the difference between text_mode, light_mode, and adding extra_args for performance tuning?"

CrawlerRunConfig:

  1. Content Extraction and Filtering

    • "How does the word_count_threshold affect which pages or sections get processed?"
    • "What extraction_strategy should I use for structured data extraction and how does chunking_strategy help organize the content?"
    • "How do I apply a css_selector or excluded_tags to refine my extracted content?"
  2. Markdown and Text-Only Modes

    • "Can I generate Markdown output directly and what markdown_generator should I use?"
    • "When should I set only_text=True to strip out non-textual content?"
  3. Caching and Session Handling

    • "How does cache_mode=ENABLED improve performance, and when should I consider WRITE_ONLY or disabling the cache?"
    • "What is the role of session_id in maintaining state across requests?"
  4. Page Loading and Timing

    • "How do wait_until, page_timeout, and wait_for elements help control page load timing before extraction?"
    • "When should I disable wait_for_images to speed up the crawl?"
  5. Delays and Concurrency

    • "How do mean_delay and max_range randomize request intervals to avoid detection?"
    • "What is semaphore_count and how does it manage concurrency for multiple crawling tasks?"
  6. JavaScript Execution and Dynamic Content

    • "How can I inject custom js_code to load additional data or simulate user interactions?"
    • "When should I use scan_full_page or adjust_viewport_to_content to handle infinite scrolling?"
  7. Screenshots, PDFs, and Media

    • "How do I enable screenshot or pdf generation to capture page states?"
    • "What are image_description_min_word_threshold and image_score_threshold for, and how do they enhance image-related extraction?"
  8. Logging and Debugging

    • "How do verbose and log_console help me troubleshoot issues with crawling or page scripts?"

Topics Discussed in the File

  • BrowserConfig Essentials:

    • Browser types (chromium, firefox, webkit)
    • Headless vs. non-headless mode
    • Persistent context and managed browser sessions
    • Proxy configurations and network settings
    • Viewport dimensions and responsive testing
    • Download handling and storage states
    • HTTPS errors and JavaScript enablement
    • Cookies, headers, and user agents
    • Performance tuning via text_mode, light_mode, and extra_args
  • CrawlerRunConfig Core Settings:

    • Content extraction parameters (word_count_threshold, extraction_strategy, chunking_strategy)
    • Markdown generation and text-only extraction
    • Content filtering (css_selector, excluded_tags)
    • Caching strategies and cache_mode options
    • Page load conditions (wait_until, wait_for) and timeouts (page_timeout)
    • Delays, concurrency, and scaling (mean_delay, max_range, semaphore_count)
    • JavaScript injections (js_code) and handling dynamic/infinite scroll content
    • Screenshots, PDFs, and image thresholds for enhanced outputs
    • Logging and debugging modes (verbose, log_console)