- Add llm.txt generator - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation.
20 lines
3.0 KiB
Markdown
20 lines
3.0 KiB
Markdown
browser_config: Configure browser type with chromium, firefox, or webkit support | browser selection, browser engine, web engine | BrowserConfig(browser_type="chromium")
|
|
headless_mode: Toggle headless browser mode for GUI-less operation | headless browser, no GUI, background mode | BrowserConfig(headless=True)
|
|
managed_browser: Enable advanced browser manipulation and control | browser management, session control | BrowserConfig(use_managed_browser=True)
|
|
debugging_setup: Configure remote debugging port for browser inspection | debug port, devtools connection | BrowserConfig(debugging_port=9222)
|
|
persistent_context: Enable persistent browser sessions for maintaining state | session persistence, profile saving | BrowserConfig(use_persistent_context=True)
|
|
browser_profile: Specify directory for storing browser profile data | user data, profile storage | BrowserConfig(user_data_dir="/path/to/profile")
|
|
proxy_configuration: Set up proxy settings for browser connections | proxy server, network routing | BrowserConfig(proxy="http://proxy.example.com:8080")
|
|
viewport_settings: Configure browser window dimensions | screen size, window dimensions | BrowserConfig(viewport_width=1920, viewport_height=1080)
|
|
download_handling: Configure browser download behavior and location | file downloads, download directory | BrowserConfig(accept_downloads=True, downloads_path="/downloads")
|
|
content_threshold: Set minimum word count for processing page content | word limit, content filter | CrawlerRunConfig(word_count_threshold=200)
|
|
extraction_strategy: Configure method for extracting structured data | data extraction, parsing strategy | CrawlerRunConfig(extraction_strategy=CustomStrategy())
|
|
content_chunking: Define strategy for breaking content into chunks | text chunking, content splitting | CrawlerRunConfig(chunking_strategy=RegexChunking())
|
|
cache_behavior: Control caching mode for crawler operations | cache control, data caching | CrawlerRunConfig(cache_mode=CacheMode.ENABLED)
|
|
page_navigation: Configure page load and navigation timing | page timeout, navigation wait | CrawlerRunConfig(wait_until="domcontentloaded", page_timeout=60000)
|
|
javascript_execution: Enable or disable JavaScript processing | JS handling, script execution | CrawlerRunConfig(java_script_enabled=True)
|
|
content_filtering: Configure HTML tag exclusion and content cleanup | tag filtering, content cleanup | CrawlerRunConfig(excluded_tags=["script", "style"])
|
|
concurrent_operations: Set limit for simultaneous crawler operations | concurrency control, parallel crawling | CrawlerRunConfig(semaphore_count=5)
|
|
page_interaction: Configure JavaScript execution and page scanning | page automation, interaction control | CrawlerRunConfig(js_code="custom_script()", scan_full_page=True)
|
|
media_capture: Enable screenshot and PDF generation capabilities | visual capture, page export | CrawlerRunConfig(screenshot=True, pdf=True)
|
|
debugging_options: Configure logging and console message capture | debug logging, error tracking | CrawlerRunConfig(verbose=True, log_console=True) |