# Crawl4ai - AI Friendly Documentation (aka LLM.TXT) This document provides AI-friendly documentation for the Crawl4ai library. It contains three types of content: **Memory (Facts)**: Similar to traditional LLM.txt files, this section contains factual information about the library - what it is, what it does, its components, APIs, and capabilities. This is the reference knowledge that an AI needs to understand the library. **Reasoning (Instructions)**: This section instructs AI models on how to use the factual knowledge, think about problems in the way the library authors intended, and provide solutions that align with the library's design philosophy. It guides the AI's problem-solving approach when working with Crawl4ai. **Examples**: Practical code examples demonstrating how to use the library's features in real-world scenarios. These examples help AI models understand the practical application of the concepts. ## Content (Memory) # Detailed Outline for crawl4ai - config_objects Component **Target Document Type:** memory **Target Output Filename Suggestion:** `llm_memory_config_objects.md` **Library Version Context:** 0.6.3 **Outline Generation Date:** 2024-05-24 --- ## 1. Introduction to Configuration Objects in Crawl4ai * **1.1. Purpose of Configuration Objects** * Explanation: Configuration objects in `crawl4ai` serve to centralize and manage settings for various components and behaviors of the library. This includes browser setup, individual crawl run parameters, LLM provider interactions, proxy settings, and more. * Benefit: This approach enhances code readability by grouping related settings, improves maintainability by providing a clear structure for configurations, and offers ease of customization for users to tailor the library's behavior to their specific needs. * **1.2. General Principles and Usage** * **1.2.1. Immutability/Cloning:** * Concept: Most configuration objects are designed with a `clone()` method, allowing users to create modified copies without altering the original configuration instance. This promotes safer state management, especially when reusing base configurations for multiple tasks. * Method: `clone(**kwargs)` on most configuration objects. * **1.2.2. Serialization and Deserialization:** * Concept: `crawl4ai` configuration objects can be serialized to dictionary format (e.g., for saving to JSON) and deserialized back into their respective class instances. * Methods: * `dump() -> dict`: Serializes the object to a dictionary suitable for JSON, often using the internal `to_serializable_dict` helper. * `load(data: dict) -> ConfigClass` (Static Method): Deserializes an object from a dictionary, often using the internal `from_serializable_dict` helper. * `to_dict() -> dict`: Converts the object to a standard Python dictionary. * `from_dict(data: dict) -> ConfigClass` (Static Method): Creates an instance from a standard Python dictionary. * Helper Functions: * `crawl4ai.async_configs.to_serializable_dict(obj: Any, ignore_default_value: bool = False) -> Dict`: Recursively converts objects into a serializable dictionary format, handling complex types like enums and nested objects. * `crawl4ai.async_configs.from_serializable_dict(data: Any) -> Any`: Reconstructs Python objects from the serializable dictionary format. * **1.3. Scope of this Document** * Statement: This document provides a factual API reference for the primary configuration objects within the `crawl4ai` library, detailing their purpose, initialization parameters, attributes, and key methods. ## 2. Core Configuration Objects ### 2.1. `BrowserConfig` Located in `crawl4ai.async_configs`. * **2.1.1. Purpose:** * Description: The `BrowserConfig` class is used to configure the settings for a browser instance and its associated contexts when using browser-based crawler strategies like `AsyncPlaywrightCrawlerStrategy`. It centralizes all parameters that affect the creation and behavior of the browser. * **2.1.2. Initialization (`__init__`)** * Signature: ```python class BrowserConfig: def __init__( self, browser_type: str = "chromium", headless: bool = True, browser_mode: str = "dedicated", use_managed_browser: bool = False, cdp_url: Optional[str] = None, use_persistent_context: bool = False, user_data_dir: Optional[str] = None, chrome_channel: Optional[str] = "chromium", # Note: 'channel' is preferred channel: Optional[str] = "chromium", proxy: Optional[str] = None, proxy_config: Optional[Union[ProxyConfig, dict]] = None, viewport_width: int = 1080, viewport_height: int = 600, viewport: Optional[dict] = None, accept_downloads: bool = False, downloads_path: Optional[str] = None, storage_state: Optional[Union[str, dict]] = None, ignore_https_errors: bool = True, java_script_enabled: bool = True, sleep_on_close: bool = False, verbose: bool = True, cookies: Optional[List[dict]] = None, headers: Optional[dict] = None, user_agent: Optional[str] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36", user_agent_mode: Optional[str] = "", user_agent_generator_config: Optional[dict] = None, # Default is {} in __init__ text_mode: bool = False, light_mode: bool = False, extra_args: Optional[List[str]] = None, debugging_port: int = 9222, host: str = "localhost" ): ... ``` * Parameters: * `browser_type (str, default: "chromium")`: Specifies the browser engine to use. Supported values: `"chromium"`, `"firefox"`, `"webkit"`. * `headless (bool, default: True)`: If `True`, runs the browser without a visible GUI. Set to `False` for debugging or visual interaction. * `browser_mode (str, default: "dedicated")`: Defines how the browser is initialized. Options: `"builtin"` (uses built-in CDP), `"dedicated"` (new instance each time), `"cdp"` (connects to an existing CDP endpoint specified by `cdp_url`), `"docker"` (runs browser in a Docker container). * `use_managed_browser (bool, default: False)`: If `True`, launches the browser using a managed approach (e.g., via CDP or Docker), allowing for more advanced control. Automatically set to `True` if `browser_mode` is `"builtin"`, `"docker"`, or if `cdp_url` is provided, or if `use_persistent_context` is `True`. * `cdp_url (Optional[str], default: None)`: The URL for the Chrome DevTools Protocol (CDP) endpoint. If not provided and `use_managed_browser` is active, it might be set by an internal browser manager. * `use_persistent_context (bool, default: False)`: If `True`, uses a persistent browser context (profile), saving cookies, localStorage, etc., across sessions. Requires `user_data_dir`. Sets `use_managed_browser=True`. * `user_data_dir (Optional[str], default: None)`: Path to a directory for storing user data for persistent sessions. If `None` and `use_persistent_context` is `True`, a temporary directory might be used. * `chrome_channel (Optional[str], default: "chromium")`: Specifies the Chrome channel (e.g., "chrome", "msedge", "chromium-beta"). Only applicable if `browser_type` is "chromium". * `channel (Optional[str], default: "chromium")`: Preferred alias for `chrome_channel`. Set to `""` for Firefox or WebKit. * `proxy (Optional[str], default: None)`: A string representing the proxy server URL (e.g., "http://username:password@proxy.example.com:8080"). * `proxy_config (Optional[Union[ProxyConfig, dict]], default: None)`: A `ProxyConfig` object or a dictionary specifying detailed proxy settings. Overrides the `proxy` string if both are provided. * `viewport_width (int, default: 1080)`: Default width of the browser viewport in pixels. * `viewport_height (int, default: 600)`: Default height of the browser viewport in pixels. * `viewport (Optional[dict], default: None)`: A dictionary specifying viewport dimensions, e.g., `{"width": 1920, "height": 1080}`. If set, overrides `viewport_width` and `viewport_height`. * `accept_downloads (bool, default: False)`: If `True`, allows files to be downloaded by the browser. * `downloads_path (Optional[str], default: None)`: Directory path where downloaded files will be stored. Required if `accept_downloads` is `True`. * `storage_state (Optional[Union[str, dict]], default: None)`: Path to a JSON file or a dictionary containing the browser's storage state (cookies, localStorage, etc.) to load. * `ignore_https_errors (bool, default: True)`: If `True`, HTTPS certificate errors will be ignored. * `java_script_enabled (bool, default: True)`: If `True`, JavaScript execution is enabled on web pages. * `sleep_on_close (bool, default: False)`: If `True`, introduces a small delay before the browser is closed. * `verbose (bool, default: True)`: If `True`, enables verbose logging for browser operations. * `cookies (Optional[List[dict]], default: None)`: A list of cookie dictionaries to be set in the browser context. Each dictionary should conform to Playwright's cookie format. * `headers (Optional[dict], default: None)`: A dictionary of additional HTTP headers to be sent with every request made by the browser. * `user_agent (Optional[str], default: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36")`: The User-Agent string the browser will use. * `user_agent_mode (Optional[str], default: "")`: Mode for generating the User-Agent string. If set (e.g., to "random"), `user_agent_generator_config` can be used. * `user_agent_generator_config (Optional[dict], default: {})`: Configuration dictionary for the User-Agent generator if `user_agent_mode` is active. * `text_mode (bool, default: False)`: If `True`, attempts to disable images and other rich content to potentially speed up loading for text-focused crawls. * `light_mode (bool, default: False)`: If `True`, disables certain background browser features for potential performance gains. * `extra_args (Optional[List[str]], default: None)`: A list of additional command-line arguments to pass to the browser executable upon launch. * `debugging_port (int, default: 9222)`: The port to use for the browser's remote debugging protocol (CDP). * `host (str, default: "localhost")`: The host on which the browser's remote debugging protocol will listen. * **2.1.3. Key Public Attributes/Properties:** * All parameters listed in `__init__` are available as public attributes with the same names and types. * `browser_hint (str)`: [Read-only] - A string representing client hints (Sec-CH-UA) generated based on the `user_agent` string. This is automatically set during initialization. * **2.1.4. Key Public Methods:** * `from_kwargs(cls, kwargs: dict) -> BrowserConfig` (Static Method): * Purpose: Creates a `BrowserConfig` instance from a dictionary of keyword arguments. * `to_dict(self) -> dict`: * Purpose: Converts the `BrowserConfig` instance into a dictionary representation. * `clone(self, **kwargs) -> BrowserConfig`: * Purpose: Creates a deep copy of the current `BrowserConfig` instance. Keyword arguments can be provided to override specific attributes in the new instance. * `dump(self) -> dict`: * Purpose: Serializes the `BrowserConfig` object into a dictionary format that is suitable for JSON storage or transmission, utilizing the `to_serializable_dict` helper. * `load(cls, data: dict) -> BrowserConfig` (Static Method): * Purpose: Deserializes a `BrowserConfig` object from a dictionary (typically one created by `dump()`), utilizing the `from_serializable_dict` helper. ### 2.2. `CrawlerRunConfig` Located in `crawl4ai.async_configs`. * **2.2.1. Purpose:** * Description: The `CrawlerRunConfig` class encapsulates all settings that control the behavior of a single crawl operation performed by `AsyncWebCrawler.arun()` or multiple operations within `AsyncWebCrawler.arun_many()`. This includes parameters for content processing, page interaction, caching, and media handling. * **2.2.2. Initialization (`__init__`)** * Signature: ```python class CrawlerRunConfig: def __init__( self, url: Optional[str] = None, word_count_threshold: int = MIN_WORD_THRESHOLD, extraction_strategy: Optional[ExtractionStrategy] = None, chunking_strategy: Optional[ChunkingStrategy] = RegexChunking(), markdown_generator: Optional[MarkdownGenerationStrategy] = DefaultMarkdownGenerator(), only_text: bool = False, css_selector: Optional[str] = None, target_elements: Optional[List[str]] = None, # Default is [] in __init__ excluded_tags: Optional[List[str]] = None, # Default is [] in __init__ excluded_selector: Optional[str] = "", # Default is "" in __init__ keep_data_attributes: bool = False, keep_attrs: Optional[List[str]] = None, # Default is [] in __init__ remove_forms: bool = False, prettify: bool = False, parser_type: str = "lxml", scraping_strategy: Optional[ContentScrapingStrategy] = None, # Instantiated with WebScrapingStrategy() if None proxy_config: Optional[Union[ProxyConfig, dict]] = None, proxy_rotation_strategy: Optional[ProxyRotationStrategy] = None, locale: Optional[str] = None, timezone_id: Optional[str] = None, geolocation: Optional[GeolocationConfig] = None, fetch_ssl_certificate: bool = False, cache_mode: CacheMode = CacheMode.BYPASS, session_id: Optional[str] = None, shared_data: Optional[dict] = None, wait_until: str = "domcontentloaded", page_timeout: int = PAGE_TIMEOUT, wait_for: Optional[str] = None, wait_for_timeout: Optional[int] = None, wait_for_images: bool = False, delay_before_return_html: float = 0.1, mean_delay: float = 0.1, max_range: float = 0.3, semaphore_count: int = 5, js_code: Optional[Union[str, List[str]]] = None, js_only: bool = False, ignore_body_visibility: bool = True, scan_full_page: bool = False, scroll_delay: float = 0.2, process_iframes: bool = False, remove_overlay_elements: bool = False, simulate_user: bool = False, override_navigator: bool = False, magic: bool = False, adjust_viewport_to_content: bool = False, screenshot: bool = False, screenshot_wait_for: Optional[float] = None, screenshot_height_threshold: int = SCREENSHOT_HEIGHT_THRESHOLD, pdf: bool = False, capture_mhtml: bool = False, image_description_min_word_threshold: int = IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD, image_score_threshold: int = IMAGE_SCORE_THRESHOLD, table_score_threshold: int = 7, exclude_external_images: bool = False, exclude_all_images: bool = False, exclude_social_media_domains: Optional[List[str]] = None, # Uses SOCIAL_MEDIA_DOMAINS if None exclude_external_links: bool = False, exclude_social_media_links: bool = False, exclude_domains: Optional[List[str]] = None, # Default is [] in __init__ exclude_internal_links: bool = False, verbose: bool = True, log_console: bool = False, capture_network_requests: bool = False, capture_console_messages: bool = False, method: str = "GET", stream: bool = False, check_robots_txt: bool = False, user_agent: Optional[str] = None, user_agent_mode: Optional[str] = None, user_agent_generator_config: Optional[dict] = None, # Default is {} in __init__ deep_crawl_strategy: Optional[DeepCrawlStrategy] = None, experimental: Optional[Dict[str, Any]] = None # Default is {} in __init__ ): ... ``` * Parameters: * `url (Optional[str], default: None)`: The target URL for this specific crawl run. * `word_count_threshold (int, default: MIN_WORD_THRESHOLD)`: Minimum word count for a text block to be considered significant during content processing. * `extraction_strategy (Optional[ExtractionStrategy], default: None)`: Strategy for extracting structured data from the page. If `None`, `NoExtractionStrategy` is used. * `chunking_strategy (Optional[ChunkingStrategy], default: RegexChunking())`: Strategy to split content into chunks before extraction. * `markdown_generator (Optional[MarkdownGenerationStrategy], default: DefaultMarkdownGenerator())`: Strategy for converting HTML to Markdown. * `only_text (bool, default: False)`: If `True`, attempts to extract only textual content, potentially ignoring structural elements beneficial for rich Markdown. * `css_selector (Optional[str], default: None)`: A CSS selector defining the primary region of the page to focus on for content extraction. The raw HTML is reduced to this region. * `target_elements (Optional[List[str]], default: [])`: A list of CSS selectors. If provided, only the content within these elements will be considered for Markdown generation and structured data extraction. Unlike `css_selector`, this does not reduce the raw HTML but scopes the processing. * `excluded_tags (Optional[List[str]], default: [])`: A list of HTML tag names (e.g., "nav", "footer") to be removed from the HTML before processing. * `excluded_selector (Optional[str], default: "")`: A CSS selector specifying elements to be removed from the HTML before processing. * `keep_data_attributes (bool, default: False)`: If `True`, `data-*` attributes on HTML elements are preserved during cleaning. * `keep_attrs (Optional[List[str]], default: [])`: A list of specific HTML attribute names to preserve during HTML cleaning. * `remove_forms (bool, default: False)`: If `True`, all `