# Core Configurations ## BrowserConfig `BrowserConfig` centralizes all parameters required to set up and manage a browser instance and its context. This configuration ensures consistent and documented browser behavior for the crawler. Below is a detailed explanation of each parameter and its optimal use cases. ### Parameters and Use Cases #### `browser_type` - **Description**: Specifies the type of browser to launch. - Supported values: `"chromium"`, `"firefox"`, `"webkit"` - Default: `"chromium"` - **Use Case**: - Use `"chromium"` for general-purpose crawling with modern web standards. - Use `"firefox"` when testing against Firefox-specific behavior. - Use `"webkit"` for testing Safari-like environments. #### `headless` - **Description**: Determines whether the browser runs in headless mode (no GUI). - Default: `True` - **Use Case**: - Enable for faster, automated operations without UI overhead. - Disable (`False`) when debugging or inspecting browser behavior visually. #### `use_managed_browser` - **Description**: Enables advanced manipulation via a managed browser approach. - Default: `False` - **Use Case**: - Use when fine-grained control is needed over browser sessions, such as debugging network requests or reusing sessions. #### `debugging_port` - **Description**: Port for remote debugging. - Default: 9222 - **Use Case**: - Use for debugging browser sessions with DevTools or external tools. #### `use_persistent_context` - **Description**: Uses a persistent browser context (e.g., saved profiles). - Automatically enables `use_managed_browser`. - Default: `False` - **Use Case**: - Persistent login sessions for authenticated crawling. - Retaining cookies or local storage across multiple runs. #### `user_data_dir` - **Description**: Path to a directory for storing persistent browser data. - Default: `None` - **Use Case**: - Specify a directory to save browser profiles for multi-run crawls or debugging. #### `chrome_channel` - **Description**: Specifies the Chrome channel to launch (e.g., `"chrome"`, `"msedge"`). - Applies only when `browser_type` is `"chromium"`. - Default: `"chrome"` - **Use Case**: - Use `"msedge"` for compatibility testing with Edge browsers. #### `proxy` and `proxy_config` - **Description**: - `proxy`: Proxy server URL for the browser. - `proxy_config`: Detailed proxy configuration. - Default: `None` - **Use Case**: - Set `proxy` for single-proxy setups. - Use `proxy_config` for advanced configurations, such as authenticated proxies or regional routing. #### `viewport_width` and `viewport_height` - **Description**: Sets the default browser viewport dimensions. - Default: `1080` (width), `600` (height) - **Use Case**: - Adjust for crawling responsive layouts or specific device emulations. #### `accept_downloads` and `downloads_path` - **Description**: - `accept_downloads`: Allows file downloads. - `downloads_path`: Directory for storing downloads. - Default: `False`, `None` - **Use Case**: - Use when downloading and analyzing files like PDFs or spreadsheets. #### `storage_state` - **Description**: Specifies cookies and local storage state. - Default: `None` - **Use Case**: - Provide state data for authenticated or preconfigured sessions. #### `ignore_https_errors` - **Description**: Ignores HTTPS certificate errors. - Default: `True` - **Use Case**: - Enable for crawling sites with invalid certificates (testing environments). #### `java_script_enabled` - **Description**: Toggles JavaScript execution in pages. - Default: `True` - **Use Case**: - Disable for simpler, faster crawls where JavaScript is unnecessary. #### `cookies` - **Description**: List of cookies to add to the browser context. - Default: `[]` - **Use Case**: - Use for authenticated or preconfigured crawling scenarios. #### `headers` - **Description**: Extra HTTP headers applied to all requests. - Default: `{}` - **Use Case**: - Customize headers for API-like crawling or bypassing bot detections. #### `user_agent` and `user_agent_mode` - **Description**: - `user_agent`: Custom User-Agent string. - `user_agent_mode`: Mode for generating User-Agent (e.g., `"random"`). - Default: Standard Chromium-based User-Agent. - **Use Case**: - Set static User-Agent for consistent identification. - Use `"random"` mode to reduce bot detection likelihood. #### `text_mode` - **Description**: Disables images and other rich content for faster load times. - Default: `False` - **Use Case**: - Enable for text-only extraction tasks where speed is prioritized. #### `light_mode` - **Description**: Disables background features for performance gains. - Default: `False` - **Use Case**: - Enable for high-performance crawls on resource-constrained environments. #### `extra_args` - **Description**: Additional command-line arguments for browser execution. - Default: `[]` - **Use Case**: - Use for advanced browser configurations like WebRTC or GPU tuning. #### `verbose` - **Description**: Enable verbose logging of browser operations. - Default: `True` - **Use Case**: - Enable for detailed logging during development and debugging. - Disable in production for better performance. #### `sleep_on_close` - **Description**: Adds a delay before closing the browser. - Default: `False` - **Use Case**: - Enable when you need to ensure all browser operations are complete before closing. ## CrawlerRunConfig The `CrawlerRunConfig` class centralizes parameters for controlling crawl operations. This configuration covers content extraction, page interactions, caching, and runtime behaviors. Below is an exhaustive breakdown of parameters and their best-use scenarios. ### Parameters and Use Cases #### Content Processing Parameters ##### `word_count_threshold` - **Description**: Minimum word count threshold for processing content. - Default: `200` - **Use Case**: - Set a higher threshold for content-heavy pages to skip lightweight or irrelevant content. ##### `extraction_strategy` - **Description**: Strategy for extracting structured data from crawled pages. - Default: `None` (uses `NoExtractionStrategy` by default). - **Use Case**: - Use for schema-driven extraction when working with well-defined data models like JSON. ##### `chunking_strategy` - **Description**: Strategy to chunk content before extraction. - Default: `RegexChunking()`. - **Use Case**: - Use NLP-based chunking for semantic extractions or regex for predictable text blocks. ##### `markdown_generator` - **Description**: Strategy for generating Markdown output. - Default: `None`. - **Use Case**: - Use custom Markdown strategies for AI-ready outputs like RAG pipelines. ##### `content_filter` - **Description**: Optional filter to prune irrelevant content. - Default: `None`. - **Use Case**: - Use relevance-based filters for focused crawls, e.g., keyword-specific searches. ##### `only_text` - **Description**: Extracts text-only content where applicable. - Default: `False`. - **Use Case**: - Enable for extracting clean text without HTML tags or rich content. ##### `css_selector` - **Description**: CSS selector to extract a specific portion of the page. - Default: `None`. - **Use Case**: - Use when targeting specific page elements, like articles or headlines. ##### `excluded_tags` - **Description**: List of HTML tags to exclude from processing. - Default: `None`. - **Use Case**: - Remove elements like `