- Added detailed CrawlerRunConfig parameters documentation. - Introduced plans for real-time event-driven crawling. - Updated async logger default level to DEBUG for better insights. - Improved structure and readability in configuration file. - Enhanced documentation on future capabilities in new blog entries.
85 lines
5.5 KiB
Markdown
85 lines
5.5 KiB
Markdown
# CrawlerRunConfig Parameters Documentation
|
|
|
|
## Content Processing Parameters
|
|
|
|
| Parameter | Type | Default | Description |
|
|
|-----------|------|---------|-------------|
|
|
| `word_count_threshold` | int | 200 | Minimum word count threshold before processing content |
|
|
| `extraction_strategy` | ExtractionStrategy | None | Strategy to extract structured data from crawled pages. When None, uses NoExtractionStrategy |
|
|
| `chunking_strategy` | ChunkingStrategy | RegexChunking() | Strategy to chunk content before extraction |
|
|
| `markdown_generator` | MarkdownGenerationStrategy | None | Strategy for generating markdown from extracted content |
|
|
| `content_filter` | RelevantContentFilter | None | Optional filter to prune irrelevant content |
|
|
| `only_text` | bool | False | If True, attempt to extract text-only content where applicable |
|
|
| `css_selector` | str | None | CSS selector to extract a specific portion of the page |
|
|
| `excluded_tags` | list[str] | [] | List of HTML tags to exclude from processing |
|
|
| `keep_data_attributes` | bool | False | If True, retain `data-*` attributes while removing unwanted attributes |
|
|
| `remove_forms` | bool | False | If True, remove all `<form>` elements from the HTML |
|
|
| `prettiify` | bool | False | If True, apply `fast_format_html` to produce prettified HTML output |
|
|
|
|
## Caching Parameters
|
|
|
|
| Parameter | Type | Default | Description |
|
|
|-----------|------|---------|-------------|
|
|
| `cache_mode` | CacheMode | None | Defines how caching is handled. Defaults to CacheMode.ENABLED internally |
|
|
| `session_id` | str | None | Optional session ID to persist browser context and page instance |
|
|
| `bypass_cache` | bool | False | Legacy parameter, if True acts like CacheMode.BYPASS |
|
|
| `disable_cache` | bool | False | Legacy parameter, if True acts like CacheMode.DISABLED |
|
|
| `no_cache_read` | bool | False | Legacy parameter, if True acts like CacheMode.WRITE_ONLY |
|
|
| `no_cache_write` | bool | False | Legacy parameter, if True acts like CacheMode.READ_ONLY |
|
|
|
|
## Page Navigation and Timing Parameters
|
|
|
|
| Parameter | Type | Default | Description |
|
|
|-----------|------|---------|-------------|
|
|
| `wait_until` | str | "domcontentloaded" | The condition to wait for when navigating |
|
|
| `page_timeout` | int | 60000 | Timeout in milliseconds for page operations like navigation |
|
|
| `wait_for` | str | None | CSS selector or JS condition to wait for before extracting content |
|
|
| `wait_for_images` | bool | True | If True, wait for images to load before extracting content |
|
|
| `delay_before_return_html` | float | 0.1 | Delay in seconds before retrieving final HTML |
|
|
| `mean_delay` | float | 0.1 | Mean base delay between requests when calling arun_many |
|
|
| `max_range` | float | 0.3 | Max random additional delay range for requests in arun_many |
|
|
| `semaphore_count` | int | 5 | Number of concurrent operations allowed |
|
|
|
|
## Page Interaction Parameters
|
|
|
|
| Parameter | Type | Default | Description |
|
|
|-----------|------|---------|-------------|
|
|
| `js_code` | str or list[str] | None | JavaScript code/snippets to run on the page |
|
|
| `js_only` | bool | False | If True, indicates subsequent calls are JS-driven updates |
|
|
| `ignore_body_visibility` | bool | True | If True, ignore whether the body is visible before proceeding |
|
|
| `scan_full_page` | bool | False | If True, scroll through the entire page to load all content |
|
|
| `scroll_delay` | float | 0.2 | Delay in seconds between scroll steps if scan_full_page is True |
|
|
| `process_iframes` | bool | False | If True, attempts to process and inline iframe content |
|
|
| `remove_overlay_elements` | bool | False | If True, remove overlays/popups before extracting HTML |
|
|
| `simulate_user` | bool | False | If True, simulate user interactions for anti-bot measures |
|
|
| `override_navigator` | bool | False | If True, overrides navigator properties for more human-like behavior |
|
|
| `magic` | bool | False | If True, attempts automatic handling of overlays/popups |
|
|
| `adjust_viewport_to_content` | bool | False | If True, adjust viewport according to page content dimensions |
|
|
|
|
## Media Handling Parameters
|
|
|
|
| Parameter | Type | Default | Description |
|
|
|-----------|------|---------|-------------|
|
|
| `screenshot` | bool | False | Whether to take a screenshot after crawling |
|
|
| `screenshot_wait_for` | float | None | Additional wait time before taking a screenshot |
|
|
| `screenshot_height_threshold` | int | 20000 | Threshold for page height to decide screenshot strategy |
|
|
| `pdf` | bool | False | Whether to generate a PDF of the page |
|
|
| `image_description_min_word_threshold` | int | 50 | Minimum words for image description extraction |
|
|
| `image_score_threshold` | int | 3 | Minimum score threshold for processing an image |
|
|
| `exclude_external_images` | bool | False | If True, exclude all external images from processing |
|
|
|
|
## Link and Domain Handling Parameters
|
|
|
|
| Parameter | Type | Default | Description |
|
|
|-----------|------|---------|-------------|
|
|
| `exclude_social_media_domains` | list[str] | SOCIAL_MEDIA_DOMAINS | List of domains to exclude for social media links |
|
|
| `exclude_external_links` | bool | False | If True, exclude all external links from the results |
|
|
| `exclude_social_media_links` | bool | False | If True, exclude links pointing to social media domains |
|
|
| `exclude_domains` | list[str] | [] | List of specific domains to exclude from results |
|
|
|
|
## Debugging and Logging Parameters
|
|
|
|
| Parameter | Type | Default | Description |
|
|
|-----------|------|---------|-------------|
|
|
| `verbose` | bool | True | Enable verbose logging |
|
|
| `log_console` | bool | False | If True, log console messages from the page | |