docs: update browser and crawler run config documentation to match async_configs.py implementation
Updated browser-crawler-config.md and parameters.md to ensure complete accuracy with the actual BrowserConfig and CrawlerRunConfig implementations. Changes: - Removed non-existent parameters from documentation: * enable_rate_limiting, rate_limit_config (never implemented) * memory_threshold_percent, check_interval, max_session_permit (internal to AsyncDispatcher) * display_mode (doesn't exist) - Added missing BrowserConfig parameters (14 total): * browser_mode, use_managed_browser, cdp_url, debugging_port, host * viewport, chrome_channel, channel * accept_downloads, downloads_path, storage_state, sleep_on_close * user_agent_mode, user_agent_generator_config, enable_stealth - Added missing CrawlerRunConfig parameters (29 total): * chunking_strategy, keep_attrs, parser_type, scraping_strategy * proxy_config, proxy_rotation_strategy * locale, timezone_id, geolocation, fetch_ssl_certificate * shared_data, wait_for_timeout * c4a_script, max_scroll_steps * exclude_all_images, table_score_threshold, table_extraction * exclude_internal_links, score_links * capture_network_requests, capture_console_messages * method, stream, url, user_agent, user_agent_mode, user_agent_generator_config * deep_crawl_strategy, link_preview_config, url_matcher, match_mode, experimental - Marked deprecated cache parameters (bypass_cache, disable_cache, no_cache_read, no_cache_write) - Reorganized parameters into logical sections (Content Processing, Browser Location & Identity, Caching & Session, Page Navigation & Timing, Page Interaction, Media Handling, Link/Domain Handling, Debug & Logging, Connection & HTTP, Virtual Scroll, URL Matching, Advanced Features) - Ensured all parameter descriptions match source code docstrings - Added proper default values from __init__ signatures
This commit is contained in:
@@ -17,6 +17,11 @@ class BrowserConfig:
|
||||
def __init__(
|
||||
browser_type="chromium",
|
||||
headless=True,
|
||||
browser_mode="dedicated",
|
||||
use_managed_browser=False,
|
||||
cdp_url=None,
|
||||
debugging_port=9222,
|
||||
host="localhost",
|
||||
proxy_config=None,
|
||||
viewport_width=1080,
|
||||
viewport_height=600,
|
||||
@@ -25,7 +30,13 @@ class BrowserConfig:
|
||||
user_data_dir=None,
|
||||
cookies=None,
|
||||
headers=None,
|
||||
user_agent=None,
|
||||
user_agent=(
|
||||
# "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) AppleWebKit/537.36 "
|
||||
# "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
|
||||
# "(KHTML, like Gecko) Chrome/116.0.5845.187 Safari/604.1 Edg/117.0.2045.47"
|
||||
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36"
|
||||
),
|
||||
user_agent_mode="",
|
||||
text_mode=False,
|
||||
light_mode=False,
|
||||
extra_args=None,
|
||||
@@ -37,17 +48,33 @@ class BrowserConfig:
|
||||
|
||||
### Key Fields to Note
|
||||
|
||||
1. **`browser_type`**
|
||||
- Options: `"chromium"`, `"firefox"`, or `"webkit"`.
|
||||
- Defaults to `"chromium"`.
|
||||
- If you need a different engine, specify it here.
|
||||
1.⠀**`browser_type`**
|
||||
- Options: `"chromium"`, `"firefox"`, or `"webkit"`.
|
||||
- Defaults to `"chromium"`.
|
||||
- If you need a different engine, specify it here.
|
||||
|
||||
2. **`headless`**
|
||||
2.⠀**`headless`**
|
||||
- `True`: Runs the browser in headless mode (invisible browser).
|
||||
- `False`: Runs the browser in visible mode, which helps with debugging.
|
||||
|
||||
3. **`proxy_config`**
|
||||
- A dictionary with fields like:
|
||||
3.⠀**`browser_mode`**
|
||||
- Determines how the browser should be initialized:
|
||||
- `"dedicated"` (default): Creates a new browser instance each time
|
||||
- `"builtin"`: Uses the builtin CDP browser running in background
|
||||
- `"custom"`: Uses explicit CDP settings provided in `cdp_url`
|
||||
- `"docker"`: Runs browser in Docker container with isolation
|
||||
|
||||
4.⠀**`use_managed_browser`** & **`cdp_url`**
|
||||
- `use_managed_browser=True`: Launch browser using Chrome DevTools Protocol (CDP) for advanced control
|
||||
- `cdp_url`: URL for CDP endpoint (e.g., `"ws://localhost:9222/devtools/browser/"`)
|
||||
- Automatically set based on `browser_mode`
|
||||
|
||||
5.⠀**`debugging_port`** & **`host`**
|
||||
- `debugging_port`: Port for browser debugging protocol (default: 9222)
|
||||
- `host`: Host for browser connection (default: "localhost")
|
||||
|
||||
6.⠀**`proxy_config`**
|
||||
- A `ProxyConfig` object or dictionary with fields like:
|
||||
```json
|
||||
{
|
||||
"server": "http://proxy.example.com:8080",
|
||||
@@ -57,35 +84,35 @@ class BrowserConfig:
|
||||
```
|
||||
- Leave as `None` if a proxy is not required.
|
||||
|
||||
4. **`viewport_width` & `viewport_height`**:
|
||||
7.⠀**`viewport_width` & `viewport_height`**
|
||||
- The initial window size.
|
||||
- Some sites behave differently with smaller or bigger viewports.
|
||||
|
||||
5. **`verbose`**:
|
||||
8.⠀**`verbose`**
|
||||
- If `True`, prints extra logs.
|
||||
- Handy for debugging.
|
||||
|
||||
6. **`use_persistent_context`**:
|
||||
9.⠀**`use_persistent_context`**
|
||||
- If `True`, uses a **persistent** browser profile, storing cookies/local storage across runs.
|
||||
- Typically also set `user_data_dir` to point to a folder.
|
||||
|
||||
7. **`cookies`** & **`headers`**:
|
||||
- If you want to start with specific cookies or add universal HTTP headers, set them here.
|
||||
- E.g. `cookies=[{"name": "session", "value": "abc123", "domain": "example.com"}]`.
|
||||
10.⠀**`cookies`** & **`headers`**
|
||||
- If you want to start with specific cookies or add universal HTTP headers to the browser context, set them here.
|
||||
- E.g. `cookies=[{"name": "session", "value": "abc123", "domain": "example.com"}]`.
|
||||
|
||||
8. **`user_agent`**:
|
||||
- Custom User-Agent string. If `None`, a default is used.
|
||||
- You can also set `user_agent_mode="random"` for randomization (if you want to fight bot detection).
|
||||
11.⠀**`user_agent`** & **`user_agent_mode`**
|
||||
- `user_agent`: Custom User-Agent string. If `None`, a default is used.
|
||||
- `user_agent_mode`: Set to `"random"` for randomization (helps fight bot detection).
|
||||
|
||||
9. **`text_mode`** & **`light_mode`**:
|
||||
- `text_mode=True` disables images, possibly speeding up text-only crawls.
|
||||
- `light_mode=True` turns off certain background features for performance.
|
||||
12.⠀**`text_mode`** & **`light_mode`**
|
||||
- `text_mode=True` disables images, possibly speeding up text-only crawls.
|
||||
- `light_mode=True` turns off certain background features for performance.
|
||||
|
||||
10. **`extra_args`**:
|
||||
13.⠀**`extra_args`**
|
||||
- Additional flags for the underlying browser.
|
||||
- E.g. `["--disable-extensions"]`.
|
||||
|
||||
11. **`enable_stealth`**:
|
||||
14.⠀**`enable_stealth`**
|
||||
- If `True`, enables stealth mode using playwright-stealth.
|
||||
- Modifies browser fingerprints to avoid basic bot detection.
|
||||
- Default is `False`. Recommended for sites with bot protection.
|
||||
@@ -134,9 +161,11 @@ class CrawlerRunConfig:
|
||||
def __init__(
|
||||
word_count_threshold=200,
|
||||
extraction_strategy=None,
|
||||
chunking_strategy=RegexChunking(),
|
||||
markdown_generator=None,
|
||||
cache_mode=None,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
js_code=None,
|
||||
c4a_script=None,
|
||||
wait_for=None,
|
||||
screenshot=False,
|
||||
pdf=False,
|
||||
@@ -145,13 +174,18 @@ class CrawlerRunConfig:
|
||||
locale=None, # e.g. "en-US", "fr-FR"
|
||||
timezone_id=None, # e.g. "America/New_York"
|
||||
geolocation=None, # GeolocationConfig object
|
||||
# Resource Management
|
||||
enable_rate_limiting=False,
|
||||
rate_limit_config=None,
|
||||
memory_threshold_percent=70.0,
|
||||
check_interval=1.0,
|
||||
max_session_permit=20,
|
||||
display_mode=None,
|
||||
# Proxy Configuration
|
||||
proxy_config=None,
|
||||
proxy_rotation_strategy=None,
|
||||
# Page Interaction Parameters
|
||||
scan_full_page=False,
|
||||
scroll_delay=0.2,
|
||||
wait_until="domcontentloaded",
|
||||
page_timeout=60000,
|
||||
delay_before_return_html=0.1,
|
||||
# URL Matching Parameters
|
||||
url_matcher=None, # For URL-specific configurations
|
||||
match_mode=MatchMode.OR,
|
||||
verbose=True,
|
||||
stream=False, # Enable streaming for arun_many()
|
||||
# ... other advanced parameters omitted
|
||||
@@ -161,69 +195,68 @@ class CrawlerRunConfig:
|
||||
|
||||
### Key Fields to Note
|
||||
|
||||
1. **`word_count_threshold`**:
|
||||
1.⠀**`word_count_threshold`**:
|
||||
- The minimum word count before a block is considered.
|
||||
- If your site has lots of short paragraphs or items, you can lower it.
|
||||
|
||||
2. **`extraction_strategy`**:
|
||||
2.⠀**`extraction_strategy`**:
|
||||
- Where you plug in JSON-based extraction (CSS, LLM, etc.).
|
||||
- If `None`, no structured extraction is done (only raw/cleaned HTML + markdown).
|
||||
|
||||
3. **`markdown_generator`**:
|
||||
3.⠀**`chunking_strategy`**:
|
||||
- Strategy to chunk content before extraction.
|
||||
- Defaults to `RegexChunking()`. Can be customized for different chunking approaches.
|
||||
|
||||
4.⠀**`markdown_generator`**:
|
||||
- E.g., `DefaultMarkdownGenerator(...)`, controlling how HTML→Markdown conversion is done.
|
||||
- If `None`, a default approach is used.
|
||||
|
||||
4. **`cache_mode`**:
|
||||
5.⠀**`cache_mode`**:
|
||||
- Controls caching behavior (`ENABLED`, `BYPASS`, `DISABLED`, etc.).
|
||||
- If `None`, defaults to some level of caching or you can specify `CacheMode.ENABLED`.
|
||||
- Defaults to `CacheMode.BYPASS`.
|
||||
|
||||
5. **`js_code`**:
|
||||
- A string or list of JS strings to execute.
|
||||
6.⠀**`js_code`** & **`c4a_script`**:
|
||||
- `js_code`: A string or list of JavaScript strings to execute.
|
||||
- `c4a_script`: C4A script that compiles to JavaScript.
|
||||
- Great for "Load More" buttons or user interactions.
|
||||
|
||||
6. **`wait_for`**:
|
||||
7.⠀**`wait_for`**:
|
||||
- A CSS or JS expression to wait for before extracting content.
|
||||
- Common usage: `wait_for="css:.main-loaded"` or `wait_for="js:() => window.loaded === true"`.
|
||||
|
||||
7. **`screenshot`**, **`pdf`**, & **`capture_mhtml`**:
|
||||
8.⠀**`screenshot`**, **`pdf`**, & **`capture_mhtml`**:
|
||||
- If `True`, captures a screenshot, PDF, or MHTML snapshot after the page is fully loaded.
|
||||
- The results go to `result.screenshot` (base64), `result.pdf` (bytes), or `result.mhtml` (string).
|
||||
|
||||
8. **Location Parameters**:
|
||||
9.⠀**Location Parameters**:
|
||||
- **`locale`**: Browser's locale (e.g., `"en-US"`, `"fr-FR"`) for language preferences
|
||||
- **`timezone_id`**: Browser's timezone (e.g., `"America/New_York"`, `"Europe/Paris"`)
|
||||
- **`geolocation`**: GPS coordinates via `GeolocationConfig(latitude=48.8566, longitude=2.3522)`
|
||||
- See [Identity Based Crawling](../advanced/identity-based-crawling.md#7-locale-timezone-and-geolocation-control)
|
||||
|
||||
9. **`verbose`**:
|
||||
- Logs additional runtime details.
|
||||
- Overlaps with the browser's verbosity if also set to `True` in `BrowserConfig`.
|
||||
10.⠀**Proxy Configuration**:
|
||||
- **`proxy_config`**: Proxy server configuration (ProxyConfig object or dict) e.g. {"server": "...", "username": "...", "password"}
|
||||
- **`proxy_rotation_strategy`**: Strategy for rotating proxies during crawls
|
||||
|
||||
10. **`enable_rate_limiting`**:
|
||||
- If `True`, enables rate limiting for batch processing.
|
||||
- Requires `rate_limit_config` to be set.
|
||||
11.⠀**Page Interaction Parameters**:
|
||||
- **`scan_full_page`**: If `True`, scroll through the entire page to load all content
|
||||
- **`wait_until`**: Condition to wait for when navigating (e.g., "domcontentloaded", "networkidle")
|
||||
- **`page_timeout`**: Timeout in milliseconds for page operations (default: 60000)
|
||||
- **`delay_before_return_html`**: Delay in seconds before retrieving final HTML.
|
||||
|
||||
11. **`memory_threshold_percent`**:
|
||||
- The memory threshold (as a percentage) to monitor.
|
||||
- If exceeded, the crawler will pause or slow down.
|
||||
|
||||
12. **`check_interval`**:
|
||||
- The interval (in seconds) to check system resources.
|
||||
- Affects how often memory and CPU usage are monitored.
|
||||
|
||||
13. **`max_session_permit`**:
|
||||
- The maximum number of concurrent crawl sessions.
|
||||
- Helps prevent overwhelming the system.
|
||||
|
||||
14. **`url_matcher`** & **`match_mode`**:
|
||||
12.⠀**`url_matcher`** & **`match_mode`**:
|
||||
- Enable URL-specific configurations when used with `arun_many()`.
|
||||
- Set `url_matcher` to patterns (glob, function, or list) to match specific URLs.
|
||||
- Use `match_mode` (OR/AND) to control how multiple patterns combine.
|
||||
- See [URL-Specific Configurations](../api/arun_many.md#url-specific-configurations) for examples.
|
||||
|
||||
15. **`display_mode`**:
|
||||
- The display mode for progress information (`DETAILED`, `BRIEF`, etc.).
|
||||
- Affects how much information is printed during the crawl.
|
||||
13.⠀**`verbose`**:
|
||||
- Logs additional runtime details.
|
||||
- Overlaps with the browser's verbosity if also set to `True` in `BrowserConfig`.
|
||||
|
||||
14.⠀**`stream`**:
|
||||
- If `True`, enables streaming mode for `arun_many()` to process URLs as they complete.
|
||||
- Allows handling results incrementally instead of waiting for all URLs to finish.
|
||||
|
||||
|
||||
### Helper Methods
|
||||
@@ -263,16 +296,16 @@ The `clone()` method:
|
||||
|
||||
### Key fields to note
|
||||
|
||||
1. **`provider`**:
|
||||
1.⠀**`provider`**:
|
||||
- Which LLM provider to use.
|
||||
- Possible values are `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)*
|
||||
|
||||
2. **`api_token`**:
|
||||
2.⠀**`api_token`**:
|
||||
- Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,`"GEMINI_API_KEY"` will be read from environment variables
|
||||
- API token of LLM provider <br/> eg: `api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"`
|
||||
- Environment variable - use with prefix "env:" <br/> eg:`api_token = "env: GROQ_API_KEY"`
|
||||
|
||||
3. **`base_url`**:
|
||||
3.⠀**`base_url`**:
|
||||
- If your provider has a custom endpoint
|
||||
|
||||
```python
|
||||
|
||||
Reference in New Issue
Block a user