diff --git a/CHANGELOG.md b/CHANGELOG.md index fea79456..b50e4eef 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -19,6 +19,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - Added intelligent context search with query filtering - Added syntax-aware code function chunking - Implemented efficient HTML processing pipeline +- Added support for controlling browser geolocation via new GeolocationConfig class + - Added locale and timezone configuration options to CrawlerRunConfig + - Added example script demonstrating geolocation and locale usage + - Added documentation for location-based identity features ### [Refactor] 2025-04-20 - Replaced crawler_manager.py with simpler crawler_pool.py implementation diff --git a/docs/md_v2/core/browser-crawler-config.md b/docs/md_v2/core/browser-crawler-config.md index 0dc846a7..b8817c6f 100644 --- a/docs/md_v2/core/browser-crawler-config.md +++ b/docs/md_v2/core/browser-crawler-config.md @@ -1,9 +1,9 @@ # Browser, Crawler & LLM Configuration (Quick Overview) -Crawl4AI’s flexibility stems from two key classes: +Crawl4AI's flexibility stems from two key classes: -1. **`BrowserConfig`** – Dictates **how** the browser is launched and behaves (e.g., headless or visible, proxy, user agent). -2. **`CrawlerRunConfig`** – Dictates **how** each **crawl** operates (e.g., caching, extraction, timeouts, JavaScript code to run, etc.). +1. **`BrowserConfig`** – Dictates **how** the browser is launched and behaves (e.g., headless or visible, proxy, user agent). +2. **`CrawlerRunConfig`** – Dictates **how** each **crawl** operates (e.g., caching, extraction, timeouts, JavaScript code to run, etc.). 3. **`LLMConfig`** - Dictates **how** LLM providers are configured. (model, api token, base url, temperature etc.) In most examples, you create **one** `BrowserConfig` for the entire crawler session, then pass a **fresh** or re-used `CrawlerRunConfig` whenever you call `arun()`. This tutorial shows the most commonly used parameters. If you need advanced or rarely used fields, see the [Configuration Parameters](../api/parameters.md). @@ -38,16 +38,16 @@ class BrowserConfig: -1. **`browser_type`** +1. **`browser_type`** - Options: `"chromium"`, `"firefox"`, or `"webkit"`. - Defaults to `"chromium"`. - If you need a different engine, specify it here. -2. **`headless`** +2. **`headless`** - `True`: Runs the browser in headless mode (invisible browser). - `False`: Runs the browser in visible mode, which helps with debugging. -3. **`proxy_config`** +3. **`proxy_config`** - A dictionary with fields like: ```json { @@ -58,31 +58,31 @@ class BrowserConfig: ``` - Leave as `None` if a proxy is not required. -4. **`viewport_width` & `viewport_height`**: +4. **`viewport_width` & `viewport_height`**: - The initial window size. - Some sites behave differently with smaller or bigger viewports. -5. **`verbose`**: +5. **`verbose`**: - If `True`, prints extra logs. - Handy for debugging. -6. **`use_persistent_context`**: +6. **`use_persistent_context`**: - If `True`, uses a **persistent** browser profile, storing cookies/local storage across runs. - Typically also set `user_data_dir` to point to a folder. -7. **`cookies`** & **`headers`**: +7. **`cookies`** & **`headers`**: - If you want to start with specific cookies or add universal HTTP headers, set them here. - E.g. `cookies=[{"name": "session", "value": "abc123", "domain": "example.com"}]`. -8. **`user_agent`**: +8. **`user_agent`**: - Custom User-Agent string. If `None`, a default is used. - You can also set `user_agent_mode="random"` for randomization (if you want to fight bot detection). -9. **`text_mode`** & **`light_mode`**: +9. **`text_mode`** & **`light_mode`**: - `text_mode=True` disables images, possibly speeding up text-only crawls. - `light_mode=True` turns off certain background features for performance. -10. **`extra_args`**: +10. **`extra_args`**: - Additional flags for the underlying browser. - E.g. `["--disable-extensions"]`. @@ -157,54 +157,61 @@ class CrawlerRunConfig: ### Key Fields to Note -1. **`word_count_threshold`**: +1. **`word_count_threshold`**: - The minimum word count before a block is considered. - If your site has lots of short paragraphs or items, you can lower it. -2. **`extraction_strategy`**: +2. **`extraction_strategy`**: - Where you plug in JSON-based extraction (CSS, LLM, etc.). - If `None`, no structured extraction is done (only raw/cleaned HTML + markdown). -3. **`markdown_generator`**: +3. **`markdown_generator`**: - E.g., `DefaultMarkdownGenerator(...)`, controlling how HTML→Markdown conversion is done. - If `None`, a default approach is used. -4. **`cache_mode`**: +4. **`cache_mode`**: - Controls caching behavior (`ENABLED`, `BYPASS`, `DISABLED`, etc.). - If `None`, defaults to some level of caching or you can specify `CacheMode.ENABLED`. -5. **`js_code`**: +5. **`js_code`**: - A string or list of JS strings to execute. - - Great for “Load More” buttons or user interactions. + - Great for "Load More" buttons or user interactions. -6. **`wait_for`**: +6. **`wait_for`**: - A CSS or JS expression to wait for before extracting content. - Common usage: `wait_for="css:.main-loaded"` or `wait_for="js:() => window.loaded === true"`. 7. **`screenshot`**, **`pdf`**, & **`capture_mhtml`**: - If `True`, captures a screenshot, PDF, or MHTML snapshot after the page is fully loaded. - The results go to `result.screenshot` (base64), `result.pdf` (bytes), or `result.mhtml` (string). -8. **`verbose`**: - - Logs additional runtime details. - - Overlaps with the browser’s verbosity if also set to `True` in `BrowserConfig`. -9. **`enable_rate_limiting`**: +8. **Location Parameters**: + - **`locale`**: Browser's locale (e.g., `"en-US"`, `"fr-FR"`) for language preferences + - **`timezone_id`**: Browser's timezone (e.g., `"America/New_York"`, `"Europe/Paris"`) + - **`geolocation`**: GPS coordinates via `GeolocationConfig(latitude=48.8566, longitude=2.3522)` + - See [Identity Based Crawling](../advanced/identity-based-crawling.md#7-locale-timezone-and-geolocation-control) + +9. **`verbose`**: + - Logs additional runtime details. + - Overlaps with the browser's verbosity if also set to `True` in `BrowserConfig`. + +10. **`enable_rate_limiting`**: - If `True`, enables rate limiting for batch processing. - Requires `rate_limit_config` to be set. -10. **`memory_threshold_percent`**: +11. **`memory_threshold_percent`**: - The memory threshold (as a percentage) to monitor. - If exceeded, the crawler will pause or slow down. -11. **`check_interval`**: +12. **`check_interval`**: - The interval (in seconds) to check system resources. - Affects how often memory and CPU usage are monitored. -12. **`max_session_permit`**: +13. **`max_session_permit`**: - The maximum number of concurrent crawl sessions. - Helps prevent overwhelming the system. -13. **`display_mode`**: +14. **`display_mode`**: - The display mode for progress information (`DETAILED`, `BRIEF`, etc.). - Affects how much information is printed during the crawl. @@ -248,16 +255,16 @@ The `clone()` method: ### Key fields to note -1. **`provider`**: +1. **`provider`**: - Which LLM provoder to use. - Possible values are `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`
*(default: `"openai/gpt-4o-mini"`)* -2. **`api_token`**: +2. **`api_token`**: - Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,`"GEMINI_API_KEY"` will be read from environment variables - API token of LLM provider
eg: `api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"` - Environment variable - use with prefix "env:"
eg:`api_token = "env: GROQ_API_KEY"` -3. **`base_url`**: +3. **`base_url`**: - If your provider has a custom endpoint ```python @@ -266,7 +273,7 @@ llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENA ## 4. Putting It All Together -In a typical scenario, you define **one** `BrowserConfig` for your crawler session, then create **one or more** `CrawlerRunConfig` & `LLMConfig` depending on each call’s needs: +In a typical scenario, you define **one** `BrowserConfig` for your crawler session, then create **one or more** `CrawlerRunConfig` & `LLMConfig` depending on each call's needs: ```python import asyncio