Update changelog

2025-04-21 23:21:49 +08:00
parent b5c25731e6
commit 0007aea204
2 changed files with 43 additions and 32 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -19,6 +19,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
  - Added intelligent context search with query filtering
  - Added syntax-aware code function chunking
  - Implemented efficient HTML processing pipeline
+- Added support for controlling browser geolocation via new GeolocationConfig class
+  - Added locale and timezone configuration options to CrawlerRunConfig
+  - Added example script demonstrating geolocation and locale usage
+  - Added documentation for location-based identity features

 ### [Refactor] 2025-04-20
 - Replaced crawler_manager.py with simpler crawler_pool.py implementation
--- a/docs/md_v2/core/browser-crawler-config.md
+++ b/docs/md_v2/core/browser-crawler-config.md
@@ -1,9 +1,9 @@
 # Browser, Crawler & LLM Configuration (Quick Overview)

-Crawl4AI’s flexibility stems from two key classes:
+Crawl4AI's flexibility stems from two key classes:

-1. **`BrowserConfig`** – Dictates **how** the browser is launched and behaves (e.g., headless or visible, proxy, user agent).  
-2. **`CrawlerRunConfig`** – Dictates **how** each **crawl** operates (e.g., caching, extraction, timeouts, JavaScript code to run, etc.).  
+1. **`BrowserConfig`** – Dictates **how** the browser is launched and behaves (e.g., headless or visible, proxy, user agent).  
+2. **`CrawlerRunConfig`** – Dictates **how** each **crawl** operates (e.g., caching, extraction, timeouts, JavaScript code to run, etc.).  
 3. **`LLMConfig`** - Dictates **how** LLM providers are configured. (model, api token, base url, temperature etc.)

 In most examples, you create **one** `BrowserConfig` for the entire crawler session, then pass a **fresh** or re-used `CrawlerRunConfig` whenever you call `arun()`. This tutorial shows the most commonly used parameters. If you need advanced or rarely used fields, see the [Configuration Parameters](../api/parameters.md).
@@ -38,16 +38,16 @@ class BrowserConfig:



-1. **`browser_type`**  
+1. **`browser_type`**  
 - Options: `"chromium"`, `"firefox"`, or `"webkit"`.  
 - Defaults to `"chromium"`.  
 - If you need a different engine, specify it here.

-2. **`headless`**  
+2. **`headless`**  
   - `True`: Runs the browser in headless mode (invisible browser).  
   - `False`: Runs the browser in visible mode, which helps with debugging.

-3. **`proxy_config`**  
+3. **`proxy_config`**  
   - A dictionary with fields like:  
 ```json
 {
@@ -58,31 +58,31 @@ class BrowserConfig:
 ```
   - Leave as `None` if a proxy is not required.

-4. **`viewport_width` & `viewport_height`**:  
+4. **`viewport_width` & `viewport_height`**:  
   - The initial window size.  
   - Some sites behave differently with smaller or bigger viewports.

-5. **`verbose`**:  
+5. **`verbose`**:  
   - If `True`, prints extra logs.  
   - Handy for debugging.

-6. **`use_persistent_context`**:  
+6. **`use_persistent_context`**:  
   - If `True`, uses a **persistent** browser profile, storing cookies/local storage across runs.  
   - Typically also set `user_data_dir` to point to a folder.

-7. **`cookies`** & **`headers`**:  
+7. **`cookies`** & **`headers`**:  
   - If you want to start with specific cookies or add universal HTTP headers, set them here.  
   - E.g. `cookies=[{"name": "session", "value": "abc123", "domain": "example.com"}]`.

-8. **`user_agent`**:  
+8. **`user_agent`**:  
   - Custom User-Agent string. If `None`, a default is used.  
   - You can also set `user_agent_mode="random"` for randomization (if you want to fight bot detection).

-9. **`text_mode`** & **`light_mode`**:  
+9. **`text_mode`** & **`light_mode`**:  
   - `text_mode=True` disables images, possibly speeding up text-only crawls.  
   - `light_mode=True` turns off certain background features for performance.  

-10. **`extra_args`**:  
+10. **`extra_args`**:  
    - Additional flags for the underlying browser.  
    - E.g. `["--disable-extensions"]`.

@@ -157,54 +157,61 @@ class CrawlerRunConfig:

 ### Key Fields to Note

-1. **`word_count_threshold`**:  
+1. **`word_count_threshold`**:  
   - The minimum word count before a block is considered.  
   - If your site has lots of short paragraphs or items, you can lower it.

-2. **`extraction_strategy`**:  
+2. **`extraction_strategy`**:  
   - Where you plug in JSON-based extraction (CSS, LLM, etc.).  
   - If `None`, no structured extraction is done (only raw/cleaned HTML + markdown).

-3. **`markdown_generator`**:  
+3. **`markdown_generator`**:  
   - E.g., `DefaultMarkdownGenerator(...)`, controlling how HTML→Markdown conversion is done.  
   - If `None`, a default approach is used.

-4. **`cache_mode`**:  
+4. **`cache_mode`**:  
   - Controls caching behavior (`ENABLED`, `BYPASS`, `DISABLED`, etc.).  
   - If `None`, defaults to some level of caching or you can specify `CacheMode.ENABLED`.

-5. **`js_code`**:  
+5. **`js_code`**:  
   - A string or list of JS strings to execute.  
-   - Great for “Load More” buttons or user interactions.  
+   - Great for "Load More" buttons or user interactions.  

-6. **`wait_for`**:  
+6. **`wait_for`**:  
   - A CSS or JS expression to wait for before extracting content.  
   - Common usage: `wait_for="css:.main-loaded"` or `wait_for="js:() => window.loaded === true"`.

 7. **`screenshot`**, **`pdf`**, & **`capture_mhtml`**:  
   - If `True`, captures a screenshot, PDF, or MHTML snapshot after the page is fully loaded.  
   - The results go to `result.screenshot` (base64), `result.pdf` (bytes), or `result.mhtml` (string).
-8. **`verbose`**:  
-   - Logs additional runtime details.  
-   - Overlaps with the browser’s verbosity if also set to `True` in `BrowserConfig`.

-9. **`enable_rate_limiting`**:  
+8. **Location Parameters**:  
+   - **`locale`**: Browser's locale (e.g., `"en-US"`, `"fr-FR"`) for language preferences
+   - **`timezone_id`**: Browser's timezone (e.g., `"America/New_York"`, `"Europe/Paris"`)
+   - **`geolocation`**: GPS coordinates via `GeolocationConfig(latitude=48.8566, longitude=2.3522)`
+   - See [Identity Based Crawling](../advanced/identity-based-crawling.md#7-locale-timezone-and-geolocation-control)
+
+9. **`verbose`**:  
+   - Logs additional runtime details.  
+   - Overlaps with the browser's verbosity if also set to `True` in `BrowserConfig`.
+
+10. **`enable_rate_limiting`**:  
   - If `True`, enables rate limiting for batch processing.  
   - Requires `rate_limit_config` to be set.

-10. **`memory_threshold_percent`**:  
+11. **`memory_threshold_percent`**:  
    - The memory threshold (as a percentage) to monitor.  
    - If exceeded, the crawler will pause or slow down.

-11. **`check_interval`**:  
+12. **`check_interval`**:  
    - The interval (in seconds) to check system resources.  
    - Affects how often memory and CPU usage are monitored.

-12. **`max_session_permit`**:  
+13. **`max_session_permit`**:  
    - The maximum number of concurrent crawl sessions.  
    - Helps prevent overwhelming the system.

-13. **`display_mode`**:  
+14. **`display_mode`**:  
    - The display mode for progress information (`DETAILED`, `BRIEF`, etc.).  
    - Affects how much information is printed during the crawl.

@@ -248,16 +255,16 @@ The `clone()` method:

 ### Key fields to note

-1. **`provider`**:  
+1. **`provider`**:  
 - Which LLM provoder to use. 
 - Possible values are `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)*

-2. **`api_token`**:  
+2. **`api_token`**:  
    - Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,`"GEMINI_API_KEY"` will be read from environment variables  
    - API token of LLM provider <br/> eg: `api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"`
    - Environment variable - use with prefix "env:" <br/> eg:`api_token = "env: GROQ_API_KEY"`            

-3. **`base_url`**:  
+3. **`base_url`**:  
   - If your provider has a custom endpoint

 ```python
@@ -266,7 +273,7 @@ llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENA

 ## 4. Putting It All Together

-In a typical scenario, you define **one** `BrowserConfig` for your crawler session, then create **one or more** `CrawlerRunConfig` & `LLMConfig` depending on each call’s needs:
+In a typical scenario, you define **one** `BrowserConfig` for your crawler session, then create **one or more** `CrawlerRunConfig` & `LLMConfig` depending on each call's needs:

 ```python
 import asyncio