refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components: - Create dedicated dispatcher classes (MemoryAdaptive, Semaphore) - Add RateLimiter for smart request throttling - Implement CrawlerMonitor for real-time progress tracking - Move dispatcher config from CrawlerRunConfig to separate classes BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.
2025-01-11 21:10:27 +08:00
parent 3865342c93
commit 825c78a048
19 changed files with 1742 additions and 484 deletions
--- a/docs/md_v2/api/arun.md
+++ b/docs/md_v2/api/arun.md
@@ -1,7 +1,3 @@
-Below is a **revised parameter guide** for **`arun()`** in **AsyncWebCrawler**, reflecting the **new** approach where all parameters are passed via a **`CrawlerRunConfig`** instead of directly to `arun()`. Each section includes example usage in the new style, ensuring a clear, modern approach.
-
---
-
 # `arun()` Parameter Guide (New Approach)

 In Crawl4AI’s **latest** configuration model, nearly all parameters that once went directly to `arun()` are now part of **`CrawlerRunConfig`**. When calling `arun()`, you provide:
--- a/docs/md_v2/api/arun_many.md
+++ b/docs/md_v2/api/arun_many.md
@@ -0,0 +1,100 @@
+# `arun_many(...)` Reference
+
+> **Note**: This function is very similar to [`arun()`](./arun.md) but focused on **concurrent** or **batch** crawling. If you’re unfamiliar with `arun()` usage, please read that doc first, then review this for differences.
+
+## Function Signature
+
+```python
+async def arun_many(
+    urls: Union[List[str], List[Any]],
+    config: Optional[CrawlerRunConfig] = None,
+    dispatcher: Optional[BaseDispatcher] = None,
+    ...
+) -> List[CrawlResult]:
+    """
+    Crawl multiple URLs concurrently or in batches.
+
+    :param urls: A list of URLs (or tasks) to crawl.
+    :param config: (Optional) A default `CrawlerRunConfig` applying to each crawl.
+    :param dispatcher: (Optional) A concurrency controller (e.g. MemoryAdaptiveDispatcher).
+    ...
+    :return: A list of `CrawlResult` objects, one per URL.
+    """
+```
+
+## Differences from `arun()`
+
+1. **Multiple URLs**:  
+   - Instead of crawling a single URL, you pass a list of them (strings or tasks).  
+   - The function returns a **list** of `CrawlResult`, in the same order as `urls`.
+
+2. **Concurrency & Dispatchers**:  
+   - **`dispatcher`** param allows advanced concurrency control.  
+   - If omitted, a default dispatcher (like `MemoryAdaptiveDispatcher`) is used internally.  
+   - Dispatchers handle concurrency, rate limiting, and memory-based adaptive throttling (see [Multi-URL Crawling](../advanced/multi-url-crawling.md)).
+
+3. **Parallel** Execution**:  
+   - `arun_many()` can run multiple requests concurrently under the hood.  
+   - Each `CrawlResult` might also include a **`dispatch_result`** with concurrency details (like memory usage, start/end times).
+
+### Basic Example
+
+```python
+# Minimal usage: The default dispatcher will be used
+results = await crawler.arun_many(
+    urls=["https://site1.com", "https://site2.com"],
+    config=my_run_config
+)
+
+for res in results:
+    if res.success:
+        print(res.url, "crawled OK!")
+    else:
+        print("Failed:", res.url, "-", res.error_message)
+```
+
+### With a Custom Dispatcher
+
+```python
+dispatcher = MemoryAdaptiveDispatcher(
+    memory_threshold_percent=70.0,
+    max_session_permit=10
+)
+results = await crawler.arun_many(
+    urls=["https://site1.com", "https://site2.com", "https://site3.com"],
+    config=my_run_config,
+    dispatcher=dispatcher
+)
+```
+
+**Key Points**:
+- Each URL is processed by the same or separate sessions, depending on the dispatcher’s strategy.
+- `dispatch_result` in each `CrawlResult` (if using concurrency) can hold memory and timing info.  
+- If you need to handle authentication or session IDs, pass them in each individual task or within your run config.
+
+### Return Value
+
+A **list** of [`CrawlResult`](./crawl-result.md) objects, one per URL. You can iterate to check `result.success` or read each item’s `extracted_content`, `markdown`, or `dispatch_result`.
+
+---
+
+## Dispatcher Reference
+
+- **`MemoryAdaptiveDispatcher`**: Dynamically manages concurrency based on system memory usage.  
+- **`SemaphoreDispatcher`**: Fixed concurrency limit, simpler but less adaptive.  
+
+For advanced usage or custom settings, see [Multi-URL Crawling with Dispatchers](../advanced/multi-url-crawling.md).
+
+---
+
+## Common Pitfalls
+
+1. **Large Lists**: If you pass thousands of URLs, be mindful of memory or rate-limits. A dispatcher can help.  
+2. **Session Reuse**: If you need specialized logins or persistent contexts, ensure your dispatcher or tasks handle sessions accordingly.  
+3. **Error Handling**: Each `CrawlResult` might fail for different reasons—always check `result.success` or the `error_message` before proceeding.
+
+---
+
+## Conclusion
+
+Use `arun_many()` when you want to **crawl multiple URLs** simultaneously or in controlled parallel tasks. If you need advanced concurrency features (like memory-based adaptive throttling or complex rate-limiting), provide a **dispatcher**. Each result is a standard `CrawlResult`, possibly augmented with concurrency stats (`dispatch_result`) for deeper inspection. For more details on concurrency logic and dispatchers, see the [Advanced Multi-URL Crawling](../advanced/multi-url-crawling.md) docs.
--- a/docs/md_v2/api/async-webcrawler.md
+++ b/docs/md_v2/api/async-webcrawler.md
@@ -130,51 +130,88 @@ For **backward** compatibility, `arun()` can still accept direct arguments like

 ---

-## 4. Helper Methods
-
-### 4.1 `arun_many()`
+## 4. Batch Processing: `arun_many()`

 ```python
 async def arun_many(
    self,
    urls: List[str],
    config: Optional[CrawlerRunConfig] = None,
-    # Legacy parameters...
+    # Legacy parameters maintained for backwards compatibility...
 ) -> List[CrawlResult]:
-    ...
+    """
+    Process multiple URLs with intelligent rate limiting and resource monitoring.
+    """
 ```

-Crawls multiple URLs in concurrency. Accepts the same style `CrawlerRunConfig`. Example:
+### 4.1 Resource-Aware Crawling
+
+The `arun_many()` method now uses an intelligent dispatcher that:
+- Monitors system memory usage
+- Implements adaptive rate limiting
+- Provides detailed progress monitoring
+- Manages concurrent crawls efficiently
+
+### 4.2 Example Usage

 ```python
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, RateLimitConfig
+from crawl4ai.dispatcher import DisplayMode
+
+# Configure browser
+browser_cfg = BrowserConfig(headless=True)
+
+# Configure crawler with rate limiting
 run_cfg = CrawlerRunConfig(
-    # e.g., concurrency, wait_for, caching, extraction, etc.
-    semaphore_count=5
+    # Enable rate limiting
+    enable_rate_limiting=True,
+    rate_limit_config=RateLimitConfig(
+        base_delay=(1.0, 2.0),  # Random delay between 1-2 seconds
+        max_delay=30.0,         # Maximum delay after rate limit hits
+        max_retries=2,          # Number of retries before giving up
+        rate_limit_codes=[429, 503]  # Status codes that trigger rate limiting
+    ),
+    # Resource monitoring
+    memory_threshold_percent=70.0,  # Pause if memory exceeds this
+    check_interval=0.5,            # How often to check resources
+    max_session_permit=3,          # Maximum concurrent crawls
+    display_mode=DisplayMode.DETAILED.value  # Show detailed progress
 )

+urls = [
+    "https://example.com/page1",
+    "https://example.com/page2",
+    "https://example.com/page3"
+]
+
 async with AsyncWebCrawler(config=browser_cfg) as crawler:
-    results = await crawler.arun_many(
-        urls=["https://example.com", "https://another.com"],
-        config=run_cfg
-    )
-    for r in results:
-        print(r.url, ":", len(r.cleaned_html))
+    results = await crawler.arun_many(urls, config=run_cfg)
+    for result in results:
+        print(f"URL: {result.url}, Success: {result.success}")
 ```

-### 4.2 `start()` & `close()`
+### 4.3 Key Features

-Allows manual lifecycle usage instead of context manager:
+1. **Rate Limiting**
+   - Automatic delay between requests
+   - Exponential backoff on rate limit detection
+   - Domain-specific rate limiting
+   - Configurable retry strategy

-```python
-crawler = AsyncWebCrawler(config=browser_cfg)
-await crawler.start()
+2. **Resource Monitoring**
+   - Memory usage tracking
+   - Adaptive concurrency based on system load
+   - Automatic pausing when resources are constrained

-# Perform multiple operations
-resultA = await crawler.arun("https://exampleA.com", config=run_cfg)
-resultB = await crawler.arun("https://exampleB.com", config=run_cfg)
+3. **Progress Monitoring**
+   - Detailed or aggregated progress display
+   - Real-time status updates
+   - Memory usage statistics

-await crawler.close()
-```
+4. **Error Handling**
+   - Graceful handling of rate limits
+   - Automatic retries with backoff
+   - Detailed error reporting

 ---

--- a/docs/md_v2/api/crawl-result.md
+++ b/docs/md_v2/api/crawl-result.md
@@ -26,6 +26,7 @@ class CrawlResult(BaseModel):
    response_headers: Optional[dict] = None
    status_code: Optional[int] = None
    ssl_certificate: Optional[SSLCertificate] = None
+    dispatch_result: Optional[DispatchResult] = None
    ...
 ```

@@ -262,7 +263,31 @@ if result.metadata:

 ---

-## 6. Example: Accessing Everything
+## 6. `dispatch_result` (optional)
+
+A `DispatchResult` object providing additional concurrency and resource usage information when crawling URLs in parallel (e.g., via `arun_many()` with custom dispatchers). It contains:
+
+- **`task_id`**: A unique identifier for the parallel task.
+- **`memory_usage`** (float): The memory (in MB) used at the time of completion.
+- **`peak_memory`** (float): The peak memory usage (in MB) recorded during the task’s execution.
+- **`start_time`** / **`end_time`** (datetime): Time range for this crawling task.
+- **`error_message`** (str): Any dispatcher- or concurrency-related error encountered.
+
+```python
+# Example usage:
+for result in results:
+    if result.success and result.dispatch_result:
+        dr = result.dispatch_result
+        print(f"URL: {result.url}, Task ID: {dr.task_id}")
+        print(f"Memory: {dr.memory_usage:.1f} MB (Peak: {dr.peak_memory:.1f} MB)")
+        print(f"Duration: {dr.end_time - dr.start_time}")
+```
+
+> **Note**: This field is typically populated when using `arun_many(...)` alongside a **dispatcher** (e.g., `MemoryAdaptiveDispatcher` or `SemaphoreDispatcher`). If no concurrency or dispatcher is used, `dispatch_result` may remain `None`. 
+
+---
+
+## 7. Example: Accessing Everything

 ```python
 async def handle_result(result: CrawlResult):
@@ -306,7 +331,7 @@ async def handle_result(result: CrawlResult):

 ---

-## 7. Key Points & Future
+## 8. Key Points & Future

 1. **`markdown_v2` vs `markdown`**  
   - Right now, `markdown_v2` is the more robust container (`MarkdownGenerationResult`), providing **raw_markdown**, **markdown_with_citations**, references, plus possible **fit_markdown**.  
--- a/docs/md_v2/api/parameters.md
+++ b/docs/md_v2/api/parameters.md
@@ -157,7 +157,32 @@ Use these for link-level content filtering (often to keep crawls “internal”

 ---

-### G) **Debug & Logging**
+### G) **Rate Limiting & Resource Management**
+
+| **Parameter**                | **Type / Default**                     | **What It Does**                                                                                                           |
+|------------------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
+| **`enable_rate_limiting`**  | `bool` (default: `False`)              | Enable intelligent rate limiting for multiple URLs                                                                          |
+| **`rate_limit_config`**     | `RateLimitConfig` (default: `None`)    | Configuration for rate limiting behavior                                                                                   |
+
+The `RateLimitConfig` class has these fields:
+
+| **Field**           | **Type / Default**                     | **What It Does**                                                                                                           |
+|--------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
+| **`base_delay`**   | `Tuple[float, float]` (1.0, 3.0)      | Random delay range between requests to the same domain                                                                      |
+| **`max_delay`**    | `float` (60.0)                        | Maximum delay after rate limit detection                                                                                    |
+| **`max_retries`**  | `int` (3)                             | Number of retries before giving up on rate-limited requests                                                                 |
+| **`rate_limit_codes`** | `List[int]` ([429, 503])          | HTTP status codes that trigger rate limiting behavior                                                                       |
+
+| **Parameter**                  | **Type / Default**                     | **What It Does**                                                                                                           |
+|-------------------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
+| **`memory_threshold_percent`** | `float` (70.0)                        | Maximum memory usage before pausing new crawls                                                                              |
+| **`check_interval`**          | `float` (1.0)                         | How often to check system resources (in seconds)                                                                           |
+| **`max_session_permit`**      | `int` (20)                            | Maximum number of concurrent crawl sessions                                                                                |
+| **`display_mode`**            | `str` (`None`, "DETAILED", "AGGREGATED") | How to display progress information                                                                                     |
+
+---
+
+### H) **Debug & Logging**

 | **Parameter**  | **Type / Default** | **What It Does**                                                         |
 |----------------|--------------------|---------------------------------------------------------------------------|
@@ -170,7 +195,7 @@ Use these for link-level content filtering (often to keep crawls “internal”

 ```python
 import asyncio
-from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, RateLimitConfig

 async def main():
    # Configure the browser
@@ -190,7 +215,18 @@ async def main():
        excluded_tags=["script", "style"],
        exclude_external_links=True,
        wait_for="css:.article-loaded",
-        screenshot=True
+        screenshot=True,
+        enable_rate_limiting=True,
+        rate_limit_config=RateLimitConfig(
+            base_delay=(1.0, 3.0),
+            max_delay=60.0,
+            max_retries=3,
+            rate_limit_codes=[429, 503]
+        ),
+        memory_threshold_percent=70.0,
+        check_interval=1.0,
+        max_session_permit=20,
+        display_mode="DETAILED"
    )

    async with AsyncWebCrawler(config=browser_cfg) as crawler:
@@ -223,4 +259,3 @@ if __name__ == "__main__":
 - **Use** `BrowserConfig` for **global** browser settings: engine, headless, proxy, user agent.  
 - **Use** `CrawlerRunConfig` for each crawl’s **context**: how to filter content, handle caching, wait for dynamic elements, or run JS.  
 - **Pass** both configs to `AsyncWebCrawler` (the `BrowserConfig`) and then to `arun()` (the `CrawlerRunConfig`).  
-