refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components: - Create dedicated dispatcher classes (MemoryAdaptive, Semaphore) - Add RateLimiter for smart request throttling - Implement CrawlerMonitor for real-time progress tracking - Move dispatcher config from CrawlerRunConfig to separate classes BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.
2025-01-11 21:10:27 +08:00
parent 3865342c93
commit 825c78a048
19 changed files with 1742 additions and 484 deletions
--- a/docs/md_v2/api/parameters.md
+++ b/docs/md_v2/api/parameters.md
@@ -157,7 +157,32 @@ Use these for link-level content filtering (often to keep crawls “internal”

 ---

-### G) **Debug & Logging**
+### G) **Rate Limiting & Resource Management**
+
+| **Parameter**                | **Type / Default**                     | **What It Does**                                                                                                           |
+|------------------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
+| **`enable_rate_limiting`**  | `bool` (default: `False`)              | Enable intelligent rate limiting for multiple URLs                                                                          |
+| **`rate_limit_config`**     | `RateLimitConfig` (default: `None`)    | Configuration for rate limiting behavior                                                                                   |
+
+The `RateLimitConfig` class has these fields:
+
+| **Field**           | **Type / Default**                     | **What It Does**                                                                                                           |
+|--------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
+| **`base_delay`**   | `Tuple[float, float]` (1.0, 3.0)      | Random delay range between requests to the same domain                                                                      |
+| **`max_delay`**    | `float` (60.0)                        | Maximum delay after rate limit detection                                                                                    |
+| **`max_retries`**  | `int` (3)                             | Number of retries before giving up on rate-limited requests                                                                 |
+| **`rate_limit_codes`** | `List[int]` ([429, 503])          | HTTP status codes that trigger rate limiting behavior                                                                       |
+
+| **Parameter**                  | **Type / Default**                     | **What It Does**                                                                                                           |
+|-------------------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
+| **`memory_threshold_percent`** | `float` (70.0)                        | Maximum memory usage before pausing new crawls                                                                              |
+| **`check_interval`**          | `float` (1.0)                         | How often to check system resources (in seconds)                                                                           |
+| **`max_session_permit`**      | `int` (20)                            | Maximum number of concurrent crawl sessions                                                                                |
+| **`display_mode`**            | `str` (`None`, "DETAILED", "AGGREGATED") | How to display progress information                                                                                     |
+
+---
+
+### H) **Debug & Logging**

 | **Parameter**  | **Type / Default** | **What It Does**                                                         |
 |----------------|--------------------|---------------------------------------------------------------------------|
@@ -170,7 +195,7 @@ Use these for link-level content filtering (often to keep crawls “internal”

 ```python
 import asyncio
-from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, RateLimitConfig

 async def main():
    # Configure the browser
@@ -190,7 +215,18 @@ async def main():
        excluded_tags=["script", "style"],
        exclude_external_links=True,
        wait_for="css:.article-loaded",
-        screenshot=True
+        screenshot=True,
+        enable_rate_limiting=True,
+        rate_limit_config=RateLimitConfig(
+            base_delay=(1.0, 3.0),
+            max_delay=60.0,
+            max_retries=3,
+            rate_limit_codes=[429, 503]
+        ),
+        memory_threshold_percent=70.0,
+        check_interval=1.0,
+        max_session_permit=20,
+        display_mode="DETAILED"
    )

    async with AsyncWebCrawler(config=browser_cfg) as crawler:
@@ -223,4 +259,3 @@ if __name__ == "__main__":
 - **Use** `BrowserConfig` for **global** browser settings: engine, headless, proxy, user agent.  
 - **Use** `CrawlerRunConfig` for each crawl’s **context**: how to filter content, handle caching, wait for dynamic elements, or run JS.  
 - **Pass** both configs to `AsyncWebCrawler` (the `BrowserConfig`) and then to `arun()` (the `CrawlerRunConfig`).  
-