feat(proxy): add proxy rotation strategy

Implements a new proxy rotation system with the following changes: - Add ProxyRotationStrategy abstract base class - Add RoundRobinProxyStrategy concrete implementation - Integrate proxy rotation with AsyncWebCrawler - Add proxy_rotation_strategy parameter to CrawlerRunConfig - Add example script demonstrating proxy rotation usage - Remove deprecated synchronous WebCrawler code - Clean up rate limiting documentation BREAKING CHANGE: Removed synchronous WebCrawler support and related rate limiting configurations
2025-02-09 18:49:10 +08:00
parent b957ff2ecd
commit 19df96ed56
12 changed files with 257 additions and 162 deletions
--- a/docs/md_v2/api/async-webcrawler.md
+++ b/docs/md_v2/api/async-webcrawler.md
@@ -160,41 +160,9 @@ The `arun_many()` method now uses an intelligent dispatcher that:

 ### 4.2 Example Usage

+Check page [Multi-url Crawling](../advanced/multi-url-crawling.md) for a detailed example of how to use `arun_many()`.
+
 ```python
-from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, RateLimitConfig
-from crawl4ai.dispatcher import DisplayMode
-
-# Configure browser
-browser_cfg = BrowserConfig(headless=True)
-
-# Configure crawler with rate limiting
-run_cfg = CrawlerRunConfig(
-    # Enable rate limiting
-    enable_rate_limiting=True,
-    rate_limit_config=RateLimitConfig(
-        base_delay=(1.0, 2.0),  # Random delay between 1-2 seconds
-        max_delay=30.0,         # Maximum delay after rate limit hits
-        max_retries=2,          # Number of retries before giving up
-        rate_limit_codes=[429, 503]  # Status codes that trigger rate limiting
-    ),
-    # Resource monitoring
-    memory_threshold_percent=70.0,  # Pause if memory exceeds this
-    check_interval=0.5,            # How often to check resources
-    max_session_permit=3,          # Maximum concurrent crawls
-    display_mode=DisplayMode.DETAILED.value  # Show detailed progress
-)
-
-urls = [
-    "https://example.com/page1",
-    "https://example.com/page2",
-    "https://example.com/page3"
-]
-
-async with AsyncWebCrawler(config=browser_cfg) as crawler:
-    results = await crawler.arun_many(urls, config=run_cfg)
-    for result in results:
-        print(f"URL: {result.url}, Success: {result.success}")
-```

 ### 4.3 Key Features

--- a/docs/md_v2/api/parameters.md
+++ b/docs/md_v2/api/parameters.md
@@ -159,32 +159,7 @@ Use these for link-level content filtering (often to keep crawls “internal”

 ---

-### G) **Rate Limiting & Resource Management**
-
-| **Parameter**                | **Type / Default**                     | **What It Does**                                                                                                           |
-|------------------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
-| **`enable_rate_limiting`**  | `bool` (default: `False`)              | Enable intelligent rate limiting for multiple URLs                                                                          |
-| **`rate_limit_config`**     | `RateLimitConfig` (default: `None`)    | Configuration for rate limiting behavior                                                                                   |
-
-The `RateLimitConfig` class has these fields:
-
-| **Field**           | **Type / Default**                     | **What It Does**                                                                                                           |
-|--------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
-| **`base_delay`**   | `Tuple[float, float]` (1.0, 3.0)      | Random delay range between requests to the same domain                                                                      |
-| **`max_delay`**    | `float` (60.0)                        | Maximum delay after rate limit detection                                                                                    |
-| **`max_retries`**  | `int` (3)                             | Number of retries before giving up on rate-limited requests                                                                 |
-| **`rate_limit_codes`** | `List[int]` ([429, 503])          | HTTP status codes that trigger rate limiting behavior                                                                       |
-
-| **Parameter**                  | **Type / Default**                     | **What It Does**                                                                                                           |
-|-------------------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
-| **`memory_threshold_percent`** | `float` (70.0)                        | Maximum memory usage before pausing new crawls                                                                              |
-| **`check_interval`**          | `float` (1.0)                         | How often to check system resources (in seconds)                                                                           |
-| **`max_session_permit`**      | `int` (20)                            | Maximum number of concurrent crawl sessions                                                                                |
-| **`display_mode`**            | `str` (`None`, "DETAILED", "AGGREGATED") | How to display progress information                                                                                     |
-
---
-
-### H) **Debug & Logging**
+### G) **Debug & Logging**

 | **Parameter**  | **Type / Default** | **What It Does**                                                         |
 |----------------|--------------------|---------------------------------------------------------------------------|
@@ -218,7 +193,7 @@ The `clone()` method is particularly useful when you need slightly different con

 ```python
 import asyncio
-from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, RateLimitConfig
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

 async def main():
    # Configure the browser
@@ -239,17 +214,6 @@ async def main():
        exclude_external_links=True,
        wait_for="css:.article-loaded",
        screenshot=True,
-        enable_rate_limiting=True,
-        rate_limit_config=RateLimitConfig(
-            base_delay=(1.0, 3.0),
-            max_delay=60.0,
-            max_retries=3,
-            rate_limit_codes=[429, 503]
-        ),
-        memory_threshold_percent=70.0,
-        check_interval=1.0,
-        max_session_permit=20,
-        display_mode="DETAILED",
        stream=True
    )

--- a/docs/md_v2/core/browser-crawler-config.md
+++ b/docs/md_v2/core/browser-crawler-config.md
@@ -186,23 +186,19 @@ class CrawlerRunConfig:
   - If `True`, enables rate limiting for batch processing.  
   - Requires `rate_limit_config` to be set.

-10. **`rate_limit_config`**:  
-    - A `RateLimitConfig` object controlling rate limiting behavior.  
-    - See below for details.
-
-11. **`memory_threshold_percent`**:  
+10. **`memory_threshold_percent`**:  
    - The memory threshold (as a percentage) to monitor.  
    - If exceeded, the crawler will pause or slow down.

-12. **`check_interval`**:  
+11. **`check_interval`**:  
    - The interval (in seconds) to check system resources.  
    - Affects how often memory and CPU usage are monitored.

-13. **`max_session_permit`**:  
+12. **`max_session_permit`**:  
    - The maximum number of concurrent crawl sessions.  
    - Helps prevent overwhelming the system.

-14. **`display_mode`**:  
+13. **`display_mode`**:  
    - The display mode for progress information (`DETAILED`, `BRIEF`, etc.).  
    - Affects how much information is printed during the crawl.

@@ -236,58 +232,6 @@ The `clone()` method:
 - Leaves the original configuration unchanged
 - Perfect for creating variations without repeating all parameters

-### Rate Limiting & Resource Management
-
-For batch processing with `arun_many()`, you can enable intelligent rate limiting:
-
-```python
-from crawl4ai import RateLimitConfig
-    
-config = CrawlerRunConfig(
-    enable_rate_limiting=True,
-    rate_limit_config=RateLimitConfig(
-        base_delay=(1.0, 3.0),    # Random delay range
-        max_delay=60.0,           # Max delay after rate limits
-        max_retries=3,            # Retries before giving up
-        rate_limit_codes=[429, 503]  # Status codes to watch
-    ),
-    memory_threshold_percent=70.0,  # Memory threshold
-    check_interval=1.0,            # Resource check interval
-    max_session_permit=20,         # Max concurrent crawls
-    display_mode="DETAILED"        # Progress display mode
-)
-```
-
-This configuration:
- Implements intelligent rate limiting per domain
- Monitors system resources
- Provides detailed progress information
- Manages concurrent crawls efficiently
-
-**Minimal Example**:
-
-```python
-from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
-
-crawl_conf = CrawlerRunConfig(
-    js_code="document.querySelector('button#loadMore')?.click()",
-    wait_for="css:.loaded-content",
-    screenshot=True,
-    enable_rate_limiting=True,
-    rate_limit_config=RateLimitConfig(
-        base_delay=(1.0, 3.0),
-        max_delay=60.0,
-        max_retries=3,
-        rate_limit_codes=[429, 503]
-    ),
-    stream=True  # Enable streaming
-)
-
-async with AsyncWebCrawler() as crawler:
-    result = await crawler.arun(url="https://example.com", config=crawl_conf)
-    print(result.screenshot[:100])  # Base64-encoded PNG snippet
-```
-
 ---

 ## 3. Putting It All Together
@@ -322,13 +266,6 @@ async def main():
    run_conf = CrawlerRunConfig(
        extraction_strategy=extraction,
        cache_mode=CacheMode.BYPASS,
-        enable_rate_limiting=True,
-        rate_limit_config=RateLimitConfig(
-            base_delay=(1.0, 3.0),
-            max_delay=60.0,
-            max_retries=3,
-            rate_limit_codes=[429, 503]
-        )
    )

    async with AsyncWebCrawler(config=browser_conf) as crawler: