feat(proxy): add proxy rotation strategy

Implements a new proxy rotation system with the following changes: - Add ProxyRotationStrategy abstract base class - Add RoundRobinProxyStrategy concrete implementation - Integrate proxy rotation with AsyncWebCrawler - Add proxy_rotation_strategy parameter to CrawlerRunConfig - Add example script demonstrating proxy rotation usage - Remove deprecated synchronous WebCrawler code - Clean up rate limiting documentation BREAKING CHANGE: Removed synchronous WebCrawler support and related rate limiting configurations
2025-02-09 18:49:10 +08:00
parent b957ff2ecd
commit 19df96ed56
12 changed files with 257 additions and 162 deletions
--- a/docs/md_v2/core/browser-crawler-config.md
+++ b/docs/md_v2/core/browser-crawler-config.md
@@ -186,23 +186,19 @@ class CrawlerRunConfig:
   - If `True`, enables rate limiting for batch processing.  
   - Requires `rate_limit_config` to be set.

-10. **`rate_limit_config`**:  
-    - A `RateLimitConfig` object controlling rate limiting behavior.  
-    - See below for details.
-
-11. **`memory_threshold_percent`**:  
+10. **`memory_threshold_percent`**:  
    - The memory threshold (as a percentage) to monitor.  
    - If exceeded, the crawler will pause or slow down.

-12. **`check_interval`**:  
+11. **`check_interval`**:  
    - The interval (in seconds) to check system resources.  
    - Affects how often memory and CPU usage are monitored.

-13. **`max_session_permit`**:  
+12. **`max_session_permit`**:  
    - The maximum number of concurrent crawl sessions.  
    - Helps prevent overwhelming the system.

-14. **`display_mode`**:  
+13. **`display_mode`**:  
    - The display mode for progress information (`DETAILED`, `BRIEF`, etc.).  
    - Affects how much information is printed during the crawl.

@@ -236,58 +232,6 @@ The `clone()` method:
 - Leaves the original configuration unchanged
 - Perfect for creating variations without repeating all parameters

-### Rate Limiting & Resource Management
-
-For batch processing with `arun_many()`, you can enable intelligent rate limiting:
-
-```python
-from crawl4ai import RateLimitConfig
-    
-config = CrawlerRunConfig(
-    enable_rate_limiting=True,
-    rate_limit_config=RateLimitConfig(
-        base_delay=(1.0, 3.0),    # Random delay range
-        max_delay=60.0,           # Max delay after rate limits
-        max_retries=3,            # Retries before giving up
-        rate_limit_codes=[429, 503]  # Status codes to watch
-    ),
-    memory_threshold_percent=70.0,  # Memory threshold
-    check_interval=1.0,            # Resource check interval
-    max_session_permit=20,         # Max concurrent crawls
-    display_mode="DETAILED"        # Progress display mode
-)
-```
-
-This configuration:
- Implements intelligent rate limiting per domain
- Monitors system resources
- Provides detailed progress information
- Manages concurrent crawls efficiently
-
-**Minimal Example**:
-
-```python
-from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
-
-crawl_conf = CrawlerRunConfig(
-    js_code="document.querySelector('button#loadMore')?.click()",
-    wait_for="css:.loaded-content",
-    screenshot=True,
-    enable_rate_limiting=True,
-    rate_limit_config=RateLimitConfig(
-        base_delay=(1.0, 3.0),
-        max_delay=60.0,
-        max_retries=3,
-        rate_limit_codes=[429, 503]
-    ),
-    stream=True  # Enable streaming
-)
-
-async with AsyncWebCrawler() as crawler:
-    result = await crawler.arun(url="https://example.com", config=crawl_conf)
-    print(result.screenshot[:100])  # Base64-encoded PNG snippet
-```
-
 ---

 ## 3. Putting It All Together
@@ -322,13 +266,6 @@ async def main():
    run_conf = CrawlerRunConfig(
        extraction_strategy=extraction,
        cache_mode=CacheMode.BYPASS,
-        enable_rate_limiting=True,
-        rate_limit_config=RateLimitConfig(
-            base_delay=(1.0, 3.0),
-            max_delay=60.0,
-            max_retries=3,
-            rate_limit_codes=[429, 503]
-        )
    )

    async with AsyncWebCrawler(config=browser_conf) as crawler: