feat(proxy): add proxy rotation strategy

Implements a new proxy rotation system with the following changes:
- Add ProxyRotationStrategy abstract base class
- Add RoundRobinProxyStrategy concrete implementation
- Integrate proxy rotation with AsyncWebCrawler
- Add proxy_rotation_strategy parameter to CrawlerRunConfig
- Add example script demonstrating proxy rotation usage
- Remove deprecated synchronous WebCrawler code
- Clean up rate limiting documentation

BREAKING CHANGE: Removed synchronous WebCrawler support and related rate limiting configurations
This commit is contained in:
UncleCode
2025-02-09 18:49:10 +08:00
parent b957ff2ecd
commit 19df96ed56
12 changed files with 257 additions and 162 deletions

View File

@@ -159,32 +159,7 @@ Use these for link-level content filtering (often to keep crawls “internal”
---
### G) **Rate Limiting & Resource Management**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
| **`enable_rate_limiting`** | `bool` (default: `False`) | Enable intelligent rate limiting for multiple URLs |
| **`rate_limit_config`** | `RateLimitConfig` (default: `None`) | Configuration for rate limiting behavior |
The `RateLimitConfig` class has these fields:
| **Field** | **Type / Default** | **What It Does** |
|--------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
| **`base_delay`** | `Tuple[float, float]` (1.0, 3.0) | Random delay range between requests to the same domain |
| **`max_delay`** | `float` (60.0) | Maximum delay after rate limit detection |
| **`max_retries`** | `int` (3) | Number of retries before giving up on rate-limited requests |
| **`rate_limit_codes`** | `List[int]` ([429, 503]) | HTTP status codes that trigger rate limiting behavior |
| **Parameter** | **Type / Default** | **What It Does** |
|-------------------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
| **`memory_threshold_percent`** | `float` (70.0) | Maximum memory usage before pausing new crawls |
| **`check_interval`** | `float` (1.0) | How often to check system resources (in seconds) |
| **`max_session_permit`** | `int` (20) | Maximum number of concurrent crawl sessions |
| **`display_mode`** | `str` (`None`, "DETAILED", "AGGREGATED") | How to display progress information |
---
### H) **Debug & Logging**
### G) **Debug & Logging**
| **Parameter** | **Type / Default** | **What It Does** |
|----------------|--------------------|---------------------------------------------------------------------------|
@@ -218,7 +193,7 @@ The `clone()` method is particularly useful when you need slightly different con
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, RateLimitConfig
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def main():
# Configure the browser
@@ -239,17 +214,6 @@ async def main():
exclude_external_links=True,
wait_for="css:.article-loaded",
screenshot=True,
enable_rate_limiting=True,
rate_limit_config=RateLimitConfig(
base_delay=(1.0, 3.0),
max_delay=60.0,
max_retries=3,
rate_limit_codes=[429, 503]
),
memory_threshold_percent=70.0,
check_interval=1.0,
max_session_permit=20,
display_mode="DETAILED",
stream=True
)