feat(proxy): add proxy rotation strategy
Implements a new proxy rotation system with the following changes: - Add ProxyRotationStrategy abstract base class - Add RoundRobinProxyStrategy concrete implementation - Integrate proxy rotation with AsyncWebCrawler - Add proxy_rotation_strategy parameter to CrawlerRunConfig - Add example script demonstrating proxy rotation usage - Remove deprecated synchronous WebCrawler code - Clean up rate limiting documentation BREAKING CHANGE: Removed synchronous WebCrawler support and related rate limiting configurations
This commit is contained in:
@@ -186,23 +186,19 @@ class CrawlerRunConfig:
|
||||
- If `True`, enables rate limiting for batch processing.
|
||||
- Requires `rate_limit_config` to be set.
|
||||
|
||||
10. **`rate_limit_config`**:
|
||||
- A `RateLimitConfig` object controlling rate limiting behavior.
|
||||
- See below for details.
|
||||
|
||||
11. **`memory_threshold_percent`**:
|
||||
10. **`memory_threshold_percent`**:
|
||||
- The memory threshold (as a percentage) to monitor.
|
||||
- If exceeded, the crawler will pause or slow down.
|
||||
|
||||
12. **`check_interval`**:
|
||||
11. **`check_interval`**:
|
||||
- The interval (in seconds) to check system resources.
|
||||
- Affects how often memory and CPU usage are monitored.
|
||||
|
||||
13. **`max_session_permit`**:
|
||||
12. **`max_session_permit`**:
|
||||
- The maximum number of concurrent crawl sessions.
|
||||
- Helps prevent overwhelming the system.
|
||||
|
||||
14. **`display_mode`**:
|
||||
13. **`display_mode`**:
|
||||
- The display mode for progress information (`DETAILED`, `BRIEF`, etc.).
|
||||
- Affects how much information is printed during the crawl.
|
||||
|
||||
@@ -236,58 +232,6 @@ The `clone()` method:
|
||||
- Leaves the original configuration unchanged
|
||||
- Perfect for creating variations without repeating all parameters
|
||||
|
||||
### Rate Limiting & Resource Management
|
||||
|
||||
For batch processing with `arun_many()`, you can enable intelligent rate limiting:
|
||||
|
||||
```python
|
||||
from crawl4ai import RateLimitConfig
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
enable_rate_limiting=True,
|
||||
rate_limit_config=RateLimitConfig(
|
||||
base_delay=(1.0, 3.0), # Random delay range
|
||||
max_delay=60.0, # Max delay after rate limits
|
||||
max_retries=3, # Retries before giving up
|
||||
rate_limit_codes=[429, 503] # Status codes to watch
|
||||
),
|
||||
memory_threshold_percent=70.0, # Memory threshold
|
||||
check_interval=1.0, # Resource check interval
|
||||
max_session_permit=20, # Max concurrent crawls
|
||||
display_mode="DETAILED" # Progress display mode
|
||||
)
|
||||
```
|
||||
|
||||
This configuration:
|
||||
- Implements intelligent rate limiting per domain
|
||||
- Monitors system resources
|
||||
- Provides detailed progress information
|
||||
- Manages concurrent crawls efficiently
|
||||
|
||||
**Minimal Example**:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
crawl_conf = CrawlerRunConfig(
|
||||
js_code="document.querySelector('button#loadMore')?.click()",
|
||||
wait_for="css:.loaded-content",
|
||||
screenshot=True,
|
||||
enable_rate_limiting=True,
|
||||
rate_limit_config=RateLimitConfig(
|
||||
base_delay=(1.0, 3.0),
|
||||
max_delay=60.0,
|
||||
max_retries=3,
|
||||
rate_limit_codes=[429, 503]
|
||||
),
|
||||
stream=True # Enable streaming
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com", config=crawl_conf)
|
||||
print(result.screenshot[:100]) # Base64-encoded PNG snippet
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Putting It All Together
|
||||
@@ -322,13 +266,6 @@ async def main():
|
||||
run_conf = CrawlerRunConfig(
|
||||
extraction_strategy=extraction,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
enable_rate_limiting=True,
|
||||
rate_limit_config=RateLimitConfig(
|
||||
base_delay=(1.0, 3.0),
|
||||
max_delay=60.0,
|
||||
max_retries=3,
|
||||
rate_limit_codes=[429, 503]
|
||||
)
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_conf) as crawler:
|
||||
|
||||
Reference in New Issue
Block a user