refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.
This commit is contained in:
UncleCode
2025-01-11 21:10:27 +08:00
parent 3865342c93
commit 825c78a048
19 changed files with 1742 additions and 484 deletions

View File

@@ -130,51 +130,88 @@ For **backward** compatibility, `arun()` can still accept direct arguments like
---
## 4. Helper Methods
### 4.1 `arun_many()`
## 4. Batch Processing: `arun_many()`
```python
async def arun_many(
self,
urls: List[str],
config: Optional[CrawlerRunConfig] = None,
# Legacy parameters...
# Legacy parameters maintained for backwards compatibility...
) -> List[CrawlResult]:
...
"""
Process multiple URLs with intelligent rate limiting and resource monitoring.
"""
```
Crawls multiple URLs in concurrency. Accepts the same style `CrawlerRunConfig`. Example:
### 4.1 Resource-Aware Crawling
The `arun_many()` method now uses an intelligent dispatcher that:
- Monitors system memory usage
- Implements adaptive rate limiting
- Provides detailed progress monitoring
- Manages concurrent crawls efficiently
### 4.2 Example Usage
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, RateLimitConfig
from crawl4ai.dispatcher import DisplayMode
# Configure browser
browser_cfg = BrowserConfig(headless=True)
# Configure crawler with rate limiting
run_cfg = CrawlerRunConfig(
# e.g., concurrency, wait_for, caching, extraction, etc.
semaphore_count=5
# Enable rate limiting
enable_rate_limiting=True,
rate_limit_config=RateLimitConfig(
base_delay=(1.0, 2.0), # Random delay between 1-2 seconds
max_delay=30.0, # Maximum delay after rate limit hits
max_retries=2, # Number of retries before giving up
rate_limit_codes=[429, 503] # Status codes that trigger rate limiting
),
# Resource monitoring
memory_threshold_percent=70.0, # Pause if memory exceeds this
check_interval=0.5, # How often to check resources
max_session_permit=3, # Maximum concurrent crawls
display_mode=DisplayMode.DETAILED.value # Show detailed progress
)
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
]
async with AsyncWebCrawler(config=browser_cfg) as crawler:
results = await crawler.arun_many(
urls=["https://example.com", "https://another.com"],
config=run_cfg
)
for r in results:
print(r.url, ":", len(r.cleaned_html))
results = await crawler.arun_many(urls, config=run_cfg)
for result in results:
print(f"URL: {result.url}, Success: {result.success}")
```
### 4.2 `start()` & `close()`
### 4.3 Key Features
Allows manual lifecycle usage instead of context manager:
1. **Rate Limiting**
- Automatic delay between requests
- Exponential backoff on rate limit detection
- Domain-specific rate limiting
- Configurable retry strategy
```python
crawler = AsyncWebCrawler(config=browser_cfg)
await crawler.start()
2. **Resource Monitoring**
- Memory usage tracking
- Adaptive concurrency based on system load
- Automatic pausing when resources are constrained
# Perform multiple operations
resultA = await crawler.arun("https://exampleA.com", config=run_cfg)
resultB = await crawler.arun("https://exampleB.com", config=run_cfg)
3. **Progress Monitoring**
- Detailed or aggregated progress display
- Real-time status updates
- Memory usage statistics
await crawler.close()
```
4. **Error Handling**
- Graceful handling of rate limits
- Automatic retries with backoff
- Detailed error reporting
---