refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.
This commit is contained in:
UncleCode
2025-01-11 21:10:27 +08:00
parent 3865342c93
commit 825c78a048
19 changed files with 1742 additions and 484 deletions

View File

@@ -1,7 +1,3 @@
Below is a **revised parameter guide** for **`arun()`** in **AsyncWebCrawler**, reflecting the **new** approach where all parameters are passed via a **`CrawlerRunConfig`** instead of directly to `arun()`. Each section includes example usage in the new style, ensuring a clear, modern approach.
---
# `arun()` Parameter Guide (New Approach)
In Crawl4AIs **latest** configuration model, nearly all parameters that once went directly to `arun()` are now part of **`CrawlerRunConfig`**. When calling `arun()`, you provide:

100
docs/md_v2/api/arun_many.md Normal file
View File

@@ -0,0 +1,100 @@
# `arun_many(...)` Reference
> **Note**: This function is very similar to [`arun()`](./arun.md) but focused on **concurrent** or **batch** crawling. If youre unfamiliar with `arun()` usage, please read that doc first, then review this for differences.
## Function Signature
```python
async def arun_many(
urls: Union[List[str], List[Any]],
config: Optional[CrawlerRunConfig] = None,
dispatcher: Optional[BaseDispatcher] = None,
...
) -> List[CrawlResult]:
"""
Crawl multiple URLs concurrently or in batches.
:param urls: A list of URLs (or tasks) to crawl.
:param config: (Optional) A default `CrawlerRunConfig` applying to each crawl.
:param dispatcher: (Optional) A concurrency controller (e.g. MemoryAdaptiveDispatcher).
...
:return: A list of `CrawlResult` objects, one per URL.
"""
```
## Differences from `arun()`
1. **Multiple URLs**:
- Instead of crawling a single URL, you pass a list of them (strings or tasks).
- The function returns a **list** of `CrawlResult`, in the same order as `urls`.
2. **Concurrency & Dispatchers**:
- **`dispatcher`** param allows advanced concurrency control.
- If omitted, a default dispatcher (like `MemoryAdaptiveDispatcher`) is used internally.
- Dispatchers handle concurrency, rate limiting, and memory-based adaptive throttling (see [Multi-URL Crawling](../advanced/multi-url-crawling.md)).
3. **Parallel** Execution**:
- `arun_many()` can run multiple requests concurrently under the hood.
- Each `CrawlResult` might also include a **`dispatch_result`** with concurrency details (like memory usage, start/end times).
### Basic Example
```python
# Minimal usage: The default dispatcher will be used
results = await crawler.arun_many(
urls=["https://site1.com", "https://site2.com"],
config=my_run_config
)
for res in results:
if res.success:
print(res.url, "crawled OK!")
else:
print("Failed:", res.url, "-", res.error_message)
```
### With a Custom Dispatcher
```python
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=70.0,
max_session_permit=10
)
results = await crawler.arun_many(
urls=["https://site1.com", "https://site2.com", "https://site3.com"],
config=my_run_config,
dispatcher=dispatcher
)
```
**Key Points**:
- Each URL is processed by the same or separate sessions, depending on the dispatchers strategy.
- `dispatch_result` in each `CrawlResult` (if using concurrency) can hold memory and timing info.
- If you need to handle authentication or session IDs, pass them in each individual task or within your run config.
### Return Value
A **list** of [`CrawlResult`](./crawl-result.md) objects, one per URL. You can iterate to check `result.success` or read each items `extracted_content`, `markdown`, or `dispatch_result`.
---
## Dispatcher Reference
- **`MemoryAdaptiveDispatcher`**: Dynamically manages concurrency based on system memory usage.
- **`SemaphoreDispatcher`**: Fixed concurrency limit, simpler but less adaptive.
For advanced usage or custom settings, see [Multi-URL Crawling with Dispatchers](../advanced/multi-url-crawling.md).
---
## Common Pitfalls
1. **Large Lists**: If you pass thousands of URLs, be mindful of memory or rate-limits. A dispatcher can help.
2. **Session Reuse**: If you need specialized logins or persistent contexts, ensure your dispatcher or tasks handle sessions accordingly.
3. **Error Handling**: Each `CrawlResult` might fail for different reasons—always check `result.success` or the `error_message` before proceeding.
---
## Conclusion
Use `arun_many()` when you want to **crawl multiple URLs** simultaneously or in controlled parallel tasks. If you need advanced concurrency features (like memory-based adaptive throttling or complex rate-limiting), provide a **dispatcher**. Each result is a standard `CrawlResult`, possibly augmented with concurrency stats (`dispatch_result`) for deeper inspection. For more details on concurrency logic and dispatchers, see the [Advanced Multi-URL Crawling](../advanced/multi-url-crawling.md) docs.

View File

@@ -130,51 +130,88 @@ For **backward** compatibility, `arun()` can still accept direct arguments like
---
## 4. Helper Methods
### 4.1 `arun_many()`
## 4. Batch Processing: `arun_many()`
```python
async def arun_many(
self,
urls: List[str],
config: Optional[CrawlerRunConfig] = None,
# Legacy parameters...
# Legacy parameters maintained for backwards compatibility...
) -> List[CrawlResult]:
...
"""
Process multiple URLs with intelligent rate limiting and resource monitoring.
"""
```
Crawls multiple URLs in concurrency. Accepts the same style `CrawlerRunConfig`. Example:
### 4.1 Resource-Aware Crawling
The `arun_many()` method now uses an intelligent dispatcher that:
- Monitors system memory usage
- Implements adaptive rate limiting
- Provides detailed progress monitoring
- Manages concurrent crawls efficiently
### 4.2 Example Usage
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, RateLimitConfig
from crawl4ai.dispatcher import DisplayMode
# Configure browser
browser_cfg = BrowserConfig(headless=True)
# Configure crawler with rate limiting
run_cfg = CrawlerRunConfig(
# e.g., concurrency, wait_for, caching, extraction, etc.
semaphore_count=5
# Enable rate limiting
enable_rate_limiting=True,
rate_limit_config=RateLimitConfig(
base_delay=(1.0, 2.0), # Random delay between 1-2 seconds
max_delay=30.0, # Maximum delay after rate limit hits
max_retries=2, # Number of retries before giving up
rate_limit_codes=[429, 503] # Status codes that trigger rate limiting
),
# Resource monitoring
memory_threshold_percent=70.0, # Pause if memory exceeds this
check_interval=0.5, # How often to check resources
max_session_permit=3, # Maximum concurrent crawls
display_mode=DisplayMode.DETAILED.value # Show detailed progress
)
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
]
async with AsyncWebCrawler(config=browser_cfg) as crawler:
results = await crawler.arun_many(
urls=["https://example.com", "https://another.com"],
config=run_cfg
)
for r in results:
print(r.url, ":", len(r.cleaned_html))
results = await crawler.arun_many(urls, config=run_cfg)
for result in results:
print(f"URL: {result.url}, Success: {result.success}")
```
### 4.2 `start()` & `close()`
### 4.3 Key Features
Allows manual lifecycle usage instead of context manager:
1. **Rate Limiting**
- Automatic delay between requests
- Exponential backoff on rate limit detection
- Domain-specific rate limiting
- Configurable retry strategy
```python
crawler = AsyncWebCrawler(config=browser_cfg)
await crawler.start()
2. **Resource Monitoring**
- Memory usage tracking
- Adaptive concurrency based on system load
- Automatic pausing when resources are constrained
# Perform multiple operations
resultA = await crawler.arun("https://exampleA.com", config=run_cfg)
resultB = await crawler.arun("https://exampleB.com", config=run_cfg)
3. **Progress Monitoring**
- Detailed or aggregated progress display
- Real-time status updates
- Memory usage statistics
await crawler.close()
```
4. **Error Handling**
- Graceful handling of rate limits
- Automatic retries with backoff
- Detailed error reporting
---

View File

@@ -26,6 +26,7 @@ class CrawlResult(BaseModel):
response_headers: Optional[dict] = None
status_code: Optional[int] = None
ssl_certificate: Optional[SSLCertificate] = None
dispatch_result: Optional[DispatchResult] = None
...
```
@@ -262,7 +263,31 @@ if result.metadata:
---
## 6. Example: Accessing Everything
## 6. `dispatch_result` (optional)
A `DispatchResult` object providing additional concurrency and resource usage information when crawling URLs in parallel (e.g., via `arun_many()` with custom dispatchers). It contains:
- **`task_id`**: A unique identifier for the parallel task.
- **`memory_usage`** (float): The memory (in MB) used at the time of completion.
- **`peak_memory`** (float): The peak memory usage (in MB) recorded during the tasks execution.
- **`start_time`** / **`end_time`** (datetime): Time range for this crawling task.
- **`error_message`** (str): Any dispatcher- or concurrency-related error encountered.
```python
# Example usage:
for result in results:
if result.success and result.dispatch_result:
dr = result.dispatch_result
print(f"URL: {result.url}, Task ID: {dr.task_id}")
print(f"Memory: {dr.memory_usage:.1f} MB (Peak: {dr.peak_memory:.1f} MB)")
print(f"Duration: {dr.end_time - dr.start_time}")
```
> **Note**: This field is typically populated when using `arun_many(...)` alongside a **dispatcher** (e.g., `MemoryAdaptiveDispatcher` or `SemaphoreDispatcher`). If no concurrency or dispatcher is used, `dispatch_result` may remain `None`.
---
## 7. Example: Accessing Everything
```python
async def handle_result(result: CrawlResult):
@@ -306,7 +331,7 @@ async def handle_result(result: CrawlResult):
---
## 7. Key Points & Future
## 8. Key Points & Future
1. **`markdown_v2` vs `markdown`**
- Right now, `markdown_v2` is the more robust container (`MarkdownGenerationResult`), providing **raw_markdown**, **markdown_with_citations**, references, plus possible **fit_markdown**.

View File

@@ -157,7 +157,32 @@ Use these for link-level content filtering (often to keep crawls “internal”
---
### G) **Debug & Logging**
### G) **Rate Limiting & Resource Management**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
| **`enable_rate_limiting`** | `bool` (default: `False`) | Enable intelligent rate limiting for multiple URLs |
| **`rate_limit_config`** | `RateLimitConfig` (default: `None`) | Configuration for rate limiting behavior |
The `RateLimitConfig` class has these fields:
| **Field** | **Type / Default** | **What It Does** |
|--------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
| **`base_delay`** | `Tuple[float, float]` (1.0, 3.0) | Random delay range between requests to the same domain |
| **`max_delay`** | `float` (60.0) | Maximum delay after rate limit detection |
| **`max_retries`** | `int` (3) | Number of retries before giving up on rate-limited requests |
| **`rate_limit_codes`** | `List[int]` ([429, 503]) | HTTP status codes that trigger rate limiting behavior |
| **Parameter** | **Type / Default** | **What It Does** |
|-------------------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
| **`memory_threshold_percent`** | `float` (70.0) | Maximum memory usage before pausing new crawls |
| **`check_interval`** | `float` (1.0) | How often to check system resources (in seconds) |
| **`max_session_permit`** | `int` (20) | Maximum number of concurrent crawl sessions |
| **`display_mode`** | `str` (`None`, "DETAILED", "AGGREGATED") | How to display progress information |
---
### H) **Debug & Logging**
| **Parameter** | **Type / Default** | **What It Does** |
|----------------|--------------------|---------------------------------------------------------------------------|
@@ -170,7 +195,7 @@ Use these for link-level content filtering (often to keep crawls “internal”
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, RateLimitConfig
async def main():
# Configure the browser
@@ -190,7 +215,18 @@ async def main():
excluded_tags=["script", "style"],
exclude_external_links=True,
wait_for="css:.article-loaded",
screenshot=True
screenshot=True,
enable_rate_limiting=True,
rate_limit_config=RateLimitConfig(
base_delay=(1.0, 3.0),
max_delay=60.0,
max_retries=3,
rate_limit_codes=[429, 503]
),
memory_threshold_percent=70.0,
check_interval=1.0,
max_session_permit=20,
display_mode="DETAILED"
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
@@ -223,4 +259,3 @@ if __name__ == "__main__":
- **Use** `BrowserConfig` for **global** browser settings: engine, headless, proxy, user agent.
- **Use** `CrawlerRunConfig` for each crawls **context**: how to filter content, handle caching, wait for dynamic elements, or run JS.
- **Pass** both configs to `AsyncWebCrawler` (the `BrowserConfig`) and then to `arun()` (the `CrawlerRunConfig`).