refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring
Reorganize dispatcher functionality into separate components: - Create dedicated dispatcher classes (MemoryAdaptive, Semaphore) - Add RateLimiter for smart request throttling - Implement CrawlerMonitor for real-time progress tracking - Move dispatcher config from CrawlerRunConfig to separate classes BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.
This commit is contained in:
@@ -1,7 +1,3 @@
|
||||
Below is a **revised parameter guide** for **`arun()`** in **AsyncWebCrawler**, reflecting the **new** approach where all parameters are passed via a **`CrawlerRunConfig`** instead of directly to `arun()`. Each section includes example usage in the new style, ensuring a clear, modern approach.
|
||||
|
||||
---
|
||||
|
||||
# `arun()` Parameter Guide (New Approach)
|
||||
|
||||
In Crawl4AI’s **latest** configuration model, nearly all parameters that once went directly to `arun()` are now part of **`CrawlerRunConfig`**. When calling `arun()`, you provide:
|
||||
|
||||
100
docs/md_v2/api/arun_many.md
Normal file
100
docs/md_v2/api/arun_many.md
Normal file
@@ -0,0 +1,100 @@
|
||||
# `arun_many(...)` Reference
|
||||
|
||||
> **Note**: This function is very similar to [`arun()`](./arun.md) but focused on **concurrent** or **batch** crawling. If you’re unfamiliar with `arun()` usage, please read that doc first, then review this for differences.
|
||||
|
||||
## Function Signature
|
||||
|
||||
```python
|
||||
async def arun_many(
|
||||
urls: Union[List[str], List[Any]],
|
||||
config: Optional[CrawlerRunConfig] = None,
|
||||
dispatcher: Optional[BaseDispatcher] = None,
|
||||
...
|
||||
) -> List[CrawlResult]:
|
||||
"""
|
||||
Crawl multiple URLs concurrently or in batches.
|
||||
|
||||
:param urls: A list of URLs (or tasks) to crawl.
|
||||
:param config: (Optional) A default `CrawlerRunConfig` applying to each crawl.
|
||||
:param dispatcher: (Optional) A concurrency controller (e.g. MemoryAdaptiveDispatcher).
|
||||
...
|
||||
:return: A list of `CrawlResult` objects, one per URL.
|
||||
"""
|
||||
```
|
||||
|
||||
## Differences from `arun()`
|
||||
|
||||
1. **Multiple URLs**:
|
||||
- Instead of crawling a single URL, you pass a list of them (strings or tasks).
|
||||
- The function returns a **list** of `CrawlResult`, in the same order as `urls`.
|
||||
|
||||
2. **Concurrency & Dispatchers**:
|
||||
- **`dispatcher`** param allows advanced concurrency control.
|
||||
- If omitted, a default dispatcher (like `MemoryAdaptiveDispatcher`) is used internally.
|
||||
- Dispatchers handle concurrency, rate limiting, and memory-based adaptive throttling (see [Multi-URL Crawling](../advanced/multi-url-crawling.md)).
|
||||
|
||||
3. **Parallel** Execution**:
|
||||
- `arun_many()` can run multiple requests concurrently under the hood.
|
||||
- Each `CrawlResult` might also include a **`dispatch_result`** with concurrency details (like memory usage, start/end times).
|
||||
|
||||
### Basic Example
|
||||
|
||||
```python
|
||||
# Minimal usage: The default dispatcher will be used
|
||||
results = await crawler.arun_many(
|
||||
urls=["https://site1.com", "https://site2.com"],
|
||||
config=my_run_config
|
||||
)
|
||||
|
||||
for res in results:
|
||||
if res.success:
|
||||
print(res.url, "crawled OK!")
|
||||
else:
|
||||
print("Failed:", res.url, "-", res.error_message)
|
||||
```
|
||||
|
||||
### With a Custom Dispatcher
|
||||
|
||||
```python
|
||||
dispatcher = MemoryAdaptiveDispatcher(
|
||||
memory_threshold_percent=70.0,
|
||||
max_session_permit=10
|
||||
)
|
||||
results = await crawler.arun_many(
|
||||
urls=["https://site1.com", "https://site2.com", "https://site3.com"],
|
||||
config=my_run_config,
|
||||
dispatcher=dispatcher
|
||||
)
|
||||
```
|
||||
|
||||
**Key Points**:
|
||||
- Each URL is processed by the same or separate sessions, depending on the dispatcher’s strategy.
|
||||
- `dispatch_result` in each `CrawlResult` (if using concurrency) can hold memory and timing info.
|
||||
- If you need to handle authentication or session IDs, pass them in each individual task or within your run config.
|
||||
|
||||
### Return Value
|
||||
|
||||
A **list** of [`CrawlResult`](./crawl-result.md) objects, one per URL. You can iterate to check `result.success` or read each item’s `extracted_content`, `markdown`, or `dispatch_result`.
|
||||
|
||||
---
|
||||
|
||||
## Dispatcher Reference
|
||||
|
||||
- **`MemoryAdaptiveDispatcher`**: Dynamically manages concurrency based on system memory usage.
|
||||
- **`SemaphoreDispatcher`**: Fixed concurrency limit, simpler but less adaptive.
|
||||
|
||||
For advanced usage or custom settings, see [Multi-URL Crawling with Dispatchers](../advanced/multi-url-crawling.md).
|
||||
|
||||
---
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
1. **Large Lists**: If you pass thousands of URLs, be mindful of memory or rate-limits. A dispatcher can help.
|
||||
2. **Session Reuse**: If you need specialized logins or persistent contexts, ensure your dispatcher or tasks handle sessions accordingly.
|
||||
3. **Error Handling**: Each `CrawlResult` might fail for different reasons—always check `result.success` or the `error_message` before proceeding.
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
Use `arun_many()` when you want to **crawl multiple URLs** simultaneously or in controlled parallel tasks. If you need advanced concurrency features (like memory-based adaptive throttling or complex rate-limiting), provide a **dispatcher**. Each result is a standard `CrawlResult`, possibly augmented with concurrency stats (`dispatch_result`) for deeper inspection. For more details on concurrency logic and dispatchers, see the [Advanced Multi-URL Crawling](../advanced/multi-url-crawling.md) docs.
|
||||
@@ -130,51 +130,88 @@ For **backward** compatibility, `arun()` can still accept direct arguments like
|
||||
|
||||
---
|
||||
|
||||
## 4. Helper Methods
|
||||
|
||||
### 4.1 `arun_many()`
|
||||
## 4. Batch Processing: `arun_many()`
|
||||
|
||||
```python
|
||||
async def arun_many(
|
||||
self,
|
||||
urls: List[str],
|
||||
config: Optional[CrawlerRunConfig] = None,
|
||||
# Legacy parameters...
|
||||
# Legacy parameters maintained for backwards compatibility...
|
||||
) -> List[CrawlResult]:
|
||||
...
|
||||
"""
|
||||
Process multiple URLs with intelligent rate limiting and resource monitoring.
|
||||
"""
|
||||
```
|
||||
|
||||
Crawls multiple URLs in concurrency. Accepts the same style `CrawlerRunConfig`. Example:
|
||||
### 4.1 Resource-Aware Crawling
|
||||
|
||||
The `arun_many()` method now uses an intelligent dispatcher that:
|
||||
- Monitors system memory usage
|
||||
- Implements adaptive rate limiting
|
||||
- Provides detailed progress monitoring
|
||||
- Manages concurrent crawls efficiently
|
||||
|
||||
### 4.2 Example Usage
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, RateLimitConfig
|
||||
from crawl4ai.dispatcher import DisplayMode
|
||||
|
||||
# Configure browser
|
||||
browser_cfg = BrowserConfig(headless=True)
|
||||
|
||||
# Configure crawler with rate limiting
|
||||
run_cfg = CrawlerRunConfig(
|
||||
# e.g., concurrency, wait_for, caching, extraction, etc.
|
||||
semaphore_count=5
|
||||
# Enable rate limiting
|
||||
enable_rate_limiting=True,
|
||||
rate_limit_config=RateLimitConfig(
|
||||
base_delay=(1.0, 2.0), # Random delay between 1-2 seconds
|
||||
max_delay=30.0, # Maximum delay after rate limit hits
|
||||
max_retries=2, # Number of retries before giving up
|
||||
rate_limit_codes=[429, 503] # Status codes that trigger rate limiting
|
||||
),
|
||||
# Resource monitoring
|
||||
memory_threshold_percent=70.0, # Pause if memory exceeds this
|
||||
check_interval=0.5, # How often to check resources
|
||||
max_session_permit=3, # Maximum concurrent crawls
|
||||
display_mode=DisplayMode.DETAILED.value # Show detailed progress
|
||||
)
|
||||
|
||||
urls = [
|
||||
"https://example.com/page1",
|
||||
"https://example.com/page2",
|
||||
"https://example.com/page3"
|
||||
]
|
||||
|
||||
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
||||
results = await crawler.arun_many(
|
||||
urls=["https://example.com", "https://another.com"],
|
||||
config=run_cfg
|
||||
)
|
||||
for r in results:
|
||||
print(r.url, ":", len(r.cleaned_html))
|
||||
results = await crawler.arun_many(urls, config=run_cfg)
|
||||
for result in results:
|
||||
print(f"URL: {result.url}, Success: {result.success}")
|
||||
```
|
||||
|
||||
### 4.2 `start()` & `close()`
|
||||
### 4.3 Key Features
|
||||
|
||||
Allows manual lifecycle usage instead of context manager:
|
||||
1. **Rate Limiting**
|
||||
- Automatic delay between requests
|
||||
- Exponential backoff on rate limit detection
|
||||
- Domain-specific rate limiting
|
||||
- Configurable retry strategy
|
||||
|
||||
```python
|
||||
crawler = AsyncWebCrawler(config=browser_cfg)
|
||||
await crawler.start()
|
||||
2. **Resource Monitoring**
|
||||
- Memory usage tracking
|
||||
- Adaptive concurrency based on system load
|
||||
- Automatic pausing when resources are constrained
|
||||
|
||||
# Perform multiple operations
|
||||
resultA = await crawler.arun("https://exampleA.com", config=run_cfg)
|
||||
resultB = await crawler.arun("https://exampleB.com", config=run_cfg)
|
||||
3. **Progress Monitoring**
|
||||
- Detailed or aggregated progress display
|
||||
- Real-time status updates
|
||||
- Memory usage statistics
|
||||
|
||||
await crawler.close()
|
||||
```
|
||||
4. **Error Handling**
|
||||
- Graceful handling of rate limits
|
||||
- Automatic retries with backoff
|
||||
- Detailed error reporting
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -26,6 +26,7 @@ class CrawlResult(BaseModel):
|
||||
response_headers: Optional[dict] = None
|
||||
status_code: Optional[int] = None
|
||||
ssl_certificate: Optional[SSLCertificate] = None
|
||||
dispatch_result: Optional[DispatchResult] = None
|
||||
...
|
||||
```
|
||||
|
||||
@@ -262,7 +263,31 @@ if result.metadata:
|
||||
|
||||
---
|
||||
|
||||
## 6. Example: Accessing Everything
|
||||
## 6. `dispatch_result` (optional)
|
||||
|
||||
A `DispatchResult` object providing additional concurrency and resource usage information when crawling URLs in parallel (e.g., via `arun_many()` with custom dispatchers). It contains:
|
||||
|
||||
- **`task_id`**: A unique identifier for the parallel task.
|
||||
- **`memory_usage`** (float): The memory (in MB) used at the time of completion.
|
||||
- **`peak_memory`** (float): The peak memory usage (in MB) recorded during the task’s execution.
|
||||
- **`start_time`** / **`end_time`** (datetime): Time range for this crawling task.
|
||||
- **`error_message`** (str): Any dispatcher- or concurrency-related error encountered.
|
||||
|
||||
```python
|
||||
# Example usage:
|
||||
for result in results:
|
||||
if result.success and result.dispatch_result:
|
||||
dr = result.dispatch_result
|
||||
print(f"URL: {result.url}, Task ID: {dr.task_id}")
|
||||
print(f"Memory: {dr.memory_usage:.1f} MB (Peak: {dr.peak_memory:.1f} MB)")
|
||||
print(f"Duration: {dr.end_time - dr.start_time}")
|
||||
```
|
||||
|
||||
> **Note**: This field is typically populated when using `arun_many(...)` alongside a **dispatcher** (e.g., `MemoryAdaptiveDispatcher` or `SemaphoreDispatcher`). If no concurrency or dispatcher is used, `dispatch_result` may remain `None`.
|
||||
|
||||
---
|
||||
|
||||
## 7. Example: Accessing Everything
|
||||
|
||||
```python
|
||||
async def handle_result(result: CrawlResult):
|
||||
@@ -306,7 +331,7 @@ async def handle_result(result: CrawlResult):
|
||||
|
||||
---
|
||||
|
||||
## 7. Key Points & Future
|
||||
## 8. Key Points & Future
|
||||
|
||||
1. **`markdown_v2` vs `markdown`**
|
||||
- Right now, `markdown_v2` is the more robust container (`MarkdownGenerationResult`), providing **raw_markdown**, **markdown_with_citations**, references, plus possible **fit_markdown**.
|
||||
|
||||
@@ -157,7 +157,32 @@ Use these for link-level content filtering (often to keep crawls “internal”
|
||||
|
||||
---
|
||||
|
||||
### G) **Debug & Logging**
|
||||
### G) **Rate Limiting & Resource Management**
|
||||
|
||||
| **Parameter** | **Type / Default** | **What It Does** |
|
||||
|------------------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
|
||||
| **`enable_rate_limiting`** | `bool` (default: `False`) | Enable intelligent rate limiting for multiple URLs |
|
||||
| **`rate_limit_config`** | `RateLimitConfig` (default: `None`) | Configuration for rate limiting behavior |
|
||||
|
||||
The `RateLimitConfig` class has these fields:
|
||||
|
||||
| **Field** | **Type / Default** | **What It Does** |
|
||||
|--------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
|
||||
| **`base_delay`** | `Tuple[float, float]` (1.0, 3.0) | Random delay range between requests to the same domain |
|
||||
| **`max_delay`** | `float` (60.0) | Maximum delay after rate limit detection |
|
||||
| **`max_retries`** | `int` (3) | Number of retries before giving up on rate-limited requests |
|
||||
| **`rate_limit_codes`** | `List[int]` ([429, 503]) | HTTP status codes that trigger rate limiting behavior |
|
||||
|
||||
| **Parameter** | **Type / Default** | **What It Does** |
|
||||
|-------------------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
|
||||
| **`memory_threshold_percent`** | `float` (70.0) | Maximum memory usage before pausing new crawls |
|
||||
| **`check_interval`** | `float` (1.0) | How often to check system resources (in seconds) |
|
||||
| **`max_session_permit`** | `int` (20) | Maximum number of concurrent crawl sessions |
|
||||
| **`display_mode`** | `str` (`None`, "DETAILED", "AGGREGATED") | How to display progress information |
|
||||
|
||||
---
|
||||
|
||||
### H) **Debug & Logging**
|
||||
|
||||
| **Parameter** | **Type / Default** | **What It Does** |
|
||||
|----------------|--------------------|---------------------------------------------------------------------------|
|
||||
@@ -170,7 +195,7 @@ Use these for link-level content filtering (often to keep crawls “internal”
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, RateLimitConfig
|
||||
|
||||
async def main():
|
||||
# Configure the browser
|
||||
@@ -190,7 +215,18 @@ async def main():
|
||||
excluded_tags=["script", "style"],
|
||||
exclude_external_links=True,
|
||||
wait_for="css:.article-loaded",
|
||||
screenshot=True
|
||||
screenshot=True,
|
||||
enable_rate_limiting=True,
|
||||
rate_limit_config=RateLimitConfig(
|
||||
base_delay=(1.0, 3.0),
|
||||
max_delay=60.0,
|
||||
max_retries=3,
|
||||
rate_limit_codes=[429, 503]
|
||||
),
|
||||
memory_threshold_percent=70.0,
|
||||
check_interval=1.0,
|
||||
max_session_permit=20,
|
||||
display_mode="DETAILED"
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
||||
@@ -223,4 +259,3 @@ if __name__ == "__main__":
|
||||
- **Use** `BrowserConfig` for **global** browser settings: engine, headless, proxy, user agent.
|
||||
- **Use** `CrawlerRunConfig` for each crawl’s **context**: how to filter content, handle caching, wait for dynamic elements, or run JS.
|
||||
- **Pass** both configs to `AsyncWebCrawler` (the `BrowserConfig`) and then to `arun()` (the `CrawlerRunConfig`).
|
||||
|
||||
|
||||
Reference in New Issue
Block a user