refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring
Reorganize dispatcher functionality into separate components: - Create dedicated dispatcher classes (MemoryAdaptive, Semaphore) - Add RateLimiter for smart request throttling - Implement CrawlerMonitor for real-time progress tracking - Move dispatcher config from CrawlerRunConfig to separate classes BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.
This commit is contained in:
@@ -116,6 +116,12 @@ class CrawlerRunConfig:
|
||||
wait_for=None,
|
||||
screenshot=False,
|
||||
pdf=False,
|
||||
enable_rate_limiting=False,
|
||||
rate_limit_config=None,
|
||||
memory_threshold_percent=70.0,
|
||||
check_interval=1.0,
|
||||
max_session_permit=20,
|
||||
display_mode=None,
|
||||
verbose=True,
|
||||
# ... other advanced parameters omitted
|
||||
):
|
||||
@@ -156,6 +162,58 @@ class CrawlerRunConfig:
|
||||
- Logs additional runtime details.
|
||||
- Overlaps with the browser’s verbosity if also set to `True` in `BrowserConfig`.
|
||||
|
||||
9. **`enable_rate_limiting`**:
|
||||
- If `True`, enables rate limiting for batch processing.
|
||||
- Requires `rate_limit_config` to be set.
|
||||
|
||||
10. **`rate_limit_config`**:
|
||||
- A `RateLimitConfig` object controlling rate limiting behavior.
|
||||
- See below for details.
|
||||
|
||||
11. **`memory_threshold_percent`**:
|
||||
- The memory threshold (as a percentage) to monitor.
|
||||
- If exceeded, the crawler will pause or slow down.
|
||||
|
||||
12. **`check_interval`**:
|
||||
- The interval (in seconds) to check system resources.
|
||||
- Affects how often memory and CPU usage are monitored.
|
||||
|
||||
13. **`max_session_permit`**:
|
||||
- The maximum number of concurrent crawl sessions.
|
||||
- Helps prevent overwhelming the system.
|
||||
|
||||
14. **`display_mode`**:
|
||||
- The display mode for progress information (`DETAILED`, `BRIEF`, etc.).
|
||||
- Affects how much information is printed during the crawl.
|
||||
|
||||
### Rate Limiting & Resource Management
|
||||
|
||||
For batch processing with `arun_many()`, you can enable intelligent rate limiting:
|
||||
|
||||
```python
|
||||
from crawl4ai import RateLimitConfig
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
enable_rate_limiting=True,
|
||||
rate_limit_config=RateLimitConfig(
|
||||
base_delay=(1.0, 3.0), # Random delay range
|
||||
max_delay=60.0, # Max delay after rate limits
|
||||
max_retries=3, # Retries before giving up
|
||||
rate_limit_codes=[429, 503] # Status codes to watch
|
||||
),
|
||||
memory_threshold_percent=70.0, # Memory threshold
|
||||
check_interval=1.0, # Resource check interval
|
||||
max_session_permit=20, # Max concurrent crawls
|
||||
display_mode="DETAILED" # Progress display mode
|
||||
)
|
||||
```
|
||||
|
||||
This configuration:
|
||||
- Implements intelligent rate limiting per domain
|
||||
- Monitors system resources
|
||||
- Provides detailed progress information
|
||||
- Manages concurrent crawls efficiently
|
||||
|
||||
**Minimal Example**:
|
||||
|
||||
```python
|
||||
@@ -164,7 +222,14 @@ from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
crawl_conf = CrawlerRunConfig(
|
||||
js_code="document.querySelector('button#loadMore')?.click()",
|
||||
wait_for="css:.loaded-content",
|
||||
screenshot=True
|
||||
screenshot=True,
|
||||
enable_rate_limiting=True,
|
||||
rate_limit_config=RateLimitConfig(
|
||||
base_delay=(1.0, 3.0),
|
||||
max_delay=60.0,
|
||||
max_retries=3,
|
||||
rate_limit_codes=[429, 503]
|
||||
)
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
@@ -205,7 +270,14 @@ async def main():
|
||||
# 3) Crawler run config: skip cache, use extraction
|
||||
run_conf = CrawlerRunConfig(
|
||||
extraction_strategy=extraction,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
enable_rate_limiting=True,
|
||||
rate_limit_config=RateLimitConfig(
|
||||
base_delay=(1.0, 3.0),
|
||||
max_delay=60.0,
|
||||
max_retries=3,
|
||||
rate_limit_codes=[429, 503]
|
||||
)
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_conf) as crawler:
|
||||
|
||||
@@ -1,7 +1,3 @@
|
||||
Below is the **revised Quickstart** guide with the **Installation** section removed, plus an updated **dynamic content** crawl example that uses `BrowserConfig` and `CrawlerRunConfig` (instead of passing parameters directly to `arun()`). Everything else remains as before.
|
||||
|
||||
---
|
||||
|
||||
# Getting Started with Crawl4AI
|
||||
|
||||
Welcome to **Crawl4AI**, an open-source LLM-friendly Web Crawler & Scraper. In this tutorial, you’ll:
|
||||
@@ -254,7 +250,39 @@ if __name__ == "__main__":
|
||||
|
||||
---
|
||||
|
||||
## 7. Dynamic Content Example
|
||||
## 7. Multi-URL Concurrency (Preview)
|
||||
|
||||
If you need to crawl multiple URLs in **parallel**, you can use `arun_many()`. By default, Crawl4AI employs a **MemoryAdaptiveDispatcher**, automatically adjusting concurrency based on system resources. Here’s a quick glimpse:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def quick_parallel_example():
|
||||
urls = [
|
||||
"https://example.com/page1",
|
||||
"https://example.com/page2",
|
||||
"https://example.com/page3"
|
||||
]
|
||||
|
||||
run_conf = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
results = await crawler.arun_many(urls, config=run_conf)
|
||||
for res in results:
|
||||
if res.success:
|
||||
print(f"[OK] {res.url}, length: {len(res.markdown_v2.raw_markdown)}")
|
||||
else:
|
||||
print(f"[ERROR] {res.url} => {res.error_message}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(quick_parallel_example())
|
||||
```
|
||||
|
||||
For more advanced concurrency (e.g., a **semaphore-based** approach, **adaptive memory usage throttling**, or customized rate limiting), see [Advanced Multi-URL Crawling](../advanced/multi-url-crawling.md).
|
||||
|
||||
|
||||
## 8. Dynamic Content Example
|
||||
|
||||
Some sites require multiple “page clicks” or dynamic JavaScript updates. Below is an example showing how to **click** a “Next Page” button and wait for new commits to load on GitHub, using **`BrowserConfig`** and **`CrawlerRunConfig`**:
|
||||
|
||||
@@ -343,7 +371,7 @@ if __name__ == "__main__":
|
||||
|
||||
---
|
||||
|
||||
## 8. Next Steps
|
||||
## 9. Next Steps
|
||||
|
||||
Congratulations! You have:
|
||||
|
||||
|
||||
Reference in New Issue
Block a user