Files

UncleCode 825c78a048 refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.

2025-01-11 21:10:27 +08:00

4.0 KiB

Raw Blame History

`arun_many(...)` Reference

Note

: This function is very similar to arun() but focused on concurrent or batch crawling. If you’re unfamiliar with arun() usage, please read that doc first, then review this for differences.

Function Signature

async def arun_many(
    urls: Union[List[str], List[Any]],
    config: Optional[CrawlerRunConfig] = None,
    dispatcher: Optional[BaseDispatcher] = None,
    ...
) -> List[CrawlResult]:
    """
    Crawl multiple URLs concurrently or in batches.

    :param urls: A list of URLs (or tasks) to crawl.
    :param config: (Optional) A default `CrawlerRunConfig` applying to each crawl.
    :param dispatcher: (Optional) A concurrency controller (e.g. MemoryAdaptiveDispatcher).
    ...
    :return: A list of `CrawlResult` objects, one per URL.
    """

Differences from `arun()`

Multiple URLs:
- Instead of crawling a single URL, you pass a list of them (strings or tasks).
- The function returns a list of CrawlResult, in the same order as urls.
Concurrency & Dispatchers:
- dispatcher param allows advanced concurrency control.
- If omitted, a default dispatcher (like MemoryAdaptiveDispatcher) is used internally.
- Dispatchers handle concurrency, rate limiting, and memory-based adaptive throttling (see Multi-URL Crawling).
Parallel Execution**:
- arun_many() can run multiple requests concurrently under the hood.
- Each CrawlResult might also include a dispatch_result with concurrency details (like memory usage, start/end times).

Basic Example

# Minimal usage: The default dispatcher will be used
results = await crawler.arun_many(
    urls=["https://site1.com", "https://site2.com"],
    config=my_run_config
)

for res in results:
    if res.success:
        print(res.url, "crawled OK!")
    else:
        print("Failed:", res.url, "-", res.error_message)

With a Custom Dispatcher

dispatcher = MemoryAdaptiveDispatcher(
    memory_threshold_percent=70.0,
    max_session_permit=10
)
results = await crawler.arun_many(
    urls=["https://site1.com", "https://site2.com", "https://site3.com"],
    config=my_run_config,
    dispatcher=dispatcher
)

Key Points:

Each URL is processed by the same or separate sessions, depending on the dispatcher’s strategy.
dispatch_result in each CrawlResult (if using concurrency) can hold memory and timing info.
If you need to handle authentication or session IDs, pass them in each individual task or within your run config.

Return Value

A list of CrawlResult objects, one per URL. You can iterate to check result.success or read each item’s extracted_content, markdown, or dispatch_result.

Dispatcher Reference

MemoryAdaptiveDispatcher: Dynamically manages concurrency based on system memory usage.
SemaphoreDispatcher: Fixed concurrency limit, simpler but less adaptive.

For advanced usage or custom settings, see Multi-URL Crawling with Dispatchers.

Common Pitfalls

Large Lists: If you pass thousands of URLs, be mindful of memory or rate-limits. A dispatcher can help.
Session Reuse: If you need specialized logins or persistent contexts, ensure your dispatcher or tasks handle sessions accordingly.
Error Handling: Each CrawlResult might fail for different reasons—always check result.success or the error_message before proceeding.

Conclusion

Use arun_many() when you want to crawl multiple URLs simultaneously or in controlled parallel tasks. If you need advanced concurrency features (like memory-based adaptive throttling or complex rate-limiting), provide a dispatcher. Each result is a standard CrawlResult, possibly augmented with concurrency stats (dispatch_result) for deeper inspection. For more details on concurrency logic and dispatchers, see the Advanced Multi-URL Crawling docs.

4.0 KiB Raw Blame History Unescape Escape

arun_many(...) Reference