Add comprehensive documentation for the new streaming mode feature in arun_many(): - Update arun_many() API docs to reflect streaming return type - Add streaming examples in quickstart and multi-url guides - Document stream parameter in configuration classes - Add clone() helper method documentation for configs This change improves documentation for processing large numbers of URLs efficiently.
124 lines
5.0 KiB
Markdown
124 lines
5.0 KiB
Markdown
# `arun_many(...)` Reference
|
||
|
||
> **Note**: This function is very similar to [`arun()`](./arun.md) but focused on **concurrent** or **batch** crawling. If you’re unfamiliar with `arun()` usage, please read that doc first, then review this for differences.
|
||
|
||
## Function Signature
|
||
|
||
```python
|
||
async def arun_many(
|
||
urls: Union[List[str], List[Any]],
|
||
config: Optional[CrawlerRunConfig] = None,
|
||
dispatcher: Optional[BaseDispatcher] = None,
|
||
...
|
||
) -> Union[List[CrawlResult], AsyncGenerator[CrawlResult, None]]:
|
||
"""
|
||
Crawl multiple URLs concurrently or in batches.
|
||
|
||
:param urls: A list of URLs (or tasks) to crawl.
|
||
:param config: (Optional) A default `CrawlerRunConfig` applying to each crawl.
|
||
:param dispatcher: (Optional) A concurrency controller (e.g. MemoryAdaptiveDispatcher).
|
||
...
|
||
:return: Either a list of `CrawlResult` objects, or an async generator if streaming is enabled.
|
||
"""
|
||
```
|
||
|
||
## Differences from `arun()`
|
||
|
||
1. **Multiple URLs**:
|
||
- Instead of crawling a single URL, you pass a list of them (strings or tasks).
|
||
- The function returns either a **list** of `CrawlResult` or an **async generator** if streaming is enabled.
|
||
|
||
2. **Concurrency & Dispatchers**:
|
||
- **`dispatcher`** param allows advanced concurrency control.
|
||
- If omitted, a default dispatcher (like `MemoryAdaptiveDispatcher`) is used internally.
|
||
- Dispatchers handle concurrency, rate limiting, and memory-based adaptive throttling (see [Multi-URL Crawling](../advanced/multi-url-crawling.md)).
|
||
|
||
3. **Streaming Support**:
|
||
- Enable streaming by setting `stream=True` in your `CrawlerRunConfig`.
|
||
- When streaming, use `async for` to process results as they become available.
|
||
- Ideal for processing large numbers of URLs without waiting for all to complete.
|
||
|
||
4. **Parallel** Execution**:
|
||
- `arun_many()` can run multiple requests concurrently under the hood.
|
||
- Each `CrawlResult` might also include a **`dispatch_result`** with concurrency details (like memory usage, start/end times).
|
||
|
||
### Basic Example (Batch Mode)
|
||
|
||
```python
|
||
# Minimal usage: The default dispatcher will be used
|
||
results = await crawler.arun_many(
|
||
urls=["https://site1.com", "https://site2.com"],
|
||
config=CrawlerRunConfig(stream=False) # Default behavior
|
||
)
|
||
|
||
for res in results:
|
||
if res.success:
|
||
print(res.url, "crawled OK!")
|
||
else:
|
||
print("Failed:", res.url, "-", res.error_message)
|
||
```
|
||
|
||
### Streaming Example
|
||
|
||
```python
|
||
config = CrawlerRunConfig(
|
||
stream=True, # Enable streaming mode
|
||
cache_mode=CacheMode.BYPASS
|
||
)
|
||
|
||
# Process results as they complete
|
||
async for result in await crawler.arun_many(
|
||
urls=["https://site1.com", "https://site2.com", "https://site3.com"],
|
||
config=config
|
||
):
|
||
if result.success:
|
||
print(f"Just completed: {result.url}")
|
||
# Process each result immediately
|
||
process_result(result)
|
||
```
|
||
|
||
### With a Custom Dispatcher
|
||
|
||
```python
|
||
dispatcher = MemoryAdaptiveDispatcher(
|
||
memory_threshold_percent=70.0,
|
||
max_session_permit=10
|
||
)
|
||
results = await crawler.arun_many(
|
||
urls=["https://site1.com", "https://site2.com", "https://site3.com"],
|
||
config=my_run_config,
|
||
dispatcher=dispatcher
|
||
)
|
||
```
|
||
|
||
**Key Points**:
|
||
- Each URL is processed by the same or separate sessions, depending on the dispatcher’s strategy.
|
||
- `dispatch_result` in each `CrawlResult` (if using concurrency) can hold memory and timing info.
|
||
- If you need to handle authentication or session IDs, pass them in each individual task or within your run config.
|
||
|
||
### Return Value
|
||
|
||
Either a **list** of [`CrawlResult`](./crawl-result.md) objects, or an **async generator** if streaming is enabled. You can iterate to check `result.success` or read each item’s `extracted_content`, `markdown`, or `dispatch_result`.
|
||
|
||
---
|
||
|
||
## Dispatcher Reference
|
||
|
||
- **`MemoryAdaptiveDispatcher`**: Dynamically manages concurrency based on system memory usage.
|
||
- **`SemaphoreDispatcher`**: Fixed concurrency limit, simpler but less adaptive.
|
||
|
||
For advanced usage or custom settings, see [Multi-URL Crawling with Dispatchers](../advanced/multi-url-crawling.md).
|
||
|
||
---
|
||
|
||
## Common Pitfalls
|
||
|
||
1. **Large Lists**: If you pass thousands of URLs, be mindful of memory or rate-limits. A dispatcher can help.
|
||
2. **Session Reuse**: If you need specialized logins or persistent contexts, ensure your dispatcher or tasks handle sessions accordingly.
|
||
3. **Error Handling**: Each `CrawlResult` might fail for different reasons—always check `result.success` or the `error_message` before proceeding.
|
||
|
||
---
|
||
|
||
## Conclusion
|
||
|
||
Use `arun_many()` when you want to **crawl multiple URLs** simultaneously or in controlled parallel tasks. If you need advanced concurrency features (like memory-based adaptive throttling or complex rate-limiting), provide a **dispatcher**. Each result is a standard `CrawlResult`, possibly augmented with concurrency stats (`dispatch_result`) for deeper inspection. For more details on concurrency logic and dispatchers, see the [Advanced Multi-URL Crawling](../advanced/multi-url-crawling.md) docs. |