docs(api): add streaming mode documentation and examples

Add comprehensive documentation for the new streaming mode feature in arun_many():
- Update arun_many() API docs to reflect streaming return type
- Add streaming examples in quickstart and multi-url guides
- Document stream parameter in configuration classes
- Add clone() helper method documentation for configs

This change improves documentation for processing large numbers of URLs efficiently.
This commit is contained in:
UncleCode
2025-01-19 18:21:34 +08:00
parent 91463e34f1
commit 8b6fe6a98f
5 changed files with 184 additions and 31 deletions

View File

@@ -10,7 +10,7 @@ async def arun_many(
config: Optional[CrawlerRunConfig] = None,
dispatcher: Optional[BaseDispatcher] = None,
...
) -> List[CrawlResult]:
) -> Union[List[CrawlResult], AsyncGenerator[CrawlResult, None]]:
"""
Crawl multiple URLs concurrently or in batches.
@@ -18,7 +18,7 @@ async def arun_many(
:param config: (Optional) A default `CrawlerRunConfig` applying to each crawl.
:param dispatcher: (Optional) A concurrency controller (e.g. MemoryAdaptiveDispatcher).
...
:return: A list of `CrawlResult` objects, one per URL.
:return: Either a list of `CrawlResult` objects, or an async generator if streaming is enabled.
"""
```
@@ -26,24 +26,29 @@ async def arun_many(
1. **Multiple URLs**:
- Instead of crawling a single URL, you pass a list of them (strings or tasks).
- The function returns a **list** of `CrawlResult`, in the same order as `urls`.
- The function returns either a **list** of `CrawlResult` or an **async generator** if streaming is enabled.
2. **Concurrency & Dispatchers**:
- **`dispatcher`** param allows advanced concurrency control.
- If omitted, a default dispatcher (like `MemoryAdaptiveDispatcher`) is used internally.
- Dispatchers handle concurrency, rate limiting, and memory-based adaptive throttling (see [Multi-URL Crawling](../advanced/multi-url-crawling.md)).
3. **Parallel** Execution**:
3. **Streaming Support**:
- Enable streaming by setting `stream=True` in your `CrawlerRunConfig`.
- When streaming, use `async for` to process results as they become available.
- Ideal for processing large numbers of URLs without waiting for all to complete.
4. **Parallel** Execution**:
- `arun_many()` can run multiple requests concurrently under the hood.
- Each `CrawlResult` might also include a **`dispatch_result`** with concurrency details (like memory usage, start/end times).
### Basic Example
### Basic Example (Batch Mode)
```python
# Minimal usage: The default dispatcher will be used
results = await crawler.arun_many(
urls=["https://site1.com", "https://site2.com"],
config=my_run_config
config=CrawlerRunConfig(stream=False) # Default behavior
)
for res in results:
@@ -53,6 +58,25 @@ for res in results:
print("Failed:", res.url, "-", res.error_message)
```
### Streaming Example
```python
config = CrawlerRunConfig(
stream=True, # Enable streaming mode
cache_mode=CacheMode.BYPASS
)
# Process results as they complete
async for result in await crawler.arun_many(
urls=["https://site1.com", "https://site2.com", "https://site3.com"],
config=config
):
if result.success:
print(f"Just completed: {result.url}")
# Process each result immediately
process_result(result)
```
### With a Custom Dispatcher
```python
@@ -74,7 +98,7 @@ results = await crawler.arun_many(
### Return Value
A **list** of [`CrawlResult`](./crawl-result.md) objects, one per URL. You can iterate to check `result.success` or read each items `extracted_content`, `markdown`, or `dispatch_result`.
Either a **list** of [`CrawlResult`](./crawl-result.md) objects, or an **async generator** if streaming is enabled. You can iterate to check `result.success` or read each items `extracted_content`, `markdown`, or `dispatch_result`.
---

View File

@@ -56,6 +56,7 @@ run_cfg = CrawlerRunConfig(
word_count_threshold=15,
excluded_tags=["nav", "footer"],
exclude_external_links=True,
stream=True, # Enable streaming for arun_many()
)
```
@@ -191,7 +192,28 @@ The `RateLimitConfig` class has these fields:
---
## 2.2 Example Usage
## 2.2 Helper Methods
Both `BrowserConfig` and `CrawlerRunConfig` provide a `clone()` method to create modified copies:
```python
# Create a base configuration
base_config = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED,
word_count_threshold=200
)
# Create variations using clone()
stream_config = base_config.clone(stream=True)
no_cache_config = base_config.clone(
cache_mode=CacheMode.BYPASS,
stream=True
)
```
The `clone()` method is particularly useful when you need slightly different configurations for different use cases, without modifying the original config.
## 2.3 Example Usage
```python
import asyncio
@@ -226,7 +248,8 @@ async def main():
memory_threshold_percent=70.0,
check_interval=1.0,
max_session_permit=20,
display_mode="DETAILED"
display_mode="DETAILED",
stream=True
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
@@ -259,3 +282,10 @@ if __name__ == "__main__":
- **Use** `BrowserConfig` for **global** browser settings: engine, headless, proxy, user agent.
- **Use** `CrawlerRunConfig` for each crawls **context**: how to filter content, handle caching, wait for dynamic elements, or run JS.
- **Pass** both configs to `AsyncWebCrawler` (the `BrowserConfig`) and then to `arun()` (the `CrawlerRunConfig`).
```python
# Create a modified copy with the clone() method
stream_cfg = run_cfg.clone(
stream=True,
cache_mode=CacheMode.BYPASS
)