docs(api): add streaming mode documentation and examples

Add comprehensive documentation for the new streaming mode feature in arun_many():
- Update arun_many() API docs to reflect streaming return type
- Add streaming examples in quickstart and multi-url guides
- Document stream parameter in configuration classes
- Add clone() helper method documentation for configs

This change improves documentation for processing large numbers of URLs efficiently.
This commit is contained in:
UncleCode
2025-01-19 18:21:34 +08:00
parent 91463e34f1
commit 8b6fe6a98f
5 changed files with 184 additions and 31 deletions

View File

@@ -85,6 +85,25 @@ class BrowserConfig:
- Additional flags for the underlying browser.
- E.g. `["--disable-extensions"]`.
### Helper Methods
Both configuration classes provide a `clone()` method to create modified copies:
```python
# Create a base browser config
base_browser = BrowserConfig(
browser_type="chromium",
headless=True,
text_mode=True
)
# Create a visible browser config for debugging
debug_browser = base_browser.clone(
headless=False,
verbose=True
)
```
**Minimal Example**:
```python
@@ -123,6 +142,7 @@ class CrawlerRunConfig:
max_session_permit=20,
display_mode=None,
verbose=True,
stream=False, # Enable streaming for arun_many()
# ... other advanced parameters omitted
):
...
@@ -186,6 +206,36 @@ class CrawlerRunConfig:
- The display mode for progress information (`DETAILED`, `BRIEF`, etc.).
- Affects how much information is printed during the crawl.
### Helper Methods
The `clone()` method is particularly useful for creating variations of your crawler configuration:
```python
# Create a base configuration
base_config = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED,
word_count_threshold=200,
wait_until="networkidle"
)
# Create variations for different use cases
stream_config = base_config.clone(
stream=True, # Enable streaming mode
cache_mode=CacheMode.BYPASS
)
debug_config = base_config.clone(
page_timeout=120000, # Longer timeout for debugging
verbose=True
)
```
The `clone()` method:
- Creates a new instance with all the same settings
- Updates only the specified parameters
- Leaves the original configuration unchanged
- Perfect for creating variations without repeating all parameters
### Rate Limiting & Resource Management
For batch processing with `arun_many()`, you can enable intelligent rate limiting:
@@ -229,7 +279,8 @@ crawl_conf = CrawlerRunConfig(
max_delay=60.0,
max_retries=3,
rate_limit_codes=[429, 503]
)
),
stream=True # Enable streaming
)
async with AsyncWebCrawler() as crawler:

View File

@@ -265,9 +265,21 @@ async def quick_parallel_example():
"https://example.com/page3"
]
run_conf = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
run_conf = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
stream=True # Enable streaming mode
)
async with AsyncWebCrawler() as crawler:
# Stream results as they complete
async for result in await crawler.arun_many(urls, config=run_conf):
if result.success:
print(f"[OK] {result.url}, length: {len(result.markdown_v2.raw_markdown)}")
else:
print(f"[ERROR] {result.url} => {result.error_message}")
# Or get all results at once (default behavior)
run_conf = run_conf.clone(stream=False)
results = await crawler.arun_many(urls, config=run_conf)
for res in results:
if res.success:
@@ -279,8 +291,13 @@ if __name__ == "__main__":
asyncio.run(quick_parallel_example())
```
The example above shows two ways to handle multiple URLs:
1. **Streaming mode** (`stream=True`): Process results as they become available using `async for`
2. **Batch mode** (`stream=False`): Wait for all results to complete
For more advanced concurrency (e.g., a **semaphore-based** approach, **adaptive memory usage throttling**, or customized rate limiting), see [Advanced Multi-URL Crawling](../advanced/multi-url-crawling.md).
---
## 8. Dynamic Content Example