docs(api): add streaming mode documentation and examples

Add comprehensive documentation for the new streaming mode feature in arun_many(): - Update arun_many() API docs to reflect streaming return type - Add streaming examples in quickstart and multi-url guides - Document stream parameter in configuration classes - Add clone() helper method documentation for configs This change improves documentation for processing large numbers of URLs efficiently.
2025-01-19 18:21:34 +08:00
parent 91463e34f1
commit 8b6fe6a98f
5 changed files with 184 additions and 31 deletions
--- a/docs/md_v2/api/arun_many.md
+++ b/docs/md_v2/api/arun_many.md
@@ -10,7 +10,7 @@ async def arun_many(
    config: Optional[CrawlerRunConfig] = None,
    dispatcher: Optional[BaseDispatcher] = None,
    ...
-) -> List[CrawlResult]:
+) -> Union[List[CrawlResult], AsyncGenerator[CrawlResult, None]]:
    """
    Crawl multiple URLs concurrently or in batches.

@@ -18,7 +18,7 @@ async def arun_many(
    :param config: (Optional) A default `CrawlerRunConfig` applying to each crawl.
    :param dispatcher: (Optional) A concurrency controller (e.g. MemoryAdaptiveDispatcher).
    ...
-    :return: A list of `CrawlResult` objects, one per URL.
+    :return: Either a list of `CrawlResult` objects, or an async generator if streaming is enabled.
    """
 ```

@@ -26,24 +26,29 @@ async def arun_many(

 1. **Multiple URLs**:  
   - Instead of crawling a single URL, you pass a list of them (strings or tasks).  
-   - The function returns a **list** of `CrawlResult`, in the same order as `urls`.
+   - The function returns either a **list** of `CrawlResult` or an **async generator** if streaming is enabled.

 2. **Concurrency & Dispatchers**:  
   - **`dispatcher`** param allows advanced concurrency control.  
   - If omitted, a default dispatcher (like `MemoryAdaptiveDispatcher`) is used internally.  
   - Dispatchers handle concurrency, rate limiting, and memory-based adaptive throttling (see [Multi-URL Crawling](../advanced/multi-url-crawling.md)).

-3. **Parallel** Execution**:  
+3. **Streaming Support**:  
+   - Enable streaming by setting `stream=True` in your `CrawlerRunConfig`.
+   - When streaming, use `async for` to process results as they become available.
+   - Ideal for processing large numbers of URLs without waiting for all to complete.
+
+4. **Parallel** Execution**:  
   - `arun_many()` can run multiple requests concurrently under the hood.  
   - Each `CrawlResult` might also include a **`dispatch_result`** with concurrency details (like memory usage, start/end times).

-### Basic Example
+### Basic Example (Batch Mode)

 ```python
 # Minimal usage: The default dispatcher will be used
 results = await crawler.arun_many(
    urls=["https://site1.com", "https://site2.com"],
-    config=my_run_config
+    config=CrawlerRunConfig(stream=False)  # Default behavior
 )

 for res in results:
@@ -53,6 +58,25 @@ for res in results:
        print("Failed:", res.url, "-", res.error_message)
 ```

+### Streaming Example
+
+```python
+config = CrawlerRunConfig(
+    stream=True,  # Enable streaming mode
+    cache_mode=CacheMode.BYPASS
+)
+
+# Process results as they complete
+async for result in await crawler.arun_many(
+    urls=["https://site1.com", "https://site2.com", "https://site3.com"],
+    config=config
+):
+    if result.success:
+        print(f"Just completed: {result.url}")
+        # Process each result immediately
+        process_result(result)
+```
+
 ### With a Custom Dispatcher

 ```python
@@ -74,7 +98,7 @@ results = await crawler.arun_many(

 ### Return Value

-A **list** of [`CrawlResult`](./crawl-result.md) objects, one per URL. You can iterate to check `result.success` or read each item’s `extracted_content`, `markdown`, or `dispatch_result`.
+Either a **list** of [`CrawlResult`](./crawl-result.md) objects, or an **async generator** if streaming is enabled. You can iterate to check `result.success` or read each item’s `extracted_content`, `markdown`, or `dispatch_result`.

 ---