docs(api): add streaming mode documentation and examples

Add comprehensive documentation for the new streaming mode feature in arun_many(): - Update arun_many() API docs to reflect streaming return type - Add streaming examples in quickstart and multi-url guides - Document stream parameter in configuration classes - Add clone() helper method documentation for configs This change improves documentation for processing large numbers of URLs efficiently.
2025-01-19 18:21:34 +08:00
parent 91463e34f1
commit 8b6fe6a98f
5 changed files with 184 additions and 31 deletions
--- a/docs/md_v2/core/browser-crawler-config.md
+++ b/docs/md_v2/core/browser-crawler-config.md
@@ -85,6 +85,25 @@ class BrowserConfig:
    - Additional flags for the underlying browser.  
    - E.g. `["--disable-extensions"]`.

+### Helper Methods
+
+Both configuration classes provide a `clone()` method to create modified copies:
+
+```python
+# Create a base browser config
+base_browser = BrowserConfig(
+    browser_type="chromium",
+    headless=True,
+    text_mode=True
+)
+
+# Create a visible browser config for debugging
+debug_browser = base_browser.clone(
+    headless=False,
+    verbose=True
+)
+```
+
 **Minimal Example**:

 ```python
@@ -123,6 +142,7 @@ class CrawlerRunConfig:
        max_session_permit=20,
        display_mode=None,
        verbose=True,
+        stream=False,  # Enable streaming for arun_many()
        # ... other advanced parameters omitted
    ):
        ...
@@ -186,6 +206,36 @@ class CrawlerRunConfig:
    - The display mode for progress information (`DETAILED`, `BRIEF`, etc.).  
    - Affects how much information is printed during the crawl.

+### Helper Methods
+
+The `clone()` method is particularly useful for creating variations of your crawler configuration:
+
+```python
+# Create a base configuration
+base_config = CrawlerRunConfig(
+    cache_mode=CacheMode.ENABLED,
+    word_count_threshold=200,
+    wait_until="networkidle"
+)
+
+# Create variations for different use cases
+stream_config = base_config.clone(
+    stream=True,  # Enable streaming mode
+    cache_mode=CacheMode.BYPASS
+)
+
+debug_config = base_config.clone(
+    page_timeout=120000,  # Longer timeout for debugging
+    verbose=True
+)
+```
+
+The `clone()` method:
+- Creates a new instance with all the same settings
+- Updates only the specified parameters
+- Leaves the original configuration unchanged
+- Perfect for creating variations without repeating all parameters
+
 ### Rate Limiting & Resource Management

 For batch processing with `arun_many()`, you can enable intelligent rate limiting:
@@ -229,7 +279,8 @@ crawl_conf = CrawlerRunConfig(
        max_delay=60.0,
        max_retries=3,
        rate_limit_codes=[429, 503]
-    )
+    ),
+    stream=True  # Enable streaming
 )

 async with AsyncWebCrawler() as crawler:
--- a/docs/md_v2/core/quickstart.md
+++ b/docs/md_v2/core/quickstart.md
@@ -265,9 +265,21 @@ async def quick_parallel_example():
        "https://example.com/page3"
    ]
    
-    run_conf = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
+    run_conf = CrawlerRunConfig(
+        cache_mode=CacheMode.BYPASS,
+        stream=True  # Enable streaming mode
+    )

    async with AsyncWebCrawler() as crawler:
+        # Stream results as they complete
+        async for result in await crawler.arun_many(urls, config=run_conf):
+            if result.success:
+                print(f"[OK] {result.url}, length: {len(result.markdown_v2.raw_markdown)}")
+            else:
+                print(f"[ERROR] {result.url} => {result.error_message}")
+
+        # Or get all results at once (default behavior)
+        run_conf = run_conf.clone(stream=False)
        results = await crawler.arun_many(urls, config=run_conf)
        for res in results:
            if res.success:
@@ -279,8 +291,13 @@ if __name__ == "__main__":
    asyncio.run(quick_parallel_example())
 ```

+The example above shows two ways to handle multiple URLs:
+1. **Streaming mode** (`stream=True`): Process results as they become available using `async for`
+2. **Batch mode** (`stream=False`): Wait for all results to complete
+
 For more advanced concurrency (e.g., a **semaphore-based** approach, **adaptive memory usage throttling**, or customized rate limiting), see [Advanced Multi-URL Crawling](../advanced/multi-url-crawling.md).

+---

 ## 8. Dynamic Content Example