Add examples for deep crawl crash recovery and prefetch mode in documentation

2026-01-14 12:58:44 +01:00
parent 530cde351f
commit 315eae9e6f
3 changed files with 828 additions and 6 deletions
--- a/docs/md_v2/core/deep-crawling.md
+++ b/docs/md_v2/core/deep-crawling.md
@@ -4,11 +4,13 @@ One of Crawl4AI's most powerful features is its ability to perform **configurabl

 In this tutorial, you'll learn:

-1. How to set up a **Basic Deep Crawler** with BFS strategy  
-2. Understanding the difference between **streamed and non-streamed** output  
-3. Implementing **filters and scorers** to target specific content  
-4. Creating **advanced filtering chains** for sophisticated crawls  
-5. Using **BestFirstCrawling** for intelligent exploration prioritization  
+1. How to set up a **Basic Deep Crawler** with BFS strategy
+2. Understanding the difference between **streamed and non-streamed** output
+3. Implementing **filters and scorers** to target specific content
+4. Creating **advanced filtering chains** for sophisticated crawls
+5. Using **BestFirstCrawling** for intelligent exploration prioritization
+6. **Crash recovery** for long-running production crawls
+7. **Prefetch mode** for fast URL discovery  

 > **Prerequisites**  
 > - You’ve completed or read [AsyncWebCrawler Basics](../core/simple-crawling.md) to understand how to run a simple crawl.  
@@ -485,7 +487,249 @@ This is especially useful for security-conscious crawling or when dealing with s

 ---

-## 10. Summary & Next Steps
+## 10. Crash Recovery for Long-Running Crawls
+
+For production deployments, especially in cloud environments where instances can be terminated unexpectedly, Crawl4AI provides built-in crash recovery support for all deep crawl strategies.
+
+### 10.1 Enabling State Persistence
+
+All deep crawl strategies (BFS, DFS, Best-First) support two optional parameters:
+
+- **`resume_state`**: Pass a previously saved state to resume from a checkpoint
+- **`on_state_change`**: Async callback fired after each URL is processed
+
+```python
+from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+import json
+
+# Callback to save state after each URL
+async def save_state_to_redis(state: dict):
+    await redis.set("crawl_state", json.dumps(state))
+
+strategy = BFSDeepCrawlStrategy(
+    max_depth=3,
+    on_state_change=save_state_to_redis,  # Called after each URL
+)
+```
+
+### 10.2 State Structure
+
+The state dictionary is JSON-serializable and contains:
+
+```python
+{
+    "strategy_type": "bfs",  # or "dfs", "best_first"
+    "visited": ["url1", "url2", ...],  # Already crawled URLs
+    "pending": [{"url": "...", "parent_url": "..."}],  # Queue/stack
+    "depths": {"url1": 0, "url2": 1},  # Depth tracking
+    "pages_crawled": 42  # Counter
+}
+```
+
+### 10.3 Resuming from a Checkpoint
+
+```python
+import json
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+
+# Load saved state (e.g., from Redis, database, or file)
+saved_state = json.loads(await redis.get("crawl_state"))
+
+# Resume crawling from where we left off
+strategy = BFSDeepCrawlStrategy(
+    max_depth=3,
+    resume_state=saved_state,  # Continue from checkpoint
+    on_state_change=save_state_to_redis,  # Keep saving progress
+)
+
+config = CrawlerRunConfig(deep_crawl_strategy=strategy)
+
+async with AsyncWebCrawler() as crawler:
+    # Will skip already-visited URLs and continue from pending queue
+    results = await crawler.arun(start_url, config=config)
+```
+
+### 10.4 Manual State Export
+
+You can export the last captured state using `export_state()`. Note that this requires `on_state_change` to be set (state is captured in the callback):
+
+```python
+import json
+
+captured_state = None
+
+async def capture_state(state: dict):
+    global captured_state
+    captured_state = state
+
+strategy = BFSDeepCrawlStrategy(
+    max_depth=2,
+    on_state_change=capture_state,  # Required for state capture
+)
+config = CrawlerRunConfig(deep_crawl_strategy=strategy)
+
+async with AsyncWebCrawler() as crawler:
+    results = await crawler.arun(start_url, config=config)
+
+# Get the last captured state
+state = strategy.export_state()
+if state:
+    # Save to your preferred storage
+    with open("crawl_checkpoint.json", "w") as f:
+        json.dump(state, f)
+```
+
+### 10.5 Complete Example: Redis-Based Recovery
+
+```python
+import asyncio
+import json
+import redis.asyncio as redis
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+
+REDIS_KEY = "crawl4ai:crawl_state"
+
+async def main():
+    redis_client = redis.Redis(host='localhost', port=6379, db=0)
+
+    # Check for existing state
+    saved_state = None
+    existing = await redis_client.get(REDIS_KEY)
+    if existing:
+        saved_state = json.loads(existing)
+        print(f"Resuming from checkpoint: {saved_state['pages_crawled']} pages already crawled")
+
+    # State persistence callback
+    async def persist_state(state: dict):
+        await redis_client.set(REDIS_KEY, json.dumps(state))
+
+    # Create strategy with recovery support
+    strategy = BFSDeepCrawlStrategy(
+        max_depth=3,
+        max_pages=100,
+        resume_state=saved_state,
+        on_state_change=persist_state,
+    )
+
+    config = CrawlerRunConfig(deep_crawl_strategy=strategy, stream=True)
+
+    try:
+        async with AsyncWebCrawler() as crawler:
+            async for result in await crawler.arun("https://example.com", config=config):
+                print(f"Crawled: {result.url}")
+    except Exception as e:
+        print(f"Crawl interrupted: {e}")
+        print("State saved - restart to resume")
+    finally:
+        await redis_client.close()
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+### 10.6 Zero Overhead
+
+When `resume_state=None` and `on_state_change=None` (the defaults), there is no performance impact. State tracking only activates when you enable these features.
+
+---
+
+## 11. Prefetch Mode for Fast URL Discovery
+
+When you need to quickly discover URLs without full page processing, use **prefetch mode**. This is ideal for two-phase crawling where you first map the site, then selectively process specific pages.
+
+### 11.1 Enabling Prefetch Mode
+
+```python
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+config = CrawlerRunConfig(prefetch=True)
+
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun("https://example.com", config=config)
+
+    # Result contains only HTML and links - no markdown, no extraction
+    print(f"Found {len(result.links['internal'])} internal links")
+    print(f"Found {len(result.links['external'])} external links")
+```
+
+### 11.2 What Gets Skipped
+
+Prefetch mode uses a fast path that bypasses heavy processing:
+
+| Processing Step | Normal Mode | Prefetch Mode |
+|----------------|-------------|---------------|
+| Fetch HTML | ✅ | ✅ |
+| Extract links | ✅ | ✅ (fast `quick_extract_links()`) |
+| Generate markdown | ✅ | ❌ Skipped |
+| Content scraping | ✅ | ❌ Skipped |
+| Media extraction | ✅ | ❌ Skipped |
+| LLM extraction | ✅ | ❌ Skipped |
+
+### 11.3 Performance Benefit
+
+- **Normal mode**: Full pipeline (~2-5 seconds per page)
+- **Prefetch mode**: HTML + links only (~200-500ms per page)
+
+This makes prefetch mode **5-10x faster** for URL discovery.
+
+### 11.4 Two-Phase Crawling Pattern
+
+The most common use case is two-phase crawling:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+async def two_phase_crawl(start_url: str):
+    async with AsyncWebCrawler() as crawler:
+        # ═══════════════════════════════════════════════
+        # Phase 1: Fast discovery (prefetch mode)
+        # ═══════════════════════════════════════════════
+        prefetch_config = CrawlerRunConfig(prefetch=True)
+        discovery = await crawler.arun(start_url, config=prefetch_config)
+
+        all_urls = [link["href"] for link in discovery.links.get("internal", [])]
+        print(f"Discovered {len(all_urls)} URLs")
+
+        # Filter to URLs you care about
+        blog_urls = [url for url in all_urls if "/blog/" in url]
+        print(f"Found {len(blog_urls)} blog posts to process")
+
+        # ═══════════════════════════════════════════════
+        # Phase 2: Full processing on selected URLs only
+        # ═══════════════════════════════════════════════
+        full_config = CrawlerRunConfig(
+            # Your normal extraction settings
+            word_count_threshold=100,
+            remove_overlay_elements=True,
+        )
+
+        results = []
+        for url in blog_urls:
+            result = await crawler.arun(url, config=full_config)
+            if result.success:
+                results.append(result)
+                print(f"Processed: {url}")
+
+        return results
+
+if __name__ == "__main__":
+    results = asyncio.run(two_phase_crawl("https://example.com"))
+    print(f"Fully processed {len(results)} pages")
+```
+
+### 11.5 Use Cases
+
+- **Site mapping**: Quickly discover all URLs before deciding what to process
+- **Link validation**: Check which pages exist without heavy processing
+- **Selective deep crawl**: Prefetch to find URLs, filter by pattern, then full crawl
+- **Crawl planning**: Estimate crawl size before committing resources
+
+---
+
+## 12. Summary & Next Steps

 In this **Deep Crawling with Crawl4AI** tutorial, you learned to:

@@ -495,5 +739,7 @@ In this **Deep Crawling with Crawl4AI** tutorial, you learned to:
 - Use scorers to prioritize the most relevant pages
 - Limit crawls with `max_pages` and `score_threshold` parameters
 - Build a complete advanced crawler with combined techniques
+- **Implement crash recovery** with `resume_state` and `on_state_change` for production deployments
+- **Use prefetch mode** for fast URL discovery and two-phase crawling

 With these tools, you can efficiently extract structured data from websites at scale, focusing precisely on the content you need for your specific use case.