Add documentation and example for deep crawl cancellation

- Add Section 11 "Cancellation Support for Deep Crawls" to deep-crawling.md
- Document should_cancel callback, cancel() method, and cancelled property
- Include complete example for cloud platform job cancellation
- Add docs/examples/deep_crawl_cancellation.py with 6 comprehensive examples
- Update summary section to mention cancellation feature
This commit is contained in:
unclecode
2026-01-22 06:10:54 +00:00
parent f6897d1429
commit 1e2b7fe7e6
2 changed files with 585 additions and 7 deletions

View File

@@ -635,11 +635,172 @@ When `resume_state=None` and `on_state_change=None` (the defaults), there is no
---
## 11. Prefetch Mode for Fast URL Discovery
## 11. Cancellation Support for Deep Crawls
For production environments like cloud platforms, you often need to stop a running crawl mid-execution—whether the user changed their mind, specified the wrong URL, or wants to control costs. Crawl4AI provides built-in cancellation support for all deep crawl strategies.
### 11.1 Two Ways to Cancel
**Option A: Callback-based cancellation** (recommended for external systems)
Use `should_cancel` to check an external source (Redis, database, API) before each URL:
```python
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
async def check_if_cancelled():
# Check Redis, database, or any external source
job = await redis.get(f"job:{job_id}")
return job.get("status") == "cancelled"
strategy = BFSDeepCrawlStrategy(
max_depth=3,
max_pages=1000,
should_cancel=check_if_cancelled, # Called before each URL
)
```
**Option B: Direct cancellation** (for in-process control)
Call `cancel()` directly on the strategy instance:
```python
strategy = BFSDeepCrawlStrategy(max_depth=3, max_pages=1000)
# In another coroutine or thread:
strategy.cancel() # Thread-safe, stops before next URL
```
### 11.2 Checking Cancellation Status
Use the `cancelled` property to check if a crawl was cancelled:
```python
async with AsyncWebCrawler() as crawler:
results = await crawler.arun(url, config=config)
if strategy.cancelled:
print(f"Crawl was cancelled after {len(results)} pages")
else:
print(f"Crawl completed with {len(results)} pages")
```
### 11.3 State Notifications Include Cancelled Flag
When using `on_state_change`, the state dictionary includes a `cancelled` field:
```python
async def handle_state(state: dict):
if state.get("cancelled"):
print("Crawl was cancelled!")
print(f"Crawled {state['pages_crawled']} pages before cancellation")
# Save state for potential resume
await redis.set("crawl_state", json.dumps(state))
strategy = BFSDeepCrawlStrategy(
max_depth=3,
should_cancel=check_cancelled,
on_state_change=handle_state,
)
```
### 11.4 Key Behaviors
| Scenario | Behavior |
|----------|----------|
| Cancel before first URL | Returns empty results, `cancelled=True` |
| Cancel during crawl | Completes current URL, then stops |
| Callback raises exception | Logged as warning, crawl continues (fail-open) |
| Strategy reuse after cancel | Works normally (cancel flag auto-resets) |
| Sync callback function | Supported (auto-detected and handled) |
### 11.5 Complete Example: Cloud Platform Job Cancellation
```python
import asyncio
import json
import redis.asyncio as redis
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
async def run_cancellable_crawl(job_id: str, start_url: str):
redis_client = redis.Redis(host='localhost', port=6379, db=0)
# Check external cancellation source
async def check_cancelled():
status = await redis_client.get(f"job:{job_id}:status")
return status == b"cancelled"
# Save progress for monitoring and recovery
async def save_progress(state: dict):
await redis_client.set(
f"job:{job_id}:state",
json.dumps(state)
)
# Update job progress
await redis_client.set(
f"job:{job_id}:pages_crawled",
state["pages_crawled"]
)
strategy = BFSDeepCrawlStrategy(
max_depth=3,
max_pages=500,
should_cancel=check_cancelled,
on_state_change=save_progress,
)
config = CrawlerRunConfig(
deep_crawl_strategy=strategy,
stream=True,
)
results = []
try:
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun(start_url, config=config):
results.append(result)
print(f"Crawled: {result.url}")
finally:
# Report final status
if strategy.cancelled:
await redis_client.set(f"job:{job_id}:status", "cancelled")
print(f"Job cancelled after {len(results)} pages")
else:
await redis_client.set(f"job:{job_id}:status", "completed")
print(f"Job completed with {len(results)} pages")
await redis_client.close()
return results
# Usage
# asyncio.run(run_cancellable_crawl("job-123", "https://example.com"))
#
# To cancel from another process:
# redis_client.set("job:job-123:status", "cancelled")
```
### 11.6 Supported Strategies
Cancellation works identically across all deep crawl strategies:
- **BFSDeepCrawlStrategy** - Breadth-first search
- **DFSDeepCrawlStrategy** - Depth-first search
- **BestFirstCrawlingStrategy** - Priority-based crawling
All strategies support:
- `should_cancel` callback parameter
- `cancel()` method
- `cancelled` property
---
## 12. Prefetch Mode for Fast URL Discovery
When you need to quickly discover URLs without full page processing, use **prefetch mode**. This is ideal for two-phase crawling where you first map the site, then selectively process specific pages.
### 11.1 Enabling Prefetch Mode
### 12.1 Enabling Prefetch Mode
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
@@ -654,7 +815,7 @@ async with AsyncWebCrawler() as crawler:
print(f"Found {len(result.links['external'])} external links")
```
### 11.2 What Gets Skipped
### 12.2 What Gets Skipped
Prefetch mode uses a fast path that bypasses heavy processing:
@@ -667,14 +828,14 @@ Prefetch mode uses a fast path that bypasses heavy processing:
| Media extraction | ✅ | ❌ Skipped |
| LLM extraction | ✅ | ❌ Skipped |
### 11.3 Performance Benefit
### 12.3 Performance Benefit
- **Normal mode**: Full pipeline (~2-5 seconds per page)
- **Prefetch mode**: HTML + links only (~200-500ms per page)
This makes prefetch mode **5-10x faster** for URL discovery.
### 11.4 Two-Phase Crawling Pattern
### 12.4 Two-Phase Crawling Pattern
The most common use case is two-phase crawling:
@@ -720,7 +881,7 @@ if __name__ == "__main__":
print(f"Fully processed {len(results)} pages")
```
### 11.5 Use Cases
### 12.5 Use Cases
- **Site mapping**: Quickly discover all URLs before deciding what to process
- **Link validation**: Check which pages exist without heavy processing
@@ -729,7 +890,7 @@ if __name__ == "__main__":
---
## 12. Summary & Next Steps
## 13. Summary & Next Steps
In this **Deep Crawling with Crawl4AI** tutorial, you learned to:
@@ -740,6 +901,7 @@ In this **Deep Crawling with Crawl4AI** tutorial, you learned to:
- Limit crawls with `max_pages` and `score_threshold` parameters
- Build a complete advanced crawler with combined techniques
- **Implement crash recovery** with `resume_state` and `on_state_change` for production deployments
- **Cancel running crawls** with `should_cancel` callback or `cancel()` method for cloud platform job management
- **Use prefetch mode** for fast URL discovery and two-phase crawling
With these tools, you can efficiently extract structured data from websites at scale, focusing precisely on the content you need for your specific use case.