# Deep Crawling

One of Crawl4AI's most powerful features is its ability to perform **configurable deep crawling** that can explore websites beyond a single page. With fine-tuned control over crawl depth, domain boundaries, and content filtering, Crawl4AI gives you the tools to extract precisely the content you need.

In this tutorial, you'll learn:

1. How to set up a **Basic Deep Crawler** with BFS strategy
2. Understanding the difference between **streamed and non-streamed** output
3. Implementing **filters and scorers** to target specific content
4. Creating **advanced filtering chains** for sophisticated crawls
5. Using **BestFirstCrawling** for intelligent exploration prioritization
6. **Crash recovery** for long-running production crawls
7. **Prefetch mode** for fast URL discovery  

> **Prerequisites**  
> - You’ve completed or read [AsyncWebCrawler Basics](../core/simple-crawling.md) to understand how to run a simple crawl.  
> - You know how to configure `CrawlerRunConfig`.

---

## 1. Quick Example

Here's a minimal code snippet that implements a basic deep crawl using the **BFSDeepCrawlStrategy**:

```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy

async def main():
    # Configure a 2-level deep crawl
    config = CrawlerRunConfig(
        deep_crawl_strategy=BFSDeepCrawlStrategy(
            max_depth=2, 
            include_external=False
        ),
        scraping_strategy=LXMLWebScrapingStrategy(),
        verbose=True
    )
    
    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun("https://example.com", config=config)
        
        print(f"Crawled {len(results)} pages in total")
        
        # Access individual results
        for result in results[:3]:  # Show first 3 results
            print(f"URL: {result.url}")
            print(f"Depth: {result.metadata.get('depth', 0)}")

if __name__ == "__main__":
    asyncio.run(main())
```

**What's happening?**  
- `BFSDeepCrawlStrategy(max_depth=2, include_external=False)` instructs Crawl4AI to:
  - Crawl the starting page (depth 0) plus 2 more levels
  - Stay within the same domain (don't follow external links)
- Each result contains metadata like the crawl depth
- Results are returned as a list after all crawling is complete

---

## 2. Understanding Deep Crawling Strategy Options

### 2.1 BFSDeepCrawlStrategy (Breadth-First Search)

The **BFSDeepCrawlStrategy** uses a breadth-first approach, exploring all links at one depth before moving deeper:

```python
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy

# Basic configuration
strategy = BFSDeepCrawlStrategy(
    max_depth=2,               # Crawl initial page + 2 levels deep
    include_external=False,    # Stay within the same domain
    max_pages=50,              # Maximum number of pages to crawl (optional)
    score_threshold=0.3,       # Minimum score for URLs to be crawled (optional)
)
```

**Key parameters:**
- **`max_depth`**: Number of levels to crawl beyond the starting page
- **`include_external`**: Whether to follow links to other domains
- **`max_pages`**: Maximum number of pages to crawl (default: infinite)
- **`score_threshold`**: Minimum score for URLs to be crawled (default: -inf)
- **`filter_chain`**: FilterChain instance for URL filtering
- **`url_scorer`**: Scorer instance for evaluating URLs

### 2.2 DFSDeepCrawlStrategy (Depth-First Search)

The **DFSDeepCrawlStrategy** uses a depth-first approach, explores as far down a branch as possible before backtracking.

```python
from crawl4ai.deep_crawling import DFSDeepCrawlStrategy

# Basic configuration
strategy = DFSDeepCrawlStrategy(
    max_depth=2,               # Crawl initial page + 2 levels deep
    include_external=False,    # Stay within the same domain
    max_pages=30,              # Maximum number of pages to crawl (optional)
    score_threshold=0.5,       # Minimum score for URLs to be crawled (optional)
)
```

**Key parameters:**
- **`max_depth`**: Number of levels to crawl beyond the starting page
- **`include_external`**: Whether to follow links to other domains
- **`max_pages`**: Maximum number of pages to crawl (default: infinite)
- **`score_threshold`**: Minimum score for URLs to be crawled (default: -inf)
- **`filter_chain`**: FilterChain instance for URL filtering
- **`url_scorer`**: Scorer instance for evaluating URLs

### 2.3 BestFirstCrawlingStrategy (⭐️ - Recommended Deep crawl strategy)

For more intelligent crawling, use **BestFirstCrawlingStrategy** with scorers to prioritize the most relevant pages:

```python
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer

# Create a scorer
scorer = KeywordRelevanceScorer(
    keywords=["crawl", "example", "async", "configuration"],
    weight=0.7
)

# Configure the strategy
strategy = BestFirstCrawlingStrategy(
    max_depth=2,
    include_external=False,
    url_scorer=scorer,
    max_pages=25,              # Maximum number of pages to crawl (optional)
)
```

This crawling approach:
- Evaluates each discovered URL based on scorer criteria
- Visits higher-scoring pages first
- Helps focus crawl resources on the most relevant content
- Can limit total pages crawled with `max_pages`
- Does not need `score_threshold` as it naturally prioritizes by score

---

## 3. Streaming vs. Non-Streaming Results

Crawl4AI can return results in two modes:

### 3.1 Non-Streaming Mode (Default)

```python
config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
    stream=False  # Default behavior
)

async with AsyncWebCrawler() as crawler:
    # Wait for ALL results to be collected before returning
    results = await crawler.arun("https://example.com", config=config)
    
    for result in results:
        process_result(result)
```

**When to use non-streaming mode:**
- You need the complete dataset before processing
- You're performing batch operations on all results together
- Crawl time isn't a critical factor

### 3.2 Streaming Mode

```python
config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
    stream=True  # Enable streaming
)

async with AsyncWebCrawler() as crawler:
    # Returns an async iterator
    async for result in await crawler.arun("https://example.com", config=config):
        # Process each result as it becomes available
        process_result(result)
```

**Benefits of streaming mode:**
- Process results immediately as they're discovered
- Start working with early results while crawling continues
- Better for real-time applications or progressive display
- Reduces memory pressure when handling many pages

---

## 4. Filtering Content with Filter Chains

Filters help you narrow down which pages to crawl. Combine multiple filters using **FilterChain** for powerful targeting.

### 4.1 Basic URL Pattern Filter

```python
from crawl4ai.deep_crawling.filters import FilterChain, URLPatternFilter

# Only follow URLs containing "blog" or "docs"
url_filter = URLPatternFilter(patterns=["*blog*", "*docs*"])

config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(
        max_depth=1,
        filter_chain=FilterChain([url_filter])
    )
)
```

### 4.2 Combining Multiple Filters

```python
from crawl4ai.deep_crawling.filters import (
    FilterChain,
    URLPatternFilter,
    DomainFilter,
    ContentTypeFilter
)

# Create a chain of filters
filter_chain = FilterChain([
    # Only follow URLs with specific patterns
    URLPatternFilter(patterns=["*guide*", "*tutorial*"]),
    
    # Only crawl specific domains
    DomainFilter(
        allowed_domains=["docs.example.com"],
        blocked_domains=["old.docs.example.com"]
    ),
    
    # Only include specific content types
    ContentTypeFilter(allowed_types=["text/html"])
])

config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(
        max_depth=2,
        filter_chain=filter_chain
    )
)
```

### 4.3 Available Filter Types

Crawl4AI includes several specialized filters:

- **`URLPatternFilter`**: Matches URL patterns using wildcard syntax
- **`DomainFilter`**: Controls which domains to include or exclude
- **`ContentTypeFilter`**: Filters based on HTTP Content-Type
- **`ContentRelevanceFilter`**: Uses similarity to a text query
- **`SEOFilter`**: Evaluates SEO elements (meta tags, headers, etc.)

---

## 5. Using Scorers for Prioritized Crawling

Scorers assign priority values to discovered URLs, helping the crawler focus on the most relevant content first.

### 5.1 KeywordRelevanceScorer

```python
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy

# Create a keyword relevance scorer
keyword_scorer = KeywordRelevanceScorer(
    keywords=["crawl", "example", "async", "configuration"],
    weight=0.7  # Importance of this scorer (0.0 to 1.0)
)

config = CrawlerRunConfig(
    deep_crawl_strategy=BestFirstCrawlingStrategy(
        max_depth=2,
        url_scorer=keyword_scorer
    ),
    stream=True  # Recommended with BestFirstCrawling
)

# Results will come in order of relevance score
async with AsyncWebCrawler() as crawler:
    async for result in await crawler.arun("https://example.com", config=config):
        score = result.metadata.get("score", 0)
        print(f"Score: {score:.2f} | {result.url}")
```

**How scorers work:**
- Evaluate each discovered URL before crawling
- Calculate relevance based on various signals
- Help the crawler make intelligent choices about traversal order

---

## 6. Advanced Filtering Techniques

### 6.1 SEO Filter for Quality Assessment

The **SEOFilter** helps you identify pages with strong SEO characteristics:

```python
from crawl4ai.deep_crawling.filters import FilterChain, SEOFilter

# Create an SEO filter that looks for specific keywords in page metadata
seo_filter = SEOFilter(
    threshold=0.5,  # Minimum score (0.0 to 1.0)
    keywords=["tutorial", "guide", "documentation"]
)

config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(
        max_depth=1,
        filter_chain=FilterChain([seo_filter])
    )
)
```

### 6.2 Content Relevance Filter

The **ContentRelevanceFilter** analyzes the actual content of pages:

```python
from crawl4ai.deep_crawling.filters import FilterChain, ContentRelevanceFilter

# Create a content relevance filter
relevance_filter = ContentRelevanceFilter(
    query="Web crawling and data extraction with Python",
    threshold=0.7  # Minimum similarity score (0.0 to 1.0)
)

config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(
        max_depth=1,
        filter_chain=FilterChain([relevance_filter])
    )
)
```

This filter:
- Measures semantic similarity between query and page content
- It's a BM25-based relevance filter using head section content

---

## 7. Building a Complete Advanced Crawler

This example combines multiple techniques for a sophisticated crawl:

```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.filters import (
    FilterChain,
    DomainFilter,
    URLPatternFilter,
    ContentTypeFilter
)
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer

async def run_advanced_crawler():
    # Create a sophisticated filter chain
    filter_chain = FilterChain([
        # Domain boundaries
        DomainFilter(
            allowed_domains=["docs.example.com"],
            blocked_domains=["old.docs.example.com"]
        ),
        
        # URL patterns to include
        URLPatternFilter(patterns=["*guide*", "*tutorial*", "*blog*"]),
        
        # Content type filtering
        ContentTypeFilter(allowed_types=["text/html"])
    ])

    # Create a relevance scorer
    keyword_scorer = KeywordRelevanceScorer(
        keywords=["crawl", "example", "async", "configuration"],
        weight=0.7
    )

    # Set up the configuration
    config = CrawlerRunConfig(
        deep_crawl_strategy=BestFirstCrawlingStrategy(
            max_depth=2,
            include_external=False,
            filter_chain=filter_chain,
            url_scorer=keyword_scorer
        ),
        scraping_strategy=LXMLWebScrapingStrategy(),
        stream=True,
        verbose=True
    )

    # Execute the crawl
    results = []
    async with AsyncWebCrawler() as crawler:
        async for result in await crawler.arun("https://docs.example.com", config=config):
            results.append(result)
            score = result.metadata.get("score", 0)
            depth = result.metadata.get("depth", 0)
            print(f"Depth: {depth} | Score: {score:.2f} | {result.url}")

    # Analyze the results
    print(f"Crawled {len(results)} high-value pages")
    print(f"Average score: {sum(r.metadata.get('score', 0) for r in results) / len(results):.2f}")

    # Group by depth
    depth_counts = {}
    for result in results:
        depth = result.metadata.get("depth", 0)
        depth_counts[depth] = depth_counts.get(depth, 0) + 1

    print("Pages crawled by depth:")
    for depth, count in sorted(depth_counts.items()):
        print(f"  Depth {depth}: {count} pages")

if __name__ == "__main__":
    asyncio.run(run_advanced_crawler())
```

---


## 8. Limiting and Controlling Crawl Size

### 8.1 Using max_pages

You can limit the total number of pages crawled with the `max_pages` parameter:

```python
# Limit to exactly 20 pages regardless of depth
strategy = BFSDeepCrawlStrategy(
    max_depth=3,
    max_pages=20
)
```

This feature is useful for:
- Controlling API costs
- Setting predictable execution times
- Focusing on the most important content
- Testing crawl configurations before full execution

### 8.2 Using score_threshold

For BFS and DFS strategies, you can set a minimum score threshold to only crawl high-quality pages:

```python
# Only follow links with scores above 0.4
strategy = DFSDeepCrawlStrategy(
    max_depth=2,
    url_scorer=KeywordRelevanceScorer(keywords=["api", "guide", "reference"]),
    score_threshold=0.4  # Skip URLs with scores below this value
)
```

Note that for BestFirstCrawlingStrategy, score_threshold is not needed since pages are already processed in order of highest score first.

## 9. Common Pitfalls & Tips

1.**Set realistic limits.** Be cautious with `max_depth` values > 3, which can exponentially increase crawl size. Use `max_pages` to set hard limits.

2.**Don't neglect the scoring component.** BestFirstCrawling works best with well-tuned scorers. Experiment with keyword weights for optimal prioritization.

3.**Be a good web citizen.**  Respect robots.txt. (disabled by default)
  
4.**Handle page errors gracefully.** Not all pages will be accessible. Check `result.status` when processing results.

5.**Balance breadth vs. depth.** Choose your strategy wisely - BFS for comprehensive coverage, DFS for deep exploration, BestFirst for focused relevance-based crawling.

6.**Preserve HTTPS for security.** If crawling HTTPS sites that redirect to HTTP, use `preserve_https_for_internal_links=True` to maintain secure connections:

```python
config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=2),
    preserve_https_for_internal_links=True  # Keep HTTPS even if server redirects to HTTP
)
```

This is especially useful for security-conscious crawling or when dealing with sites that support both protocols.

---

## 10. Crash Recovery for Long-Running Crawls

For production deployments, especially in cloud environments where instances can be terminated unexpectedly, Crawl4AI provides built-in crash recovery support for all deep crawl strategies.

### 10.1 Enabling State Persistence

All deep crawl strategies (BFS, DFS, Best-First) support two optional parameters:

- **`resume_state`**: Pass a previously saved state to resume from a checkpoint
- **`on_state_change`**: Async callback fired after each URL is processed

```python
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
import json

# Callback to save state after each URL
async def save_state_to_redis(state: dict):
    await redis.set("crawl_state", json.dumps(state))

strategy = BFSDeepCrawlStrategy(
    max_depth=3,
    on_state_change=save_state_to_redis,  # Called after each URL
)
```

### 10.2 State Structure

The state dictionary is JSON-serializable and contains:

```python
{
    "strategy_type": "bfs",  # or "dfs", "best_first"
    "visited": ["url1", "url2", ...],  # Already crawled URLs
    "pending": [{"url": "...", "parent_url": "..."}],  # Queue/stack
    "depths": {"url1": 0, "url2": 1},  # Depth tracking
    "pages_crawled": 42  # Counter
}
```

### 10.3 Resuming from a Checkpoint

```python
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy

# Load saved state (e.g., from Redis, database, or file)
saved_state = json.loads(await redis.get("crawl_state"))

# Resume crawling from where we left off
strategy = BFSDeepCrawlStrategy(
    max_depth=3,
    resume_state=saved_state,  # Continue from checkpoint
    on_state_change=save_state_to_redis,  # Keep saving progress
)

config = CrawlerRunConfig(deep_crawl_strategy=strategy)

async with AsyncWebCrawler() as crawler:
    # Will skip already-visited URLs and continue from pending queue
    results = await crawler.arun(start_url, config=config)
```

### 10.4 Manual State Export

You can export the last captured state using `export_state()`. Note that this requires `on_state_change` to be set (state is captured in the callback):

```python
import json

captured_state = None

async def capture_state(state: dict):
    global captured_state
    captured_state = state

strategy = BFSDeepCrawlStrategy(
    max_depth=2,
    on_state_change=capture_state,  # Required for state capture
)
config = CrawlerRunConfig(deep_crawl_strategy=strategy)

async with AsyncWebCrawler() as crawler:
    results = await crawler.arun(start_url, config=config)

# Get the last captured state
state = strategy.export_state()
if state:
    # Save to your preferred storage
    with open("crawl_checkpoint.json", "w") as f:
        json.dump(state, f)
```

### 10.5 Complete Example: Redis-Based Recovery

```python
import asyncio
import json
import redis.asyncio as redis
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy

REDIS_KEY = "crawl4ai:crawl_state"

async def main():
    redis_client = redis.Redis(host='localhost', port=6379, db=0)

    # Check for existing state
    saved_state = None
    existing = await redis_client.get(REDIS_KEY)
    if existing:
        saved_state = json.loads(existing)
        print(f"Resuming from checkpoint: {saved_state['pages_crawled']} pages already crawled")

    # State persistence callback
    async def persist_state(state: dict):
        await redis_client.set(REDIS_KEY, json.dumps(state))

    # Create strategy with recovery support
    strategy = BFSDeepCrawlStrategy(
        max_depth=3,
        max_pages=100,
        resume_state=saved_state,
        on_state_change=persist_state,
    )

    config = CrawlerRunConfig(deep_crawl_strategy=strategy, stream=True)

    try:
        async with AsyncWebCrawler() as crawler:
            async for result in await crawler.arun("https://example.com", config=config):
                print(f"Crawled: {result.url}")
    except Exception as e:
        print(f"Crawl interrupted: {e}")
        print("State saved - restart to resume")
    finally:
        await redis_client.close()

if __name__ == "__main__":
    asyncio.run(main())
```

### 10.6 Zero Overhead

When `resume_state=None` and `on_state_change=None` (the defaults), there is no performance impact. State tracking only activates when you enable these features.

---

## 11. Prefetch Mode for Fast URL Discovery

When you need to quickly discover URLs without full page processing, use **prefetch mode**. This is ideal for two-phase crawling where you first map the site, then selectively process specific pages.

### 11.1 Enabling Prefetch Mode

```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

config = CrawlerRunConfig(prefetch=True)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun("https://example.com", config=config)

    # Result contains only HTML and links - no markdown, no extraction
    print(f"Found {len(result.links['internal'])} internal links")
    print(f"Found {len(result.links['external'])} external links")
```

### 11.2 What Gets Skipped

Prefetch mode uses a fast path that bypasses heavy processing:

| Processing Step | Normal Mode | Prefetch Mode |
|----------------|-------------|---------------|
| Fetch HTML | ✅ | ✅ |
| Extract links | ✅ | ✅ (fast `quick_extract_links()`) |
| Generate markdown | ✅ | ❌ Skipped |
| Content scraping | ✅ | ❌ Skipped |
| Media extraction | ✅ | ❌ Skipped |
| LLM extraction | ✅ | ❌ Skipped |

### 11.3 Performance Benefit

- **Normal mode**: Full pipeline (~2-5 seconds per page)
- **Prefetch mode**: HTML + links only (~200-500ms per page)

This makes prefetch mode **5-10x faster** for URL discovery.

### 11.4 Two-Phase Crawling Pattern

The most common use case is two-phase crawling:

```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def two_phase_crawl(start_url: str):
    async with AsyncWebCrawler() as crawler:
        # ═══════════════════════════════════════════════
        # Phase 1: Fast discovery (prefetch mode)
        # ═══════════════════════════════════════════════
        prefetch_config = CrawlerRunConfig(prefetch=True)
        discovery = await crawler.arun(start_url, config=prefetch_config)

        all_urls = [link["href"] for link in discovery.links.get("internal", [])]
        print(f"Discovered {len(all_urls)} URLs")

        # Filter to URLs you care about
        blog_urls = [url for url in all_urls if "/blog/" in url]
        print(f"Found {len(blog_urls)} blog posts to process")

        # ═══════════════════════════════════════════════
        # Phase 2: Full processing on selected URLs only
        # ═══════════════════════════════════════════════
        full_config = CrawlerRunConfig(
            # Your normal extraction settings
            word_count_threshold=100,
            remove_overlay_elements=True,
        )

        results = []
        for url in blog_urls:
            result = await crawler.arun(url, config=full_config)
            if result.success:
                results.append(result)
                print(f"Processed: {url}")

        return results

if __name__ == "__main__":
    results = asyncio.run(two_phase_crawl("https://example.com"))
    print(f"Fully processed {len(results)} pages")
```

### 11.5 Use Cases

- **Site mapping**: Quickly discover all URLs before deciding what to process
- **Link validation**: Check which pages exist without heavy processing
- **Selective deep crawl**: Prefetch to find URLs, filter by pattern, then full crawl
- **Crawl planning**: Estimate crawl size before committing resources

---

## 12. Summary & Next Steps

In this **Deep Crawling with Crawl4AI** tutorial, you learned to:

- Configure **BFSDeepCrawlStrategy**, **DFSDeepCrawlStrategy**, and **BestFirstCrawlingStrategy**
- Process results in streaming or non-streaming mode
- Apply filters to target specific content
- Use scorers to prioritize the most relevant pages
- Limit crawls with `max_pages` and `score_threshold` parameters
- Build a complete advanced crawler with combined techniques
- **Implement crash recovery** with `resume_state` and `on_state_change` for production deployments
- **Use prefetch mode** for fast URL discovery and two-phase crawling

With these tools, you can efficiently extract structured data from websites at scale, focusing precisely on the content you need for your specific use case.