Files

Nasrin f6f7f1b551 Release v0.8.0: Crash Recovery, Prefetch Mode & Security Fixes (#1712 )

* Fix: Use correct URL variable for raw HTML extraction (#1116)

- Prevents full HTML content from being passed as URL to extraction strategies
- Added unit tests to verify raw HTML and regular URL processing

Fix: Wrong URL variable used for extraction of raw html

* Fix #1181: Preserve whitespace in code blocks during HTML scraping

  The remove_empty_elements_fast() method was removing whitespace-only
  span elements inside <pre> and <code> tags, causing import statements
  like "import torch" to become "importtorch". Now skips elements inside
  code blocks where whitespace is significant.

* Refactor Pydantic model configuration to use ConfigDict for arbitrary types

* Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621

* Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638

* fix: ensure BrowserConfig.to_dict serializes proxy_config

* feat: make LLM backoff configurable end-to-end

- extend LLMConfig with backoff delay/attempt/factor fields and thread them
  through LLMExtractionStrategy, LLMContentFilter, table extraction, and
  Docker API handlers
- expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff
  and document them in the md_v2 guides

* reproduced AttributeError from #1642

* pass timeout parameter to docker client request

* added missing deep crawling objects to init

* generalized query in ContentRelevanceFilter to be a str or list

* import modules from enhanceable deserialization

* parameterized tests

* Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268

* refactor: replace PyPDF2 with pypdf across the codebase. ref #1412

* Add browser_context_id and target_id parameters to BrowserConfig

Enable Crawl4AI to connect to pre-created CDP browser contexts, which is
essential for cloud browser services that pre-create isolated contexts.

Changes:
- Add browser_context_id and target_id parameters to BrowserConfig
- Update from_kwargs() and to_dict() methods
- Modify BrowserManager.start() to use existing context when provided
- Add _get_page_by_target_id() helper method
- Update get_page() to handle pre-existing targets
- Add test for browser_context_id functionality

This enables cloud services to:
1. Create isolated CDP contexts before Crawl4AI connects
2. Pass context/target IDs to BrowserConfig
3. Have Crawl4AI reuse existing contexts instead of creating new ones

* Add cdp_cleanup_on_close flag to prevent memory leaks in cloud/server scenarios

* Fix: add cdp_cleanup_on_close to from_kwargs

* Fix: find context by target_id for concurrent CDP connections

* Fix: use target_id to find correct page in get_page

* Fix: use CDP to find context by browserContextId for concurrent sessions

* Revert context matching attempts - Playwright cannot see CDP-created contexts

* Add create_isolated_context flag for concurrent CDP crawls

When True, forces creation of a new browser context instead of reusing
the default context. Essential for concurrent crawls on the same browser
to prevent navigation conflicts.

* Add context caching to create_isolated_context branch

Uses contexts_by_config cache (same as non-CDP mode) to reuse contexts
for multiple URLs with same config. Still creates new page per crawl
for navigation isolation. Benefits batch/deep crawls.

* Add init_scripts support to BrowserConfig for pre-page-load JS injection

This adds the ability to inject JavaScript that runs before any page loads,
useful for stealth evasions (canvas/audio fingerprinting, userAgentData).

- Add init_scripts parameter to BrowserConfig (list of JS strings)
- Apply init_scripts in setup_context() via context.add_init_script()
- Update from_kwargs() and to_dict() for serialization

* Fix CDP connection handling: support WS URLs and proper cleanup

Changes to browser_manager.py:

1. _verify_cdp_ready(): Support multiple URL formats
   - WebSocket URLs (ws://, wss://): Skip HTTP verification, Playwright handles directly
   - HTTP URLs with query params: Properly parse with urlparse to preserve query string
   - Fixes issue where naive f"{cdp_url}/json/version" broke WS URLs and query params

2. close(): Proper cleanup when cdp_cleanup_on_close=True
   - Close all sessions (pages)
   - Close all contexts
   - Call browser.close() to disconnect (doesn't terminate browser, just releases connection)
   - Wait 1 second for CDP connection to fully release
   - Stop Playwright instance to prevent memory leaks

This enables:
- Connecting to specific browsers via WS URL
- Reusing the same browser with multiple sequential connections
- No user wait needed between connections (internal 1s delay handles it)

Added tests/browser/test_cdp_cleanup_reuse.py with comprehensive tests.

* Update gitignore

* Some debugging for caching

* Add _generate_screenshot_from_html for raw: and file:// URLs

Implements the missing method that was being called but never defined.
Now raw: and file:// URLs can generate screenshots by:
1. Loading HTML into a browser page via page.set_content()
2. Taking screenshot using existing take_screenshot() method
3. Cleaning up the page afterward

This enables cached HTML to be rendered with screenshots in crawl4ai-cloud.

* Add PDF and MHTML support for raw: and file:// URLs

- Replace _generate_screenshot_from_html with _generate_media_from_html
- New method handles screenshot, PDF, and MHTML in one browser session
- Update raw: and file:// URL handlers to use new method
- Enables cached HTML to generate all media types

* Add crash recovery for deep crawl strategies

Add optional resume_state and on_state_change parameters to all deep
crawl strategies (BFS, DFS, Best-First) for cloud deployment crash
recovery.

Features:
- resume_state: Pass saved state to resume from checkpoint
- on_state_change: Async callback fired after each URL for real-time
  state persistence to external storage (Redis, DB, etc.)
- export_state(): Get last captured state manually
- Zero overhead when features are disabled (None defaults)

State includes visited URLs, pending queue/stack, depths, and
pages_crawled count. All state is JSON-serializable.

* Fix: HTTP strategy raw: URL parsing truncates at # character

The AsyncHTTPCrawlerStrategy.crawl() method used urlparse() to extract
content from raw: URLs. This caused HTML with CSS color codes like #eee
to be truncated because # is treated as a URL fragment delimiter.

Before: raw:body{background:#eee} -> parsed.path = 'body{background:'
After:  raw:body{background:#eee} -> raw_content = 'body{background:#eee'

Fix: Strip the raw: or raw:// prefix directly instead of using urlparse,
matching how the browser strategy handles it.

* Add base_url parameter to CrawlerRunConfig for raw HTML processing

When processing raw: HTML (e.g., from cache), the URL parameter is meaningless
for markdown link resolution. This adds a base_url parameter that can be set
explicitly to provide proper URL resolution context.

Changes:
- Add base_url parameter to CrawlerRunConfig.__init__
- Add base_url to CrawlerRunConfig.from_kwargs
- Update aprocess_html to use base_url for markdown generation

Usage:
  config = CrawlerRunConfig(base_url='https://example.com')
  result = await crawler.arun(url='raw:{html}', config=config)

* Add prefetch mode for two-phase deep crawling

- Add `prefetch` parameter to CrawlerRunConfig
- Add `quick_extract_links()` function for fast link extraction
- Add short-circuit in aprocess_html() for prefetch mode
- Add 42 tests (unit, integration, regression)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Updates on proxy rotation and proxy configuration

* Add proxy support to HTTP crawler strategy

* Add browser pipeline support for raw:/file:// URLs

- Add process_in_browser parameter to CrawlerRunConfig
- Route raw:/file:// URLs through _crawl_web() when browser operations needed
- Use page.set_content() instead of goto() for local content
- Fix cookie handling for non-HTTP URLs in browser_manager
- Auto-detect browser requirements: js_code, wait_for, screenshot, etc.
- Maintain fast path for raw:/file:// without browser params

Fixes #310

* Add smart TTL cache for sitemap URL seeder

- Add cache_ttl_hours and validate_sitemap_lastmod params to SeedingConfig
- New JSON cache format with metadata (version, created_at, lastmod, url_count)
- Cache validation by TTL expiry and sitemap lastmod comparison
- Auto-migration from old .jsonl to new .json format
- Fixes bug where incomplete cache was used indefinitely

* Update URL seeder docs with smart TTL cache parameters

- Add cache_ttl_hours and validate_sitemap_lastmod to parameter table
- Document smart TTL cache validation with examples
- Add cache-related troubleshooting entries
- Update key features summary

* Add MEMORY.md to gitignore

* Docs: Add multi-sample schema generation section

Add documentation explaining how to pass multiple HTML samples
to generate_schema() for stable selectors that work across pages
with varying DOM structures.

Includes:
- Problem explanation (fragile nth-child selectors)
- Solution with code example
- Key points for multi-sample queries
- Comparison table of fragile vs stable selectors

* Fix critical RCE and LFI vulnerabilities in Docker API deployment

Security fixes for vulnerabilities reported by ProjectDiscovery:

1. Remote Code Execution via Hooks (CVE pending)
   - Remove __import__ from allowed_builtins in hook_manager.py
   - Prevents arbitrary module imports (os, subprocess, etc.)
   - Hooks now disabled by default via CRAWL4AI_HOOKS_ENABLED env var

2. Local File Inclusion via file:// URLs (CVE pending)
   - Add URL scheme validation to /execute_js, /screenshot, /pdf, /html
   - Block file://, javascript:, data: and other dangerous schemes
   - Only allow http://, https://, and raw: (where appropriate)

3. Security hardening
   - Add CRAWL4AI_HOOKS_ENABLED=false as default (opt-in for hooks)
   - Add security warning comments in config.yml
   - Add validate_url_scheme() helper for consistent validation

Testing:
   - Add unit tests (test_security_fixes.py) - 16 tests
   - Add integration tests (run_security_tests.py) for live server

Affected endpoints:
   - POST /crawl (hooks disabled by default)
   - POST /crawl/stream (hooks disabled by default)
   - POST /execute_js (URL validation added)
   - POST /screenshot (URL validation added)
   - POST /pdf (URL validation added)
   - POST /html (URL validation added)

Breaking changes:
   - Hooks require CRAWL4AI_HOOKS_ENABLED=true to function
   - file:// URLs no longer work on API endpoints (use library directly)

* Enhance authentication flow by implementing JWT token retrieval and adding authorization headers to API requests

* Add release notes for v0.7.9, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates

* Add release notes for v0.8.0, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates

Documentation for v0.8.0 release:

- SECURITY.md: Security policy and vulnerability reporting guidelines
- RELEASE_NOTES_v0.8.0.md: Comprehensive release notes
- migration/v0.8.0-upgrade-guide.md: Step-by-step migration guide
- security/GHSA-DRAFT-RCE-LFI.md: GitHub security advisory drafts
- CHANGELOG.md: Updated with v0.8.0 changes

Breaking changes documented:
- Docker API hooks disabled by default (CRAWL4AI_HOOKS_ENABLED)
- file:// URLs blocked on Docker API endpoints

Security fixes credited to Neo by ProjectDiscovery

* Add examples for deep crawl crash recovery and prefetch mode in documentation

* Release v0.8.0: The v0.8.0 Update

- Updated version to 0.8.0
- Added comprehensive demo and release notes
- Updated all documentation

* Update security researcher acknowledgment with a hyperlink for Neo by ProjectDiscovery

* Add async agenerate_schema method for schema generation

- Extract prompt building to shared _build_schema_prompt() method
- Add agenerate_schema() async version using aperform_completion_with_backoff
- Refactor generate_schema() to use shared prompt builder
- Fixes Gemini/Vertex AI compatibility in async contexts (FastAPI)

* Fix: Enable litellm.drop_params for O-series/GPT-5 model compatibility

O-series (o1, o3) and GPT-5 models only support temperature=1.
Setting litellm.drop_params=True auto-drops unsupported parameters
instead of throwing UnsupportedParamsError.

Fixes temperature=0.01 error for these models in LLM extraction.

---------

Co-authored-by: rbushria <rbushri@gmail.com>
Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com>
Co-authored-by: Soham Kukreti <kukretisoham@gmail.com>
Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com>
Co-authored-by: unclecode <unclecode@kidocode.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-17 14:19:15 +01:00

24 KiB

Raw Blame History

Deep Crawling

One of Crawl4AI's most powerful features is its ability to perform configurable deep crawling that can explore websites beyond a single page. With fine-tuned control over crawl depth, domain boundaries, and content filtering, Crawl4AI gives you the tools to extract precisely the content you need.

In this tutorial, you'll learn:

How to set up a Basic Deep Crawler with BFS strategy
Understanding the difference between streamed and non-streamed output
Implementing filters and scorers to target specific content
Creating advanced filtering chains for sophisticated crawls
Using BestFirstCrawling for intelligent exploration prioritization
Crash recovery for long-running production crawls
Prefetch mode for fast URL discovery

Prerequisites

You’ve completed or read AsyncWebCrawler Basics to understand how to run a simple crawl.

You know how to configure CrawlerRunConfig.

1. Quick Example

Here's a minimal code snippet that implements a basic deep crawl using the BFSDeepCrawlStrategy:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy

async def main():
    # Configure a 2-level deep crawl
    config = CrawlerRunConfig(
        deep_crawl_strategy=BFSDeepCrawlStrategy(
            max_depth=2, 
            include_external=False
        ),
        scraping_strategy=LXMLWebScrapingStrategy(),
        verbose=True
    )
    
    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun("https://example.com", config=config)
        
        print(f"Crawled {len(results)} pages in total")
        
        # Access individual results
        for result in results[:3]:  # Show first 3 results
            print(f"URL: {result.url}")
            print(f"Depth: {result.metadata.get('depth', 0)}")

if __name__ == "__main__":
    asyncio.run(main())

What's happening?

BFSDeepCrawlStrategy(max_depth=2, include_external=False) instructs Crawl4AI to:
- Crawl the starting page (depth 0) plus 2 more levels
- Stay within the same domain (don't follow external links)
Each result contains metadata like the crawl depth
Results are returned as a list after all crawling is complete

2. Understanding Deep Crawling Strategy Options

2.1 BFSDeepCrawlStrategy (Breadth-First Search)

The BFSDeepCrawlStrategy uses a breadth-first approach, exploring all links at one depth before moving deeper:

from crawl4ai.deep_crawling import BFSDeepCrawlStrategy

# Basic configuration
strategy = BFSDeepCrawlStrategy(
    max_depth=2,               # Crawl initial page + 2 levels deep
    include_external=False,    # Stay within the same domain
    max_pages=50,              # Maximum number of pages to crawl (optional)
    score_threshold=0.3,       # Minimum score for URLs to be crawled (optional)
)

Key parameters:

max_depth: Number of levels to crawl beyond the starting page
include_external: Whether to follow links to other domains
max_pages: Maximum number of pages to crawl (default: infinite)
score_threshold: Minimum score for URLs to be crawled (default: -inf)
filter_chain: FilterChain instance for URL filtering
url_scorer: Scorer instance for evaluating URLs

2.2 DFSDeepCrawlStrategy (Depth-First Search)

The DFSDeepCrawlStrategy uses a depth-first approach, explores as far down a branch as possible before backtracking.

from crawl4ai.deep_crawling import DFSDeepCrawlStrategy

# Basic configuration
strategy = DFSDeepCrawlStrategy(
    max_depth=2,               # Crawl initial page + 2 levels deep
    include_external=False,    # Stay within the same domain
    max_pages=30,              # Maximum number of pages to crawl (optional)
    score_threshold=0.5,       # Minimum score for URLs to be crawled (optional)
)

Key parameters:

max_depth: Number of levels to crawl beyond the starting page
include_external: Whether to follow links to other domains
max_pages: Maximum number of pages to crawl (default: infinite)
score_threshold: Minimum score for URLs to be crawled (default: -inf)
filter_chain: FilterChain instance for URL filtering
url_scorer: Scorer instance for evaluating URLs

2.3 BestFirstCrawlingStrategy (⭐️ - Recommended Deep crawl strategy)

For more intelligent crawling, use BestFirstCrawlingStrategy with scorers to prioritize the most relevant pages:

from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer

# Create a scorer
scorer = KeywordRelevanceScorer(
    keywords=["crawl", "example", "async", "configuration"],
    weight=0.7
)

# Configure the strategy
strategy = BestFirstCrawlingStrategy(
    max_depth=2,
    include_external=False,
    url_scorer=scorer,
    max_pages=25,              # Maximum number of pages to crawl (optional)
)

This crawling approach:

Evaluates each discovered URL based on scorer criteria
Visits higher-scoring pages first
Helps focus crawl resources on the most relevant content
Can limit total pages crawled with max_pages
Does not need score_threshold as it naturally prioritizes by score

3. Streaming vs. Non-Streaming Results

Crawl4AI can return results in two modes:

3.1 Non-Streaming Mode (Default)

config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
    stream=False  # Default behavior
)

async with AsyncWebCrawler() as crawler:
    # Wait for ALL results to be collected before returning
    results = await crawler.arun("https://example.com", config=config)
    
    for result in results:
        process_result(result)

When to use non-streaming mode:

You need the complete dataset before processing
You're performing batch operations on all results together
Crawl time isn't a critical factor

3.2 Streaming Mode

config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
    stream=True  # Enable streaming
)

async with AsyncWebCrawler() as crawler:
    # Returns an async iterator
    async for result in await crawler.arun("https://example.com", config=config):
        # Process each result as it becomes available
        process_result(result)

Benefits of streaming mode:

Process results immediately as they're discovered
Start working with early results while crawling continues
Better for real-time applications or progressive display
Reduces memory pressure when handling many pages

4. Filtering Content with Filter Chains

Filters help you narrow down which pages to crawl. Combine multiple filters using FilterChain for powerful targeting.

4.1 Basic URL Pattern Filter

from crawl4ai.deep_crawling.filters import FilterChain, URLPatternFilter

# Only follow URLs containing "blog" or "docs"
url_filter = URLPatternFilter(patterns=["*blog*", "*docs*"])

config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(
        max_depth=1,
        filter_chain=FilterChain([url_filter])
    )
)

4.2 Combining Multiple Filters

from crawl4ai.deep_crawling.filters import (
    FilterChain,
    URLPatternFilter,
    DomainFilter,
    ContentTypeFilter
)

# Create a chain of filters
filter_chain = FilterChain([
    # Only follow URLs with specific patterns
    URLPatternFilter(patterns=["*guide*", "*tutorial*"]),
    
    # Only crawl specific domains
    DomainFilter(
        allowed_domains=["docs.example.com"],
        blocked_domains=["old.docs.example.com"]
    ),
    
    # Only include specific content types
    ContentTypeFilter(allowed_types=["text/html"])
])

config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(
        max_depth=2,
        filter_chain=filter_chain
    )
)

4.3 Available Filter Types

Crawl4AI includes several specialized filters:

URLPatternFilter: Matches URL patterns using wildcard syntax
DomainFilter: Controls which domains to include or exclude
ContentTypeFilter: Filters based on HTTP Content-Type
ContentRelevanceFilter: Uses similarity to a text query
SEOFilter: Evaluates SEO elements (meta tags, headers, etc.)

5. Using Scorers for Prioritized Crawling

Scorers assign priority values to discovered URLs, helping the crawler focus on the most relevant content first.

5.1 KeywordRelevanceScorer

from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy

# Create a keyword relevance scorer
keyword_scorer = KeywordRelevanceScorer(
    keywords=["crawl", "example", "async", "configuration"],
    weight=0.7  # Importance of this scorer (0.0 to 1.0)
)

config = CrawlerRunConfig(
    deep_crawl_strategy=BestFirstCrawlingStrategy(
        max_depth=2,
        url_scorer=keyword_scorer
    ),
    stream=True  # Recommended with BestFirstCrawling
)

# Results will come in order of relevance score
async with AsyncWebCrawler() as crawler:
    async for result in await crawler.arun("https://example.com", config=config):
        score = result.metadata.get("score", 0)
        print(f"Score: {score:.2f} | {result.url}")

How scorers work:

Evaluate each discovered URL before crawling
Calculate relevance based on various signals
Help the crawler make intelligent choices about traversal order

6. Advanced Filtering Techniques

6.1 SEO Filter for Quality Assessment

The SEOFilter helps you identify pages with strong SEO characteristics:

from crawl4ai.deep_crawling.filters import FilterChain, SEOFilter

# Create an SEO filter that looks for specific keywords in page metadata
seo_filter = SEOFilter(
    threshold=0.5,  # Minimum score (0.0 to 1.0)
    keywords=["tutorial", "guide", "documentation"]
)

config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(
        max_depth=1,
        filter_chain=FilterChain([seo_filter])
    )
)

6.2 Content Relevance Filter

The ContentRelevanceFilter analyzes the actual content of pages:

from crawl4ai.deep_crawling.filters import FilterChain, ContentRelevanceFilter

# Create a content relevance filter
relevance_filter = ContentRelevanceFilter(
    query="Web crawling and data extraction with Python",
    threshold=0.7  # Minimum similarity score (0.0 to 1.0)
)

config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(
        max_depth=1,
        filter_chain=FilterChain([relevance_filter])
    )
)

This filter:

Measures semantic similarity between query and page content
It's a BM25-based relevance filter using head section content

7. Building a Complete Advanced Crawler

This example combines multiple techniques for a sophisticated crawl:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.filters import (
    FilterChain,
    DomainFilter,
    URLPatternFilter,
    ContentTypeFilter
)
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer

async def run_advanced_crawler():
    # Create a sophisticated filter chain
    filter_chain = FilterChain([
        # Domain boundaries
        DomainFilter(
            allowed_domains=["docs.example.com"],
            blocked_domains=["old.docs.example.com"]
        ),
        
        # URL patterns to include
        URLPatternFilter(patterns=["*guide*", "*tutorial*", "*blog*"]),
        
        # Content type filtering
        ContentTypeFilter(allowed_types=["text/html"])
    ])

    # Create a relevance scorer
    keyword_scorer = KeywordRelevanceScorer(
        keywords=["crawl", "example", "async", "configuration"],
        weight=0.7
    )

    # Set up the configuration
    config = CrawlerRunConfig(
        deep_crawl_strategy=BestFirstCrawlingStrategy(
            max_depth=2,
            include_external=False,
            filter_chain=filter_chain,
            url_scorer=keyword_scorer
        ),
        scraping_strategy=LXMLWebScrapingStrategy(),
        stream=True,
        verbose=True
    )

    # Execute the crawl
    results = []
    async with AsyncWebCrawler() as crawler:
        async for result in await crawler.arun("https://docs.example.com", config=config):
            results.append(result)
            score = result.metadata.get("score", 0)
            depth = result.metadata.get("depth", 0)
            print(f"Depth: {depth} | Score: {score:.2f} | {result.url}")

    # Analyze the results
    print(f"Crawled {len(results)} high-value pages")
    print(f"Average score: {sum(r.metadata.get('score', 0) for r in results) / len(results):.2f}")

    # Group by depth
    depth_counts = {}
    for result in results:
        depth = result.metadata.get("depth", 0)
        depth_counts[depth] = depth_counts.get(depth, 0) + 1

    print("Pages crawled by depth:")
    for depth, count in sorted(depth_counts.items()):
        print(f"  Depth {depth}: {count} pages")

if __name__ == "__main__":
    asyncio.run(run_advanced_crawler())

8. Limiting and Controlling Crawl Size

8.1 Using max_pages

You can limit the total number of pages crawled with the max_pages parameter:

# Limit to exactly 20 pages regardless of depth
strategy = BFSDeepCrawlStrategy(
    max_depth=3,
    max_pages=20
)

This feature is useful for:

Controlling API costs
Setting predictable execution times
Focusing on the most important content
Testing crawl configurations before full execution

8.2 Using score_threshold

For BFS and DFS strategies, you can set a minimum score threshold to only crawl high-quality pages:

# Only follow links with scores above 0.4
strategy = DFSDeepCrawlStrategy(
    max_depth=2,
    url_scorer=KeywordRelevanceScorer(keywords=["api", "guide", "reference"]),
    score_threshold=0.4  # Skip URLs with scores below this value
)

Note that for BestFirstCrawlingStrategy, score_threshold is not needed since pages are already processed in order of highest score first.

9. Common Pitfalls & Tips

1.Set realistic limits. Be cautious with max_depth values > 3, which can exponentially increase crawl size. Use max_pages to set hard limits.

2.Don't neglect the scoring component. BestFirstCrawling works best with well-tuned scorers. Experiment with keyword weights for optimal prioritization.

3.Be a good web citizen. Respect robots.txt. (disabled by default)

4.Handle page errors gracefully. Not all pages will be accessible. Check result.status when processing results.

5.Balance breadth vs. depth. Choose your strategy wisely - BFS for comprehensive coverage, DFS for deep exploration, BestFirst for focused relevance-based crawling.

6.Preserve HTTPS for security. If crawling HTTPS sites that redirect to HTTP, use preserve_https_for_internal_links=True to maintain secure connections:

config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=2),
    preserve_https_for_internal_links=True  # Keep HTTPS even if server redirects to HTTP
)

This is especially useful for security-conscious crawling or when dealing with sites that support both protocols.

10. Crash Recovery for Long-Running Crawls

For production deployments, especially in cloud environments where instances can be terminated unexpectedly, Crawl4AI provides built-in crash recovery support for all deep crawl strategies.

10.1 Enabling State Persistence

All deep crawl strategies (BFS, DFS, Best-First) support two optional parameters:

resume_state: Pass a previously saved state to resume from a checkpoint
on_state_change: Async callback fired after each URL is processed

from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
import json

# Callback to save state after each URL
async def save_state_to_redis(state: dict):
    await redis.set("crawl_state", json.dumps(state))

strategy = BFSDeepCrawlStrategy(
    max_depth=3,
    on_state_change=save_state_to_redis,  # Called after each URL
)

10.2 State Structure

The state dictionary is JSON-serializable and contains:

{
    "strategy_type": "bfs",  # or "dfs", "best_first"
    "visited": ["url1", "url2", ...],  # Already crawled URLs
    "pending": [{"url": "...", "parent_url": "..."}],  # Queue/stack
    "depths": {"url1": 0, "url2": 1},  # Depth tracking
    "pages_crawled": 42  # Counter
}

10.3 Resuming from a Checkpoint

import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy

# Load saved state (e.g., from Redis, database, or file)
saved_state = json.loads(await redis.get("crawl_state"))

# Resume crawling from where we left off
strategy = BFSDeepCrawlStrategy(
    max_depth=3,
    resume_state=saved_state,  # Continue from checkpoint
    on_state_change=save_state_to_redis,  # Keep saving progress
)

config = CrawlerRunConfig(deep_crawl_strategy=strategy)

async with AsyncWebCrawler() as crawler:
    # Will skip already-visited URLs and continue from pending queue
    results = await crawler.arun(start_url, config=config)

10.4 Manual State Export

You can export the last captured state using export_state(). Note that this requires on_state_change to be set (state is captured in the callback):

import json

captured_state = None

async def capture_state(state: dict):
    global captured_state
    captured_state = state

strategy = BFSDeepCrawlStrategy(
    max_depth=2,
    on_state_change=capture_state,  # Required for state capture
)
config = CrawlerRunConfig(deep_crawl_strategy=strategy)

async with AsyncWebCrawler() as crawler:
    results = await crawler.arun(start_url, config=config)

# Get the last captured state
state = strategy.export_state()
if state:
    # Save to your preferred storage
    with open("crawl_checkpoint.json", "w") as f:
        json.dump(state, f)

10.5 Complete Example: Redis-Based Recovery

import asyncio
import json
import redis.asyncio as redis
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy

REDIS_KEY = "crawl4ai:crawl_state"

async def main():
    redis_client = redis.Redis(host='localhost', port=6379, db=0)

    # Check for existing state
    saved_state = None
    existing = await redis_client.get(REDIS_KEY)
    if existing:
        saved_state = json.loads(existing)
        print(f"Resuming from checkpoint: {saved_state['pages_crawled']} pages already crawled")

    # State persistence callback
    async def persist_state(state: dict):
        await redis_client.set(REDIS_KEY, json.dumps(state))

    # Create strategy with recovery support
    strategy = BFSDeepCrawlStrategy(
        max_depth=3,
        max_pages=100,
        resume_state=saved_state,
        on_state_change=persist_state,
    )

    config = CrawlerRunConfig(deep_crawl_strategy=strategy, stream=True)

    try:
        async with AsyncWebCrawler() as crawler:
            async for result in await crawler.arun("https://example.com", config=config):
                print(f"Crawled: {result.url}")
    except Exception as e:
        print(f"Crawl interrupted: {e}")
        print("State saved - restart to resume")
    finally:
        await redis_client.close()

if __name__ == "__main__":
    asyncio.run(main())

10.6 Zero Overhead

When resume_state=None and on_state_change=None (the defaults), there is no performance impact. State tracking only activates when you enable these features.

11. Prefetch Mode for Fast URL Discovery

When you need to quickly discover URLs without full page processing, use prefetch mode. This is ideal for two-phase crawling where you first map the site, then selectively process specific pages.

11.1 Enabling Prefetch Mode

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

config = CrawlerRunConfig(prefetch=True)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun("https://example.com", config=config)

    # Result contains only HTML and links - no markdown, no extraction
    print(f"Found {len(result.links['internal'])} internal links")
    print(f"Found {len(result.links['external'])} external links")

11.2 What Gets Skipped

Prefetch mode uses a fast path that bypasses heavy processing:

Processing Step	Normal Mode	Prefetch Mode
Fetch HTML	✅	✅
Extract links	✅	✅ (fast `quick_extract_links()`)
Generate markdown	✅	❌ Skipped
Content scraping	✅	❌ Skipped
Media extraction	✅	❌ Skipped
LLM extraction	✅	❌ Skipped

11.3 Performance Benefit

Normal mode: Full pipeline (~2-5 seconds per page)
Prefetch mode: HTML + links only (~200-500ms per page)

This makes prefetch mode 5-10x faster for URL discovery.

11.4 Two-Phase Crawling Pattern

The most common use case is two-phase crawling:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def two_phase_crawl(start_url: str):
    async with AsyncWebCrawler() as crawler:
        # ═══════════════════════════════════════════════
        # Phase 1: Fast discovery (prefetch mode)
        # ═══════════════════════════════════════════════
        prefetch_config = CrawlerRunConfig(prefetch=True)
        discovery = await crawler.arun(start_url, config=prefetch_config)

        all_urls = [link["href"] for link in discovery.links.get("internal", [])]
        print(f"Discovered {len(all_urls)} URLs")

        # Filter to URLs you care about
        blog_urls = [url for url in all_urls if "/blog/" in url]
        print(f"Found {len(blog_urls)} blog posts to process")

        # ═══════════════════════════════════════════════
        # Phase 2: Full processing on selected URLs only
        # ═══════════════════════════════════════════════
        full_config = CrawlerRunConfig(
            # Your normal extraction settings
            word_count_threshold=100,
            remove_overlay_elements=True,
        )

        results = []
        for url in blog_urls:
            result = await crawler.arun(url, config=full_config)
            if result.success:
                results.append(result)
                print(f"Processed: {url}")

        return results

if __name__ == "__main__":
    results = asyncio.run(two_phase_crawl("https://example.com"))
    print(f"Fully processed {len(results)} pages")

11.5 Use Cases

Site mapping: Quickly discover all URLs before deciding what to process
Link validation: Check which pages exist without heavy processing
Selective deep crawl: Prefetch to find URLs, filter by pattern, then full crawl
Crawl planning: Estimate crawl size before committing resources

12. Summary & Next Steps

In this Deep Crawling with Crawl4AI tutorial, you learned to:

Configure BFSDeepCrawlStrategy, DFSDeepCrawlStrategy, and BestFirstCrawlingStrategy
Process results in streaming or non-streaming mode
Apply filters to target specific content
Use scorers to prioritize the most relevant pages
Limit crawls with max_pages and score_threshold parameters
Build a complete advanced crawler with combined techniques
Implement crash recovery with resume_state and on_state_change for production deployments
Use prefetch mode for fast URL discovery and two-phase crawling

With these tools, you can efficiently extract structured data from websites at scale, focusing precisely on the content you need for your specific use case.

24 KiB Raw Blame History Unescape Escape