Files

UncleCode d1de82a332 feat(crawl4ai): Implement SMART cache mode

This commit introduces a new cache mode, SMART, to the crawl4ai library. The SMART mode intelligently validates cached content using HEAD requests before using it, saving significant bandwidth while ensuring fresh content. The changes include modifications to the async_webcrawler.py, cache_context.py, and utils.py files in the crawl4ai directory. The async_webcrawler.py file now includes a check for the SMART cache mode and performs a HEAD check to see if the content has changed. If the content has changed, the url is re-crawled; otherwise, the cached result is used. The cache_context.py and utils.py files have been updated to support these changes.

The documentation has also been updated to reflect these changes. The cache-modes.md file now includes a detailed explanation of the SMART mode, its logs, limitations, and an advanced example. The examples.md file now includes a link to the SMART Cache Mode example. The quickstart.md file now mentions the SMART mode in the note about cache modes.

These changes improve the efficiency of the crawl4ai library by reducing unnecessary re-crawling and bandwidth usage.

BREAKING CHANGE: The introduction of the SMART cache mode may affect existing code that uses the crawl4ai library and does not expect this new mode. Users should review the updated documentation to understand how to use this new mode.

2025-07-21 21:19:37 +08:00

6.7 KiB

Raw Permalink Blame History

Crawl4AI Cache System and Migration Guide

Overview

Starting from version 0.5.0, Crawl4AI introduces a new caching system that replaces the old boolean flags with a more intuitive CacheMode enum. This change simplifies cache control and makes the behavior more predictable.

Old vs New Approach

Old Way (Deprecated)

The old system used multiple boolean flags:

bypass_cache: Skip cache entirely
disable_cache: Disable all caching
no_cache_read: Don't read from cache
no_cache_write: Don't write to cache

New Way (Recommended)

The new system uses a single CacheMode enum:

CacheMode.ENABLED: Normal caching (read/write)
CacheMode.DISABLED: No caching at all
CacheMode.READ_ONLY: Only read from cache
CacheMode.WRITE_ONLY: Only write to cache
CacheMode.BYPASS: Skip cache for this operation
CacheMode.SMART: NEW - Intelligently validate cache with HEAD requests

Migration Example

Old Code (Deprecated)

import asyncio
from crawl4ai import AsyncWebCrawler

async def use_proxy():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            bypass_cache=True  # Old way
        )
        print(len(result.markdown))

async def main():
    await use_proxy()

if __name__ == "__main__":
    asyncio.run(main())

New Code (Recommended)

import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.async_configs import CrawlerRunConfig

async def use_proxy():
    # Use CacheMode in CrawlerRunConfig
    config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)  
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            config=config  # Pass the configuration object
        )
        print(len(result.markdown))

async def main():
    await use_proxy()

if __name__ == "__main__":
    asyncio.run(main())

Common Migration Patterns

Old Flag	New Mode
`bypass_cache=True`	`cache_mode=CacheMode.BYPASS`
`disable_cache=True`	`cache_mode=CacheMode.DISABLED`
`no_cache_read=True`	`cache_mode=CacheMode.WRITE_ONLY`
`no_cache_write=True`	`cache_mode=CacheMode.READ_ONLY`

SMART Cache Mode: Only Crawl When Changes

Starting from version 0.7.1, Crawl4AI introduces the SMART cache mode - an intelligent caching strategy that validates cached content before using it. This mode uses HTTP HEAD requests to check if content has changed, potentially saving 70-95% bandwidth on unchanged content.

How SMART Mode Works

When you use CacheMode.SMART, Crawl4AI:

Retrieves cached content (if available)
Sends a HEAD request with conditional headers (ETag, Last-Modified)
Validates the response:
- If server returns 304 Not Modified → uses cache
- If content changed → performs fresh crawl
- If headers indicate changes → performs fresh crawl

Benefits

Bandwidth Efficient: Only downloads full content when necessary
Always Fresh: Ensures you get the latest content when it changes
Cost Effective: Reduces API calls and bandwidth usage
Intelligent: Uses multiple signals to detect changes (ETag, Last-Modified, Content-Length)

Basic Usage

import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.cache_context import CacheMode
from crawl4ai.async_configs import CrawlerRunConfig

async def smart_crawl():
    async with AsyncWebCrawler(verbose=True) as crawler:
        # First crawl - caches the content
        config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED)
        result1 = await crawler.arun(
            url="https://example.com",
            config=config
        )
        print(f"First crawl: {len(result1.html)} bytes")
        
        # Second crawl - uses SMART mode
        smart_config = CrawlerRunConfig(cache_mode=CacheMode.SMART)
        result2 = await crawler.arun(
            url="https://example.com",
            config=smart_config
        )
        print(f"SMART crawl: {len(result2.html)} bytes (from cache if unchanged)")

asyncio.run(smart_crawl())

When to Use SMART Mode

SMART mode is ideal for:

Periodic crawling of websites that update irregularly
News sites where you want fresh content but avoid re-downloading unchanged pages
API endpoints that provide proper caching headers
Large-scale crawling where bandwidth costs are significant

How It Detects Changes

SMART mode checks these signals in order:

304 Not Modified status (most reliable)
Content-Digest header (RFC 9530)
Strong ETag comparison
Last-Modified timestamp
Content-Length changes (as a hint)

Example: News Site Monitoring

async def monitor_news_site():
    async with AsyncWebCrawler(verbose=True) as crawler:
        config = CrawlerRunConfig(cache_mode=CacheMode.SMART)
        
        # Check multiple times
        for i in range(3):
            result = await crawler.arun(
                url="https://news.ycombinator.com",
                config=config
            )
            
            # SMART mode will only re-crawl if content changed
            print(f"Check {i+1}: Retrieved {len(result.html)} bytes")
            await asyncio.sleep(300)  # Wait 5 minutes

asyncio.run(monitor_news_site())

Understanding SMART Mode Logs

When using SMART mode with verbose=True, you'll see informative logs:

[SMART] ℹ SMART cache: 304 Not Modified - Content unchanged - Using cache for https://example.com
[SMART] ℹ SMART cache: Content-Length changed (12345 -> 12789) - Re-crawling https://example.com
[SMART] ℹ SMART cache: No definitive cache headers matched - Assuming content changed - Re-crawling https://example.com

Limitations

Some servers don't properly support HEAD requests
Dynamic content without proper cache headers will always be re-crawled
Content changes must be reflected in HTTP headers for detection

Advanced Example

For a complete example demonstrating SMART mode with both static and dynamic content, check out docs/examples/smart_cache.py.

Cache Mode Reference

Mode	Read from Cache	Write to Cache	Use Case
`ENABLED`	✓	✓	Normal operation
`DISABLED`	✗	✗	No caching needed
`READ_ONLY`	✓	✗	Use existing cache only
`WRITE_ONLY`	✗	✓	Refresh cache only
`BYPASS`	✗	✗	Skip cache for this request
`SMART`	✓*	✓	Validate before using cache

*SMART mode reads from cache but validates it first with a HEAD request.

6.7 KiB Raw Permalink Blame History Unescape Escape