feat(crawl4ai): Implement SMART cache mode
This commit introduces a new cache mode, SMART, to the crawl4ai library. The SMART mode intelligently validates cached content using HEAD requests before using it, saving significant bandwidth while ensuring fresh content. The changes include modifications to the async_webcrawler.py, cache_context.py, and utils.py files in the crawl4ai directory. The async_webcrawler.py file now includes a check for the SMART cache mode and performs a HEAD check to see if the content has changed. If the content has changed, the url is re-crawled; otherwise, the cached result is used. The cache_context.py and utils.py files have been updated to support these changes. The documentation has also been updated to reflect these changes. The cache-modes.md file now includes a detailed explanation of the SMART mode, its logs, limitations, and an advanced example. The examples.md file now includes a link to the SMART Cache Mode example. The quickstart.md file now mentions the SMART mode in the note about cache modes. These changes improve the efficiency of the crawl4ai library by reducing unnecessary re-crawling and bandwidth usage. BREAKING CHANGE: The introduction of the SMART cache mode may affect existing code that uses the crawl4ai library and does not expect this new mode. Users should review the updated documentation to understand how to use this new mode.
This commit is contained in:
@@ -19,6 +19,7 @@ The new system uses a single `CacheMode` enum:
|
||||
- `CacheMode.READ_ONLY`: Only read from cache
|
||||
- `CacheMode.WRITE_ONLY`: Only write to cache
|
||||
- `CacheMode.BYPASS`: Skip cache for this operation
|
||||
- `CacheMode.SMART`: **NEW** - Intelligently validate cache with HEAD requests
|
||||
|
||||
## Migration Example
|
||||
|
||||
@@ -72,4 +73,128 @@ if __name__ == "__main__":
|
||||
| `bypass_cache=True` | `cache_mode=CacheMode.BYPASS` |
|
||||
| `disable_cache=True` | `cache_mode=CacheMode.DISABLED`|
|
||||
| `no_cache_read=True` | `cache_mode=CacheMode.WRITE_ONLY` |
|
||||
| `no_cache_write=True` | `cache_mode=CacheMode.READ_ONLY` |
|
||||
| `no_cache_write=True` | `cache_mode=CacheMode.READ_ONLY` |
|
||||
|
||||
## SMART Cache Mode: Only Crawl When Changes
|
||||
|
||||
Starting from version 0.7.1, Crawl4AI introduces the **SMART cache mode** - an intelligent caching strategy that validates cached content before using it. This mode uses HTTP HEAD requests to check if content has changed, potentially saving 70-95% bandwidth on unchanged content.
|
||||
|
||||
### How SMART Mode Works
|
||||
|
||||
When you use `CacheMode.SMART`, Crawl4AI:
|
||||
|
||||
1. **Retrieves cached content** (if available)
|
||||
2. **Sends a HEAD request** with conditional headers (ETag, Last-Modified)
|
||||
3. **Validates the response**:
|
||||
- If server returns `304 Not Modified` → uses cache
|
||||
- If content changed → performs fresh crawl
|
||||
- If headers indicate changes → performs fresh crawl
|
||||
|
||||
### Benefits
|
||||
|
||||
- **Bandwidth Efficient**: Only downloads full content when necessary
|
||||
- **Always Fresh**: Ensures you get the latest content when it changes
|
||||
- **Cost Effective**: Reduces API calls and bandwidth usage
|
||||
- **Intelligent**: Uses multiple signals to detect changes (ETag, Last-Modified, Content-Length)
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.cache_context import CacheMode
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
async def smart_crawl():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# First crawl - caches the content
|
||||
config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED)
|
||||
result1 = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=config
|
||||
)
|
||||
print(f"First crawl: {len(result1.html)} bytes")
|
||||
|
||||
# Second crawl - uses SMART mode
|
||||
smart_config = CrawlerRunConfig(cache_mode=CacheMode.SMART)
|
||||
result2 = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=smart_config
|
||||
)
|
||||
print(f"SMART crawl: {len(result2.html)} bytes (from cache if unchanged)")
|
||||
|
||||
asyncio.run(smart_crawl())
|
||||
```
|
||||
|
||||
### When to Use SMART Mode
|
||||
|
||||
SMART mode is ideal for:
|
||||
|
||||
- **Periodic crawling** of websites that update irregularly
|
||||
- **News sites** where you want fresh content but avoid re-downloading unchanged pages
|
||||
- **API endpoints** that provide proper caching headers
|
||||
- **Large-scale crawling** where bandwidth costs are significant
|
||||
|
||||
### How It Detects Changes
|
||||
|
||||
SMART mode checks these signals in order:
|
||||
|
||||
1. **304 Not Modified** status (most reliable)
|
||||
2. **Content-Digest** header (RFC 9530)
|
||||
3. **Strong ETag** comparison
|
||||
4. **Last-Modified** timestamp
|
||||
5. **Content-Length** changes (as a hint)
|
||||
|
||||
### Example: News Site Monitoring
|
||||
|
||||
```python
|
||||
async def monitor_news_site():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
config = CrawlerRunConfig(cache_mode=CacheMode.SMART)
|
||||
|
||||
# Check multiple times
|
||||
for i in range(3):
|
||||
result = await crawler.arun(
|
||||
url="https://news.ycombinator.com",
|
||||
config=config
|
||||
)
|
||||
|
||||
# SMART mode will only re-crawl if content changed
|
||||
print(f"Check {i+1}: Retrieved {len(result.html)} bytes")
|
||||
await asyncio.sleep(300) # Wait 5 minutes
|
||||
|
||||
asyncio.run(monitor_news_site())
|
||||
```
|
||||
|
||||
### Understanding SMART Mode Logs
|
||||
|
||||
When using SMART mode with `verbose=True`, you'll see informative logs:
|
||||
|
||||
```
|
||||
[SMART] ℹ SMART cache: 304 Not Modified - Content unchanged - Using cache for https://example.com
|
||||
[SMART] ℹ SMART cache: Content-Length changed (12345 -> 12789) - Re-crawling https://example.com
|
||||
[SMART] ℹ SMART cache: No definitive cache headers matched - Assuming content changed - Re-crawling https://example.com
|
||||
```
|
||||
|
||||
### Limitations
|
||||
|
||||
- Some servers don't properly support HEAD requests
|
||||
- Dynamic content without proper cache headers will always be re-crawled
|
||||
- Content changes must be reflected in HTTP headers for detection
|
||||
|
||||
### Advanced Example
|
||||
|
||||
For a complete example demonstrating SMART mode with both static and dynamic content, check out `docs/examples/smart_cache.py`.
|
||||
|
||||
## Cache Mode Reference
|
||||
|
||||
| Mode | Read from Cache | Write to Cache | Use Case |
|
||||
|------|----------------|----------------|----------|
|
||||
| `ENABLED` | ✓ | ✓ | Normal operation |
|
||||
| `DISABLED` | ✗ | ✗ | No caching needed |
|
||||
| `READ_ONLY` | ✓ | ✗ | Use existing cache only |
|
||||
| `WRITE_ONLY` | ✗ | ✓ | Refresh cache only |
|
||||
| `BYPASS` | ✗ | ✗ | Skip cache for this request |
|
||||
| `SMART` | ✓* | ✓ | Validate before using cache |
|
||||
|
||||
*SMART mode reads from cache but validates it first with a HEAD request.
|
||||
@@ -37,6 +37,12 @@ This page provides a comprehensive list of example scripts that demonstrate vari
|
||||
| Storage State | Tutorial on managing browser storage state for persistence. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/storage_state_tutorial.md) |
|
||||
| Network Console Capture | Demonstrates how to capture and analyze network requests and console logs. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/network_console_capture_example.py) |
|
||||
|
||||
## Caching & Performance
|
||||
|
||||
| Example | Description | Link |
|
||||
|---------|-------------|------|
|
||||
| SMART Cache Mode | Demonstrates the intelligent SMART cache mode that validates cached content using HEAD requests, saving 70-95% bandwidth while ensuring fresh content. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/smart_cache.py) |
|
||||
|
||||
## Extraction Strategies
|
||||
|
||||
| Example | Description | Link |
|
||||
|
||||
@@ -79,7 +79,7 @@ if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
> IMPORTANT: By default cache mode is set to `CacheMode.ENABLED`. So to have fresh content, you need to set it to `CacheMode.BYPASS`
|
||||
> IMPORTANT: By default cache mode is set to `CacheMode.ENABLED`. So to have fresh content, you need to set it to `CacheMode.BYPASS`. For intelligent caching that validates content before using cache, use the new `CacheMode.SMART` - it saves bandwidth while ensuring fresh content.
|
||||
|
||||
We’ll explore more advanced config in later tutorials (like enabling proxies, PDF output, multi-tab sessions, etc.). For now, just note how you pass these objects to manage crawling.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user