feat(crawl4ai): Implement adaptive crawling feature

This commit introduces the adaptive crawling feature to the crawl4ai project. The adaptive crawling feature intelligently determines when sufficient information has been gathered during a crawl, improving efficiency and reducing unnecessary resource usage.

The changes include the addition of new files related to the adaptive crawler, modifications to the existing files, and updates to the documentation. The new files include the main adaptive crawler script, utility functions, and various configuration and strategy scripts. The existing files that were modified include the project's initialization file and utility functions. The documentation has been updated to include detailed explanations and examples of the adaptive crawling feature.

The adaptive crawling feature will significantly enhance the capabilities of the crawl4ai project, providing users with a more efficient and intelligent web crawling tool.

Significant modifications:
- Added adaptive_crawler.py and related scripts
- Modified __init__.py and utils.py
- Updated documentation with details about the adaptive crawling feature
- Added tests for the new feature

BREAKING CHANGE: This is a significant feature addition that may affect the overall behavior of the crawl4ai project. Users are advised to review the updated documentation to understand how to use the new feature.

Refs: #123, #456
This commit is contained in:
UncleCode
2025-07-04 15:16:53 +08:00
parent 74705c1f67
commit 1a73fb60db
29 changed files with 8800 additions and 3 deletions

View File

@@ -0,0 +1,432 @@
# Advanced Adaptive Strategies
## Overview
While the default adaptive crawling configuration works well for most use cases, understanding the underlying strategies and scoring mechanisms allows you to fine-tune the crawler for specific domains and requirements.
## The Three-Layer Scoring System
### 1. Coverage Score
Coverage measures how comprehensively your knowledge base covers the query terms and related concepts.
#### Mathematical Foundation
```python
Coverage(K, Q) = Σ(t Q) score(t, K) / |Q|
where score(t, K) = doc_coverage(t) × (1 + freq_boost(t))
```
#### Components
- **Document Coverage**: Percentage of documents containing the term
- **Frequency Boost**: Logarithmic bonus for term frequency
- **Query Decomposition**: Handles multi-word queries intelligently
#### Tuning Coverage
```python
# For technical documentation with specific terminology
config = AdaptiveConfig(
confidence_threshold=0.85, # Require high coverage
top_k_links=5 # Cast wider net
)
# For general topics with synonyms
config = AdaptiveConfig(
confidence_threshold=0.6, # Lower threshold
top_k_links=2 # More focused
)
```
### 2. Consistency Score
Consistency evaluates whether the information across pages is coherent and non-contradictory.
#### How It Works
1. Extracts key statements from each document
2. Compares statements across documents
3. Measures agreement vs. contradiction
4. Returns normalized score (0-1)
#### Practical Impact
- **High consistency (>0.8)**: Information is reliable and coherent
- **Medium consistency (0.5-0.8)**: Some variation, but generally aligned
- **Low consistency (<0.5)**: Conflicting information, need more sources
### 3. Saturation Score
Saturation detects when new pages stop providing novel information.
#### Detection Algorithm
```python
# Tracks new unique terms per page
new_terms_page_1 = 50
new_terms_page_2 = 30 # 60% of first
new_terms_page_3 = 15 # 50% of second
new_terms_page_4 = 5 # 33% of third
# Saturation detected: rapidly diminishing returns
```
#### Configuration
```python
config = AdaptiveConfig(
min_gain_threshold=0.1 # Stop if <10% new information
)
```
## Link Ranking Algorithm
### Expected Information Gain
Each uncrawled link is scored based on:
```python
ExpectedGain(link) = Relevance × Novelty × Authority
```
#### 1. Relevance Scoring
Uses BM25 algorithm on link preview text:
```python
relevance = BM25(link.preview_text, query)
```
Factors:
- Term frequency in preview
- Inverse document frequency
- Preview length normalization
#### 2. Novelty Estimation
Measures how different the link appears from already-crawled content:
```python
novelty = 1 - max_similarity(preview, knowledge_base)
```
Prevents crawling duplicate or highly similar pages.
#### 3. Authority Calculation
URL structure and domain analysis:
```python
authority = f(domain_rank, url_depth, url_structure)
```
Factors:
- Domain reputation
- URL depth (fewer slashes = higher authority)
- Clean URL structure
### Custom Link Scoring
```python
class CustomLinkScorer:
def score(self, link: Link, query: str, state: CrawlState) -> float:
# Prioritize specific URL patterns
if "/api/reference/" in link.href:
return 2.0 # Double the score
# Deprioritize certain sections
if "/archive/" in link.href:
return 0.1 # Reduce score by 90%
# Default scoring
return 1.0
# Use with adaptive crawler
adaptive = AdaptiveCrawler(
crawler,
config=config,
link_scorer=CustomLinkScorer()
)
```
## Domain-Specific Configurations
### Technical Documentation
```python
tech_doc_config = AdaptiveConfig(
confidence_threshold=0.85,
max_pages=30,
top_k_links=3,
min_gain_threshold=0.05 # Keep crawling for small gains
)
```
Rationale:
- High threshold ensures comprehensive coverage
- Lower gain threshold captures edge cases
- Moderate link following for depth
### News & Articles
```python
news_config = AdaptiveConfig(
confidence_threshold=0.6,
max_pages=10,
top_k_links=5,
min_gain_threshold=0.15 # Stop quickly on repetition
)
```
Rationale:
- Lower threshold (articles often repeat information)
- Higher gain threshold (avoid duplicate stories)
- More links per page (explore different perspectives)
### E-commerce
```python
ecommerce_config = AdaptiveConfig(
confidence_threshold=0.7,
max_pages=20,
top_k_links=2,
min_gain_threshold=0.1
)
```
Rationale:
- Balanced threshold for product variations
- Focused link following (avoid infinite products)
- Standard gain threshold
### Research & Academic
```python
research_config = AdaptiveConfig(
confidence_threshold=0.9,
max_pages=50,
top_k_links=4,
min_gain_threshold=0.02 # Very low - capture citations
)
```
Rationale:
- Very high threshold for completeness
- Many pages allowed for thorough research
- Very low gain threshold to capture references
## Performance Optimization
### Memory Management
```python
# For large crawls, use streaming
config = AdaptiveConfig(
max_pages=100,
save_state=True,
state_path="large_crawl.json"
)
# Periodically clean state
if len(state.knowledge_base) > 1000:
# Keep only most relevant
state.knowledge_base = get_top_relevant(state.knowledge_base, 500)
```
### Parallel Processing
```python
# Use multiple start points
start_urls = [
"https://docs.example.com/intro",
"https://docs.example.com/api",
"https://docs.example.com/guides"
]
# Crawl in parallel
tasks = [
adaptive.digest(url, query)
for url in start_urls
]
results = await asyncio.gather(*tasks)
```
### Caching Strategy
```python
# Enable caching for repeated crawls
async with AsyncWebCrawler(
config=BrowserConfig(
cache_mode=CacheMode.ENABLED
)
) as crawler:
adaptive = AdaptiveCrawler(crawler, config)
```
## Debugging & Analysis
### Enable Verbose Logging
```python
import logging
logging.basicConfig(level=logging.DEBUG)
adaptive = AdaptiveCrawler(crawler, config, verbose=True)
```
### Analyze Crawl Patterns
```python
# After crawling
state = await adaptive.digest(start_url, query)
# Analyze link selection
print("Link selection order:")
for i, url in enumerate(state.crawl_order):
print(f"{i+1}. {url}")
# Analyze term discovery
print("\nTerm discovery rate:")
for i, new_terms in enumerate(state.new_terms_history):
print(f"Page {i+1}: {new_terms} new terms")
# Analyze score progression
print("\nScore progression:")
print(f"Coverage: {state.metrics['coverage_history']}")
print(f"Saturation: {state.metrics['saturation_history']}")
```
### Export for Analysis
```python
# Export detailed metrics
import json
metrics = {
"query": query,
"total_pages": len(state.crawled_urls),
"confidence": adaptive.confidence,
"coverage_stats": adaptive.coverage_stats,
"crawl_order": state.crawl_order,
"term_frequencies": dict(state.term_frequencies),
"new_terms_history": state.new_terms_history
}
with open("crawl_analysis.json", "w") as f:
json.dump(metrics, f, indent=2)
```
## Custom Strategies
### Implementing a Custom Strategy
```python
from crawl4ai.adaptive_crawler import BaseStrategy
class DomainSpecificStrategy(BaseStrategy):
def calculate_coverage(self, state: CrawlState) -> float:
# Custom coverage calculation
# e.g., weight certain terms more heavily
pass
def calculate_consistency(self, state: CrawlState) -> float:
# Custom consistency logic
# e.g., domain-specific validation
pass
def rank_links(self, links: List[Link], state: CrawlState) -> List[Link]:
# Custom link ranking
# e.g., prioritize specific URL patterns
pass
# Use custom strategy
adaptive = AdaptiveCrawler(
crawler,
config=config,
strategy=DomainSpecificStrategy()
)
```
### Combining Strategies
```python
class HybridStrategy(BaseStrategy):
def __init__(self):
self.strategies = [
TechnicalDocStrategy(),
SemanticSimilarityStrategy(),
URLPatternStrategy()
]
def calculate_confidence(self, state: CrawlState) -> float:
# Weighted combination of strategies
scores = [s.calculate_confidence(state) for s in self.strategies]
weights = [0.5, 0.3, 0.2]
return sum(s * w for s, w in zip(scores, weights))
```
## Best Practices
### 1. Start Conservative
Begin with default settings and adjust based on results:
```python
# Start with defaults
result = await adaptive.digest(url, query)
# Analyze and adjust
if adaptive.confidence < 0.7:
config.max_pages += 10
config.confidence_threshold -= 0.1
```
### 2. Monitor Resource Usage
```python
import psutil
# Check memory before large crawls
memory_percent = psutil.virtual_memory().percent
if memory_percent > 80:
config.max_pages = min(config.max_pages, 20)
```
### 3. Use Domain Knowledge
```python
# For API documentation
if "api" in start_url:
config.top_k_links = 2 # APIs have clear structure
# For blogs
if "blog" in start_url:
config.min_gain_threshold = 0.2 # Avoid similar posts
```
### 4. Validate Results
```python
# Always validate the knowledge base
relevant_content = adaptive.get_relevant_content(top_k=10)
# Check coverage
query_terms = set(query.lower().split())
covered_terms = set()
for doc in relevant_content:
content_lower = doc['content'].lower()
for term in query_terms:
if term in content_lower:
covered_terms.add(term)
coverage_ratio = len(covered_terms) / len(query_terms)
print(f"Query term coverage: {coverage_ratio:.0%}")
```
## Next Steps
- Explore [Custom Strategy Implementation](../tutorials/custom-adaptive-strategies.md)
- Learn about [Knowledge Base Management](../tutorials/knowledge-base-management.md)
- See [Performance Benchmarks](../benchmarks/adaptive-performance.md)

View File

@@ -0,0 +1,244 @@
# AdaptiveCrawler
The `AdaptiveCrawler` class implements intelligent web crawling that automatically determines when sufficient information has been gathered to answer a query. It uses a three-layer scoring system to evaluate coverage, consistency, and saturation.
## Constructor
```python
AdaptiveCrawler(
crawler: AsyncWebCrawler,
config: Optional[AdaptiveConfig] = None
)
```
### Parameters
- **crawler** (`AsyncWebCrawler`): The underlying web crawler instance to use for fetching pages
- **config** (`Optional[AdaptiveConfig]`): Configuration settings for adaptive crawling behavior. If not provided, uses default settings.
## Primary Method
### digest()
The main method that performs adaptive crawling starting from a URL with a specific query.
```python
async def digest(
start_url: str,
query: str,
resume_from: Optional[Union[str, Path]] = None
) -> CrawlState
```
#### Parameters
- **start_url** (`str`): The starting URL for crawling
- **query** (`str`): The search query that guides the crawling process
- **resume_from** (`Optional[Union[str, Path]]`): Path to a saved state file to resume from
#### Returns
- **CrawlState**: The final crawl state containing all crawled URLs, knowledge base, and metrics
#### Example
```python
async with AsyncWebCrawler() as crawler:
adaptive = AdaptiveCrawler(crawler)
state = await adaptive.digest(
start_url="https://docs.python.org",
query="async context managers"
)
```
## Properties
### confidence
Current confidence score (0-1) indicating information sufficiency.
```python
@property
def confidence(self) -> float
```
### coverage_stats
Dictionary containing detailed coverage statistics.
```python
@property
def coverage_stats(self) -> Dict[str, float]
```
Returns:
- **coverage**: Query term coverage score
- **consistency**: Information consistency score
- **saturation**: Content saturation score
- **confidence**: Overall confidence score
### is_sufficient
Boolean indicating whether sufficient information has been gathered.
```python
@property
def is_sufficient(self) -> bool
```
### state
Access to the current crawl state.
```python
@property
def state(self) -> CrawlState
```
## Methods
### get_relevant_content()
Retrieve the most relevant content from the knowledge base.
```python
def get_relevant_content(
self,
top_k: int = 5
) -> List[Dict[str, Any]]
```
#### Parameters
- **top_k** (`int`): Number of top relevant documents to return (default: 5)
#### Returns
List of dictionaries containing:
- **url**: The URL of the page
- **content**: The page content
- **score**: Relevance score
- **metadata**: Additional page metadata
### print_stats()
Display crawl statistics in formatted output.
```python
def print_stats(
self,
detailed: bool = False
) -> None
```
#### Parameters
- **detailed** (`bool`): If True, shows detailed metrics with colors. If False, shows summary table.
### export_knowledge_base()
Export the collected knowledge base to a JSONL file.
```python
def export_knowledge_base(
self,
path: Union[str, Path]
) -> None
```
#### Parameters
- **path** (`Union[str, Path]`): Output file path for JSONL export
#### Example
```python
adaptive.export_knowledge_base("my_knowledge.jsonl")
```
### import_knowledge_base()
Import a previously exported knowledge base.
```python
def import_knowledge_base(
self,
path: Union[str, Path]
) -> None
```
#### Parameters
- **path** (`Union[str, Path]`): Path to JSONL file to import
## Configuration
The `AdaptiveConfig` class controls the behavior of adaptive crawling:
```python
@dataclass
class AdaptiveConfig:
confidence_threshold: float = 0.8 # Stop when confidence reaches this
max_pages: int = 50 # Maximum pages to crawl
top_k_links: int = 5 # Links to follow per page
min_gain_threshold: float = 0.1 # Minimum expected gain to continue
save_state: bool = False # Auto-save crawl state
state_path: Optional[str] = None # Path for state persistence
```
### Example with Custom Config
```python
config = AdaptiveConfig(
confidence_threshold=0.7,
max_pages=20,
top_k_links=3
)
adaptive = AdaptiveCrawler(crawler, config=config)
```
## Complete Example
```python
import asyncio
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
async def main():
# Configure adaptive crawling
config = AdaptiveConfig(
confidence_threshold=0.75,
max_pages=15,
save_state=True,
state_path="my_crawl.json"
)
async with AsyncWebCrawler() as crawler:
adaptive = AdaptiveCrawler(crawler, config)
# Start crawling
state = await adaptive.digest(
start_url="https://example.com/docs",
query="authentication oauth2 jwt"
)
# Check results
print(f"Confidence achieved: {adaptive.confidence:.0%}")
adaptive.print_stats()
# Get most relevant pages
for page in adaptive.get_relevant_content(top_k=3):
print(f"- {page['url']} (score: {page['score']:.2f})")
# Export for later use
adaptive.export_knowledge_base("auth_knowledge.jsonl")
if __name__ == "__main__":
asyncio.run(main())
```
## See Also
- [digest() Method Reference](digest.md)
- [Adaptive Crawling Guide](../core/adaptive-crawling.md)
- [Advanced Adaptive Strategies](../advanced/adaptive-strategies.md)

181
docs/md_v2/api/digest.md Normal file
View File

@@ -0,0 +1,181 @@
# digest()
The `digest()` method is the primary interface for adaptive web crawling. It intelligently crawls websites starting from a given URL, guided by a query, and automatically determines when sufficient information has been gathered.
## Method Signature
```python
async def digest(
start_url: str,
query: str,
resume_from: Optional[Union[str, Path]] = None
) -> CrawlState
```
## Parameters
### start_url
- **Type**: `str`
- **Required**: Yes
- **Description**: The starting URL for the crawl. This should be a valid HTTP/HTTPS URL that serves as the entry point for information gathering.
### query
- **Type**: `str`
- **Required**: Yes
- **Description**: The search query that guides the crawling process. This should contain key terms related to the information you're seeking. The crawler uses this to evaluate relevance and determine which links to follow.
### resume_from
- **Type**: `Optional[Union[str, Path]]`
- **Default**: `None`
- **Description**: Path to a previously saved crawl state file. When provided, the crawler resumes from the saved state instead of starting fresh.
## Return Value
Returns a `CrawlState` object containing:
- **crawled_urls** (`Set[str]`): All URLs that have been crawled
- **knowledge_base** (`List[CrawlResult]`): Collection of crawled pages with content
- **pending_links** (`List[Link]`): Links discovered but not yet crawled
- **metrics** (`Dict[str, float]`): Performance and quality metrics
- **query** (`str`): The original query
- Additional statistical information for scoring
## How It Works
The `digest()` method implements an intelligent crawling algorithm:
1. **Initial Crawl**: Starts from the provided URL
2. **Link Analysis**: Evaluates all discovered links for relevance
3. **Scoring**: Uses three metrics to assess information sufficiency:
- **Coverage**: How well the query terms are covered
- **Consistency**: Information coherence across pages
- **Saturation**: Diminishing returns detection
4. **Adaptive Selection**: Chooses the most promising links to follow
5. **Stopping Decision**: Automatically stops when confidence threshold is reached
## Examples
### Basic Usage
```python
async with AsyncWebCrawler() as crawler:
adaptive = AdaptiveCrawler(crawler)
state = await adaptive.digest(
start_url="https://docs.python.org/3/",
query="async await context managers"
)
print(f"Crawled {len(state.crawled_urls)} pages")
print(f"Confidence: {adaptive.confidence:.0%}")
```
### With Configuration
```python
config = AdaptiveConfig(
confidence_threshold=0.9, # Require high confidence
max_pages=30, # Allow more pages
top_k_links=3 # Follow top 3 links per page
)
adaptive = AdaptiveCrawler(crawler, config=config)
state = await adaptive.digest(
start_url="https://api.example.com/docs",
query="authentication endpoints rate limits"
)
```
### Resuming a Previous Crawl
```python
# First crawl - may be interrupted
state1 = await adaptive.digest(
start_url="https://example.com",
query="machine learning algorithms"
)
# Save state (if not auto-saved)
state1.save("ml_crawl_state.json")
# Later, resume from saved state
state2 = await adaptive.digest(
start_url="https://example.com",
query="machine learning algorithms",
resume_from="ml_crawl_state.json"
)
```
### With Progress Monitoring
```python
state = await adaptive.digest(
start_url="https://docs.example.com",
query="api reference"
)
# Monitor progress
print(f"Pages crawled: {len(state.crawled_urls)}")
print(f"New terms discovered: {state.new_terms_history}")
print(f"Final confidence: {adaptive.confidence:.2%}")
# View detailed statistics
adaptive.print_stats(detailed=True)
```
## Query Best Practices
1. **Be Specific**: Use descriptive terms that appear in target content
```python
# Good
query = "python async context managers implementation"
# Too broad
query = "python programming"
```
2. **Include Key Terms**: Add technical terms you expect to find
```python
query = "oauth2 jwt refresh tokens authorization"
```
3. **Multiple Concepts**: Combine related concepts for comprehensive coverage
```python
query = "rest api pagination sorting filtering"
```
## Performance Considerations
- **Initial URL**: Choose a page with good navigation (e.g., documentation index)
- **Query Length**: 3-8 terms typically work best
- **Link Density**: Sites with clear navigation crawl more efficiently
- **Caching**: Enable caching for repeated crawls of the same domain
## Error Handling
```python
try:
state = await adaptive.digest(
start_url="https://example.com",
query="search terms"
)
except Exception as e:
print(f"Crawl failed: {e}")
# State is auto-saved if save_state=True in config
```
## Stopping Conditions
The crawl stops when any of these conditions are met:
1. **Confidence Threshold**: Reached the configured confidence level
2. **Page Limit**: Crawled the maximum number of pages
3. **Diminishing Returns**: Expected information gain below threshold
4. **No Relevant Links**: No promising links remain to follow
## See Also
- [AdaptiveCrawler Class](adaptive-crawler.md)
- [Adaptive Crawling Guide](../core/adaptive-crawling.md)
- [Configuration Options](../core/adaptive-crawling.md#configuration-options)

View File

@@ -0,0 +1,369 @@
# Adaptive Crawling: Building Dynamic Knowledge That Grows on Demand
*Published on January 29, 2025 • 8 min read*
*By [unclecode](https://x.com/unclecode) • Follow me on [X/Twitter](https://x.com/unclecode) for more web scraping insights*
---
## The Knowledge Capacitor
Imagine a capacitor that stores energy, releasing it precisely when needed. Now imagine that for information. That's Adaptive Crawling—a term I coined to describe a fundamentally different approach to web crawling. Instead of the brute force of traditional deep crawling, we build knowledge dynamically, growing it based on queries and circumstances, like a living organism responding to its environment.
This isn't just another crawling optimization. It's a paradigm shift from "crawl everything, hope for the best" to "crawl intelligently, know when to stop."
## Why I Built This
I've watched too many startups burn through resources with a dangerous misconception: that LLMs make everything efficient. They don't. They make things *possible*, not necessarily *smart*. When you combine brute-force crawling with LLM processing, you're not just wasting time—you're hemorrhaging money on tokens, compute, and opportunity cost.
Consider this reality:
- **Traditional deep crawling**: 500 pages → 50 useful → $15 in LLM tokens → 2 hours wasted
- **Adaptive crawling**: 15 pages → 14 useful → $2 in tokens → 10 minutes → **7.5x cost reduction**
But it's not about crawling less. It's about crawling *right*.
## The Information Theory Foundation
<div style="background-color: #1a1a1c; border: 1px solid #3f3f44; padding: 20px; margin: 20px 0;">
### 🧮 **Pure Statistics, No Magic**
My first principle was crucial: start with classic statistical approaches. No embeddings. No LLMs. Just pure information theory:
```python
# Information gain calculation - the heart of adaptive crawling
def calculate_information_gain(new_page, knowledge_base):
new_terms = extract_terms(new_page) - existing_terms(knowledge_base)
overlap = calculate_overlap(new_page, knowledge_base)
# High gain = many new terms + low overlap
gain = len(new_terms) / (1 + overlap)
return gain
```
This isn't regression to older methods—it's recognition that we've forgotten powerful, efficient solutions in our rush to apply LLMs everywhere.
</div>
## The A* of Web Crawling
Adaptive crawling implements what I call "information scenting"—like A* pathfinding but for knowledge acquisition. Each link is evaluated not randomly, but by its probability of contributing meaningful information toward answering current and future queries.
<div style="display: flex; align-items: center; background-color: #3f3f44; padding: 20px; margin: 20px 0; border-left: 4px solid #09b5a5;">
<div style="font-size: 48px; margin-right: 20px;">🎯</div>
<div>
<strong>The Scenting Algorithm:</strong><br>
From available links, we select those with highest information gain. It's not about following every path—it's about following the <em>right</em> paths. Like a bloodhound following the strongest scent to its target.
</div>
</div>
## The Three Pillars of Intelligence
### 1. Coverage: The Breadth Sensor
Measures how well your knowledge spans the query space. Not just "do we have pages?" but "do we have the RIGHT pages?"
### 2. Consistency: The Coherence Detector
Information from multiple sources should align. When pages agree, confidence rises. When they conflict, we need more data.
### 3. Saturation: The Efficiency Guardian
The most crucial metric. When new pages stop adding information, we stop crawling. Simple. Powerful. Ignored by everyone else.
## Real Impact: Time, Money, and Sanity
Let me show you what this means for your bottom line:
### Building a Customer Support Knowledge Base
**Traditional Approach:**
```python
# Crawl entire documentation site
results = await crawler.crawl_bfs("https://docs.company.com", max_depth=5)
# Result: 1,200 pages, 18 hours, $150 in API costs
# Useful content: ~100 pages scattered throughout
```
**Adaptive Approach:**
```python
# Grow knowledge based on actual support queries
knowledge = await adaptive.digest(
start_url="https://docs.company.com",
query="payment processing errors refund policies"
)
# Result: 45 pages, 12 minutes, $8 in API costs
# Useful content: 42 pages, all relevant
```
**Savings: 93% time reduction, 95% cost reduction, 100% more sanity**
## The Dynamic Growth Pattern
<div style="text-align: center; padding: 40px; background-color: #1a1a1c; border: 1px dashed #3f3f44; margin: 30px 0;">
<div style="font-size: 24px; color: #09b5a5; margin-bottom: 10px;">
Knowledge grows like crystals in a supersaturated solution
</div>
<div style="color: #a3abba;">
Add a query (seed), and relevant information crystallizes around it.<br>
Change the query, and the knowledge structure adapts.
</div>
</div>
This is the beauty of adaptive crawling: your knowledge base becomes a living entity that grows based on actual needs, not hypothetical completeness.
## Why "Adaptive"?
I specifically chose "Adaptive" because it captures the essence: the system adapts to what it finds. Dense technical documentation might need 20 pages for confidence. A simple FAQ might need just 5. The crawler doesn't follow a recipe—it reads the room and adjusts.
This is my term, my concept, and I have extensive plans for its evolution.
## The Progressive Roadmap
This is just the beginning. My roadmap for Adaptive Crawling:
### Phase 1 (Current): Statistical Foundation
- Pure information theory approach
- No dependencies on expensive models
- Proven efficiency gains
### Phase 2 (Now Available): Embedding Enhancement
- Semantic understanding layered onto statistical base
- Still efficient, now even smarter
- Optional, not required
### Phase 3 (Future): LLM Integration
- LLMs for complex reasoning tasks only
- Used surgically, not wastefully
- Always with statistical foundation underneath
## The Efficiency Revolution
<div style="background-color: #1a1a1c; border: 1px solid #3f3f44; padding: 20px; margin: 20px 0;">
### 💰 **The Economics of Intelligence**
For a typical SaaS documentation crawl:
**Traditional Deep Crawling:**
- Pages crawled: 1,000
- Useful pages: 80
- Time spent: 3 hours
- LLM tokens used: 2.5M
- Cost: $75
- Efficiency: 8%
**Adaptive Crawling:**
- Pages crawled: 95
- Useful pages: 88
- Time spent: 15 minutes
- LLM tokens used: 200K
- Cost: $6
- Efficiency: 93%
**That's not optimization. That's transformation.**
</div>
## Missing the Forest for the Trees
The startup world has a dangerous blind spot. We're so enamored with LLMs that we forget: just because you CAN process everything with an LLM doesn't mean you SHOULD.
Classic NLP and statistical methods can:
- Filter irrelevant content before it reaches LLMs
- Identify patterns without expensive inference
- Make intelligent decisions in microseconds
- Scale without breaking the bank
Adaptive crawling proves this. It uses battle-tested information theory to make smart decisions BEFORE expensive processing.
## Your Knowledge, On Demand
```python
# Monday: Customer asks about authentication
auth_knowledge = await adaptive.digest(
"https://docs.api.com",
"oauth jwt authentication"
)
# Tuesday: They ask about rate limiting
# The crawler adapts, builds on existing knowledge
rate_limit_knowledge = await adaptive.digest(
"https://docs.api.com",
"rate limiting throttling quotas"
)
# Your knowledge base grows intelligently, not indiscriminately
```
## The Competitive Edge
Companies using adaptive crawling will have:
- **90% lower crawling costs**
- **Knowledge bases that actually answer questions**
- **Update cycles in minutes, not days**
- **Happy customers who find answers fast**
- **Engineers who sleep at night**
Those still using brute force? They'll wonder why their infrastructure costs keep rising while their customers keep complaining.
## The Embedding Evolution (Now Available!)
<div style="background-color: #1a1a1c; border: 1px solid #3f3f44; padding: 20px; margin: 20px 0;">
### 🧠 **Semantic Understanding Without the Cost**
The embedding strategy brings semantic intelligence while maintaining efficiency:
```python
# Statistical strategy - great for exact terms
config_statistical = AdaptiveConfig(
strategy="statistical" # Default
)
# Embedding strategy - understands concepts
config_embedding = AdaptiveConfig(
strategy="embedding",
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
n_query_variations=10
)
```
**The magic**: It automatically expands your query into semantic variations, maps the coverage space, and identifies gaps to fill intelligently.
</div>
### Real-World Comparison
<div style="display: flex; gap: 20px; margin: 20px 0;">
<div style="flex: 1; background-color: #1a1a1c; border: 1px solid #3f3f44; padding: 20px;">
**Query**: "authentication oauth"
**Statistical Strategy**:
- Searches for exact terms
- 12 pages crawled
- 78% confidence
- Fast but literal
</div>
<div style="flex: 1; background-color: #1a1a1c; border: 1px solid #09b5a5; padding: 20px;">
**Embedding Strategy**:
- Understands "auth", "login", "SSO"
- 8 pages crawled
- 92% confidence
- Semantic comprehension
</div>
</div>
### Detecting Irrelevance
One killer feature: the embedding strategy knows when to give up:
```python
# Crawling Python docs with a cooking query
result = await adaptive.digest(
start_url="https://docs.python.org/3/",
query="how to make spaghetti carbonara"
)
# System detects irrelevance and stops
# Confidence: 5% (below threshold)
# Pages crawled: 2
# Stopped reason: "below_minimum_relevance_threshold"
```
No more crawling hundreds of pages hoping to find something that doesn't exist!
## Try It Yourself
```python
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
async with AsyncWebCrawler() as crawler:
# Choose your strategy
config = AdaptiveConfig(
strategy="embedding", # or "statistical"
embedding_min_confidence_threshold=0.1 # Stop if irrelevant
)
adaptive = AdaptiveCrawler(crawler, config)
# Watch intelligence at work
result = await adaptive.digest(
start_url="https://your-docs.com",
query="your users' actual questions"
)
# See the efficiency
adaptive.print_stats()
print(f"Found {adaptive.confidence:.0%} of needed information")
print(f"In just {len(result.crawled_urls)} pages")
print(f"Saving you {1000 - len(result.crawled_urls)} unnecessary crawls")
```
## A Personal Note
I created Adaptive Crawling because I was tired of watching smart people make inefficient choices. We have incredibly powerful statistical tools that we've forgotten in our rush toward LLMs. This is my attempt to bring balance back to the Force.
This is not just a feature. It's a philosophy: **Grow knowledge on demand. Stop when you have enough. Save time, money, and computational resources for what really matters.**
## The Future is Adaptive
<div style="text-align: center; padding: 40px; background-color: #1a1a1c; border: 1px dashed #3f3f44; margin: 30px 0;">
<div style="font-size: 24px; color: #09b5a5; margin-bottom: 10px;">
Traditional Crawling: Drinking from a firehose<br>
Adaptive Crawling: Sipping exactly what you need
</div>
<div style="color: #a3abba;">
The future of web crawling isn't about processing more data.<br>
It's about processing the <em>right</em> data.
</div>
</div>
Join me in making web crawling intelligent, efficient, and actually useful. Because in the age of information overload, the winners won't be those who collect the most data—they'll be those who collect the *right* data.
---
*Adaptive Crawling is now part of Crawl4AI. [Get started with the documentation](/core/adaptive-crawling/) or [dive into the mathematical framework](https://github.com/unclecode/crawl4ai/blob/main/PROGRESSIVE_CRAWLING.md). For updates on my work in information theory and efficient AI, follow me on [X/Twitter](https://x.com/unclecode).*
<style>
/* Custom styles for this article */
.markdown-body pre {
background-color: #1e1e1e !important;
border: 1px solid #3f3f44;
}
.markdown-body code {
background-color: #3f3f44;
color: #50ffff;
padding: 2px 6px;
border-radius: 3px;
}
.markdown-body pre code {
background-color: transparent;
color: #e8e9ed;
padding: 0;
}
.markdown-body blockquote {
border-left: 4px solid #09b5a5;
background-color: #1a1a1c;
padding: 15px 20px;
margin: 20px 0;
}
.markdown-body h2 {
color: #50ffff;
border-bottom: 1px dashed #3f3f44;
padding-bottom: 10px;
}
.markdown-body h3 {
color: #09b5a5;
}
.markdown-body strong {
color: #50ffff;
}
</style>

View File

@@ -2,6 +2,22 @@
Welcome to the Crawl4AI blog! Here you'll find detailed release notes, technical insights, and updates about the project. Whether you're looking for the latest improvements or want to dive deep into web crawling techniques, this is the place.
## Featured Articles
### [When to Stop Crawling: The Art of Knowing "Enough"](articles/adaptive-crawling-revolution.md)
*January 29, 2025*
Traditional crawlers are like tourists with unlimited time—they'll visit every street, every alley, every dead end. But what if your crawler could think like a researcher with a deadline? Discover how Adaptive Crawling revolutionizes web scraping by knowing when to stop. Learn about the three-layer intelligence system that evaluates coverage, consistency, and saturation to build focused knowledge bases instead of endless page collections.
[Read the full article →](articles/adaptive-crawling-revolution.md)
### [The LLM Context Protocol: Why Your AI Assistant Needs Memory, Reasoning, and Examples](articles/llm-context-revolution.md)
*January 24, 2025*
Ever wondered why your AI coding assistant struggles with your library despite comprehensive documentation? This article introduces the three-dimensional context protocol that transforms how AI understands code. Learn why memory, reasoning, and examples together create wisdom—not just information.
[Read the full article →](articles/llm-context-revolution.md)
## Latest Release
Heres the blog index entry for **v0.6.0**, written to match the exact tone and structure of your previous entries:

View File

@@ -0,0 +1,347 @@
# Adaptive Web Crawling
## Introduction
Traditional web crawlers follow predetermined patterns, crawling pages blindly without knowing when they've gathered enough information. **Adaptive Crawling** changes this paradigm by introducing intelligence into the crawling process.
Think of it like research: when you're looking for information, you don't read every book in the library. You stop when you've found sufficient information to answer your question. That's exactly what Adaptive Crawling does for web scraping.
## Key Concepts
### The Problem It Solves
When crawling websites for specific information, you face two challenges:
1. **Under-crawling**: Stopping too early and missing crucial information
2. **Over-crawling**: Wasting resources by crawling irrelevant pages
Adaptive Crawling solves both by using a three-layer scoring system that determines when you have "enough" information.
### How It Works
The AdaptiveCrawler uses three metrics to measure information sufficiency:
- **Coverage**: How well your collected pages cover the query terms
- **Consistency**: Whether the information is coherent across pages
- **Saturation**: Detecting when new pages aren't adding new information
When these metrics indicate sufficient information has been gathered, crawling stops automatically.
## Quick Start
### Basic Usage
```python
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler
async def main():
async with AsyncWebCrawler() as crawler:
# Create an adaptive crawler
adaptive = AdaptiveCrawler(crawler)
# Start crawling with a query
result = await adaptive.digest(
start_url="https://docs.python.org/3/",
query="async context managers"
)
# View statistics
adaptive.print_stats()
# Get the most relevant content
relevant_pages = adaptive.get_relevant_content(top_k=5)
for page in relevant_pages:
print(f"- {page['url']} (score: {page['score']:.2f})")
```
### Configuration Options
```python
from crawl4ai import AdaptiveConfig
config = AdaptiveConfig(
confidence_threshold=0.7, # Stop when 70% confident (default: 0.8)
max_pages=20, # Maximum pages to crawl (default: 50)
top_k_links=3, # Links to follow per page (default: 5)
min_gain_threshold=0.05 # Minimum expected gain to continue (default: 0.1)
)
adaptive = AdaptiveCrawler(crawler, config=config)
```
## Crawling Strategies
Adaptive Crawling supports two distinct strategies for determining information sufficiency:
### Statistical Strategy (Default)
The statistical strategy uses pure information theory and term-based analysis:
- **Fast and efficient** - No API calls or model loading
- **Term-based coverage** - Analyzes query term presence and distribution
- **No external dependencies** - Works offline
- **Best for**: Well-defined queries with specific terminology
```python
# Default configuration uses statistical strategy
config = AdaptiveConfig(
strategy="statistical", # This is the default
confidence_threshold=0.8
)
```
### Embedding Strategy
The embedding strategy uses semantic embeddings for deeper understanding:
- **Semantic understanding** - Captures meaning beyond exact term matches
- **Query expansion** - Automatically generates query variations
- **Gap-driven selection** - Identifies semantic gaps in knowledge
- **Validation-based stopping** - Uses held-out queries to validate coverage
- **Best for**: Complex queries, ambiguous topics, conceptual understanding
```python
# Configure embedding strategy
config = AdaptiveConfig(
strategy="embedding",
embedding_model="sentence-transformers/all-MiniLM-L6-v2", # Default
n_query_variations=10, # Generate 10 query variations
embedding_min_confidence_threshold=0.1 # Stop if completely irrelevant
)
# With custom embedding provider (e.g., OpenAI)
config = AdaptiveConfig(
strategy="embedding",
embedding_llm_config={
'provider': 'openai/text-embedding-3-small',
'api_token': 'your-api-key'
}
)
```
### Strategy Comparison
| Feature | Statistical | Embedding |
|---------|------------|-----------|
| **Speed** | Very fast | Moderate (API calls) |
| **Cost** | Free | Depends on provider |
| **Accuracy** | Good for exact terms | Excellent for concepts |
| **Dependencies** | None | Embedding model/API |
| **Query Understanding** | Literal | Semantic |
| **Best Use Case** | Technical docs, specific terms | Research, broad topics |
### Embedding Strategy Configuration
The embedding strategy offers fine-tuned control through several parameters:
```python
config = AdaptiveConfig(
strategy="embedding",
# Model configuration
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
embedding_llm_config=None, # Use for API-based embeddings
# Query expansion
n_query_variations=10, # Number of query variations to generate
# Coverage parameters
embedding_coverage_radius=0.2, # Distance threshold for coverage
embedding_k_exp=3.0, # Exponential decay factor (higher = stricter)
# Stopping criteria
embedding_min_relative_improvement=0.1, # Min improvement to continue
embedding_validation_min_score=0.3, # Min validation score
embedding_min_confidence_threshold=0.1, # Below this = irrelevant
# Link selection
embedding_overlap_threshold=0.85, # Similarity for deduplication
# Display confidence mapping
embedding_quality_min_confidence=0.7, # Min displayed confidence
embedding_quality_max_confidence=0.95 # Max displayed confidence
)
```
### Handling Irrelevant Queries
The embedding strategy can detect when a query is completely unrelated to the content:
```python
# This will stop quickly with low confidence
result = await adaptive.digest(
start_url="https://docs.python.org/3/",
query="how to cook pasta" # Irrelevant to Python docs
)
# Check if query was irrelevant
if result.metrics.get('is_irrelevant', False):
print("Query is unrelated to the content!")
```
## When to Use Adaptive Crawling
### Perfect For:
- **Research Tasks**: Finding comprehensive information about a topic
- **Question Answering**: Gathering sufficient context to answer specific queries
- **Knowledge Base Building**: Creating focused datasets for AI/ML applications
- **Competitive Intelligence**: Collecting complete information about specific products/features
### Not Recommended For:
- **Full Site Archiving**: When you need every page regardless of content
- **Structured Data Extraction**: When targeting specific, known page patterns
- **Real-time Monitoring**: When you need continuous updates
## Understanding the Output
### Confidence Score
The confidence score (0-1) indicates how sufficient the gathered information is:
- **0.0-0.3**: Insufficient information, needs more crawling
- **0.3-0.6**: Partial information, may answer basic queries
- **0.6-0.8**: Good coverage, can answer most queries
- **0.8-1.0**: Excellent coverage, comprehensive information
### Statistics Display
```python
adaptive.print_stats(detailed=False) # Summary table
adaptive.print_stats(detailed=True) # Detailed metrics
```
The summary shows:
- Pages crawled vs. confidence achieved
- Coverage, consistency, and saturation scores
- Crawling efficiency metrics
## Persistence and Resumption
### Saving Progress
```python
config = AdaptiveConfig(
save_state=True,
state_path="my_crawl_state.json"
)
# Crawl will auto-save progress
result = await adaptive.digest(start_url, query)
```
### Resuming a Crawl
```python
# Resume from saved state
result = await adaptive.digest(
start_url,
query,
resume_from="my_crawl_state.json"
)
```
### Exporting Knowledge Base
```python
# Export collected pages to JSONL
adaptive.export_knowledge_base("knowledge_base.jsonl")
# Import into another session
new_adaptive = AdaptiveCrawler(crawler)
new_adaptive.import_knowledge_base("knowledge_base.jsonl")
```
## Best Practices
### 1. Query Formulation
- Use specific, descriptive queries
- Include key terms you expect to find
- Avoid overly broad queries
### 2. Threshold Tuning
- Start with default (0.8) for general use
- Lower to 0.6-0.7 for exploratory crawling
- Raise to 0.9+ for exhaustive coverage
### 3. Performance Optimization
- Use appropriate `max_pages` limits
- Adjust `top_k_links` based on site structure
- Enable caching for repeat crawls
### 4. Link Selection
- The crawler prioritizes links based on:
- Relevance to query
- Expected information gain
- URL structure and depth
## Examples
### Research Assistant
```python
# Gather information about a programming concept
result = await adaptive.digest(
start_url="https://realpython.com",
query="python decorators implementation patterns"
)
# Get the most relevant excerpts
for doc in adaptive.get_relevant_content(top_k=3):
print(f"\nFrom: {doc['url']}")
print(f"Relevance: {doc['score']:.2%}")
print(doc['content'][:500] + "...")
```
### Knowledge Base Builder
```python
# Build a focused knowledge base about machine learning
queries = [
"supervised learning algorithms",
"neural network architectures",
"model evaluation metrics"
]
for query in queries:
await adaptive.digest(
start_url="https://scikit-learn.org/stable/",
query=query
)
# Export combined knowledge base
adaptive.export_knowledge_base("ml_knowledge.jsonl")
```
### API Documentation Crawler
```python
# Intelligently crawl API documentation
config = AdaptiveConfig(
confidence_threshold=0.85, # Higher threshold for completeness
max_pages=30
)
adaptive = AdaptiveCrawler(crawler, config)
result = await adaptive.digest(
start_url="https://api.example.com/docs",
query="authentication endpoints rate limits"
)
```
## Next Steps
- Learn about [Advanced Adaptive Strategies](../advanced/adaptive-strategies.md)
- Explore the [AdaptiveCrawler API Reference](../api/adaptive-crawler.md)
- See more [Examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples/adaptive_crawling)
## FAQ
**Q: How is this different from traditional crawling?**
A: Traditional crawling follows fixed patterns (BFS/DFS). Adaptive crawling makes intelligent decisions about which links to follow and when to stop based on information gain.
**Q: Can I use this with JavaScript-heavy sites?**
A: Yes! AdaptiveCrawler inherits all capabilities from AsyncWebCrawler, including JavaScript execution.
**Q: How does it handle large websites?**
A: The algorithm naturally limits crawling to relevant sections. Use `max_pages` as a safety limit.
**Q: Can I customize the scoring algorithms?**
A: Advanced users can implement custom strategies. See [Adaptive Strategies](../advanced/adaptive-strategies.md).

View File

@@ -28,7 +28,11 @@ This page provides a comprehensive list of example scripts that demonstrate vari
| Example | Description | Link |
|---------|-------------|------|
| Deep Crawling | An extensive tutorial on deep crawling capabilities, demonstrating BFS and BestFirst strategies, stream vs. non-stream execution, filters, scorers, and advanced configurations. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/deepcrawl_example.py) |
<<<<<<< HEAD
| Virtual Scroll | Comprehensive examples for handling virtualized scrolling on sites like Twitter, Instagram. Demonstrates different scrolling scenarios with local test server. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/virtual_scroll_example.py) |
=======
| Adaptive Crawling | Demonstrates intelligent crawling that automatically determines when sufficient information has been gathered. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/adaptive_crawling/) |
>>>>>>> feature/progressive-crawling
| Dispatcher | Shows how to use the crawl dispatcher for advanced workload management. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/dispatcher_example.py) |
| Storage State | Tutorial on managing browser storage state for persistence. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/storage_state_tutorial.md) |
| Network Console Capture | Demonstrates how to capture and analyze network requests and console logs. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/network_console_capture_example.py) |

View File

@@ -272,7 +272,43 @@ if __name__ == "__main__":
---
## 7. Multi-URL Concurrency (Preview)
## 7. Adaptive Crawling (New!)
Crawl4AI now includes intelligent adaptive crawling that automatically determines when sufficient information has been gathered. Here's a quick example:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler
async def adaptive_example():
async with AsyncWebCrawler() as crawler:
adaptive = AdaptiveCrawler(crawler)
# Start adaptive crawling
result = await adaptive.digest(
start_url="https://docs.python.org/3/",
query="async context managers"
)
# View results
adaptive.print_stats()
print(f"Crawled {len(result.crawled_urls)} pages")
print(f"Achieved {adaptive.confidence:.0%} confidence")
if __name__ == "__main__":
asyncio.run(adaptive_example())
```
**What's special about adaptive crawling?**
- **Automatic stopping**: Stops when sufficient information is gathered
- **Intelligent link selection**: Follows only relevant links
- **Confidence scoring**: Know how complete your information is
[Learn more about Adaptive Crawling →](adaptive-crawling.md)
---
## 8. Multi-URL Concurrency (Preview)
If you need to crawl multiple URLs in **parallel**, you can use `arun_many()`. By default, Crawl4AI employs a **MemoryAdaptiveDispatcher**, automatically adjusting concurrency based on system resources. Heres a quick glimpse:

View File

@@ -48,6 +48,12 @@ Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant
> **Note**: If you're looking for the old documentation, you can access it [here](https://old.docs.crawl4ai.com).
## 🎯 New: Adaptive Web Crawling
Crawl4AI now features intelligent adaptive crawling that knows when to stop! Using advanced information foraging algorithms, it determines when sufficient information has been gathered to answer your query.
[Learn more about Adaptive Crawling →](core/adaptive-crawling.md)
## Quick Start