- Replace BaseStrategy with CrawlStrategy in custom strategy examples (DomainSpecificStrategy, HybridStrategy) - Remove “Custom Link Scoring” and “Caching Strategy” sections no longer aligned with current library - Revise memory pruning example to use adaptive.get_relevant_content and index-based retention of top 500 docs - Correct Quickstart note: default cache mode is CacheMode.BYPASS; instruct enabling with CacheMode.ENABLED
400 lines
9.1 KiB
Markdown
400 lines
9.1 KiB
Markdown
# Advanced Adaptive Strategies
|
||
|
||
## Overview
|
||
|
||
While the default adaptive crawling configuration works well for most use cases, understanding the underlying strategies and scoring mechanisms allows you to fine-tune the crawler for specific domains and requirements.
|
||
|
||
## The Three-Layer Scoring System
|
||
|
||
### 1. Coverage Score
|
||
|
||
Coverage measures how comprehensively your knowledge base covers the query terms and related concepts.
|
||
|
||
#### Mathematical Foundation
|
||
|
||
```python
|
||
Coverage(K, Q) = Σ(t ∈ Q) score(t, K) / |Q|
|
||
|
||
where score(t, K) = doc_coverage(t) × (1 + freq_boost(t))
|
||
```
|
||
|
||
#### Components
|
||
|
||
- **Document Coverage**: Percentage of documents containing the term
|
||
- **Frequency Boost**: Logarithmic bonus for term frequency
|
||
- **Query Decomposition**: Handles multi-word queries intelligently
|
||
|
||
#### Tuning Coverage
|
||
|
||
```python
|
||
# For technical documentation with specific terminology
|
||
config = AdaptiveConfig(
|
||
confidence_threshold=0.85, # Require high coverage
|
||
top_k_links=5 # Cast wider net
|
||
)
|
||
|
||
# For general topics with synonyms
|
||
config = AdaptiveConfig(
|
||
confidence_threshold=0.6, # Lower threshold
|
||
top_k_links=2 # More focused
|
||
)
|
||
```
|
||
|
||
### 2. Consistency Score
|
||
|
||
Consistency evaluates whether the information across pages is coherent and non-contradictory.
|
||
|
||
#### How It Works
|
||
|
||
1. Extracts key statements from each document
|
||
2. Compares statements across documents
|
||
3. Measures agreement vs. contradiction
|
||
4. Returns normalized score (0-1)
|
||
|
||
#### Practical Impact
|
||
|
||
- **High consistency (>0.8)**: Information is reliable and coherent
|
||
- **Medium consistency (0.5-0.8)**: Some variation, but generally aligned
|
||
- **Low consistency (<0.5)**: Conflicting information, need more sources
|
||
|
||
### 3. Saturation Score
|
||
|
||
Saturation detects when new pages stop providing novel information.
|
||
|
||
#### Detection Algorithm
|
||
|
||
```python
|
||
# Tracks new unique terms per page
|
||
new_terms_page_1 = 50
|
||
new_terms_page_2 = 30 # 60% of first
|
||
new_terms_page_3 = 15 # 50% of second
|
||
new_terms_page_4 = 5 # 33% of third
|
||
# Saturation detected: rapidly diminishing returns
|
||
```
|
||
|
||
#### Configuration
|
||
|
||
```python
|
||
config = AdaptiveConfig(
|
||
min_gain_threshold=0.1 # Stop if <10% new information
|
||
)
|
||
```
|
||
|
||
## Link Ranking Algorithm
|
||
|
||
### Expected Information Gain
|
||
|
||
Each uncrawled link is scored based on:
|
||
|
||
```python
|
||
ExpectedGain(link) = Relevance × Novelty × Authority
|
||
```
|
||
|
||
#### 1. Relevance Scoring
|
||
|
||
Uses BM25 algorithm on link preview text:
|
||
|
||
```python
|
||
relevance = BM25(link.preview_text, query)
|
||
```
|
||
|
||
Factors:
|
||
- Term frequency in preview
|
||
- Inverse document frequency
|
||
- Preview length normalization
|
||
|
||
#### 2. Novelty Estimation
|
||
|
||
Measures how different the link appears from already-crawled content:
|
||
|
||
```python
|
||
novelty = 1 - max_similarity(preview, knowledge_base)
|
||
```
|
||
|
||
Prevents crawling duplicate or highly similar pages.
|
||
|
||
#### 3. Authority Calculation
|
||
|
||
URL structure and domain analysis:
|
||
|
||
```python
|
||
authority = f(domain_rank, url_depth, url_structure)
|
||
```
|
||
|
||
Factors:
|
||
- Domain reputation
|
||
- URL depth (fewer slashes = higher authority)
|
||
- Clean URL structure
|
||
|
||
## Domain-Specific Configurations
|
||
|
||
### Technical Documentation
|
||
|
||
```python
|
||
tech_doc_config = AdaptiveConfig(
|
||
confidence_threshold=0.85,
|
||
max_pages=30,
|
||
top_k_links=3,
|
||
min_gain_threshold=0.05 # Keep crawling for small gains
|
||
)
|
||
```
|
||
|
||
Rationale:
|
||
- High threshold ensures comprehensive coverage
|
||
- Lower gain threshold captures edge cases
|
||
- Moderate link following for depth
|
||
|
||
### News & Articles
|
||
|
||
```python
|
||
news_config = AdaptiveConfig(
|
||
confidence_threshold=0.6,
|
||
max_pages=10,
|
||
top_k_links=5,
|
||
min_gain_threshold=0.15 # Stop quickly on repetition
|
||
)
|
||
```
|
||
|
||
Rationale:
|
||
- Lower threshold (articles often repeat information)
|
||
- Higher gain threshold (avoid duplicate stories)
|
||
- More links per page (explore different perspectives)
|
||
|
||
### E-commerce
|
||
|
||
```python
|
||
ecommerce_config = AdaptiveConfig(
|
||
confidence_threshold=0.7,
|
||
max_pages=20,
|
||
top_k_links=2,
|
||
min_gain_threshold=0.1
|
||
)
|
||
```
|
||
|
||
Rationale:
|
||
- Balanced threshold for product variations
|
||
- Focused link following (avoid infinite products)
|
||
- Standard gain threshold
|
||
|
||
### Research & Academic
|
||
|
||
```python
|
||
research_config = AdaptiveConfig(
|
||
confidence_threshold=0.9,
|
||
max_pages=50,
|
||
top_k_links=4,
|
||
min_gain_threshold=0.02 # Very low - capture citations
|
||
)
|
||
```
|
||
|
||
Rationale:
|
||
- Very high threshold for completeness
|
||
- Many pages allowed for thorough research
|
||
- Very low gain threshold to capture references
|
||
|
||
## Performance Optimization
|
||
|
||
### Memory Management
|
||
|
||
```python
|
||
# For large crawls, use streaming
|
||
config = AdaptiveConfig(
|
||
max_pages=100,
|
||
save_state=True,
|
||
state_path="large_crawl.json"
|
||
)
|
||
|
||
# Periodically clean state
|
||
if len(state.knowledge_base) > 1000:
|
||
# Keep only the top 500 most relevant docs
|
||
top_content = adaptive.get_relevant_content(top_k=500)
|
||
keep_indices = {d["index"] for d in top_content}
|
||
state.knowledge_base = [
|
||
doc for i, doc in enumerate(state.knowledge_base) if i in keep_indices
|
||
]
|
||
```
|
||
|
||
### Parallel Processing
|
||
|
||
```python
|
||
# Use multiple start points
|
||
start_urls = [
|
||
"https://docs.example.com/intro",
|
||
"https://docs.example.com/api",
|
||
"https://docs.example.com/guides"
|
||
]
|
||
|
||
# Crawl in parallel
|
||
tasks = [
|
||
adaptive.digest(url, query)
|
||
for url in start_urls
|
||
]
|
||
results = await asyncio.gather(*tasks)
|
||
```
|
||
|
||
## Debugging & Analysis
|
||
|
||
### Enable Verbose Logging
|
||
|
||
```python
|
||
import logging
|
||
|
||
logging.basicConfig(level=logging.DEBUG)
|
||
adaptive = AdaptiveCrawler(crawler, config, verbose=True)
|
||
```
|
||
|
||
### Analyze Crawl Patterns
|
||
|
||
```python
|
||
# After crawling
|
||
state = await adaptive.digest(start_url, query)
|
||
|
||
# Analyze link selection
|
||
print("Link selection order:")
|
||
for i, url in enumerate(state.crawl_order):
|
||
print(f"{i+1}. {url}")
|
||
|
||
# Analyze term discovery
|
||
print("\nTerm discovery rate:")
|
||
for i, new_terms in enumerate(state.new_terms_history):
|
||
print(f"Page {i+1}: {new_terms} new terms")
|
||
|
||
# Analyze score progression
|
||
print("\nScore progression:")
|
||
print(f"Coverage: {state.metrics['coverage_history']}")
|
||
print(f"Saturation: {state.metrics['saturation_history']}")
|
||
```
|
||
|
||
### Export for Analysis
|
||
|
||
```python
|
||
# Export detailed metrics
|
||
import json
|
||
|
||
metrics = {
|
||
"query": query,
|
||
"total_pages": len(state.crawled_urls),
|
||
"confidence": adaptive.confidence,
|
||
"coverage_stats": adaptive.coverage_stats,
|
||
"crawl_order": state.crawl_order,
|
||
"term_frequencies": dict(state.term_frequencies),
|
||
"new_terms_history": state.new_terms_history
|
||
}
|
||
|
||
with open("crawl_analysis.json", "w") as f:
|
||
json.dump(metrics, f, indent=2)
|
||
```
|
||
|
||
## Custom Strategies
|
||
|
||
### Implementing a Custom Strategy
|
||
|
||
```python
|
||
from crawl4ai.adaptive_crawler import CrawlStrategy
|
||
|
||
class DomainSpecificStrategy(CrawlStrategy):
|
||
def calculate_coverage(self, state: CrawlState) -> float:
|
||
# Custom coverage calculation
|
||
# e.g., weight certain terms more heavily
|
||
pass
|
||
|
||
def calculate_consistency(self, state: CrawlState) -> float:
|
||
# Custom consistency logic
|
||
# e.g., domain-specific validation
|
||
pass
|
||
|
||
def rank_links(self, links: List[Link], state: CrawlState) -> List[Link]:
|
||
# Custom link ranking
|
||
# e.g., prioritize specific URL patterns
|
||
pass
|
||
|
||
# Use custom strategy
|
||
adaptive = AdaptiveCrawler(
|
||
crawler,
|
||
config=config,
|
||
strategy=DomainSpecificStrategy()
|
||
)
|
||
```
|
||
|
||
### Combining Strategies
|
||
|
||
```python
|
||
class HybridStrategy(CrawlStrategy):
|
||
def __init__(self):
|
||
self.strategies = [
|
||
TechnicalDocStrategy(),
|
||
SemanticSimilarityStrategy(),
|
||
URLPatternStrategy()
|
||
]
|
||
|
||
def calculate_confidence(self, state: CrawlState) -> float:
|
||
# Weighted combination of strategies
|
||
scores = [s.calculate_confidence(state) for s in self.strategies]
|
||
weights = [0.5, 0.3, 0.2]
|
||
return sum(s * w for s, w in zip(scores, weights))
|
||
```
|
||
|
||
## Best Practices
|
||
|
||
### 1. Start Conservative
|
||
|
||
Begin with default settings and adjust based on results:
|
||
|
||
```python
|
||
# Start with defaults
|
||
result = await adaptive.digest(url, query)
|
||
|
||
# Analyze and adjust
|
||
if adaptive.confidence < 0.7:
|
||
config.max_pages += 10
|
||
config.confidence_threshold -= 0.1
|
||
```
|
||
|
||
### 2. Monitor Resource Usage
|
||
|
||
```python
|
||
import psutil
|
||
|
||
# Check memory before large crawls
|
||
memory_percent = psutil.virtual_memory().percent
|
||
if memory_percent > 80:
|
||
config.max_pages = min(config.max_pages, 20)
|
||
```
|
||
|
||
### 3. Use Domain Knowledge
|
||
|
||
```python
|
||
# For API documentation
|
||
if "api" in start_url:
|
||
config.top_k_links = 2 # APIs have clear structure
|
||
|
||
# For blogs
|
||
if "blog" in start_url:
|
||
config.min_gain_threshold = 0.2 # Avoid similar posts
|
||
```
|
||
|
||
### 4. Validate Results
|
||
|
||
```python
|
||
# Always validate the knowledge base
|
||
relevant_content = adaptive.get_relevant_content(top_k=10)
|
||
|
||
# Check coverage
|
||
query_terms = set(query.lower().split())
|
||
covered_terms = set()
|
||
|
||
for doc in relevant_content:
|
||
content_lower = doc['content'].lower()
|
||
for term in query_terms:
|
||
if term in content_lower:
|
||
covered_terms.add(term)
|
||
|
||
coverage_ratio = len(covered_terms) / len(query_terms)
|
||
print(f"Query term coverage: {coverage_ratio:.0%}")
|
||
```
|
||
|
||
## Next Steps
|
||
|
||
- Explore [Custom Strategy Implementation](../tutorials/custom-adaptive-strategies.md)
|
||
- Learn about [Knowledge Base Management](../tutorials/knowledge-base-management.md)
|
||
- See [Performance Benchmarks](../benchmarks/adaptive-performance.md) |