feat(crawl4ai): Implement adaptive crawling feature

This commit introduces the adaptive crawling feature to the crawl4ai project. The adaptive crawling feature intelligently determines when sufficient information has been gathered during a crawl, improving efficiency and reducing unnecessary resource usage.

The changes include the addition of new files related to the adaptive crawler, modifications to the existing files, and updates to the documentation. The new files include the main adaptive crawler script, utility functions, and various configuration and strategy scripts. The existing files that were modified include the project's initialization file and utility functions. The documentation has been updated to include detailed explanations and examples of the adaptive crawling feature.

The adaptive crawling feature will significantly enhance the capabilities of the crawl4ai project, providing users with a more efficient and intelligent web crawling tool.

Significant modifications:
- Added adaptive_crawler.py and related scripts
- Modified __init__.py and utils.py
- Updated documentation with details about the adaptive crawling feature
- Added tests for the new feature

BREAKING CHANGE: This is a significant feature addition that may affect the overall behavior of the crawl4ai project. Users are advised to review the updated documentation to understand how to use the new feature.

Refs: #123, #456
This commit is contained in:
UncleCode
2025-07-04 15:16:53 +08:00
parent 74705c1f67
commit 1a73fb60db
29 changed files with 8800 additions and 3 deletions

View File

@@ -0,0 +1,85 @@
# Adaptive Crawling Examples
This directory contains examples demonstrating various aspects of Crawl4AI's Adaptive Crawling feature.
## Examples Overview
### 1. `basic_usage.py`
- Simple introduction to adaptive crawling
- Uses default statistical strategy
- Shows how to get crawl statistics and relevant content
### 2. `embedding_strategy.py` ⭐ NEW
- Demonstrates the embedding-based strategy for semantic understanding
- Shows query expansion and irrelevance detection
- Includes configuration for both local and API-based embeddings
### 3. `embedding_vs_statistical.py` ⭐ NEW
- Direct comparison between statistical and embedding strategies
- Helps you choose the right strategy for your use case
- Shows performance and accuracy trade-offs
### 4. `embedding_configuration.py` ⭐ NEW
- Advanced configuration options for embedding strategy
- Parameter tuning guide for different scenarios
- Examples for research, exploration, and quality-focused crawling
### 5. `advanced_configuration.py`
- Shows various configuration options for both strategies
- Demonstrates threshold tuning and performance optimization
### 6. `custom_strategies.py`
- How to implement your own crawling strategy
- Extends the base CrawlStrategy class
- Advanced use case for specialized requirements
### 7. `export_import_kb.py`
- Export crawled knowledge base to JSONL
- Import and continue crawling from saved state
- Useful for building persistent knowledge bases
## Quick Start
For your first adaptive crawling experience, run:
```bash
python basic_usage.py
```
To try the new embedding strategy with semantic understanding:
```bash
python embedding_strategy.py
```
To compare strategies and see which works best for your use case:
```bash
python embedding_vs_statistical.py
```
## Strategy Selection Guide
### Use Statistical Strategy (Default) When:
- Working with technical documentation
- Queries contain specific terms or code
- Speed is critical
- No API access available
### Use Embedding Strategy When:
- Queries are conceptual or ambiguous
- Need semantic understanding beyond exact matches
- Want to detect irrelevant content
- Working with diverse content sources
## Requirements
- Crawl4AI installed
- For embedding strategy with local models: `sentence-transformers`
- For embedding strategy with OpenAI: Set `OPENAI_API_KEY` environment variable
## Learn More
- [Adaptive Crawling Documentation](https://docs.crawl4ai.com/core/adaptive-crawling/)
- [Mathematical Framework](https://github.com/unclecode/crawl4ai/blob/main/PROGRESSIVE_CRAWLING.md)
- [Blog: The Adaptive Crawling Revolution](https://docs.crawl4ai.com/blog/adaptive-crawling-revolution/)

View File

@@ -0,0 +1,207 @@
"""
Advanced Adaptive Crawling Configuration
This example demonstrates all configuration options available for adaptive crawling,
including threshold tuning, persistence, and custom parameters.
"""
import asyncio
from pathlib import Path
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
async def main():
"""Demonstrate advanced configuration options"""
# Example 1: Custom thresholds for different use cases
print("="*60)
print("EXAMPLE 1: Custom Confidence Thresholds")
print("="*60)
# High-precision configuration (exhaustive crawling)
high_precision_config = AdaptiveConfig(
confidence_threshold=0.9, # Very high confidence required
max_pages=50, # Allow more pages
top_k_links=5, # Follow more links per page
min_gain_threshold=0.02 # Lower threshold to continue
)
# Balanced configuration (default use case)
balanced_config = AdaptiveConfig(
confidence_threshold=0.7, # Moderate confidence
max_pages=20, # Reasonable limit
top_k_links=3, # Moderate branching
min_gain_threshold=0.05 # Standard gain threshold
)
# Quick exploration configuration
quick_config = AdaptiveConfig(
confidence_threshold=0.5, # Lower confidence acceptable
max_pages=10, # Strict limit
top_k_links=2, # Minimal branching
min_gain_threshold=0.1 # High gain required
)
async with AsyncWebCrawler(verbose=False) as crawler:
# Test different configurations
for config_name, config in [
("High Precision", high_precision_config),
("Balanced", balanced_config),
("Quick Exploration", quick_config)
]:
print(f"\nTesting {config_name} configuration...")
adaptive = AdaptiveCrawler(crawler, config=config)
result = await adaptive.digest(
start_url="https://httpbin.org",
query="http headers authentication"
)
print(f" - Pages crawled: {len(result.crawled_urls)}")
print(f" - Confidence achieved: {adaptive.confidence:.2%}")
print(f" - Coverage score: {adaptive.coverage_stats['coverage']:.2f}")
# Example 2: Persistence and state management
print("\n" + "="*60)
print("EXAMPLE 2: State Persistence")
print("="*60)
state_file = "crawl_state_demo.json"
# Configuration with persistence
persistent_config = AdaptiveConfig(
confidence_threshold=0.8,
max_pages=30,
save_state=True, # Enable auto-save
state_path=state_file # Specify save location
)
async with AsyncWebCrawler(verbose=False) as crawler:
# First crawl - will be interrupted
print("\nStarting initial crawl (will interrupt after 5 pages)...")
interrupt_config = AdaptiveConfig(
confidence_threshold=0.8,
max_pages=5, # Artificially low to simulate interruption
save_state=True,
state_path=state_file
)
adaptive = AdaptiveCrawler(crawler, config=interrupt_config)
result1 = await adaptive.digest(
start_url="https://docs.python.org/3/",
query="exception handling try except finally"
)
print(f"First crawl completed: {len(result1.crawled_urls)} pages")
print(f"Confidence reached: {adaptive.confidence:.2%}")
# Resume crawl with higher page limit
print("\nResuming crawl from saved state...")
resume_config = AdaptiveConfig(
confidence_threshold=0.8,
max_pages=20, # Increase limit
save_state=True,
state_path=state_file
)
adaptive2 = AdaptiveCrawler(crawler, config=resume_config)
result2 = await adaptive2.digest(
start_url="https://docs.python.org/3/",
query="exception handling try except finally",
resume_from=state_file
)
print(f"Resumed crawl completed: {len(result2.crawled_urls)} total pages")
print(f"Final confidence: {adaptive2.confidence:.2%}")
# Clean up
Path(state_file).unlink(missing_ok=True)
# Example 3: Link selection strategies
print("\n" + "="*60)
print("EXAMPLE 3: Link Selection Strategies")
print("="*60)
# Conservative link following
conservative_config = AdaptiveConfig(
confidence_threshold=0.7,
max_pages=15,
top_k_links=1, # Only follow best link
min_gain_threshold=0.15 # High threshold
)
# Aggressive link following
aggressive_config = AdaptiveConfig(
confidence_threshold=0.7,
max_pages=15,
top_k_links=10, # Follow many links
min_gain_threshold=0.01 # Very low threshold
)
async with AsyncWebCrawler(verbose=False) as crawler:
for strategy_name, config in [
("Conservative", conservative_config),
("Aggressive", aggressive_config)
]:
print(f"\n{strategy_name} link selection:")
adaptive = AdaptiveCrawler(crawler, config=config)
result = await adaptive.digest(
start_url="https://httpbin.org",
query="api endpoints"
)
# Analyze crawl pattern
print(f" - Total pages: {len(result.crawled_urls)}")
print(f" - Unique domains: {len(set(url.split('/')[2] for url in result.crawled_urls))}")
print(f" - Max depth reached: {max(url.count('/') for url in result.crawled_urls) - 2}")
# Show saturation trend
if hasattr(result, 'new_terms_history') and result.new_terms_history:
print(f" - New terms discovered: {result.new_terms_history[:5]}...")
print(f" - Saturation trend: {'decreasing' if result.new_terms_history[-1] < result.new_terms_history[0] else 'increasing'}")
# Example 4: Monitoring crawl progress
print("\n" + "="*60)
print("EXAMPLE 4: Progress Monitoring")
print("="*60)
# Configuration with detailed monitoring
monitor_config = AdaptiveConfig(
confidence_threshold=0.75,
max_pages=10,
top_k_links=3
)
async with AsyncWebCrawler(verbose=False) as crawler:
adaptive = AdaptiveCrawler(crawler, config=monitor_config)
# Start crawl
print("\nMonitoring crawl progress...")
result = await adaptive.digest(
start_url="https://httpbin.org",
query="http methods headers"
)
# Detailed statistics
print("\nDetailed crawl analysis:")
adaptive.print_stats(detailed=True)
# Export for analysis
print("\nExporting knowledge base for external analysis...")
adaptive.export_knowledge_base("knowledge_export_demo.jsonl")
print("Knowledge base exported to: knowledge_export_demo.jsonl")
# Show sample of exported data
with open("knowledge_export_demo.jsonl", 'r') as f:
first_line = f.readline()
print(f"Sample export: {first_line[:100]}...")
# Clean up
Path("knowledge_export_demo.jsonl").unlink(missing_ok=True)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,76 @@
"""
Basic Adaptive Crawling Example
This example demonstrates the simplest use case of adaptive crawling:
finding information about a specific topic and knowing when to stop.
"""
import asyncio
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler
async def main():
"""Basic adaptive crawling example"""
# Initialize the crawler
async with AsyncWebCrawler(verbose=True) as crawler:
# Create an adaptive crawler with default settings (statistical strategy)
adaptive = AdaptiveCrawler(crawler)
# Note: You can also use embedding strategy for semantic understanding:
# from crawl4ai import AdaptiveConfig
# config = AdaptiveConfig(strategy="embedding")
# adaptive = AdaptiveCrawler(crawler, config)
# Start adaptive crawling
print("Starting adaptive crawl for Python async programming information...")
result = await adaptive.digest(
start_url="https://docs.python.org/3/library/asyncio.html",
query="async await context managers coroutines"
)
# Display crawl statistics
print("\n" + "="*50)
print("CRAWL STATISTICS")
print("="*50)
adaptive.print_stats(detailed=False)
# Get the most relevant content found
print("\n" + "="*50)
print("MOST RELEVANT PAGES")
print("="*50)
relevant_pages = adaptive.get_relevant_content(top_k=5)
for i, page in enumerate(relevant_pages, 1):
print(f"\n{i}. {page['url']}")
print(f" Relevance Score: {page['score']:.2%}")
# Show a snippet of the content
content = page['content'] or ""
if content:
snippet = content[:200].replace('\n', ' ')
if len(content) > 200:
snippet += "..."
print(f" Preview: {snippet}")
# Show final confidence
print(f"\n{'='*50}")
print(f"Final Confidence: {adaptive.confidence:.2%}")
print(f"Total Pages Crawled: {len(result.crawled_urls)}")
print(f"Knowledge Base Size: {len(adaptive.state.knowledge_base)} documents")
# Example: Check if we can answer specific questions
print(f"\n{'='*50}")
print("INFORMATION SUFFICIENCY CHECK")
print(f"{'='*50}")
if adaptive.confidence >= 0.8:
print("✓ High confidence - can answer detailed questions about async Python")
elif adaptive.confidence >= 0.6:
print("~ Moderate confidence - can answer basic questions")
else:
print("✗ Low confidence - need more information")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,373 @@
"""
Custom Adaptive Crawling Strategies
This example demonstrates how to implement custom scoring strategies
for domain-specific crawling needs.
"""
import asyncio
import re
from typing import List, Dict, Set
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
from crawl4ai.adaptive_crawler import CrawlState, Link
import math
class APIDocumentationStrategy:
"""
Custom strategy optimized for API documentation crawling.
Prioritizes endpoint references, code examples, and parameter descriptions.
"""
def __init__(self):
# Keywords that indicate high-value API documentation
self.api_keywords = {
'endpoint', 'request', 'response', 'parameter', 'authentication',
'header', 'body', 'query', 'path', 'method', 'get', 'post', 'put',
'delete', 'patch', 'status', 'code', 'example', 'curl', 'python'
}
# URL patterns that typically contain API documentation
self.valuable_patterns = [
r'/api/',
r'/reference/',
r'/endpoints?/',
r'/methods?/',
r'/resources?/'
]
# Patterns to avoid
self.avoid_patterns = [
r'/blog/',
r'/news/',
r'/about/',
r'/contact/',
r'/legal/'
]
def score_link(self, link: Link, query: str, state: CrawlState) -> float:
"""Custom link scoring for API documentation"""
score = 1.0
url = link.href.lower()
# Boost API-related URLs
for pattern in self.valuable_patterns:
if re.search(pattern, url):
score *= 2.0
break
# Reduce score for non-API content
for pattern in self.avoid_patterns:
if re.search(pattern, url):
score *= 0.1
break
# Boost if preview contains API keywords
if link.text:
preview_lower = link.text.lower()
keyword_count = sum(1 for kw in self.api_keywords if kw in preview_lower)
score *= (1 + keyword_count * 0.2)
# Prioritize shallow URLs (likely overview pages)
depth = url.count('/') - 2 # Subtract protocol slashes
if depth <= 3:
score *= 1.5
elif depth > 6:
score *= 0.5
return score
def calculate_api_coverage(self, state: CrawlState, query: str) -> Dict[str, float]:
"""Calculate specialized coverage metrics for API documentation"""
metrics = {
'endpoint_coverage': 0.0,
'example_coverage': 0.0,
'parameter_coverage': 0.0
}
# Analyze knowledge base for API-specific content
endpoint_patterns = [r'GET\s+/', r'POST\s+/', r'PUT\s+/', r'DELETE\s+/']
example_patterns = [r'```\w+', r'curl\s+-', r'import\s+requests']
param_patterns = [r'param(?:eter)?s?\s*:', r'required\s*:', r'optional\s*:']
total_docs = len(state.knowledge_base)
if total_docs == 0:
return metrics
docs_with_endpoints = 0
docs_with_examples = 0
docs_with_params = 0
for doc in state.knowledge_base:
content = doc.markdown.raw_markdown if hasattr(doc, 'markdown') else str(doc)
# Check for endpoints
if any(re.search(pattern, content, re.IGNORECASE) for pattern in endpoint_patterns):
docs_with_endpoints += 1
# Check for examples
if any(re.search(pattern, content, re.IGNORECASE) for pattern in example_patterns):
docs_with_examples += 1
# Check for parameters
if any(re.search(pattern, content, re.IGNORECASE) for pattern in param_patterns):
docs_with_params += 1
metrics['endpoint_coverage'] = docs_with_endpoints / total_docs
metrics['example_coverage'] = docs_with_examples / total_docs
metrics['parameter_coverage'] = docs_with_params / total_docs
return metrics
class ResearchPaperStrategy:
"""
Strategy optimized for crawling research papers and academic content.
Prioritizes citations, abstracts, and methodology sections.
"""
def __init__(self):
self.academic_keywords = {
'abstract', 'introduction', 'methodology', 'results', 'conclusion',
'references', 'citation', 'paper', 'study', 'research', 'analysis',
'hypothesis', 'experiment', 'findings', 'doi'
}
self.citation_patterns = [
r'\[\d+\]', # [1] style citations
r'\(\w+\s+\d{4}\)', # (Author 2024) style
r'doi:\s*\S+', # DOI references
]
def calculate_academic_relevance(self, content: str, query: str) -> float:
"""Calculate relevance score for academic content"""
score = 0.0
content_lower = content.lower()
# Check for academic keywords
keyword_matches = sum(1 for kw in self.academic_keywords if kw in content_lower)
score += keyword_matches * 0.1
# Check for citations
citation_count = sum(
len(re.findall(pattern, content))
for pattern in self.citation_patterns
)
score += min(citation_count * 0.05, 1.0) # Cap at 1.0
# Check for query terms in academic context
query_terms = query.lower().split()
for term in query_terms:
# Boost if term appears near academic keywords
for keyword in ['abstract', 'conclusion', 'results']:
if keyword in content_lower:
section = content_lower[content_lower.find(keyword):content_lower.find(keyword) + 500]
if term in section:
score += 0.2
return min(score, 2.0) # Cap total score
async def demo_custom_strategies():
"""Demonstrate custom strategy usage"""
# Example 1: API Documentation Strategy
print("="*60)
print("EXAMPLE 1: Custom API Documentation Strategy")
print("="*60)
api_strategy = APIDocumentationStrategy()
async with AsyncWebCrawler() as crawler:
# Standard adaptive crawler
config = AdaptiveConfig(
confidence_threshold=0.8,
max_pages=15
)
adaptive = AdaptiveCrawler(crawler, config)
# Override link scoring with custom strategy
original_rank_links = adaptive._rank_links
def custom_rank_links(links, query, state):
# Apply custom scoring
scored_links = []
for link in links:
base_score = api_strategy.score_link(link, query, state)
scored_links.append((link, base_score))
# Sort by score
scored_links.sort(key=lambda x: x[1], reverse=True)
return [link for link, _ in scored_links[:config.top_k_links]]
adaptive._rank_links = custom_rank_links
# Crawl API documentation
print("\nCrawling API documentation with custom strategy...")
state = await adaptive.digest(
start_url="https://httpbin.org",
query="api endpoints authentication headers"
)
# Calculate custom metrics
api_metrics = api_strategy.calculate_api_coverage(state, "api endpoints")
print(f"\nResults:")
print(f"Pages crawled: {len(state.crawled_urls)}")
print(f"Confidence: {adaptive.confidence:.2%}")
print(f"\nAPI-Specific Metrics:")
print(f" - Endpoint coverage: {api_metrics['endpoint_coverage']:.2%}")
print(f" - Example coverage: {api_metrics['example_coverage']:.2%}")
print(f" - Parameter coverage: {api_metrics['parameter_coverage']:.2%}")
# Example 2: Combined Strategy
print("\n" + "="*60)
print("EXAMPLE 2: Hybrid Strategy Combining Multiple Approaches")
print("="*60)
class HybridStrategy:
"""Combines multiple strategies with weights"""
def __init__(self):
self.api_strategy = APIDocumentationStrategy()
self.research_strategy = ResearchPaperStrategy()
self.weights = {
'api': 0.7,
'research': 0.3
}
def score_content(self, content: str, query: str) -> float:
# Get scores from each strategy
api_score = self._calculate_api_score(content, query)
research_score = self.research_strategy.calculate_academic_relevance(content, query)
# Weighted combination
total_score = (
api_score * self.weights['api'] +
research_score * self.weights['research']
)
return total_score
def _calculate_api_score(self, content: str, query: str) -> float:
# Simplified API scoring based on keyword presence
content_lower = content.lower()
api_keywords = self.api_strategy.api_keywords
keyword_count = sum(1 for kw in api_keywords if kw in content_lower)
return min(keyword_count * 0.1, 2.0)
hybrid_strategy = HybridStrategy()
async with AsyncWebCrawler() as crawler:
adaptive = AdaptiveCrawler(crawler)
# Crawl with hybrid scoring
print("\nTesting hybrid strategy on technical documentation...")
state = await adaptive.digest(
start_url="https://docs.python.org/3/library/asyncio.html",
query="async await coroutines api"
)
# Analyze results with hybrid strategy
print(f"\nHybrid Strategy Analysis:")
total_score = 0
for doc in adaptive.get_relevant_content(top_k=5):
content = doc['content'] or ""
score = hybrid_strategy.score_content(content, "async await api")
total_score += score
print(f" - {doc['url'][:50]}... Score: {score:.2f}")
print(f"\nAverage hybrid score: {total_score/5:.2f}")
async def demo_performance_optimization():
"""Demonstrate performance optimization with custom strategies"""
print("\n" + "="*60)
print("EXAMPLE 3: Performance-Optimized Strategy")
print("="*60)
class PerformanceOptimizedStrategy:
"""Strategy that balances thoroughness with speed"""
def __init__(self):
self.url_cache: Set[str] = set()
self.domain_scores: Dict[str, float] = {}
def should_crawl_domain(self, url: str) -> bool:
"""Implement domain-level filtering"""
domain = url.split('/')[2] if url.startswith('http') else url
# Skip if we've already crawled many pages from this domain
domain_count = sum(1 for cached in self.url_cache if domain in cached)
if domain_count > 5:
return False
# Skip low-scoring domains
if domain in self.domain_scores and self.domain_scores[domain] < 0.3:
return False
return True
def update_domain_score(self, url: str, relevance: float):
"""Track domain-level performance"""
domain = url.split('/')[2] if url.startswith('http') else url
if domain not in self.domain_scores:
self.domain_scores[domain] = relevance
else:
# Moving average
self.domain_scores[domain] = (
0.7 * self.domain_scores[domain] + 0.3 * relevance
)
perf_strategy = PerformanceOptimizedStrategy()
async with AsyncWebCrawler() as crawler:
config = AdaptiveConfig(
confidence_threshold=0.7,
max_pages=10,
top_k_links=2 # Fewer links for speed
)
adaptive = AdaptiveCrawler(crawler, config)
# Track performance
import time
start_time = time.time()
state = await adaptive.digest(
start_url="https://httpbin.org",
query="http methods headers"
)
elapsed = time.time() - start_time
print(f"\nPerformance Results:")
print(f" - Time elapsed: {elapsed:.2f} seconds")
print(f" - Pages crawled: {len(state.crawled_urls)}")
print(f" - Pages per second: {len(state.crawled_urls)/elapsed:.2f}")
print(f" - Final confidence: {adaptive.confidence:.2%}")
print(f" - Efficiency: {adaptive.confidence/len(state.crawled_urls):.2%} confidence per page")
async def main():
"""Run all demonstrations"""
try:
await demo_custom_strategies()
await demo_performance_optimization()
print("\n" + "="*60)
print("All custom strategy examples completed!")
print("="*60)
except Exception as e:
print(f"Error: {e}")
import traceback
traceback.print_exc()
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,206 @@
"""
Advanced Embedding Configuration Example
This example demonstrates all configuration options available for the
embedding strategy, including fine-tuning parameters for different use cases.
"""
import asyncio
import os
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
async def test_configuration(name: str, config: AdaptiveConfig, url: str, query: str):
"""Test a specific configuration"""
print(f"\n{'='*60}")
print(f"Configuration: {name}")
print(f"{'='*60}")
async with AsyncWebCrawler(verbose=False) as crawler:
adaptive = AdaptiveCrawler(crawler, config)
result = await adaptive.digest(start_url=url, query=query)
print(f"Pages crawled: {len(result.crawled_urls)}")
print(f"Final confidence: {adaptive.confidence:.1%}")
print(f"Stopped reason: {result.metrics.get('stopped_reason', 'max_pages')}")
if result.metrics.get('is_irrelevant', False):
print("⚠️ Query detected as irrelevant!")
return result
async def main():
"""Demonstrate various embedding configurations"""
print("EMBEDDING STRATEGY CONFIGURATION EXAMPLES")
print("=" * 60)
# Base URL and query for testing
test_url = "https://docs.python.org/3/library/asyncio.html"
# 1. Default Configuration
config_default = AdaptiveConfig(
strategy="embedding",
max_pages=10
)
await test_configuration(
"Default Settings",
config_default,
test_url,
"async programming patterns"
)
# 2. Strict Coverage Requirements
config_strict = AdaptiveConfig(
strategy="embedding",
max_pages=20,
# Stricter similarity requirements
embedding_k_exp=5.0, # Default is 3.0, higher = stricter
embedding_coverage_radius=0.15, # Default is 0.2, lower = stricter
# Higher validation threshold
embedding_validation_min_score=0.6, # Default is 0.3
# More query variations for better coverage
n_query_variations=15 # Default is 10
)
await test_configuration(
"Strict Coverage (Research/Academic)",
config_strict,
test_url,
"comprehensive guide async await"
)
# 3. Fast Exploration
config_fast = AdaptiveConfig(
strategy="embedding",
max_pages=10,
top_k_links=5, # Follow more links per page
# Relaxed requirements for faster convergence
embedding_k_exp=1.0, # Lower = more lenient
embedding_min_relative_improvement=0.05, # Stop earlier
# Lower quality thresholds
embedding_quality_min_confidence=0.5, # Display lower confidence
embedding_quality_max_confidence=0.85,
# Fewer query variations for speed
n_query_variations=5
)
await test_configuration(
"Fast Exploration (Quick Overview)",
config_fast,
test_url,
"async basics"
)
# 4. Irrelevance Detection Focus
config_irrelevance = AdaptiveConfig(
strategy="embedding",
max_pages=5,
# Aggressive irrelevance detection
embedding_min_confidence_threshold=0.2, # Higher threshold (default 0.1)
embedding_k_exp=5.0, # Strict similarity
# Quick stopping for irrelevant content
embedding_min_relative_improvement=0.15
)
await test_configuration(
"Irrelevance Detection",
config_irrelevance,
test_url,
"recipe for chocolate cake" # Irrelevant query
)
# 5. High-Quality Knowledge Base
config_quality = AdaptiveConfig(
strategy="embedding",
max_pages=30,
# Deduplication settings
embedding_overlap_threshold=0.75, # More aggressive deduplication
# Quality focus
embedding_validation_min_score=0.5,
embedding_quality_scale_factor=1.0, # Linear quality mapping
# Balanced parameters
embedding_k_exp=3.0,
embedding_nearest_weight=0.8, # Focus on best matches
embedding_top_k_weight=0.2
)
await test_configuration(
"High-Quality Knowledge Base",
config_quality,
test_url,
"asyncio advanced patterns best practices"
)
# 6. Custom Embedding Provider
if os.getenv('OPENAI_API_KEY'):
config_openai = AdaptiveConfig(
strategy="embedding",
max_pages=10,
# Use OpenAI embeddings
embedding_llm_config={
'provider': 'openai/text-embedding-3-small',
'api_token': os.getenv('OPENAI_API_KEY')
},
# OpenAI embeddings are high quality, can be stricter
embedding_k_exp=4.0,
n_query_variations=12
)
await test_configuration(
"OpenAI Embeddings",
config_openai,
test_url,
"event-driven architecture patterns"
)
# Parameter Guide
print("\n" + "="*60)
print("PARAMETER TUNING GUIDE")
print("="*60)
print("\n📊 Key Parameters and Their Effects:")
print("\n1. embedding_k_exp (default: 3.0)")
print(" - Lower (1-2): More lenient, faster convergence")
print(" - Higher (4-5): Stricter, better precision")
print("\n2. embedding_coverage_radius (default: 0.2)")
print(" - Lower (0.1-0.15): Requires closer matches")
print(" - Higher (0.25-0.3): Accepts broader matches")
print("\n3. n_query_variations (default: 10)")
print(" - Lower (5-7): Faster, less comprehensive")
print(" - Higher (15-20): Better coverage, slower")
print("\n4. embedding_min_confidence_threshold (default: 0.1)")
print(" - Set to 0.15-0.2 for aggressive irrelevance detection")
print(" - Set to 0.05 to crawl even barely relevant content")
print("\n5. embedding_validation_min_score (default: 0.3)")
print(" - Higher (0.5-0.6): Requires strong validation")
print(" - Lower (0.2): More permissive stopping")
print("\n💡 Tips:")
print("- For research: High k_exp, more variations, strict validation")
print("- For exploration: Low k_exp, fewer variations, relaxed thresholds")
print("- For quality: Focus on overlap_threshold and validation scores")
print("- For speed: Reduce variations, increase min_relative_improvement")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,109 @@
"""
Embedding Strategy Example for Adaptive Crawling
This example demonstrates how to use the embedding-based strategy
for semantic understanding and intelligent crawling.
"""
import asyncio
import os
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
async def main():
"""Demonstrate embedding strategy for adaptive crawling"""
# Configure embedding strategy
config = AdaptiveConfig(
strategy="embedding", # Use embedding strategy
embedding_model="sentence-transformers/all-MiniLM-L6-v2", # Default model
n_query_variations=10, # Generate 10 semantic variations
max_pages=15,
top_k_links=3,
min_gain_threshold=0.05,
# Embedding-specific parameters
embedding_k_exp=3.0, # Higher = stricter similarity requirements
embedding_min_confidence_threshold=0.1, # Stop if <10% relevant
embedding_validation_min_score=0.4 # Validation threshold
)
# Optional: Use OpenAI embeddings instead
if os.getenv('OPENAI_API_KEY'):
config.embedding_llm_config = {
'provider': 'openai/text-embedding-3-small',
'api_token': os.getenv('OPENAI_API_KEY')
}
print("Using OpenAI embeddings")
else:
print("Using sentence-transformers (local embeddings)")
async with AsyncWebCrawler(verbose=True) as crawler:
adaptive = AdaptiveCrawler(crawler, config)
# Test 1: Relevant query with semantic understanding
print("\n" + "="*50)
print("TEST 1: Semantic Query Understanding")
print("="*50)
result = await adaptive.digest(
start_url="https://docs.python.org/3/library/asyncio.html",
query="concurrent programming event-driven architecture"
)
print("\nQuery Expansion:")
print(f"Original query expanded to {len(result.expanded_queries)} variations")
for i, q in enumerate(result.expanded_queries[:3], 1):
print(f" {i}. {q}")
print(" ...")
print("\nResults:")
adaptive.print_stats(detailed=False)
# Test 2: Detecting irrelevant queries
print("\n" + "="*50)
print("TEST 2: Irrelevant Query Detection")
print("="*50)
# Reset crawler for new query
adaptive = AdaptiveCrawler(crawler, config)
result = await adaptive.digest(
start_url="https://docs.python.org/3/library/asyncio.html",
query="how to bake chocolate chip cookies"
)
if result.metrics.get('is_irrelevant', False):
print("\n✅ Successfully detected irrelevant query!")
print(f"Stopped after just {len(result.crawled_urls)} pages")
print(f"Reason: {result.metrics.get('stopped_reason', 'unknown')}")
else:
print("\n❌ Failed to detect irrelevance")
print(f"Final confidence: {adaptive.confidence:.1%}")
# Test 3: Semantic gap analysis
print("\n" + "="*50)
print("TEST 3: Semantic Gap Analysis")
print("="*50)
# Show how embedding strategy identifies gaps
adaptive = AdaptiveCrawler(crawler, config)
result = await adaptive.digest(
start_url="https://realpython.com",
query="python decorators advanced patterns"
)
print(f"\nSemantic gaps identified: {len(result.semantic_gaps)}")
print(f"Knowledge base embeddings shape: {result.kb_embeddings.shape if result.kb_embeddings is not None else 'None'}")
# Show coverage metrics specific to embedding strategy
print("\nEmbedding-specific metrics:")
print(f" Average best similarity: {result.metrics.get('avg_best_similarity', 0):.3f}")
print(f" Coverage score: {result.metrics.get('coverage_score', 0):.3f}")
print(f" Validation confidence: {result.metrics.get('validation_confidence', 0):.2%}")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,167 @@
"""
Comparison: Embedding vs Statistical Strategy
This example demonstrates the differences between statistical and embedding
strategies for adaptive crawling, showing when to use each approach.
"""
import asyncio
import time
import os
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
async def crawl_with_strategy(url: str, query: str, strategy: str, **kwargs):
"""Helper function to crawl with a specific strategy"""
config = AdaptiveConfig(
strategy=strategy,
max_pages=20,
top_k_links=3,
min_gain_threshold=0.05,
**kwargs
)
async with AsyncWebCrawler(verbose=False) as crawler:
adaptive = AdaptiveCrawler(crawler, config)
start_time = time.time()
result = await adaptive.digest(start_url=url, query=query)
elapsed = time.time() - start_time
return {
'result': result,
'crawler': adaptive,
'elapsed': elapsed,
'pages': len(result.crawled_urls),
'confidence': adaptive.confidence
}
async def main():
"""Compare embedding and statistical strategies"""
# Test scenarios
test_cases = [
{
'name': 'Technical Documentation (Specific Terms)',
'url': 'https://docs.python.org/3/library/asyncio.html',
'query': 'asyncio.create_task event_loop.run_until_complete'
},
{
'name': 'Conceptual Query (Semantic Understanding)',
'url': 'https://docs.python.org/3/library/asyncio.html',
'query': 'concurrent programming patterns'
},
{
'name': 'Ambiguous Query',
'url': 'https://realpython.com',
'query': 'python performance optimization'
}
]
# Configure embedding strategy
embedding_config = {}
if os.getenv('OPENAI_API_KEY'):
embedding_config['embedding_llm_config'] = {
'provider': 'openai/text-embedding-3-small',
'api_token': os.getenv('OPENAI_API_KEY')
}
for test in test_cases:
print("\n" + "="*70)
print(f"TEST: {test['name']}")
print(f"URL: {test['url']}")
print(f"Query: '{test['query']}'")
print("="*70)
# Run statistical strategy
print("\n📊 Statistical Strategy:")
stat_result = await crawl_with_strategy(
test['url'],
test['query'],
'statistical'
)
print(f" Pages crawled: {stat_result['pages']}")
print(f" Time taken: {stat_result['elapsed']:.2f}s")
print(f" Confidence: {stat_result['confidence']:.1%}")
print(f" Sufficient: {'Yes' if stat_result['crawler'].is_sufficient else 'No'}")
# Show term coverage
if hasattr(stat_result['result'], 'term_frequencies'):
query_terms = test['query'].lower().split()
covered = sum(1 for term in query_terms
if term in stat_result['result'].term_frequencies)
print(f" Term coverage: {covered}/{len(query_terms)} query terms found")
# Run embedding strategy
print("\n🧠 Embedding Strategy:")
emb_result = await crawl_with_strategy(
test['url'],
test['query'],
'embedding',
**embedding_config
)
print(f" Pages crawled: {emb_result['pages']}")
print(f" Time taken: {emb_result['elapsed']:.2f}s")
print(f" Confidence: {emb_result['confidence']:.1%}")
print(f" Sufficient: {'Yes' if emb_result['crawler'].is_sufficient else 'No'}")
# Show semantic understanding
if emb_result['result'].expanded_queries:
print(f" Query variations: {len(emb_result['result'].expanded_queries)}")
print(f" Semantic gaps: {len(emb_result['result'].semantic_gaps)}")
# Compare results
print("\n📈 Comparison:")
efficiency_diff = ((stat_result['pages'] - emb_result['pages']) /
stat_result['pages'] * 100) if stat_result['pages'] > 0 else 0
print(f" Efficiency: ", end="")
if efficiency_diff > 0:
print(f"Embedding used {efficiency_diff:.0f}% fewer pages")
else:
print(f"Statistical used {-efficiency_diff:.0f}% fewer pages")
print(f" Speed: ", end="")
if stat_result['elapsed'] < emb_result['elapsed']:
print(f"Statistical was {emb_result['elapsed']/stat_result['elapsed']:.1f}x faster")
else:
print(f"Embedding was {stat_result['elapsed']/emb_result['elapsed']:.1f}x faster")
print(f" Confidence difference: {abs(stat_result['confidence'] - emb_result['confidence'])*100:.0f} percentage points")
# Recommendation
print("\n💡 Recommendation:")
if 'specific' in test['name'].lower() or all(len(term) > 5 for term in test['query'].split()):
print(" → Statistical strategy is likely better for this use case (specific terms)")
elif 'conceptual' in test['name'].lower() or 'semantic' in test['name'].lower():
print(" → Embedding strategy is likely better for this use case (semantic understanding)")
else:
if emb_result['confidence'] > stat_result['confidence'] + 0.1:
print(" → Embedding strategy achieved significantly better understanding")
elif stat_result['elapsed'] < emb_result['elapsed'] / 2:
print(" → Statistical strategy is much faster with similar results")
else:
print(" → Both strategies performed similarly; choose based on your priorities")
# Summary recommendations
print("\n" + "="*70)
print("STRATEGY SELECTION GUIDE")
print("="*70)
print("\n✅ Use STATISTICAL strategy when:")
print(" - Queries contain specific technical terms")
print(" - Speed is critical")
print(" - No API access available")
print(" - Working with well-structured documentation")
print("\n✅ Use EMBEDDING strategy when:")
print(" - Queries are conceptual or ambiguous")
print(" - Semantic understanding is important")
print(" - Need to detect irrelevant content")
print(" - Working with diverse content sources")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,232 @@
"""
Knowledge Base Export and Import
This example demonstrates how to export crawled knowledge bases and
import them for reuse, sharing, or analysis.
"""
import asyncio
import json
from pathlib import Path
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
async def build_knowledge_base():
"""Build a knowledge base about web technologies"""
print("="*60)
print("PHASE 1: Building Knowledge Base")
print("="*60)
async with AsyncWebCrawler(verbose=False) as crawler:
adaptive = AdaptiveCrawler(crawler)
# Crawl information about HTTP
print("\n1. Gathering HTTP protocol information...")
await adaptive.digest(
start_url="https://httpbin.org",
query="http methods headers status codes"
)
print(f" - Pages crawled: {len(adaptive.state.crawled_urls)}")
print(f" - Confidence: {adaptive.confidence:.2%}")
# Add more information about APIs
print("\n2. Adding API documentation knowledge...")
await adaptive.digest(
start_url="https://httpbin.org/anything",
query="rest api json response request"
)
print(f" - Total pages: {len(adaptive.state.crawled_urls)}")
print(f" - Confidence: {adaptive.confidence:.2%}")
# Export the knowledge base
export_path = "web_tech_knowledge.jsonl"
print(f"\n3. Exporting knowledge base to {export_path}")
adaptive.export_knowledge_base(export_path)
# Show export statistics
export_size = Path(export_path).stat().st_size / 1024
with open(export_path, 'r') as f:
line_count = sum(1 for _ in f)
print(f" - Exported {line_count} documents")
print(f" - File size: {export_size:.1f} KB")
return export_path
async def analyze_knowledge_base(kb_path):
"""Analyze the exported knowledge base"""
print("\n" + "="*60)
print("PHASE 2: Analyzing Exported Knowledge Base")
print("="*60)
# Read and analyze JSONL
documents = []
with open(kb_path, 'r') as f:
for line in f:
documents.append(json.loads(line))
print(f"\nKnowledge base contains {len(documents)} documents:")
# Analyze document properties
total_content_length = 0
urls_by_domain = {}
for doc in documents:
# Content analysis
content_length = len(doc.get('content', ''))
total_content_length += content_length
# URL analysis
url = doc.get('url', '')
domain = url.split('/')[2] if url.startswith('http') else 'unknown'
urls_by_domain[domain] = urls_by_domain.get(domain, 0) + 1
# Show sample document
if documents.index(doc) == 0:
print(f"\nSample document structure:")
print(f" - URL: {url}")
print(f" - Content length: {content_length} chars")
print(f" - Has metadata: {'metadata' in doc}")
print(f" - Has links: {len(doc.get('links', []))} links")
print(f" - Query: {doc.get('query', 'N/A')}")
print(f"\nContent statistics:")
print(f" - Total content: {total_content_length:,} characters")
print(f" - Average per document: {total_content_length/len(documents):,.0f} chars")
print(f"\nDomain distribution:")
for domain, count in urls_by_domain.items():
print(f" - {domain}: {count} pages")
async def import_and_continue():
"""Import a knowledge base and continue crawling"""
print("\n" + "="*60)
print("PHASE 3: Importing and Extending Knowledge Base")
print("="*60)
kb_path = "web_tech_knowledge.jsonl"
async with AsyncWebCrawler(verbose=False) as crawler:
# Create new adaptive crawler
adaptive = AdaptiveCrawler(crawler)
# Import existing knowledge base
print(f"\n1. Importing knowledge base from {kb_path}")
adaptive.import_knowledge_base(kb_path)
print(f" - Imported {len(adaptive.state.knowledge_base)} documents")
print(f" - Existing URLs: {len(adaptive.state.crawled_urls)}")
# Check current state
print("\n2. Checking imported knowledge state:")
adaptive.print_stats(detailed=False)
# Continue crawling with new query
print("\n3. Extending knowledge with new query...")
await adaptive.digest(
start_url="https://httpbin.org/status/200",
query="error handling retry timeout"
)
print("\n4. Final knowledge base state:")
adaptive.print_stats(detailed=False)
# Export extended knowledge base
extended_path = "web_tech_knowledge_extended.jsonl"
adaptive.export_knowledge_base(extended_path)
print(f"\n5. Extended knowledge base exported to {extended_path}")
async def share_knowledge_bases():
"""Demonstrate sharing knowledge bases between projects"""
print("\n" + "="*60)
print("PHASE 4: Sharing Knowledge Between Projects")
print("="*60)
# Simulate two different projects
project_a_kb = "project_a_knowledge.jsonl"
project_b_kb = "project_b_knowledge.jsonl"
async with AsyncWebCrawler(verbose=False) as crawler:
# Project A: Security documentation
print("\n1. Project A: Building security knowledge...")
crawler_a = AdaptiveCrawler(crawler)
await crawler_a.digest(
start_url="https://httpbin.org/basic-auth/user/pass",
query="authentication security headers"
)
crawler_a.export_knowledge_base(project_a_kb)
print(f" - Exported {len(crawler_a.state.knowledge_base)} documents")
# Project B: API testing
print("\n2. Project B: Building testing knowledge...")
crawler_b = AdaptiveCrawler(crawler)
await crawler_b.digest(
start_url="https://httpbin.org/anything",
query="testing endpoints mocking"
)
crawler_b.export_knowledge_base(project_b_kb)
print(f" - Exported {len(crawler_b.state.knowledge_base)} documents")
# Merge knowledge bases
print("\n3. Merging knowledge bases...")
merged_crawler = AdaptiveCrawler(crawler)
# Import both knowledge bases
merged_crawler.import_knowledge_base(project_a_kb)
initial_size = len(merged_crawler.state.knowledge_base)
merged_crawler.import_knowledge_base(project_b_kb)
final_size = len(merged_crawler.state.knowledge_base)
print(f" - Project A documents: {initial_size}")
print(f" - Additional from Project B: {final_size - initial_size}")
print(f" - Total merged documents: {final_size}")
# Export merged knowledge
merged_kb = "merged_knowledge.jsonl"
merged_crawler.export_knowledge_base(merged_kb)
print(f"\n4. Merged knowledge base exported to {merged_kb}")
# Show combined coverage
print("\n5. Combined knowledge coverage:")
merged_crawler.print_stats(detailed=False)
async def main():
"""Run all examples"""
try:
# Build initial knowledge base
kb_path = await build_knowledge_base()
# Analyze the export
await analyze_knowledge_base(kb_path)
# Import and extend
await import_and_continue()
# Demonstrate sharing
await share_knowledge_bases()
print("\n" + "="*60)
print("All examples completed successfully!")
print("="*60)
finally:
# Clean up generated files
print("\nCleaning up generated files...")
for file in [
"web_tech_knowledge.jsonl",
"web_tech_knowledge_extended.jsonl",
"project_a_knowledge.jsonl",
"project_b_knowledge.jsonl",
"merged_knowledge.jsonl"
]:
Path(file).unlink(missing_ok=True)
print("Cleanup complete.")
if __name__ == "__main__":
asyncio.run(main())