feat(crawl4ai): Implement adaptive crawling feature
This commit introduces the adaptive crawling feature to the crawl4ai project. The adaptive crawling feature intelligently determines when sufficient information has been gathered during a crawl, improving efficiency and reducing unnecessary resource usage. The changes include the addition of new files related to the adaptive crawler, modifications to the existing files, and updates to the documentation. The new files include the main adaptive crawler script, utility functions, and various configuration and strategy scripts. The existing files that were modified include the project's initialization file and utility functions. The documentation has been updated to include detailed explanations and examples of the adaptive crawling feature. The adaptive crawling feature will significantly enhance the capabilities of the crawl4ai project, providing users with a more efficient and intelligent web crawling tool. Significant modifications: - Added adaptive_crawler.py and related scripts - Modified __init__.py and utils.py - Updated documentation with details about the adaptive crawling feature - Added tests for the new feature BREAKING CHANGE: This is a significant feature addition that may affect the overall behavior of the crawl4ai project. Users are advised to review the updated documentation to understand how to use the new feature. Refs: #123, #456
This commit is contained in:
85
docs/examples/adaptive_crawling/README.md
Normal file
85
docs/examples/adaptive_crawling/README.md
Normal file
@@ -0,0 +1,85 @@
|
||||
# Adaptive Crawling Examples
|
||||
|
||||
This directory contains examples demonstrating various aspects of Crawl4AI's Adaptive Crawling feature.
|
||||
|
||||
## Examples Overview
|
||||
|
||||
### 1. `basic_usage.py`
|
||||
- Simple introduction to adaptive crawling
|
||||
- Uses default statistical strategy
|
||||
- Shows how to get crawl statistics and relevant content
|
||||
|
||||
### 2. `embedding_strategy.py` ⭐ NEW
|
||||
- Demonstrates the embedding-based strategy for semantic understanding
|
||||
- Shows query expansion and irrelevance detection
|
||||
- Includes configuration for both local and API-based embeddings
|
||||
|
||||
### 3. `embedding_vs_statistical.py` ⭐ NEW
|
||||
- Direct comparison between statistical and embedding strategies
|
||||
- Helps you choose the right strategy for your use case
|
||||
- Shows performance and accuracy trade-offs
|
||||
|
||||
### 4. `embedding_configuration.py` ⭐ NEW
|
||||
- Advanced configuration options for embedding strategy
|
||||
- Parameter tuning guide for different scenarios
|
||||
- Examples for research, exploration, and quality-focused crawling
|
||||
|
||||
### 5. `advanced_configuration.py`
|
||||
- Shows various configuration options for both strategies
|
||||
- Demonstrates threshold tuning and performance optimization
|
||||
|
||||
### 6. `custom_strategies.py`
|
||||
- How to implement your own crawling strategy
|
||||
- Extends the base CrawlStrategy class
|
||||
- Advanced use case for specialized requirements
|
||||
|
||||
### 7. `export_import_kb.py`
|
||||
- Export crawled knowledge base to JSONL
|
||||
- Import and continue crawling from saved state
|
||||
- Useful for building persistent knowledge bases
|
||||
|
||||
## Quick Start
|
||||
|
||||
For your first adaptive crawling experience, run:
|
||||
|
||||
```bash
|
||||
python basic_usage.py
|
||||
```
|
||||
|
||||
To try the new embedding strategy with semantic understanding:
|
||||
|
||||
```bash
|
||||
python embedding_strategy.py
|
||||
```
|
||||
|
||||
To compare strategies and see which works best for your use case:
|
||||
|
||||
```bash
|
||||
python embedding_vs_statistical.py
|
||||
```
|
||||
|
||||
## Strategy Selection Guide
|
||||
|
||||
### Use Statistical Strategy (Default) When:
|
||||
- Working with technical documentation
|
||||
- Queries contain specific terms or code
|
||||
- Speed is critical
|
||||
- No API access available
|
||||
|
||||
### Use Embedding Strategy When:
|
||||
- Queries are conceptual or ambiguous
|
||||
- Need semantic understanding beyond exact matches
|
||||
- Want to detect irrelevant content
|
||||
- Working with diverse content sources
|
||||
|
||||
## Requirements
|
||||
|
||||
- Crawl4AI installed
|
||||
- For embedding strategy with local models: `sentence-transformers`
|
||||
- For embedding strategy with OpenAI: Set `OPENAI_API_KEY` environment variable
|
||||
|
||||
## Learn More
|
||||
|
||||
- [Adaptive Crawling Documentation](https://docs.crawl4ai.com/core/adaptive-crawling/)
|
||||
- [Mathematical Framework](https://github.com/unclecode/crawl4ai/blob/main/PROGRESSIVE_CRAWLING.md)
|
||||
- [Blog: The Adaptive Crawling Revolution](https://docs.crawl4ai.com/blog/adaptive-crawling-revolution/)
|
||||
207
docs/examples/adaptive_crawling/advanced_configuration.py
Normal file
207
docs/examples/adaptive_crawling/advanced_configuration.py
Normal file
@@ -0,0 +1,207 @@
|
||||
"""
|
||||
Advanced Adaptive Crawling Configuration
|
||||
|
||||
This example demonstrates all configuration options available for adaptive crawling,
|
||||
including threshold tuning, persistence, and custom parameters.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
from pathlib import Path
|
||||
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
|
||||
|
||||
|
||||
async def main():
|
||||
"""Demonstrate advanced configuration options"""
|
||||
|
||||
# Example 1: Custom thresholds for different use cases
|
||||
print("="*60)
|
||||
print("EXAMPLE 1: Custom Confidence Thresholds")
|
||||
print("="*60)
|
||||
|
||||
# High-precision configuration (exhaustive crawling)
|
||||
high_precision_config = AdaptiveConfig(
|
||||
confidence_threshold=0.9, # Very high confidence required
|
||||
max_pages=50, # Allow more pages
|
||||
top_k_links=5, # Follow more links per page
|
||||
min_gain_threshold=0.02 # Lower threshold to continue
|
||||
)
|
||||
|
||||
# Balanced configuration (default use case)
|
||||
balanced_config = AdaptiveConfig(
|
||||
confidence_threshold=0.7, # Moderate confidence
|
||||
max_pages=20, # Reasonable limit
|
||||
top_k_links=3, # Moderate branching
|
||||
min_gain_threshold=0.05 # Standard gain threshold
|
||||
)
|
||||
|
||||
# Quick exploration configuration
|
||||
quick_config = AdaptiveConfig(
|
||||
confidence_threshold=0.5, # Lower confidence acceptable
|
||||
max_pages=10, # Strict limit
|
||||
top_k_links=2, # Minimal branching
|
||||
min_gain_threshold=0.1 # High gain required
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(verbose=False) as crawler:
|
||||
# Test different configurations
|
||||
for config_name, config in [
|
||||
("High Precision", high_precision_config),
|
||||
("Balanced", balanced_config),
|
||||
("Quick Exploration", quick_config)
|
||||
]:
|
||||
print(f"\nTesting {config_name} configuration...")
|
||||
adaptive = AdaptiveCrawler(crawler, config=config)
|
||||
|
||||
result = await adaptive.digest(
|
||||
start_url="https://httpbin.org",
|
||||
query="http headers authentication"
|
||||
)
|
||||
|
||||
print(f" - Pages crawled: {len(result.crawled_urls)}")
|
||||
print(f" - Confidence achieved: {adaptive.confidence:.2%}")
|
||||
print(f" - Coverage score: {adaptive.coverage_stats['coverage']:.2f}")
|
||||
|
||||
# Example 2: Persistence and state management
|
||||
print("\n" + "="*60)
|
||||
print("EXAMPLE 2: State Persistence")
|
||||
print("="*60)
|
||||
|
||||
state_file = "crawl_state_demo.json"
|
||||
|
||||
# Configuration with persistence
|
||||
persistent_config = AdaptiveConfig(
|
||||
confidence_threshold=0.8,
|
||||
max_pages=30,
|
||||
save_state=True, # Enable auto-save
|
||||
state_path=state_file # Specify save location
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(verbose=False) as crawler:
|
||||
# First crawl - will be interrupted
|
||||
print("\nStarting initial crawl (will interrupt after 5 pages)...")
|
||||
|
||||
interrupt_config = AdaptiveConfig(
|
||||
confidence_threshold=0.8,
|
||||
max_pages=5, # Artificially low to simulate interruption
|
||||
save_state=True,
|
||||
state_path=state_file
|
||||
)
|
||||
|
||||
adaptive = AdaptiveCrawler(crawler, config=interrupt_config)
|
||||
result1 = await adaptive.digest(
|
||||
start_url="https://docs.python.org/3/",
|
||||
query="exception handling try except finally"
|
||||
)
|
||||
|
||||
print(f"First crawl completed: {len(result1.crawled_urls)} pages")
|
||||
print(f"Confidence reached: {adaptive.confidence:.2%}")
|
||||
|
||||
# Resume crawl with higher page limit
|
||||
print("\nResuming crawl from saved state...")
|
||||
|
||||
resume_config = AdaptiveConfig(
|
||||
confidence_threshold=0.8,
|
||||
max_pages=20, # Increase limit
|
||||
save_state=True,
|
||||
state_path=state_file
|
||||
)
|
||||
|
||||
adaptive2 = AdaptiveCrawler(crawler, config=resume_config)
|
||||
result2 = await adaptive2.digest(
|
||||
start_url="https://docs.python.org/3/",
|
||||
query="exception handling try except finally",
|
||||
resume_from=state_file
|
||||
)
|
||||
|
||||
print(f"Resumed crawl completed: {len(result2.crawled_urls)} total pages")
|
||||
print(f"Final confidence: {adaptive2.confidence:.2%}")
|
||||
|
||||
# Clean up
|
||||
Path(state_file).unlink(missing_ok=True)
|
||||
|
||||
# Example 3: Link selection strategies
|
||||
print("\n" + "="*60)
|
||||
print("EXAMPLE 3: Link Selection Strategies")
|
||||
print("="*60)
|
||||
|
||||
# Conservative link following
|
||||
conservative_config = AdaptiveConfig(
|
||||
confidence_threshold=0.7,
|
||||
max_pages=15,
|
||||
top_k_links=1, # Only follow best link
|
||||
min_gain_threshold=0.15 # High threshold
|
||||
)
|
||||
|
||||
# Aggressive link following
|
||||
aggressive_config = AdaptiveConfig(
|
||||
confidence_threshold=0.7,
|
||||
max_pages=15,
|
||||
top_k_links=10, # Follow many links
|
||||
min_gain_threshold=0.01 # Very low threshold
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(verbose=False) as crawler:
|
||||
for strategy_name, config in [
|
||||
("Conservative", conservative_config),
|
||||
("Aggressive", aggressive_config)
|
||||
]:
|
||||
print(f"\n{strategy_name} link selection:")
|
||||
adaptive = AdaptiveCrawler(crawler, config=config)
|
||||
|
||||
result = await adaptive.digest(
|
||||
start_url="https://httpbin.org",
|
||||
query="api endpoints"
|
||||
)
|
||||
|
||||
# Analyze crawl pattern
|
||||
print(f" - Total pages: {len(result.crawled_urls)}")
|
||||
print(f" - Unique domains: {len(set(url.split('/')[2] for url in result.crawled_urls))}")
|
||||
print(f" - Max depth reached: {max(url.count('/') for url in result.crawled_urls) - 2}")
|
||||
|
||||
# Show saturation trend
|
||||
if hasattr(result, 'new_terms_history') and result.new_terms_history:
|
||||
print(f" - New terms discovered: {result.new_terms_history[:5]}...")
|
||||
print(f" - Saturation trend: {'decreasing' if result.new_terms_history[-1] < result.new_terms_history[0] else 'increasing'}")
|
||||
|
||||
# Example 4: Monitoring crawl progress
|
||||
print("\n" + "="*60)
|
||||
print("EXAMPLE 4: Progress Monitoring")
|
||||
print("="*60)
|
||||
|
||||
# Configuration with detailed monitoring
|
||||
monitor_config = AdaptiveConfig(
|
||||
confidence_threshold=0.75,
|
||||
max_pages=10,
|
||||
top_k_links=3
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(verbose=False) as crawler:
|
||||
adaptive = AdaptiveCrawler(crawler, config=monitor_config)
|
||||
|
||||
# Start crawl
|
||||
print("\nMonitoring crawl progress...")
|
||||
result = await adaptive.digest(
|
||||
start_url="https://httpbin.org",
|
||||
query="http methods headers"
|
||||
)
|
||||
|
||||
# Detailed statistics
|
||||
print("\nDetailed crawl analysis:")
|
||||
adaptive.print_stats(detailed=True)
|
||||
|
||||
# Export for analysis
|
||||
print("\nExporting knowledge base for external analysis...")
|
||||
adaptive.export_knowledge_base("knowledge_export_demo.jsonl")
|
||||
print("Knowledge base exported to: knowledge_export_demo.jsonl")
|
||||
|
||||
# Show sample of exported data
|
||||
with open("knowledge_export_demo.jsonl", 'r') as f:
|
||||
first_line = f.readline()
|
||||
print(f"Sample export: {first_line[:100]}...")
|
||||
|
||||
# Clean up
|
||||
Path("knowledge_export_demo.jsonl").unlink(missing_ok=True)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
76
docs/examples/adaptive_crawling/basic_usage.py
Normal file
76
docs/examples/adaptive_crawling/basic_usage.py
Normal file
@@ -0,0 +1,76 @@
|
||||
"""
|
||||
Basic Adaptive Crawling Example
|
||||
|
||||
This example demonstrates the simplest use case of adaptive crawling:
|
||||
finding information about a specific topic and knowing when to stop.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler
|
||||
|
||||
|
||||
async def main():
|
||||
"""Basic adaptive crawling example"""
|
||||
|
||||
# Initialize the crawler
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# Create an adaptive crawler with default settings (statistical strategy)
|
||||
adaptive = AdaptiveCrawler(crawler)
|
||||
|
||||
# Note: You can also use embedding strategy for semantic understanding:
|
||||
# from crawl4ai import AdaptiveConfig
|
||||
# config = AdaptiveConfig(strategy="embedding")
|
||||
# adaptive = AdaptiveCrawler(crawler, config)
|
||||
|
||||
# Start adaptive crawling
|
||||
print("Starting adaptive crawl for Python async programming information...")
|
||||
result = await adaptive.digest(
|
||||
start_url="https://docs.python.org/3/library/asyncio.html",
|
||||
query="async await context managers coroutines"
|
||||
)
|
||||
|
||||
# Display crawl statistics
|
||||
print("\n" + "="*50)
|
||||
print("CRAWL STATISTICS")
|
||||
print("="*50)
|
||||
adaptive.print_stats(detailed=False)
|
||||
|
||||
# Get the most relevant content found
|
||||
print("\n" + "="*50)
|
||||
print("MOST RELEVANT PAGES")
|
||||
print("="*50)
|
||||
|
||||
relevant_pages = adaptive.get_relevant_content(top_k=5)
|
||||
for i, page in enumerate(relevant_pages, 1):
|
||||
print(f"\n{i}. {page['url']}")
|
||||
print(f" Relevance Score: {page['score']:.2%}")
|
||||
|
||||
# Show a snippet of the content
|
||||
content = page['content'] or ""
|
||||
if content:
|
||||
snippet = content[:200].replace('\n', ' ')
|
||||
if len(content) > 200:
|
||||
snippet += "..."
|
||||
print(f" Preview: {snippet}")
|
||||
|
||||
# Show final confidence
|
||||
print(f"\n{'='*50}")
|
||||
print(f"Final Confidence: {adaptive.confidence:.2%}")
|
||||
print(f"Total Pages Crawled: {len(result.crawled_urls)}")
|
||||
print(f"Knowledge Base Size: {len(adaptive.state.knowledge_base)} documents")
|
||||
|
||||
# Example: Check if we can answer specific questions
|
||||
print(f"\n{'='*50}")
|
||||
print("INFORMATION SUFFICIENCY CHECK")
|
||||
print(f"{'='*50}")
|
||||
|
||||
if adaptive.confidence >= 0.8:
|
||||
print("✓ High confidence - can answer detailed questions about async Python")
|
||||
elif adaptive.confidence >= 0.6:
|
||||
print("~ Moderate confidence - can answer basic questions")
|
||||
else:
|
||||
print("✗ Low confidence - need more information")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
373
docs/examples/adaptive_crawling/custom_strategies.py
Normal file
373
docs/examples/adaptive_crawling/custom_strategies.py
Normal file
@@ -0,0 +1,373 @@
|
||||
"""
|
||||
Custom Adaptive Crawling Strategies
|
||||
|
||||
This example demonstrates how to implement custom scoring strategies
|
||||
for domain-specific crawling needs.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import re
|
||||
from typing import List, Dict, Set
|
||||
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
|
||||
from crawl4ai.adaptive_crawler import CrawlState, Link
|
||||
import math
|
||||
|
||||
|
||||
class APIDocumentationStrategy:
|
||||
"""
|
||||
Custom strategy optimized for API documentation crawling.
|
||||
Prioritizes endpoint references, code examples, and parameter descriptions.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
# Keywords that indicate high-value API documentation
|
||||
self.api_keywords = {
|
||||
'endpoint', 'request', 'response', 'parameter', 'authentication',
|
||||
'header', 'body', 'query', 'path', 'method', 'get', 'post', 'put',
|
||||
'delete', 'patch', 'status', 'code', 'example', 'curl', 'python'
|
||||
}
|
||||
|
||||
# URL patterns that typically contain API documentation
|
||||
self.valuable_patterns = [
|
||||
r'/api/',
|
||||
r'/reference/',
|
||||
r'/endpoints?/',
|
||||
r'/methods?/',
|
||||
r'/resources?/'
|
||||
]
|
||||
|
||||
# Patterns to avoid
|
||||
self.avoid_patterns = [
|
||||
r'/blog/',
|
||||
r'/news/',
|
||||
r'/about/',
|
||||
r'/contact/',
|
||||
r'/legal/'
|
||||
]
|
||||
|
||||
def score_link(self, link: Link, query: str, state: CrawlState) -> float:
|
||||
"""Custom link scoring for API documentation"""
|
||||
score = 1.0
|
||||
url = link.href.lower()
|
||||
|
||||
# Boost API-related URLs
|
||||
for pattern in self.valuable_patterns:
|
||||
if re.search(pattern, url):
|
||||
score *= 2.0
|
||||
break
|
||||
|
||||
# Reduce score for non-API content
|
||||
for pattern in self.avoid_patterns:
|
||||
if re.search(pattern, url):
|
||||
score *= 0.1
|
||||
break
|
||||
|
||||
# Boost if preview contains API keywords
|
||||
if link.text:
|
||||
preview_lower = link.text.lower()
|
||||
keyword_count = sum(1 for kw in self.api_keywords if kw in preview_lower)
|
||||
score *= (1 + keyword_count * 0.2)
|
||||
|
||||
# Prioritize shallow URLs (likely overview pages)
|
||||
depth = url.count('/') - 2 # Subtract protocol slashes
|
||||
if depth <= 3:
|
||||
score *= 1.5
|
||||
elif depth > 6:
|
||||
score *= 0.5
|
||||
|
||||
return score
|
||||
|
||||
def calculate_api_coverage(self, state: CrawlState, query: str) -> Dict[str, float]:
|
||||
"""Calculate specialized coverage metrics for API documentation"""
|
||||
metrics = {
|
||||
'endpoint_coverage': 0.0,
|
||||
'example_coverage': 0.0,
|
||||
'parameter_coverage': 0.0
|
||||
}
|
||||
|
||||
# Analyze knowledge base for API-specific content
|
||||
endpoint_patterns = [r'GET\s+/', r'POST\s+/', r'PUT\s+/', r'DELETE\s+/']
|
||||
example_patterns = [r'```\w+', r'curl\s+-', r'import\s+requests']
|
||||
param_patterns = [r'param(?:eter)?s?\s*:', r'required\s*:', r'optional\s*:']
|
||||
|
||||
total_docs = len(state.knowledge_base)
|
||||
if total_docs == 0:
|
||||
return metrics
|
||||
|
||||
docs_with_endpoints = 0
|
||||
docs_with_examples = 0
|
||||
docs_with_params = 0
|
||||
|
||||
for doc in state.knowledge_base:
|
||||
content = doc.markdown.raw_markdown if hasattr(doc, 'markdown') else str(doc)
|
||||
|
||||
# Check for endpoints
|
||||
if any(re.search(pattern, content, re.IGNORECASE) for pattern in endpoint_patterns):
|
||||
docs_with_endpoints += 1
|
||||
|
||||
# Check for examples
|
||||
if any(re.search(pattern, content, re.IGNORECASE) for pattern in example_patterns):
|
||||
docs_with_examples += 1
|
||||
|
||||
# Check for parameters
|
||||
if any(re.search(pattern, content, re.IGNORECASE) for pattern in param_patterns):
|
||||
docs_with_params += 1
|
||||
|
||||
metrics['endpoint_coverage'] = docs_with_endpoints / total_docs
|
||||
metrics['example_coverage'] = docs_with_examples / total_docs
|
||||
metrics['parameter_coverage'] = docs_with_params / total_docs
|
||||
|
||||
return metrics
|
||||
|
||||
|
||||
class ResearchPaperStrategy:
|
||||
"""
|
||||
Strategy optimized for crawling research papers and academic content.
|
||||
Prioritizes citations, abstracts, and methodology sections.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
self.academic_keywords = {
|
||||
'abstract', 'introduction', 'methodology', 'results', 'conclusion',
|
||||
'references', 'citation', 'paper', 'study', 'research', 'analysis',
|
||||
'hypothesis', 'experiment', 'findings', 'doi'
|
||||
}
|
||||
|
||||
self.citation_patterns = [
|
||||
r'\[\d+\]', # [1] style citations
|
||||
r'\(\w+\s+\d{4}\)', # (Author 2024) style
|
||||
r'doi:\s*\S+', # DOI references
|
||||
]
|
||||
|
||||
def calculate_academic_relevance(self, content: str, query: str) -> float:
|
||||
"""Calculate relevance score for academic content"""
|
||||
score = 0.0
|
||||
content_lower = content.lower()
|
||||
|
||||
# Check for academic keywords
|
||||
keyword_matches = sum(1 for kw in self.academic_keywords if kw in content_lower)
|
||||
score += keyword_matches * 0.1
|
||||
|
||||
# Check for citations
|
||||
citation_count = sum(
|
||||
len(re.findall(pattern, content))
|
||||
for pattern in self.citation_patterns
|
||||
)
|
||||
score += min(citation_count * 0.05, 1.0) # Cap at 1.0
|
||||
|
||||
# Check for query terms in academic context
|
||||
query_terms = query.lower().split()
|
||||
for term in query_terms:
|
||||
# Boost if term appears near academic keywords
|
||||
for keyword in ['abstract', 'conclusion', 'results']:
|
||||
if keyword in content_lower:
|
||||
section = content_lower[content_lower.find(keyword):content_lower.find(keyword) + 500]
|
||||
if term in section:
|
||||
score += 0.2
|
||||
|
||||
return min(score, 2.0) # Cap total score
|
||||
|
||||
|
||||
async def demo_custom_strategies():
|
||||
"""Demonstrate custom strategy usage"""
|
||||
|
||||
# Example 1: API Documentation Strategy
|
||||
print("="*60)
|
||||
print("EXAMPLE 1: Custom API Documentation Strategy")
|
||||
print("="*60)
|
||||
|
||||
api_strategy = APIDocumentationStrategy()
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Standard adaptive crawler
|
||||
config = AdaptiveConfig(
|
||||
confidence_threshold=0.8,
|
||||
max_pages=15
|
||||
)
|
||||
|
||||
adaptive = AdaptiveCrawler(crawler, config)
|
||||
|
||||
# Override link scoring with custom strategy
|
||||
original_rank_links = adaptive._rank_links
|
||||
|
||||
def custom_rank_links(links, query, state):
|
||||
# Apply custom scoring
|
||||
scored_links = []
|
||||
for link in links:
|
||||
base_score = api_strategy.score_link(link, query, state)
|
||||
scored_links.append((link, base_score))
|
||||
|
||||
# Sort by score
|
||||
scored_links.sort(key=lambda x: x[1], reverse=True)
|
||||
return [link for link, _ in scored_links[:config.top_k_links]]
|
||||
|
||||
adaptive._rank_links = custom_rank_links
|
||||
|
||||
# Crawl API documentation
|
||||
print("\nCrawling API documentation with custom strategy...")
|
||||
state = await adaptive.digest(
|
||||
start_url="https://httpbin.org",
|
||||
query="api endpoints authentication headers"
|
||||
)
|
||||
|
||||
# Calculate custom metrics
|
||||
api_metrics = api_strategy.calculate_api_coverage(state, "api endpoints")
|
||||
|
||||
print(f"\nResults:")
|
||||
print(f"Pages crawled: {len(state.crawled_urls)}")
|
||||
print(f"Confidence: {adaptive.confidence:.2%}")
|
||||
print(f"\nAPI-Specific Metrics:")
|
||||
print(f" - Endpoint coverage: {api_metrics['endpoint_coverage']:.2%}")
|
||||
print(f" - Example coverage: {api_metrics['example_coverage']:.2%}")
|
||||
print(f" - Parameter coverage: {api_metrics['parameter_coverage']:.2%}")
|
||||
|
||||
# Example 2: Combined Strategy
|
||||
print("\n" + "="*60)
|
||||
print("EXAMPLE 2: Hybrid Strategy Combining Multiple Approaches")
|
||||
print("="*60)
|
||||
|
||||
class HybridStrategy:
|
||||
"""Combines multiple strategies with weights"""
|
||||
|
||||
def __init__(self):
|
||||
self.api_strategy = APIDocumentationStrategy()
|
||||
self.research_strategy = ResearchPaperStrategy()
|
||||
self.weights = {
|
||||
'api': 0.7,
|
||||
'research': 0.3
|
||||
}
|
||||
|
||||
def score_content(self, content: str, query: str) -> float:
|
||||
# Get scores from each strategy
|
||||
api_score = self._calculate_api_score(content, query)
|
||||
research_score = self.research_strategy.calculate_academic_relevance(content, query)
|
||||
|
||||
# Weighted combination
|
||||
total_score = (
|
||||
api_score * self.weights['api'] +
|
||||
research_score * self.weights['research']
|
||||
)
|
||||
|
||||
return total_score
|
||||
|
||||
def _calculate_api_score(self, content: str, query: str) -> float:
|
||||
# Simplified API scoring based on keyword presence
|
||||
content_lower = content.lower()
|
||||
api_keywords = self.api_strategy.api_keywords
|
||||
|
||||
keyword_count = sum(1 for kw in api_keywords if kw in content_lower)
|
||||
return min(keyword_count * 0.1, 2.0)
|
||||
|
||||
hybrid_strategy = HybridStrategy()
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
adaptive = AdaptiveCrawler(crawler)
|
||||
|
||||
# Crawl with hybrid scoring
|
||||
print("\nTesting hybrid strategy on technical documentation...")
|
||||
state = await adaptive.digest(
|
||||
start_url="https://docs.python.org/3/library/asyncio.html",
|
||||
query="async await coroutines api"
|
||||
)
|
||||
|
||||
# Analyze results with hybrid strategy
|
||||
print(f"\nHybrid Strategy Analysis:")
|
||||
total_score = 0
|
||||
for doc in adaptive.get_relevant_content(top_k=5):
|
||||
content = doc['content'] or ""
|
||||
score = hybrid_strategy.score_content(content, "async await api")
|
||||
total_score += score
|
||||
print(f" - {doc['url'][:50]}... Score: {score:.2f}")
|
||||
|
||||
print(f"\nAverage hybrid score: {total_score/5:.2f}")
|
||||
|
||||
|
||||
async def demo_performance_optimization():
|
||||
"""Demonstrate performance optimization with custom strategies"""
|
||||
|
||||
print("\n" + "="*60)
|
||||
print("EXAMPLE 3: Performance-Optimized Strategy")
|
||||
print("="*60)
|
||||
|
||||
class PerformanceOptimizedStrategy:
|
||||
"""Strategy that balances thoroughness with speed"""
|
||||
|
||||
def __init__(self):
|
||||
self.url_cache: Set[str] = set()
|
||||
self.domain_scores: Dict[str, float] = {}
|
||||
|
||||
def should_crawl_domain(self, url: str) -> bool:
|
||||
"""Implement domain-level filtering"""
|
||||
domain = url.split('/')[2] if url.startswith('http') else url
|
||||
|
||||
# Skip if we've already crawled many pages from this domain
|
||||
domain_count = sum(1 for cached in self.url_cache if domain in cached)
|
||||
if domain_count > 5:
|
||||
return False
|
||||
|
||||
# Skip low-scoring domains
|
||||
if domain in self.domain_scores and self.domain_scores[domain] < 0.3:
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
def update_domain_score(self, url: str, relevance: float):
|
||||
"""Track domain-level performance"""
|
||||
domain = url.split('/')[2] if url.startswith('http') else url
|
||||
|
||||
if domain not in self.domain_scores:
|
||||
self.domain_scores[domain] = relevance
|
||||
else:
|
||||
# Moving average
|
||||
self.domain_scores[domain] = (
|
||||
0.7 * self.domain_scores[domain] + 0.3 * relevance
|
||||
)
|
||||
|
||||
perf_strategy = PerformanceOptimizedStrategy()
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = AdaptiveConfig(
|
||||
confidence_threshold=0.7,
|
||||
max_pages=10,
|
||||
top_k_links=2 # Fewer links for speed
|
||||
)
|
||||
|
||||
adaptive = AdaptiveCrawler(crawler, config)
|
||||
|
||||
# Track performance
|
||||
import time
|
||||
start_time = time.time()
|
||||
|
||||
state = await adaptive.digest(
|
||||
start_url="https://httpbin.org",
|
||||
query="http methods headers"
|
||||
)
|
||||
|
||||
elapsed = time.time() - start_time
|
||||
|
||||
print(f"\nPerformance Results:")
|
||||
print(f" - Time elapsed: {elapsed:.2f} seconds")
|
||||
print(f" - Pages crawled: {len(state.crawled_urls)}")
|
||||
print(f" - Pages per second: {len(state.crawled_urls)/elapsed:.2f}")
|
||||
print(f" - Final confidence: {adaptive.confidence:.2%}")
|
||||
print(f" - Efficiency: {adaptive.confidence/len(state.crawled_urls):.2%} confidence per page")
|
||||
|
||||
|
||||
async def main():
|
||||
"""Run all demonstrations"""
|
||||
try:
|
||||
await demo_custom_strategies()
|
||||
await demo_performance_optimization()
|
||||
|
||||
print("\n" + "="*60)
|
||||
print("All custom strategy examples completed!")
|
||||
print("="*60)
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
206
docs/examples/adaptive_crawling/embedding_configuration.py
Normal file
206
docs/examples/adaptive_crawling/embedding_configuration.py
Normal file
@@ -0,0 +1,206 @@
|
||||
"""
|
||||
Advanced Embedding Configuration Example
|
||||
|
||||
This example demonstrates all configuration options available for the
|
||||
embedding strategy, including fine-tuning parameters for different use cases.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
|
||||
|
||||
|
||||
async def test_configuration(name: str, config: AdaptiveConfig, url: str, query: str):
|
||||
"""Test a specific configuration"""
|
||||
print(f"\n{'='*60}")
|
||||
print(f"Configuration: {name}")
|
||||
print(f"{'='*60}")
|
||||
|
||||
async with AsyncWebCrawler(verbose=False) as crawler:
|
||||
adaptive = AdaptiveCrawler(crawler, config)
|
||||
result = await adaptive.digest(start_url=url, query=query)
|
||||
|
||||
print(f"Pages crawled: {len(result.crawled_urls)}")
|
||||
print(f"Final confidence: {adaptive.confidence:.1%}")
|
||||
print(f"Stopped reason: {result.metrics.get('stopped_reason', 'max_pages')}")
|
||||
|
||||
if result.metrics.get('is_irrelevant', False):
|
||||
print("⚠️ Query detected as irrelevant!")
|
||||
|
||||
return result
|
||||
|
||||
|
||||
async def main():
|
||||
"""Demonstrate various embedding configurations"""
|
||||
|
||||
print("EMBEDDING STRATEGY CONFIGURATION EXAMPLES")
|
||||
print("=" * 60)
|
||||
|
||||
# Base URL and query for testing
|
||||
test_url = "https://docs.python.org/3/library/asyncio.html"
|
||||
|
||||
# 1. Default Configuration
|
||||
config_default = AdaptiveConfig(
|
||||
strategy="embedding",
|
||||
max_pages=10
|
||||
)
|
||||
|
||||
await test_configuration(
|
||||
"Default Settings",
|
||||
config_default,
|
||||
test_url,
|
||||
"async programming patterns"
|
||||
)
|
||||
|
||||
# 2. Strict Coverage Requirements
|
||||
config_strict = AdaptiveConfig(
|
||||
strategy="embedding",
|
||||
max_pages=20,
|
||||
|
||||
# Stricter similarity requirements
|
||||
embedding_k_exp=5.0, # Default is 3.0, higher = stricter
|
||||
embedding_coverage_radius=0.15, # Default is 0.2, lower = stricter
|
||||
|
||||
# Higher validation threshold
|
||||
embedding_validation_min_score=0.6, # Default is 0.3
|
||||
|
||||
# More query variations for better coverage
|
||||
n_query_variations=15 # Default is 10
|
||||
)
|
||||
|
||||
await test_configuration(
|
||||
"Strict Coverage (Research/Academic)",
|
||||
config_strict,
|
||||
test_url,
|
||||
"comprehensive guide async await"
|
||||
)
|
||||
|
||||
# 3. Fast Exploration
|
||||
config_fast = AdaptiveConfig(
|
||||
strategy="embedding",
|
||||
max_pages=10,
|
||||
top_k_links=5, # Follow more links per page
|
||||
|
||||
# Relaxed requirements for faster convergence
|
||||
embedding_k_exp=1.0, # Lower = more lenient
|
||||
embedding_min_relative_improvement=0.05, # Stop earlier
|
||||
|
||||
# Lower quality thresholds
|
||||
embedding_quality_min_confidence=0.5, # Display lower confidence
|
||||
embedding_quality_max_confidence=0.85,
|
||||
|
||||
# Fewer query variations for speed
|
||||
n_query_variations=5
|
||||
)
|
||||
|
||||
await test_configuration(
|
||||
"Fast Exploration (Quick Overview)",
|
||||
config_fast,
|
||||
test_url,
|
||||
"async basics"
|
||||
)
|
||||
|
||||
# 4. Irrelevance Detection Focus
|
||||
config_irrelevance = AdaptiveConfig(
|
||||
strategy="embedding",
|
||||
max_pages=5,
|
||||
|
||||
# Aggressive irrelevance detection
|
||||
embedding_min_confidence_threshold=0.2, # Higher threshold (default 0.1)
|
||||
embedding_k_exp=5.0, # Strict similarity
|
||||
|
||||
# Quick stopping for irrelevant content
|
||||
embedding_min_relative_improvement=0.15
|
||||
)
|
||||
|
||||
await test_configuration(
|
||||
"Irrelevance Detection",
|
||||
config_irrelevance,
|
||||
test_url,
|
||||
"recipe for chocolate cake" # Irrelevant query
|
||||
)
|
||||
|
||||
# 5. High-Quality Knowledge Base
|
||||
config_quality = AdaptiveConfig(
|
||||
strategy="embedding",
|
||||
max_pages=30,
|
||||
|
||||
# Deduplication settings
|
||||
embedding_overlap_threshold=0.75, # More aggressive deduplication
|
||||
|
||||
# Quality focus
|
||||
embedding_validation_min_score=0.5,
|
||||
embedding_quality_scale_factor=1.0, # Linear quality mapping
|
||||
|
||||
# Balanced parameters
|
||||
embedding_k_exp=3.0,
|
||||
embedding_nearest_weight=0.8, # Focus on best matches
|
||||
embedding_top_k_weight=0.2
|
||||
)
|
||||
|
||||
await test_configuration(
|
||||
"High-Quality Knowledge Base",
|
||||
config_quality,
|
||||
test_url,
|
||||
"asyncio advanced patterns best practices"
|
||||
)
|
||||
|
||||
# 6. Custom Embedding Provider
|
||||
if os.getenv('OPENAI_API_KEY'):
|
||||
config_openai = AdaptiveConfig(
|
||||
strategy="embedding",
|
||||
max_pages=10,
|
||||
|
||||
# Use OpenAI embeddings
|
||||
embedding_llm_config={
|
||||
'provider': 'openai/text-embedding-3-small',
|
||||
'api_token': os.getenv('OPENAI_API_KEY')
|
||||
},
|
||||
|
||||
# OpenAI embeddings are high quality, can be stricter
|
||||
embedding_k_exp=4.0,
|
||||
n_query_variations=12
|
||||
)
|
||||
|
||||
await test_configuration(
|
||||
"OpenAI Embeddings",
|
||||
config_openai,
|
||||
test_url,
|
||||
"event-driven architecture patterns"
|
||||
)
|
||||
|
||||
# Parameter Guide
|
||||
print("\n" + "="*60)
|
||||
print("PARAMETER TUNING GUIDE")
|
||||
print("="*60)
|
||||
|
||||
print("\n📊 Key Parameters and Their Effects:")
|
||||
print("\n1. embedding_k_exp (default: 3.0)")
|
||||
print(" - Lower (1-2): More lenient, faster convergence")
|
||||
print(" - Higher (4-5): Stricter, better precision")
|
||||
|
||||
print("\n2. embedding_coverage_radius (default: 0.2)")
|
||||
print(" - Lower (0.1-0.15): Requires closer matches")
|
||||
print(" - Higher (0.25-0.3): Accepts broader matches")
|
||||
|
||||
print("\n3. n_query_variations (default: 10)")
|
||||
print(" - Lower (5-7): Faster, less comprehensive")
|
||||
print(" - Higher (15-20): Better coverage, slower")
|
||||
|
||||
print("\n4. embedding_min_confidence_threshold (default: 0.1)")
|
||||
print(" - Set to 0.15-0.2 for aggressive irrelevance detection")
|
||||
print(" - Set to 0.05 to crawl even barely relevant content")
|
||||
|
||||
print("\n5. embedding_validation_min_score (default: 0.3)")
|
||||
print(" - Higher (0.5-0.6): Requires strong validation")
|
||||
print(" - Lower (0.2): More permissive stopping")
|
||||
|
||||
print("\n💡 Tips:")
|
||||
print("- For research: High k_exp, more variations, strict validation")
|
||||
print("- For exploration: Low k_exp, fewer variations, relaxed thresholds")
|
||||
print("- For quality: Focus on overlap_threshold and validation scores")
|
||||
print("- For speed: Reduce variations, increase min_relative_improvement")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
109
docs/examples/adaptive_crawling/embedding_strategy.py
Normal file
109
docs/examples/adaptive_crawling/embedding_strategy.py
Normal file
@@ -0,0 +1,109 @@
|
||||
"""
|
||||
Embedding Strategy Example for Adaptive Crawling
|
||||
|
||||
This example demonstrates how to use the embedding-based strategy
|
||||
for semantic understanding and intelligent crawling.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
|
||||
|
||||
|
||||
async def main():
|
||||
"""Demonstrate embedding strategy for adaptive crawling"""
|
||||
|
||||
# Configure embedding strategy
|
||||
config = AdaptiveConfig(
|
||||
strategy="embedding", # Use embedding strategy
|
||||
embedding_model="sentence-transformers/all-MiniLM-L6-v2", # Default model
|
||||
n_query_variations=10, # Generate 10 semantic variations
|
||||
max_pages=15,
|
||||
top_k_links=3,
|
||||
min_gain_threshold=0.05,
|
||||
|
||||
# Embedding-specific parameters
|
||||
embedding_k_exp=3.0, # Higher = stricter similarity requirements
|
||||
embedding_min_confidence_threshold=0.1, # Stop if <10% relevant
|
||||
embedding_validation_min_score=0.4 # Validation threshold
|
||||
)
|
||||
|
||||
# Optional: Use OpenAI embeddings instead
|
||||
if os.getenv('OPENAI_API_KEY'):
|
||||
config.embedding_llm_config = {
|
||||
'provider': 'openai/text-embedding-3-small',
|
||||
'api_token': os.getenv('OPENAI_API_KEY')
|
||||
}
|
||||
print("Using OpenAI embeddings")
|
||||
else:
|
||||
print("Using sentence-transformers (local embeddings)")
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
adaptive = AdaptiveCrawler(crawler, config)
|
||||
|
||||
# Test 1: Relevant query with semantic understanding
|
||||
print("\n" + "="*50)
|
||||
print("TEST 1: Semantic Query Understanding")
|
||||
print("="*50)
|
||||
|
||||
result = await adaptive.digest(
|
||||
start_url="https://docs.python.org/3/library/asyncio.html",
|
||||
query="concurrent programming event-driven architecture"
|
||||
)
|
||||
|
||||
print("\nQuery Expansion:")
|
||||
print(f"Original query expanded to {len(result.expanded_queries)} variations")
|
||||
for i, q in enumerate(result.expanded_queries[:3], 1):
|
||||
print(f" {i}. {q}")
|
||||
print(" ...")
|
||||
|
||||
print("\nResults:")
|
||||
adaptive.print_stats(detailed=False)
|
||||
|
||||
# Test 2: Detecting irrelevant queries
|
||||
print("\n" + "="*50)
|
||||
print("TEST 2: Irrelevant Query Detection")
|
||||
print("="*50)
|
||||
|
||||
# Reset crawler for new query
|
||||
adaptive = AdaptiveCrawler(crawler, config)
|
||||
|
||||
result = await adaptive.digest(
|
||||
start_url="https://docs.python.org/3/library/asyncio.html",
|
||||
query="how to bake chocolate chip cookies"
|
||||
)
|
||||
|
||||
if result.metrics.get('is_irrelevant', False):
|
||||
print("\n✅ Successfully detected irrelevant query!")
|
||||
print(f"Stopped after just {len(result.crawled_urls)} pages")
|
||||
print(f"Reason: {result.metrics.get('stopped_reason', 'unknown')}")
|
||||
else:
|
||||
print("\n❌ Failed to detect irrelevance")
|
||||
|
||||
print(f"Final confidence: {adaptive.confidence:.1%}")
|
||||
|
||||
# Test 3: Semantic gap analysis
|
||||
print("\n" + "="*50)
|
||||
print("TEST 3: Semantic Gap Analysis")
|
||||
print("="*50)
|
||||
|
||||
# Show how embedding strategy identifies gaps
|
||||
adaptive = AdaptiveCrawler(crawler, config)
|
||||
|
||||
result = await adaptive.digest(
|
||||
start_url="https://realpython.com",
|
||||
query="python decorators advanced patterns"
|
||||
)
|
||||
|
||||
print(f"\nSemantic gaps identified: {len(result.semantic_gaps)}")
|
||||
print(f"Knowledge base embeddings shape: {result.kb_embeddings.shape if result.kb_embeddings is not None else 'None'}")
|
||||
|
||||
# Show coverage metrics specific to embedding strategy
|
||||
print("\nEmbedding-specific metrics:")
|
||||
print(f" Average best similarity: {result.metrics.get('avg_best_similarity', 0):.3f}")
|
||||
print(f" Coverage score: {result.metrics.get('coverage_score', 0):.3f}")
|
||||
print(f" Validation confidence: {result.metrics.get('validation_confidence', 0):.2%}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
167
docs/examples/adaptive_crawling/embedding_vs_statistical.py
Normal file
167
docs/examples/adaptive_crawling/embedding_vs_statistical.py
Normal file
@@ -0,0 +1,167 @@
|
||||
"""
|
||||
Comparison: Embedding vs Statistical Strategy
|
||||
|
||||
This example demonstrates the differences between statistical and embedding
|
||||
strategies for adaptive crawling, showing when to use each approach.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import time
|
||||
import os
|
||||
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
|
||||
|
||||
|
||||
async def crawl_with_strategy(url: str, query: str, strategy: str, **kwargs):
|
||||
"""Helper function to crawl with a specific strategy"""
|
||||
config = AdaptiveConfig(
|
||||
strategy=strategy,
|
||||
max_pages=20,
|
||||
top_k_links=3,
|
||||
min_gain_threshold=0.05,
|
||||
**kwargs
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(verbose=False) as crawler:
|
||||
adaptive = AdaptiveCrawler(crawler, config)
|
||||
|
||||
start_time = time.time()
|
||||
result = await adaptive.digest(start_url=url, query=query)
|
||||
elapsed = time.time() - start_time
|
||||
|
||||
return {
|
||||
'result': result,
|
||||
'crawler': adaptive,
|
||||
'elapsed': elapsed,
|
||||
'pages': len(result.crawled_urls),
|
||||
'confidence': adaptive.confidence
|
||||
}
|
||||
|
||||
|
||||
async def main():
|
||||
"""Compare embedding and statistical strategies"""
|
||||
|
||||
# Test scenarios
|
||||
test_cases = [
|
||||
{
|
||||
'name': 'Technical Documentation (Specific Terms)',
|
||||
'url': 'https://docs.python.org/3/library/asyncio.html',
|
||||
'query': 'asyncio.create_task event_loop.run_until_complete'
|
||||
},
|
||||
{
|
||||
'name': 'Conceptual Query (Semantic Understanding)',
|
||||
'url': 'https://docs.python.org/3/library/asyncio.html',
|
||||
'query': 'concurrent programming patterns'
|
||||
},
|
||||
{
|
||||
'name': 'Ambiguous Query',
|
||||
'url': 'https://realpython.com',
|
||||
'query': 'python performance optimization'
|
||||
}
|
||||
]
|
||||
|
||||
# Configure embedding strategy
|
||||
embedding_config = {}
|
||||
if os.getenv('OPENAI_API_KEY'):
|
||||
embedding_config['embedding_llm_config'] = {
|
||||
'provider': 'openai/text-embedding-3-small',
|
||||
'api_token': os.getenv('OPENAI_API_KEY')
|
||||
}
|
||||
|
||||
for test in test_cases:
|
||||
print("\n" + "="*70)
|
||||
print(f"TEST: {test['name']}")
|
||||
print(f"URL: {test['url']}")
|
||||
print(f"Query: '{test['query']}'")
|
||||
print("="*70)
|
||||
|
||||
# Run statistical strategy
|
||||
print("\n📊 Statistical Strategy:")
|
||||
stat_result = await crawl_with_strategy(
|
||||
test['url'],
|
||||
test['query'],
|
||||
'statistical'
|
||||
)
|
||||
|
||||
print(f" Pages crawled: {stat_result['pages']}")
|
||||
print(f" Time taken: {stat_result['elapsed']:.2f}s")
|
||||
print(f" Confidence: {stat_result['confidence']:.1%}")
|
||||
print(f" Sufficient: {'Yes' if stat_result['crawler'].is_sufficient else 'No'}")
|
||||
|
||||
# Show term coverage
|
||||
if hasattr(stat_result['result'], 'term_frequencies'):
|
||||
query_terms = test['query'].lower().split()
|
||||
covered = sum(1 for term in query_terms
|
||||
if term in stat_result['result'].term_frequencies)
|
||||
print(f" Term coverage: {covered}/{len(query_terms)} query terms found")
|
||||
|
||||
# Run embedding strategy
|
||||
print("\n🧠 Embedding Strategy:")
|
||||
emb_result = await crawl_with_strategy(
|
||||
test['url'],
|
||||
test['query'],
|
||||
'embedding',
|
||||
**embedding_config
|
||||
)
|
||||
|
||||
print(f" Pages crawled: {emb_result['pages']}")
|
||||
print(f" Time taken: {emb_result['elapsed']:.2f}s")
|
||||
print(f" Confidence: {emb_result['confidence']:.1%}")
|
||||
print(f" Sufficient: {'Yes' if emb_result['crawler'].is_sufficient else 'No'}")
|
||||
|
||||
# Show semantic understanding
|
||||
if emb_result['result'].expanded_queries:
|
||||
print(f" Query variations: {len(emb_result['result'].expanded_queries)}")
|
||||
print(f" Semantic gaps: {len(emb_result['result'].semantic_gaps)}")
|
||||
|
||||
# Compare results
|
||||
print("\n📈 Comparison:")
|
||||
efficiency_diff = ((stat_result['pages'] - emb_result['pages']) /
|
||||
stat_result['pages'] * 100) if stat_result['pages'] > 0 else 0
|
||||
|
||||
print(f" Efficiency: ", end="")
|
||||
if efficiency_diff > 0:
|
||||
print(f"Embedding used {efficiency_diff:.0f}% fewer pages")
|
||||
else:
|
||||
print(f"Statistical used {-efficiency_diff:.0f}% fewer pages")
|
||||
|
||||
print(f" Speed: ", end="")
|
||||
if stat_result['elapsed'] < emb_result['elapsed']:
|
||||
print(f"Statistical was {emb_result['elapsed']/stat_result['elapsed']:.1f}x faster")
|
||||
else:
|
||||
print(f"Embedding was {stat_result['elapsed']/emb_result['elapsed']:.1f}x faster")
|
||||
|
||||
print(f" Confidence difference: {abs(stat_result['confidence'] - emb_result['confidence'])*100:.0f} percentage points")
|
||||
|
||||
# Recommendation
|
||||
print("\n💡 Recommendation:")
|
||||
if 'specific' in test['name'].lower() or all(len(term) > 5 for term in test['query'].split()):
|
||||
print(" → Statistical strategy is likely better for this use case (specific terms)")
|
||||
elif 'conceptual' in test['name'].lower() or 'semantic' in test['name'].lower():
|
||||
print(" → Embedding strategy is likely better for this use case (semantic understanding)")
|
||||
else:
|
||||
if emb_result['confidence'] > stat_result['confidence'] + 0.1:
|
||||
print(" → Embedding strategy achieved significantly better understanding")
|
||||
elif stat_result['elapsed'] < emb_result['elapsed'] / 2:
|
||||
print(" → Statistical strategy is much faster with similar results")
|
||||
else:
|
||||
print(" → Both strategies performed similarly; choose based on your priorities")
|
||||
|
||||
# Summary recommendations
|
||||
print("\n" + "="*70)
|
||||
print("STRATEGY SELECTION GUIDE")
|
||||
print("="*70)
|
||||
print("\n✅ Use STATISTICAL strategy when:")
|
||||
print(" - Queries contain specific technical terms")
|
||||
print(" - Speed is critical")
|
||||
print(" - No API access available")
|
||||
print(" - Working with well-structured documentation")
|
||||
|
||||
print("\n✅ Use EMBEDDING strategy when:")
|
||||
print(" - Queries are conceptual or ambiguous")
|
||||
print(" - Semantic understanding is important")
|
||||
print(" - Need to detect irrelevant content")
|
||||
print(" - Working with diverse content sources")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
232
docs/examples/adaptive_crawling/export_import_kb.py
Normal file
232
docs/examples/adaptive_crawling/export_import_kb.py
Normal file
@@ -0,0 +1,232 @@
|
||||
"""
|
||||
Knowledge Base Export and Import
|
||||
|
||||
This example demonstrates how to export crawled knowledge bases and
|
||||
import them for reuse, sharing, or analysis.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
from pathlib import Path
|
||||
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
|
||||
|
||||
|
||||
async def build_knowledge_base():
|
||||
"""Build a knowledge base about web technologies"""
|
||||
print("="*60)
|
||||
print("PHASE 1: Building Knowledge Base")
|
||||
print("="*60)
|
||||
|
||||
async with AsyncWebCrawler(verbose=False) as crawler:
|
||||
adaptive = AdaptiveCrawler(crawler)
|
||||
|
||||
# Crawl information about HTTP
|
||||
print("\n1. Gathering HTTP protocol information...")
|
||||
await adaptive.digest(
|
||||
start_url="https://httpbin.org",
|
||||
query="http methods headers status codes"
|
||||
)
|
||||
print(f" - Pages crawled: {len(adaptive.state.crawled_urls)}")
|
||||
print(f" - Confidence: {adaptive.confidence:.2%}")
|
||||
|
||||
# Add more information about APIs
|
||||
print("\n2. Adding API documentation knowledge...")
|
||||
await adaptive.digest(
|
||||
start_url="https://httpbin.org/anything",
|
||||
query="rest api json response request"
|
||||
)
|
||||
print(f" - Total pages: {len(adaptive.state.crawled_urls)}")
|
||||
print(f" - Confidence: {adaptive.confidence:.2%}")
|
||||
|
||||
# Export the knowledge base
|
||||
export_path = "web_tech_knowledge.jsonl"
|
||||
print(f"\n3. Exporting knowledge base to {export_path}")
|
||||
adaptive.export_knowledge_base(export_path)
|
||||
|
||||
# Show export statistics
|
||||
export_size = Path(export_path).stat().st_size / 1024
|
||||
with open(export_path, 'r') as f:
|
||||
line_count = sum(1 for _ in f)
|
||||
|
||||
print(f" - Exported {line_count} documents")
|
||||
print(f" - File size: {export_size:.1f} KB")
|
||||
|
||||
return export_path
|
||||
|
||||
|
||||
async def analyze_knowledge_base(kb_path):
|
||||
"""Analyze the exported knowledge base"""
|
||||
print("\n" + "="*60)
|
||||
print("PHASE 2: Analyzing Exported Knowledge Base")
|
||||
print("="*60)
|
||||
|
||||
# Read and analyze JSONL
|
||||
documents = []
|
||||
with open(kb_path, 'r') as f:
|
||||
for line in f:
|
||||
documents.append(json.loads(line))
|
||||
|
||||
print(f"\nKnowledge base contains {len(documents)} documents:")
|
||||
|
||||
# Analyze document properties
|
||||
total_content_length = 0
|
||||
urls_by_domain = {}
|
||||
|
||||
for doc in documents:
|
||||
# Content analysis
|
||||
content_length = len(doc.get('content', ''))
|
||||
total_content_length += content_length
|
||||
|
||||
# URL analysis
|
||||
url = doc.get('url', '')
|
||||
domain = url.split('/')[2] if url.startswith('http') else 'unknown'
|
||||
urls_by_domain[domain] = urls_by_domain.get(domain, 0) + 1
|
||||
|
||||
# Show sample document
|
||||
if documents.index(doc) == 0:
|
||||
print(f"\nSample document structure:")
|
||||
print(f" - URL: {url}")
|
||||
print(f" - Content length: {content_length} chars")
|
||||
print(f" - Has metadata: {'metadata' in doc}")
|
||||
print(f" - Has links: {len(doc.get('links', []))} links")
|
||||
print(f" - Query: {doc.get('query', 'N/A')}")
|
||||
|
||||
print(f"\nContent statistics:")
|
||||
print(f" - Total content: {total_content_length:,} characters")
|
||||
print(f" - Average per document: {total_content_length/len(documents):,.0f} chars")
|
||||
|
||||
print(f"\nDomain distribution:")
|
||||
for domain, count in urls_by_domain.items():
|
||||
print(f" - {domain}: {count} pages")
|
||||
|
||||
|
||||
async def import_and_continue():
|
||||
"""Import a knowledge base and continue crawling"""
|
||||
print("\n" + "="*60)
|
||||
print("PHASE 3: Importing and Extending Knowledge Base")
|
||||
print("="*60)
|
||||
|
||||
kb_path = "web_tech_knowledge.jsonl"
|
||||
|
||||
async with AsyncWebCrawler(verbose=False) as crawler:
|
||||
# Create new adaptive crawler
|
||||
adaptive = AdaptiveCrawler(crawler)
|
||||
|
||||
# Import existing knowledge base
|
||||
print(f"\n1. Importing knowledge base from {kb_path}")
|
||||
adaptive.import_knowledge_base(kb_path)
|
||||
|
||||
print(f" - Imported {len(adaptive.state.knowledge_base)} documents")
|
||||
print(f" - Existing URLs: {len(adaptive.state.crawled_urls)}")
|
||||
|
||||
# Check current state
|
||||
print("\n2. Checking imported knowledge state:")
|
||||
adaptive.print_stats(detailed=False)
|
||||
|
||||
# Continue crawling with new query
|
||||
print("\n3. Extending knowledge with new query...")
|
||||
await adaptive.digest(
|
||||
start_url="https://httpbin.org/status/200",
|
||||
query="error handling retry timeout"
|
||||
)
|
||||
|
||||
print("\n4. Final knowledge base state:")
|
||||
adaptive.print_stats(detailed=False)
|
||||
|
||||
# Export extended knowledge base
|
||||
extended_path = "web_tech_knowledge_extended.jsonl"
|
||||
adaptive.export_knowledge_base(extended_path)
|
||||
print(f"\n5. Extended knowledge base exported to {extended_path}")
|
||||
|
||||
|
||||
async def share_knowledge_bases():
|
||||
"""Demonstrate sharing knowledge bases between projects"""
|
||||
print("\n" + "="*60)
|
||||
print("PHASE 4: Sharing Knowledge Between Projects")
|
||||
print("="*60)
|
||||
|
||||
# Simulate two different projects
|
||||
project_a_kb = "project_a_knowledge.jsonl"
|
||||
project_b_kb = "project_b_knowledge.jsonl"
|
||||
|
||||
async with AsyncWebCrawler(verbose=False) as crawler:
|
||||
# Project A: Security documentation
|
||||
print("\n1. Project A: Building security knowledge...")
|
||||
crawler_a = AdaptiveCrawler(crawler)
|
||||
await crawler_a.digest(
|
||||
start_url="https://httpbin.org/basic-auth/user/pass",
|
||||
query="authentication security headers"
|
||||
)
|
||||
crawler_a.export_knowledge_base(project_a_kb)
|
||||
print(f" - Exported {len(crawler_a.state.knowledge_base)} documents")
|
||||
|
||||
# Project B: API testing
|
||||
print("\n2. Project B: Building testing knowledge...")
|
||||
crawler_b = AdaptiveCrawler(crawler)
|
||||
await crawler_b.digest(
|
||||
start_url="https://httpbin.org/anything",
|
||||
query="testing endpoints mocking"
|
||||
)
|
||||
crawler_b.export_knowledge_base(project_b_kb)
|
||||
print(f" - Exported {len(crawler_b.state.knowledge_base)} documents")
|
||||
|
||||
# Merge knowledge bases
|
||||
print("\n3. Merging knowledge bases...")
|
||||
merged_crawler = AdaptiveCrawler(crawler)
|
||||
|
||||
# Import both knowledge bases
|
||||
merged_crawler.import_knowledge_base(project_a_kb)
|
||||
initial_size = len(merged_crawler.state.knowledge_base)
|
||||
|
||||
merged_crawler.import_knowledge_base(project_b_kb)
|
||||
final_size = len(merged_crawler.state.knowledge_base)
|
||||
|
||||
print(f" - Project A documents: {initial_size}")
|
||||
print(f" - Additional from Project B: {final_size - initial_size}")
|
||||
print(f" - Total merged documents: {final_size}")
|
||||
|
||||
# Export merged knowledge
|
||||
merged_kb = "merged_knowledge.jsonl"
|
||||
merged_crawler.export_knowledge_base(merged_kb)
|
||||
print(f"\n4. Merged knowledge base exported to {merged_kb}")
|
||||
|
||||
# Show combined coverage
|
||||
print("\n5. Combined knowledge coverage:")
|
||||
merged_crawler.print_stats(detailed=False)
|
||||
|
||||
|
||||
async def main():
|
||||
"""Run all examples"""
|
||||
try:
|
||||
# Build initial knowledge base
|
||||
kb_path = await build_knowledge_base()
|
||||
|
||||
# Analyze the export
|
||||
await analyze_knowledge_base(kb_path)
|
||||
|
||||
# Import and extend
|
||||
await import_and_continue()
|
||||
|
||||
# Demonstrate sharing
|
||||
await share_knowledge_bases()
|
||||
|
||||
print("\n" + "="*60)
|
||||
print("All examples completed successfully!")
|
||||
print("="*60)
|
||||
|
||||
finally:
|
||||
# Clean up generated files
|
||||
print("\nCleaning up generated files...")
|
||||
for file in [
|
||||
"web_tech_knowledge.jsonl",
|
||||
"web_tech_knowledge_extended.jsonl",
|
||||
"project_a_knowledge.jsonl",
|
||||
"project_b_knowledge.jsonl",
|
||||
"merged_knowledge.jsonl"
|
||||
]:
|
||||
Path(file).unlink(missing_ok=True)
|
||||
print("Cleanup complete.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
Reference in New Issue
Block a user