feat(crawl4ai): Implement adaptive crawling feature
This commit introduces the adaptive crawling feature to the crawl4ai project. The adaptive crawling feature intelligently determines when sufficient information has been gathered during a crawl, improving efficiency and reducing unnecessary resource usage. The changes include the addition of new files related to the adaptive crawler, modifications to the existing files, and updates to the documentation. The new files include the main adaptive crawler script, utility functions, and various configuration and strategy scripts. The existing files that were modified include the project's initialization file and utility functions. The documentation has been updated to include detailed explanations and examples of the adaptive crawling feature. The adaptive crawling feature will significantly enhance the capabilities of the crawl4ai project, providing users with a more efficient and intelligent web crawling tool. Significant modifications: - Added adaptive_crawler.py and related scripts - Modified __init__.py and utils.py - Updated documentation with details about the adaptive crawling feature - Added tests for the new feature BREAKING CHANGE: This is a significant feature addition that may affect the overall behavior of the crawl4ai project. Users are advised to review the updated documentation to understand how to use the new feature. Refs: #123, #456
This commit is contained in:
347
docs/md_v2/core/adaptive-crawling.md
Normal file
347
docs/md_v2/core/adaptive-crawling.md
Normal file
@@ -0,0 +1,347 @@
|
||||
# Adaptive Web Crawling
|
||||
|
||||
## Introduction
|
||||
|
||||
Traditional web crawlers follow predetermined patterns, crawling pages blindly without knowing when they've gathered enough information. **Adaptive Crawling** changes this paradigm by introducing intelligence into the crawling process.
|
||||
|
||||
Think of it like research: when you're looking for information, you don't read every book in the library. You stop when you've found sufficient information to answer your question. That's exactly what Adaptive Crawling does for web scraping.
|
||||
|
||||
## Key Concepts
|
||||
|
||||
### The Problem It Solves
|
||||
|
||||
When crawling websites for specific information, you face two challenges:
|
||||
1. **Under-crawling**: Stopping too early and missing crucial information
|
||||
2. **Over-crawling**: Wasting resources by crawling irrelevant pages
|
||||
|
||||
Adaptive Crawling solves both by using a three-layer scoring system that determines when you have "enough" information.
|
||||
|
||||
### How It Works
|
||||
|
||||
The AdaptiveCrawler uses three metrics to measure information sufficiency:
|
||||
|
||||
- **Coverage**: How well your collected pages cover the query terms
|
||||
- **Consistency**: Whether the information is coherent across pages
|
||||
- **Saturation**: Detecting when new pages aren't adding new information
|
||||
|
||||
When these metrics indicate sufficient information has been gathered, crawling stops automatically.
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Create an adaptive crawler
|
||||
adaptive = AdaptiveCrawler(crawler)
|
||||
|
||||
# Start crawling with a query
|
||||
result = await adaptive.digest(
|
||||
start_url="https://docs.python.org/3/",
|
||||
query="async context managers"
|
||||
)
|
||||
|
||||
# View statistics
|
||||
adaptive.print_stats()
|
||||
|
||||
# Get the most relevant content
|
||||
relevant_pages = adaptive.get_relevant_content(top_k=5)
|
||||
for page in relevant_pages:
|
||||
print(f"- {page['url']} (score: {page['score']:.2f})")
|
||||
```
|
||||
|
||||
### Configuration Options
|
||||
|
||||
```python
|
||||
from crawl4ai import AdaptiveConfig
|
||||
|
||||
config = AdaptiveConfig(
|
||||
confidence_threshold=0.7, # Stop when 70% confident (default: 0.8)
|
||||
max_pages=20, # Maximum pages to crawl (default: 50)
|
||||
top_k_links=3, # Links to follow per page (default: 5)
|
||||
min_gain_threshold=0.05 # Minimum expected gain to continue (default: 0.1)
|
||||
)
|
||||
|
||||
adaptive = AdaptiveCrawler(crawler, config=config)
|
||||
```
|
||||
|
||||
## Crawling Strategies
|
||||
|
||||
Adaptive Crawling supports two distinct strategies for determining information sufficiency:
|
||||
|
||||
### Statistical Strategy (Default)
|
||||
|
||||
The statistical strategy uses pure information theory and term-based analysis:
|
||||
|
||||
- **Fast and efficient** - No API calls or model loading
|
||||
- **Term-based coverage** - Analyzes query term presence and distribution
|
||||
- **No external dependencies** - Works offline
|
||||
- **Best for**: Well-defined queries with specific terminology
|
||||
|
||||
```python
|
||||
# Default configuration uses statistical strategy
|
||||
config = AdaptiveConfig(
|
||||
strategy="statistical", # This is the default
|
||||
confidence_threshold=0.8
|
||||
)
|
||||
```
|
||||
|
||||
### Embedding Strategy
|
||||
|
||||
The embedding strategy uses semantic embeddings for deeper understanding:
|
||||
|
||||
- **Semantic understanding** - Captures meaning beyond exact term matches
|
||||
- **Query expansion** - Automatically generates query variations
|
||||
- **Gap-driven selection** - Identifies semantic gaps in knowledge
|
||||
- **Validation-based stopping** - Uses held-out queries to validate coverage
|
||||
- **Best for**: Complex queries, ambiguous topics, conceptual understanding
|
||||
|
||||
```python
|
||||
# Configure embedding strategy
|
||||
config = AdaptiveConfig(
|
||||
strategy="embedding",
|
||||
embedding_model="sentence-transformers/all-MiniLM-L6-v2", # Default
|
||||
n_query_variations=10, # Generate 10 query variations
|
||||
embedding_min_confidence_threshold=0.1 # Stop if completely irrelevant
|
||||
)
|
||||
|
||||
# With custom embedding provider (e.g., OpenAI)
|
||||
config = AdaptiveConfig(
|
||||
strategy="embedding",
|
||||
embedding_llm_config={
|
||||
'provider': 'openai/text-embedding-3-small',
|
||||
'api_token': 'your-api-key'
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
### Strategy Comparison
|
||||
|
||||
| Feature | Statistical | Embedding |
|
||||
|---------|------------|-----------|
|
||||
| **Speed** | Very fast | Moderate (API calls) |
|
||||
| **Cost** | Free | Depends on provider |
|
||||
| **Accuracy** | Good for exact terms | Excellent for concepts |
|
||||
| **Dependencies** | None | Embedding model/API |
|
||||
| **Query Understanding** | Literal | Semantic |
|
||||
| **Best Use Case** | Technical docs, specific terms | Research, broad topics |
|
||||
|
||||
### Embedding Strategy Configuration
|
||||
|
||||
The embedding strategy offers fine-tuned control through several parameters:
|
||||
|
||||
```python
|
||||
config = AdaptiveConfig(
|
||||
strategy="embedding",
|
||||
|
||||
# Model configuration
|
||||
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
|
||||
embedding_llm_config=None, # Use for API-based embeddings
|
||||
|
||||
# Query expansion
|
||||
n_query_variations=10, # Number of query variations to generate
|
||||
|
||||
# Coverage parameters
|
||||
embedding_coverage_radius=0.2, # Distance threshold for coverage
|
||||
embedding_k_exp=3.0, # Exponential decay factor (higher = stricter)
|
||||
|
||||
# Stopping criteria
|
||||
embedding_min_relative_improvement=0.1, # Min improvement to continue
|
||||
embedding_validation_min_score=0.3, # Min validation score
|
||||
embedding_min_confidence_threshold=0.1, # Below this = irrelevant
|
||||
|
||||
# Link selection
|
||||
embedding_overlap_threshold=0.85, # Similarity for deduplication
|
||||
|
||||
# Display confidence mapping
|
||||
embedding_quality_min_confidence=0.7, # Min displayed confidence
|
||||
embedding_quality_max_confidence=0.95 # Max displayed confidence
|
||||
)
|
||||
```
|
||||
|
||||
### Handling Irrelevant Queries
|
||||
|
||||
The embedding strategy can detect when a query is completely unrelated to the content:
|
||||
|
||||
```python
|
||||
# This will stop quickly with low confidence
|
||||
result = await adaptive.digest(
|
||||
start_url="https://docs.python.org/3/",
|
||||
query="how to cook pasta" # Irrelevant to Python docs
|
||||
)
|
||||
|
||||
# Check if query was irrelevant
|
||||
if result.metrics.get('is_irrelevant', False):
|
||||
print("Query is unrelated to the content!")
|
||||
```
|
||||
|
||||
## When to Use Adaptive Crawling
|
||||
|
||||
### Perfect For:
|
||||
- **Research Tasks**: Finding comprehensive information about a topic
|
||||
- **Question Answering**: Gathering sufficient context to answer specific queries
|
||||
- **Knowledge Base Building**: Creating focused datasets for AI/ML applications
|
||||
- **Competitive Intelligence**: Collecting complete information about specific products/features
|
||||
|
||||
### Not Recommended For:
|
||||
- **Full Site Archiving**: When you need every page regardless of content
|
||||
- **Structured Data Extraction**: When targeting specific, known page patterns
|
||||
- **Real-time Monitoring**: When you need continuous updates
|
||||
|
||||
## Understanding the Output
|
||||
|
||||
### Confidence Score
|
||||
|
||||
The confidence score (0-1) indicates how sufficient the gathered information is:
|
||||
- **0.0-0.3**: Insufficient information, needs more crawling
|
||||
- **0.3-0.6**: Partial information, may answer basic queries
|
||||
- **0.6-0.8**: Good coverage, can answer most queries
|
||||
- **0.8-1.0**: Excellent coverage, comprehensive information
|
||||
|
||||
### Statistics Display
|
||||
|
||||
```python
|
||||
adaptive.print_stats(detailed=False) # Summary table
|
||||
adaptive.print_stats(detailed=True) # Detailed metrics
|
||||
```
|
||||
|
||||
The summary shows:
|
||||
- Pages crawled vs. confidence achieved
|
||||
- Coverage, consistency, and saturation scores
|
||||
- Crawling efficiency metrics
|
||||
|
||||
## Persistence and Resumption
|
||||
|
||||
### Saving Progress
|
||||
|
||||
```python
|
||||
config = AdaptiveConfig(
|
||||
save_state=True,
|
||||
state_path="my_crawl_state.json"
|
||||
)
|
||||
|
||||
# Crawl will auto-save progress
|
||||
result = await adaptive.digest(start_url, query)
|
||||
```
|
||||
|
||||
### Resuming a Crawl
|
||||
|
||||
```python
|
||||
# Resume from saved state
|
||||
result = await adaptive.digest(
|
||||
start_url,
|
||||
query,
|
||||
resume_from="my_crawl_state.json"
|
||||
)
|
||||
```
|
||||
|
||||
### Exporting Knowledge Base
|
||||
|
||||
```python
|
||||
# Export collected pages to JSONL
|
||||
adaptive.export_knowledge_base("knowledge_base.jsonl")
|
||||
|
||||
# Import into another session
|
||||
new_adaptive = AdaptiveCrawler(crawler)
|
||||
new_adaptive.import_knowledge_base("knowledge_base.jsonl")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Query Formulation
|
||||
- Use specific, descriptive queries
|
||||
- Include key terms you expect to find
|
||||
- Avoid overly broad queries
|
||||
|
||||
### 2. Threshold Tuning
|
||||
- Start with default (0.8) for general use
|
||||
- Lower to 0.6-0.7 for exploratory crawling
|
||||
- Raise to 0.9+ for exhaustive coverage
|
||||
|
||||
### 3. Performance Optimization
|
||||
- Use appropriate `max_pages` limits
|
||||
- Adjust `top_k_links` based on site structure
|
||||
- Enable caching for repeat crawls
|
||||
|
||||
### 4. Link Selection
|
||||
- The crawler prioritizes links based on:
|
||||
- Relevance to query
|
||||
- Expected information gain
|
||||
- URL structure and depth
|
||||
|
||||
## Examples
|
||||
|
||||
### Research Assistant
|
||||
|
||||
```python
|
||||
# Gather information about a programming concept
|
||||
result = await adaptive.digest(
|
||||
start_url="https://realpython.com",
|
||||
query="python decorators implementation patterns"
|
||||
)
|
||||
|
||||
# Get the most relevant excerpts
|
||||
for doc in adaptive.get_relevant_content(top_k=3):
|
||||
print(f"\nFrom: {doc['url']}")
|
||||
print(f"Relevance: {doc['score']:.2%}")
|
||||
print(doc['content'][:500] + "...")
|
||||
```
|
||||
|
||||
### Knowledge Base Builder
|
||||
|
||||
```python
|
||||
# Build a focused knowledge base about machine learning
|
||||
queries = [
|
||||
"supervised learning algorithms",
|
||||
"neural network architectures",
|
||||
"model evaluation metrics"
|
||||
]
|
||||
|
||||
for query in queries:
|
||||
await adaptive.digest(
|
||||
start_url="https://scikit-learn.org/stable/",
|
||||
query=query
|
||||
)
|
||||
|
||||
# Export combined knowledge base
|
||||
adaptive.export_knowledge_base("ml_knowledge.jsonl")
|
||||
```
|
||||
|
||||
### API Documentation Crawler
|
||||
|
||||
```python
|
||||
# Intelligently crawl API documentation
|
||||
config = AdaptiveConfig(
|
||||
confidence_threshold=0.85, # Higher threshold for completeness
|
||||
max_pages=30
|
||||
)
|
||||
|
||||
adaptive = AdaptiveCrawler(crawler, config)
|
||||
result = await adaptive.digest(
|
||||
start_url="https://api.example.com/docs",
|
||||
query="authentication endpoints rate limits"
|
||||
)
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
- Learn about [Advanced Adaptive Strategies](../advanced/adaptive-strategies.md)
|
||||
- Explore the [AdaptiveCrawler API Reference](../api/adaptive-crawler.md)
|
||||
- See more [Examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples/adaptive_crawling)
|
||||
|
||||
## FAQ
|
||||
|
||||
**Q: How is this different from traditional crawling?**
|
||||
A: Traditional crawling follows fixed patterns (BFS/DFS). Adaptive crawling makes intelligent decisions about which links to follow and when to stop based on information gain.
|
||||
|
||||
**Q: Can I use this with JavaScript-heavy sites?**
|
||||
A: Yes! AdaptiveCrawler inherits all capabilities from AsyncWebCrawler, including JavaScript execution.
|
||||
|
||||
**Q: How does it handle large websites?**
|
||||
A: The algorithm naturally limits crawling to relevant sections. Use `max_pages` as a safety limit.
|
||||
|
||||
**Q: Can I customize the scoring algorithms?**
|
||||
A: Advanced users can implement custom strategies. See [Adaptive Strategies](../advanced/adaptive-strategies.md).
|
||||
Reference in New Issue
Block a user