feat(crawl4ai): Implement adaptive crawling feature
This commit introduces the adaptive crawling feature to the crawl4ai project. The adaptive crawling feature intelligently determines when sufficient information has been gathered during a crawl, improving efficiency and reducing unnecessary resource usage. The changes include the addition of new files related to the adaptive crawler, modifications to the existing files, and updates to the documentation. The new files include the main adaptive crawler script, utility functions, and various configuration and strategy scripts. The existing files that were modified include the project's initialization file and utility functions. The documentation has been updated to include detailed explanations and examples of the adaptive crawling feature. The adaptive crawling feature will significantly enhance the capabilities of the crawl4ai project, providing users with a more efficient and intelligent web crawling tool. Significant modifications: - Added adaptive_crawler.py and related scripts - Modified __init__.py and utils.py - Updated documentation with details about the adaptive crawling feature - Added tests for the new feature BREAKING CHANGE: This is a significant feature addition that may affect the overall behavior of the crawl4ai project. Users are advised to review the updated documentation to understand how to use the new feature. Refs: #123, #456
This commit is contained in:
432
docs/md_v2/advanced/adaptive-strategies.md
Normal file
432
docs/md_v2/advanced/adaptive-strategies.md
Normal file
@@ -0,0 +1,432 @@
|
||||
# Advanced Adaptive Strategies
|
||||
|
||||
## Overview
|
||||
|
||||
While the default adaptive crawling configuration works well for most use cases, understanding the underlying strategies and scoring mechanisms allows you to fine-tune the crawler for specific domains and requirements.
|
||||
|
||||
## The Three-Layer Scoring System
|
||||
|
||||
### 1. Coverage Score
|
||||
|
||||
Coverage measures how comprehensively your knowledge base covers the query terms and related concepts.
|
||||
|
||||
#### Mathematical Foundation
|
||||
|
||||
```python
|
||||
Coverage(K, Q) = Σ(t ∈ Q) score(t, K) / |Q|
|
||||
|
||||
where score(t, K) = doc_coverage(t) × (1 + freq_boost(t))
|
||||
```
|
||||
|
||||
#### Components
|
||||
|
||||
- **Document Coverage**: Percentage of documents containing the term
|
||||
- **Frequency Boost**: Logarithmic bonus for term frequency
|
||||
- **Query Decomposition**: Handles multi-word queries intelligently
|
||||
|
||||
#### Tuning Coverage
|
||||
|
||||
```python
|
||||
# For technical documentation with specific terminology
|
||||
config = AdaptiveConfig(
|
||||
confidence_threshold=0.85, # Require high coverage
|
||||
top_k_links=5 # Cast wider net
|
||||
)
|
||||
|
||||
# For general topics with synonyms
|
||||
config = AdaptiveConfig(
|
||||
confidence_threshold=0.6, # Lower threshold
|
||||
top_k_links=2 # More focused
|
||||
)
|
||||
```
|
||||
|
||||
### 2. Consistency Score
|
||||
|
||||
Consistency evaluates whether the information across pages is coherent and non-contradictory.
|
||||
|
||||
#### How It Works
|
||||
|
||||
1. Extracts key statements from each document
|
||||
2. Compares statements across documents
|
||||
3. Measures agreement vs. contradiction
|
||||
4. Returns normalized score (0-1)
|
||||
|
||||
#### Practical Impact
|
||||
|
||||
- **High consistency (>0.8)**: Information is reliable and coherent
|
||||
- **Medium consistency (0.5-0.8)**: Some variation, but generally aligned
|
||||
- **Low consistency (<0.5)**: Conflicting information, need more sources
|
||||
|
||||
### 3. Saturation Score
|
||||
|
||||
Saturation detects when new pages stop providing novel information.
|
||||
|
||||
#### Detection Algorithm
|
||||
|
||||
```python
|
||||
# Tracks new unique terms per page
|
||||
new_terms_page_1 = 50
|
||||
new_terms_page_2 = 30 # 60% of first
|
||||
new_terms_page_3 = 15 # 50% of second
|
||||
new_terms_page_4 = 5 # 33% of third
|
||||
# Saturation detected: rapidly diminishing returns
|
||||
```
|
||||
|
||||
#### Configuration
|
||||
|
||||
```python
|
||||
config = AdaptiveConfig(
|
||||
min_gain_threshold=0.1 # Stop if <10% new information
|
||||
)
|
||||
```
|
||||
|
||||
## Link Ranking Algorithm
|
||||
|
||||
### Expected Information Gain
|
||||
|
||||
Each uncrawled link is scored based on:
|
||||
|
||||
```python
|
||||
ExpectedGain(link) = Relevance × Novelty × Authority
|
||||
```
|
||||
|
||||
#### 1. Relevance Scoring
|
||||
|
||||
Uses BM25 algorithm on link preview text:
|
||||
|
||||
```python
|
||||
relevance = BM25(link.preview_text, query)
|
||||
```
|
||||
|
||||
Factors:
|
||||
- Term frequency in preview
|
||||
- Inverse document frequency
|
||||
- Preview length normalization
|
||||
|
||||
#### 2. Novelty Estimation
|
||||
|
||||
Measures how different the link appears from already-crawled content:
|
||||
|
||||
```python
|
||||
novelty = 1 - max_similarity(preview, knowledge_base)
|
||||
```
|
||||
|
||||
Prevents crawling duplicate or highly similar pages.
|
||||
|
||||
#### 3. Authority Calculation
|
||||
|
||||
URL structure and domain analysis:
|
||||
|
||||
```python
|
||||
authority = f(domain_rank, url_depth, url_structure)
|
||||
```
|
||||
|
||||
Factors:
|
||||
- Domain reputation
|
||||
- URL depth (fewer slashes = higher authority)
|
||||
- Clean URL structure
|
||||
|
||||
### Custom Link Scoring
|
||||
|
||||
```python
|
||||
class CustomLinkScorer:
|
||||
def score(self, link: Link, query: str, state: CrawlState) -> float:
|
||||
# Prioritize specific URL patterns
|
||||
if "/api/reference/" in link.href:
|
||||
return 2.0 # Double the score
|
||||
|
||||
# Deprioritize certain sections
|
||||
if "/archive/" in link.href:
|
||||
return 0.1 # Reduce score by 90%
|
||||
|
||||
# Default scoring
|
||||
return 1.0
|
||||
|
||||
# Use with adaptive crawler
|
||||
adaptive = AdaptiveCrawler(
|
||||
crawler,
|
||||
config=config,
|
||||
link_scorer=CustomLinkScorer()
|
||||
)
|
||||
```
|
||||
|
||||
## Domain-Specific Configurations
|
||||
|
||||
### Technical Documentation
|
||||
|
||||
```python
|
||||
tech_doc_config = AdaptiveConfig(
|
||||
confidence_threshold=0.85,
|
||||
max_pages=30,
|
||||
top_k_links=3,
|
||||
min_gain_threshold=0.05 # Keep crawling for small gains
|
||||
)
|
||||
```
|
||||
|
||||
Rationale:
|
||||
- High threshold ensures comprehensive coverage
|
||||
- Lower gain threshold captures edge cases
|
||||
- Moderate link following for depth
|
||||
|
||||
### News & Articles
|
||||
|
||||
```python
|
||||
news_config = AdaptiveConfig(
|
||||
confidence_threshold=0.6,
|
||||
max_pages=10,
|
||||
top_k_links=5,
|
||||
min_gain_threshold=0.15 # Stop quickly on repetition
|
||||
)
|
||||
```
|
||||
|
||||
Rationale:
|
||||
- Lower threshold (articles often repeat information)
|
||||
- Higher gain threshold (avoid duplicate stories)
|
||||
- More links per page (explore different perspectives)
|
||||
|
||||
### E-commerce
|
||||
|
||||
```python
|
||||
ecommerce_config = AdaptiveConfig(
|
||||
confidence_threshold=0.7,
|
||||
max_pages=20,
|
||||
top_k_links=2,
|
||||
min_gain_threshold=0.1
|
||||
)
|
||||
```
|
||||
|
||||
Rationale:
|
||||
- Balanced threshold for product variations
|
||||
- Focused link following (avoid infinite products)
|
||||
- Standard gain threshold
|
||||
|
||||
### Research & Academic
|
||||
|
||||
```python
|
||||
research_config = AdaptiveConfig(
|
||||
confidence_threshold=0.9,
|
||||
max_pages=50,
|
||||
top_k_links=4,
|
||||
min_gain_threshold=0.02 # Very low - capture citations
|
||||
)
|
||||
```
|
||||
|
||||
Rationale:
|
||||
- Very high threshold for completeness
|
||||
- Many pages allowed for thorough research
|
||||
- Very low gain threshold to capture references
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Memory Management
|
||||
|
||||
```python
|
||||
# For large crawls, use streaming
|
||||
config = AdaptiveConfig(
|
||||
max_pages=100,
|
||||
save_state=True,
|
||||
state_path="large_crawl.json"
|
||||
)
|
||||
|
||||
# Periodically clean state
|
||||
if len(state.knowledge_base) > 1000:
|
||||
# Keep only most relevant
|
||||
state.knowledge_base = get_top_relevant(state.knowledge_base, 500)
|
||||
```
|
||||
|
||||
### Parallel Processing
|
||||
|
||||
```python
|
||||
# Use multiple start points
|
||||
start_urls = [
|
||||
"https://docs.example.com/intro",
|
||||
"https://docs.example.com/api",
|
||||
"https://docs.example.com/guides"
|
||||
]
|
||||
|
||||
# Crawl in parallel
|
||||
tasks = [
|
||||
adaptive.digest(url, query)
|
||||
for url in start_urls
|
||||
]
|
||||
results = await asyncio.gather(*tasks)
|
||||
```
|
||||
|
||||
### Caching Strategy
|
||||
|
||||
```python
|
||||
# Enable caching for repeated crawls
|
||||
async with AsyncWebCrawler(
|
||||
config=BrowserConfig(
|
||||
cache_mode=CacheMode.ENABLED
|
||||
)
|
||||
) as crawler:
|
||||
adaptive = AdaptiveCrawler(crawler, config)
|
||||
```
|
||||
|
||||
## Debugging & Analysis
|
||||
|
||||
### Enable Verbose Logging
|
||||
|
||||
```python
|
||||
import logging
|
||||
|
||||
logging.basicConfig(level=logging.DEBUG)
|
||||
adaptive = AdaptiveCrawler(crawler, config, verbose=True)
|
||||
```
|
||||
|
||||
### Analyze Crawl Patterns
|
||||
|
||||
```python
|
||||
# After crawling
|
||||
state = await adaptive.digest(start_url, query)
|
||||
|
||||
# Analyze link selection
|
||||
print("Link selection order:")
|
||||
for i, url in enumerate(state.crawl_order):
|
||||
print(f"{i+1}. {url}")
|
||||
|
||||
# Analyze term discovery
|
||||
print("\nTerm discovery rate:")
|
||||
for i, new_terms in enumerate(state.new_terms_history):
|
||||
print(f"Page {i+1}: {new_terms} new terms")
|
||||
|
||||
# Analyze score progression
|
||||
print("\nScore progression:")
|
||||
print(f"Coverage: {state.metrics['coverage_history']}")
|
||||
print(f"Saturation: {state.metrics['saturation_history']}")
|
||||
```
|
||||
|
||||
### Export for Analysis
|
||||
|
||||
```python
|
||||
# Export detailed metrics
|
||||
import json
|
||||
|
||||
metrics = {
|
||||
"query": query,
|
||||
"total_pages": len(state.crawled_urls),
|
||||
"confidence": adaptive.confidence,
|
||||
"coverage_stats": adaptive.coverage_stats,
|
||||
"crawl_order": state.crawl_order,
|
||||
"term_frequencies": dict(state.term_frequencies),
|
||||
"new_terms_history": state.new_terms_history
|
||||
}
|
||||
|
||||
with open("crawl_analysis.json", "w") as f:
|
||||
json.dump(metrics, f, indent=2)
|
||||
```
|
||||
|
||||
## Custom Strategies
|
||||
|
||||
### Implementing a Custom Strategy
|
||||
|
||||
```python
|
||||
from crawl4ai.adaptive_crawler import BaseStrategy
|
||||
|
||||
class DomainSpecificStrategy(BaseStrategy):
|
||||
def calculate_coverage(self, state: CrawlState) -> float:
|
||||
# Custom coverage calculation
|
||||
# e.g., weight certain terms more heavily
|
||||
pass
|
||||
|
||||
def calculate_consistency(self, state: CrawlState) -> float:
|
||||
# Custom consistency logic
|
||||
# e.g., domain-specific validation
|
||||
pass
|
||||
|
||||
def rank_links(self, links: List[Link], state: CrawlState) -> List[Link]:
|
||||
# Custom link ranking
|
||||
# e.g., prioritize specific URL patterns
|
||||
pass
|
||||
|
||||
# Use custom strategy
|
||||
adaptive = AdaptiveCrawler(
|
||||
crawler,
|
||||
config=config,
|
||||
strategy=DomainSpecificStrategy()
|
||||
)
|
||||
```
|
||||
|
||||
### Combining Strategies
|
||||
|
||||
```python
|
||||
class HybridStrategy(BaseStrategy):
|
||||
def __init__(self):
|
||||
self.strategies = [
|
||||
TechnicalDocStrategy(),
|
||||
SemanticSimilarityStrategy(),
|
||||
URLPatternStrategy()
|
||||
]
|
||||
|
||||
def calculate_confidence(self, state: CrawlState) -> float:
|
||||
# Weighted combination of strategies
|
||||
scores = [s.calculate_confidence(state) for s in self.strategies]
|
||||
weights = [0.5, 0.3, 0.2]
|
||||
return sum(s * w for s, w in zip(scores, weights))
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Start Conservative
|
||||
|
||||
Begin with default settings and adjust based on results:
|
||||
|
||||
```python
|
||||
# Start with defaults
|
||||
result = await adaptive.digest(url, query)
|
||||
|
||||
# Analyze and adjust
|
||||
if adaptive.confidence < 0.7:
|
||||
config.max_pages += 10
|
||||
config.confidence_threshold -= 0.1
|
||||
```
|
||||
|
||||
### 2. Monitor Resource Usage
|
||||
|
||||
```python
|
||||
import psutil
|
||||
|
||||
# Check memory before large crawls
|
||||
memory_percent = psutil.virtual_memory().percent
|
||||
if memory_percent > 80:
|
||||
config.max_pages = min(config.max_pages, 20)
|
||||
```
|
||||
|
||||
### 3. Use Domain Knowledge
|
||||
|
||||
```python
|
||||
# For API documentation
|
||||
if "api" in start_url:
|
||||
config.top_k_links = 2 # APIs have clear structure
|
||||
|
||||
# For blogs
|
||||
if "blog" in start_url:
|
||||
config.min_gain_threshold = 0.2 # Avoid similar posts
|
||||
```
|
||||
|
||||
### 4. Validate Results
|
||||
|
||||
```python
|
||||
# Always validate the knowledge base
|
||||
relevant_content = adaptive.get_relevant_content(top_k=10)
|
||||
|
||||
# Check coverage
|
||||
query_terms = set(query.lower().split())
|
||||
covered_terms = set()
|
||||
|
||||
for doc in relevant_content:
|
||||
content_lower = doc['content'].lower()
|
||||
for term in query_terms:
|
||||
if term in content_lower:
|
||||
covered_terms.add(term)
|
||||
|
||||
coverage_ratio = len(covered_terms) / len(query_terms)
|
||||
print(f"Query term coverage: {coverage_ratio:.0%}")
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
- Explore [Custom Strategy Implementation](../tutorials/custom-adaptive-strategies.md)
|
||||
- Learn about [Knowledge Base Management](../tutorials/knowledge-base-management.md)
|
||||
- See [Performance Benchmarks](../benchmarks/adaptive-performance.md)
|
||||
244
docs/md_v2/api/adaptive-crawler.md
Normal file
244
docs/md_v2/api/adaptive-crawler.md
Normal file
@@ -0,0 +1,244 @@
|
||||
# AdaptiveCrawler
|
||||
|
||||
The `AdaptiveCrawler` class implements intelligent web crawling that automatically determines when sufficient information has been gathered to answer a query. It uses a three-layer scoring system to evaluate coverage, consistency, and saturation.
|
||||
|
||||
## Constructor
|
||||
|
||||
```python
|
||||
AdaptiveCrawler(
|
||||
crawler: AsyncWebCrawler,
|
||||
config: Optional[AdaptiveConfig] = None
|
||||
)
|
||||
```
|
||||
|
||||
### Parameters
|
||||
|
||||
- **crawler** (`AsyncWebCrawler`): The underlying web crawler instance to use for fetching pages
|
||||
- **config** (`Optional[AdaptiveConfig]`): Configuration settings for adaptive crawling behavior. If not provided, uses default settings.
|
||||
|
||||
## Primary Method
|
||||
|
||||
### digest()
|
||||
|
||||
The main method that performs adaptive crawling starting from a URL with a specific query.
|
||||
|
||||
```python
|
||||
async def digest(
|
||||
start_url: str,
|
||||
query: str,
|
||||
resume_from: Optional[Union[str, Path]] = None
|
||||
) -> CrawlState
|
||||
```
|
||||
|
||||
#### Parameters
|
||||
|
||||
- **start_url** (`str`): The starting URL for crawling
|
||||
- **query** (`str`): The search query that guides the crawling process
|
||||
- **resume_from** (`Optional[Union[str, Path]]`): Path to a saved state file to resume from
|
||||
|
||||
#### Returns
|
||||
|
||||
- **CrawlState**: The final crawl state containing all crawled URLs, knowledge base, and metrics
|
||||
|
||||
#### Example
|
||||
|
||||
```python
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
adaptive = AdaptiveCrawler(crawler)
|
||||
state = await adaptive.digest(
|
||||
start_url="https://docs.python.org",
|
||||
query="async context managers"
|
||||
)
|
||||
```
|
||||
|
||||
## Properties
|
||||
|
||||
### confidence
|
||||
|
||||
Current confidence score (0-1) indicating information sufficiency.
|
||||
|
||||
```python
|
||||
@property
|
||||
def confidence(self) -> float
|
||||
```
|
||||
|
||||
### coverage_stats
|
||||
|
||||
Dictionary containing detailed coverage statistics.
|
||||
|
||||
```python
|
||||
@property
|
||||
def coverage_stats(self) -> Dict[str, float]
|
||||
```
|
||||
|
||||
Returns:
|
||||
- **coverage**: Query term coverage score
|
||||
- **consistency**: Information consistency score
|
||||
- **saturation**: Content saturation score
|
||||
- **confidence**: Overall confidence score
|
||||
|
||||
### is_sufficient
|
||||
|
||||
Boolean indicating whether sufficient information has been gathered.
|
||||
|
||||
```python
|
||||
@property
|
||||
def is_sufficient(self) -> bool
|
||||
```
|
||||
|
||||
### state
|
||||
|
||||
Access to the current crawl state.
|
||||
|
||||
```python
|
||||
@property
|
||||
def state(self) -> CrawlState
|
||||
```
|
||||
|
||||
## Methods
|
||||
|
||||
### get_relevant_content()
|
||||
|
||||
Retrieve the most relevant content from the knowledge base.
|
||||
|
||||
```python
|
||||
def get_relevant_content(
|
||||
self,
|
||||
top_k: int = 5
|
||||
) -> List[Dict[str, Any]]
|
||||
```
|
||||
|
||||
#### Parameters
|
||||
|
||||
- **top_k** (`int`): Number of top relevant documents to return (default: 5)
|
||||
|
||||
#### Returns
|
||||
|
||||
List of dictionaries containing:
|
||||
- **url**: The URL of the page
|
||||
- **content**: The page content
|
||||
- **score**: Relevance score
|
||||
- **metadata**: Additional page metadata
|
||||
|
||||
### print_stats()
|
||||
|
||||
Display crawl statistics in formatted output.
|
||||
|
||||
```python
|
||||
def print_stats(
|
||||
self,
|
||||
detailed: bool = False
|
||||
) -> None
|
||||
```
|
||||
|
||||
#### Parameters
|
||||
|
||||
- **detailed** (`bool`): If True, shows detailed metrics with colors. If False, shows summary table.
|
||||
|
||||
### export_knowledge_base()
|
||||
|
||||
Export the collected knowledge base to a JSONL file.
|
||||
|
||||
```python
|
||||
def export_knowledge_base(
|
||||
self,
|
||||
path: Union[str, Path]
|
||||
) -> None
|
||||
```
|
||||
|
||||
#### Parameters
|
||||
|
||||
- **path** (`Union[str, Path]`): Output file path for JSONL export
|
||||
|
||||
#### Example
|
||||
|
||||
```python
|
||||
adaptive.export_knowledge_base("my_knowledge.jsonl")
|
||||
```
|
||||
|
||||
### import_knowledge_base()
|
||||
|
||||
Import a previously exported knowledge base.
|
||||
|
||||
```python
|
||||
def import_knowledge_base(
|
||||
self,
|
||||
path: Union[str, Path]
|
||||
) -> None
|
||||
```
|
||||
|
||||
#### Parameters
|
||||
|
||||
- **path** (`Union[str, Path]`): Path to JSONL file to import
|
||||
|
||||
## Configuration
|
||||
|
||||
The `AdaptiveConfig` class controls the behavior of adaptive crawling:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class AdaptiveConfig:
|
||||
confidence_threshold: float = 0.8 # Stop when confidence reaches this
|
||||
max_pages: int = 50 # Maximum pages to crawl
|
||||
top_k_links: int = 5 # Links to follow per page
|
||||
min_gain_threshold: float = 0.1 # Minimum expected gain to continue
|
||||
save_state: bool = False # Auto-save crawl state
|
||||
state_path: Optional[str] = None # Path for state persistence
|
||||
```
|
||||
|
||||
### Example with Custom Config
|
||||
|
||||
```python
|
||||
config = AdaptiveConfig(
|
||||
confidence_threshold=0.7,
|
||||
max_pages=20,
|
||||
top_k_links=3
|
||||
)
|
||||
|
||||
adaptive = AdaptiveCrawler(crawler, config=config)
|
||||
```
|
||||
|
||||
## Complete Example
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
|
||||
|
||||
async def main():
|
||||
# Configure adaptive crawling
|
||||
config = AdaptiveConfig(
|
||||
confidence_threshold=0.75,
|
||||
max_pages=15,
|
||||
save_state=True,
|
||||
state_path="my_crawl.json"
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
adaptive = AdaptiveCrawler(crawler, config)
|
||||
|
||||
# Start crawling
|
||||
state = await adaptive.digest(
|
||||
start_url="https://example.com/docs",
|
||||
query="authentication oauth2 jwt"
|
||||
)
|
||||
|
||||
# Check results
|
||||
print(f"Confidence achieved: {adaptive.confidence:.0%}")
|
||||
adaptive.print_stats()
|
||||
|
||||
# Get most relevant pages
|
||||
for page in adaptive.get_relevant_content(top_k=3):
|
||||
print(f"- {page['url']} (score: {page['score']:.2f})")
|
||||
|
||||
# Export for later use
|
||||
adaptive.export_knowledge_base("auth_knowledge.jsonl")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
## See Also
|
||||
|
||||
- [digest() Method Reference](digest.md)
|
||||
- [Adaptive Crawling Guide](../core/adaptive-crawling.md)
|
||||
- [Advanced Adaptive Strategies](../advanced/adaptive-strategies.md)
|
||||
181
docs/md_v2/api/digest.md
Normal file
181
docs/md_v2/api/digest.md
Normal file
@@ -0,0 +1,181 @@
|
||||
# digest()
|
||||
|
||||
The `digest()` method is the primary interface for adaptive web crawling. It intelligently crawls websites starting from a given URL, guided by a query, and automatically determines when sufficient information has been gathered.
|
||||
|
||||
## Method Signature
|
||||
|
||||
```python
|
||||
async def digest(
|
||||
start_url: str,
|
||||
query: str,
|
||||
resume_from: Optional[Union[str, Path]] = None
|
||||
) -> CrawlState
|
||||
```
|
||||
|
||||
## Parameters
|
||||
|
||||
### start_url
|
||||
- **Type**: `str`
|
||||
- **Required**: Yes
|
||||
- **Description**: The starting URL for the crawl. This should be a valid HTTP/HTTPS URL that serves as the entry point for information gathering.
|
||||
|
||||
### query
|
||||
- **Type**: `str`
|
||||
- **Required**: Yes
|
||||
- **Description**: The search query that guides the crawling process. This should contain key terms related to the information you're seeking. The crawler uses this to evaluate relevance and determine which links to follow.
|
||||
|
||||
### resume_from
|
||||
- **Type**: `Optional[Union[str, Path]]`
|
||||
- **Default**: `None`
|
||||
- **Description**: Path to a previously saved crawl state file. When provided, the crawler resumes from the saved state instead of starting fresh.
|
||||
|
||||
## Return Value
|
||||
|
||||
Returns a `CrawlState` object containing:
|
||||
|
||||
- **crawled_urls** (`Set[str]`): All URLs that have been crawled
|
||||
- **knowledge_base** (`List[CrawlResult]`): Collection of crawled pages with content
|
||||
- **pending_links** (`List[Link]`): Links discovered but not yet crawled
|
||||
- **metrics** (`Dict[str, float]`): Performance and quality metrics
|
||||
- **query** (`str`): The original query
|
||||
- Additional statistical information for scoring
|
||||
|
||||
## How It Works
|
||||
|
||||
The `digest()` method implements an intelligent crawling algorithm:
|
||||
|
||||
1. **Initial Crawl**: Starts from the provided URL
|
||||
2. **Link Analysis**: Evaluates all discovered links for relevance
|
||||
3. **Scoring**: Uses three metrics to assess information sufficiency:
|
||||
- **Coverage**: How well the query terms are covered
|
||||
- **Consistency**: Information coherence across pages
|
||||
- **Saturation**: Diminishing returns detection
|
||||
4. **Adaptive Selection**: Chooses the most promising links to follow
|
||||
5. **Stopping Decision**: Automatically stops when confidence threshold is reached
|
||||
|
||||
## Examples
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
adaptive = AdaptiveCrawler(crawler)
|
||||
|
||||
state = await adaptive.digest(
|
||||
start_url="https://docs.python.org/3/",
|
||||
query="async await context managers"
|
||||
)
|
||||
|
||||
print(f"Crawled {len(state.crawled_urls)} pages")
|
||||
print(f"Confidence: {adaptive.confidence:.0%}")
|
||||
```
|
||||
|
||||
### With Configuration
|
||||
|
||||
```python
|
||||
config = AdaptiveConfig(
|
||||
confidence_threshold=0.9, # Require high confidence
|
||||
max_pages=30, # Allow more pages
|
||||
top_k_links=3 # Follow top 3 links per page
|
||||
)
|
||||
|
||||
adaptive = AdaptiveCrawler(crawler, config=config)
|
||||
|
||||
state = await adaptive.digest(
|
||||
start_url="https://api.example.com/docs",
|
||||
query="authentication endpoints rate limits"
|
||||
)
|
||||
```
|
||||
|
||||
### Resuming a Previous Crawl
|
||||
|
||||
```python
|
||||
# First crawl - may be interrupted
|
||||
state1 = await adaptive.digest(
|
||||
start_url="https://example.com",
|
||||
query="machine learning algorithms"
|
||||
)
|
||||
|
||||
# Save state (if not auto-saved)
|
||||
state1.save("ml_crawl_state.json")
|
||||
|
||||
# Later, resume from saved state
|
||||
state2 = await adaptive.digest(
|
||||
start_url="https://example.com",
|
||||
query="machine learning algorithms",
|
||||
resume_from="ml_crawl_state.json"
|
||||
)
|
||||
```
|
||||
|
||||
### With Progress Monitoring
|
||||
|
||||
```python
|
||||
state = await adaptive.digest(
|
||||
start_url="https://docs.example.com",
|
||||
query="api reference"
|
||||
)
|
||||
|
||||
# Monitor progress
|
||||
print(f"Pages crawled: {len(state.crawled_urls)}")
|
||||
print(f"New terms discovered: {state.new_terms_history}")
|
||||
print(f"Final confidence: {adaptive.confidence:.2%}")
|
||||
|
||||
# View detailed statistics
|
||||
adaptive.print_stats(detailed=True)
|
||||
```
|
||||
|
||||
## Query Best Practices
|
||||
|
||||
1. **Be Specific**: Use descriptive terms that appear in target content
|
||||
```python
|
||||
# Good
|
||||
query = "python async context managers implementation"
|
||||
|
||||
# Too broad
|
||||
query = "python programming"
|
||||
```
|
||||
|
||||
2. **Include Key Terms**: Add technical terms you expect to find
|
||||
```python
|
||||
query = "oauth2 jwt refresh tokens authorization"
|
||||
```
|
||||
|
||||
3. **Multiple Concepts**: Combine related concepts for comprehensive coverage
|
||||
```python
|
||||
query = "rest api pagination sorting filtering"
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- **Initial URL**: Choose a page with good navigation (e.g., documentation index)
|
||||
- **Query Length**: 3-8 terms typically work best
|
||||
- **Link Density**: Sites with clear navigation crawl more efficiently
|
||||
- **Caching**: Enable caching for repeated crawls of the same domain
|
||||
|
||||
## Error Handling
|
||||
|
||||
```python
|
||||
try:
|
||||
state = await adaptive.digest(
|
||||
start_url="https://example.com",
|
||||
query="search terms"
|
||||
)
|
||||
except Exception as e:
|
||||
print(f"Crawl failed: {e}")
|
||||
# State is auto-saved if save_state=True in config
|
||||
```
|
||||
|
||||
## Stopping Conditions
|
||||
|
||||
The crawl stops when any of these conditions are met:
|
||||
|
||||
1. **Confidence Threshold**: Reached the configured confidence level
|
||||
2. **Page Limit**: Crawled the maximum number of pages
|
||||
3. **Diminishing Returns**: Expected information gain below threshold
|
||||
4. **No Relevant Links**: No promising links remain to follow
|
||||
|
||||
## See Also
|
||||
|
||||
- [AdaptiveCrawler Class](adaptive-crawler.md)
|
||||
- [Adaptive Crawling Guide](../core/adaptive-crawling.md)
|
||||
- [Configuration Options](../core/adaptive-crawling.md#configuration-options)
|
||||
369
docs/md_v2/blog/articles/adaptive-crawling-revolution.md
Normal file
369
docs/md_v2/blog/articles/adaptive-crawling-revolution.md
Normal file
@@ -0,0 +1,369 @@
|
||||
# Adaptive Crawling: Building Dynamic Knowledge That Grows on Demand
|
||||
|
||||
*Published on January 29, 2025 • 8 min read*
|
||||
|
||||
*By [unclecode](https://x.com/unclecode) • Follow me on [X/Twitter](https://x.com/unclecode) for more web scraping insights*
|
||||
|
||||
---
|
||||
|
||||
## The Knowledge Capacitor
|
||||
|
||||
Imagine a capacitor that stores energy, releasing it precisely when needed. Now imagine that for information. That's Adaptive Crawling—a term I coined to describe a fundamentally different approach to web crawling. Instead of the brute force of traditional deep crawling, we build knowledge dynamically, growing it based on queries and circumstances, like a living organism responding to its environment.
|
||||
|
||||
This isn't just another crawling optimization. It's a paradigm shift from "crawl everything, hope for the best" to "crawl intelligently, know when to stop."
|
||||
|
||||
## Why I Built This
|
||||
|
||||
I've watched too many startups burn through resources with a dangerous misconception: that LLMs make everything efficient. They don't. They make things *possible*, not necessarily *smart*. When you combine brute-force crawling with LLM processing, you're not just wasting time—you're hemorrhaging money on tokens, compute, and opportunity cost.
|
||||
|
||||
Consider this reality:
|
||||
- **Traditional deep crawling**: 500 pages → 50 useful → $15 in LLM tokens → 2 hours wasted
|
||||
- **Adaptive crawling**: 15 pages → 14 useful → $2 in tokens → 10 minutes → **7.5x cost reduction**
|
||||
|
||||
But it's not about crawling less. It's about crawling *right*.
|
||||
|
||||
## The Information Theory Foundation
|
||||
|
||||
<div style="background-color: #1a1a1c; border: 1px solid #3f3f44; padding: 20px; margin: 20px 0;">
|
||||
|
||||
### 🧮 **Pure Statistics, No Magic**
|
||||
|
||||
My first principle was crucial: start with classic statistical approaches. No embeddings. No LLMs. Just pure information theory:
|
||||
|
||||
```python
|
||||
# Information gain calculation - the heart of adaptive crawling
|
||||
def calculate_information_gain(new_page, knowledge_base):
|
||||
new_terms = extract_terms(new_page) - existing_terms(knowledge_base)
|
||||
overlap = calculate_overlap(new_page, knowledge_base)
|
||||
|
||||
# High gain = many new terms + low overlap
|
||||
gain = len(new_terms) / (1 + overlap)
|
||||
return gain
|
||||
```
|
||||
|
||||
This isn't regression to older methods—it's recognition that we've forgotten powerful, efficient solutions in our rush to apply LLMs everywhere.
|
||||
|
||||
</div>
|
||||
|
||||
## The A* of Web Crawling
|
||||
|
||||
Adaptive crawling implements what I call "information scenting"—like A* pathfinding but for knowledge acquisition. Each link is evaluated not randomly, but by its probability of contributing meaningful information toward answering current and future queries.
|
||||
|
||||
<div style="display: flex; align-items: center; background-color: #3f3f44; padding: 20px; margin: 20px 0; border-left: 4px solid #09b5a5;">
|
||||
<div style="font-size: 48px; margin-right: 20px;">🎯</div>
|
||||
<div>
|
||||
<strong>The Scenting Algorithm:</strong><br>
|
||||
From available links, we select those with highest information gain. It's not about following every path—it's about following the <em>right</em> paths. Like a bloodhound following the strongest scent to its target.
|
||||
</div>
|
||||
</div>
|
||||
|
||||
## The Three Pillars of Intelligence
|
||||
|
||||
### 1. Coverage: The Breadth Sensor
|
||||
Measures how well your knowledge spans the query space. Not just "do we have pages?" but "do we have the RIGHT pages?"
|
||||
|
||||
### 2. Consistency: The Coherence Detector
|
||||
Information from multiple sources should align. When pages agree, confidence rises. When they conflict, we need more data.
|
||||
|
||||
### 3. Saturation: The Efficiency Guardian
|
||||
The most crucial metric. When new pages stop adding information, we stop crawling. Simple. Powerful. Ignored by everyone else.
|
||||
|
||||
## Real Impact: Time, Money, and Sanity
|
||||
|
||||
Let me show you what this means for your bottom line:
|
||||
|
||||
### Building a Customer Support Knowledge Base
|
||||
|
||||
**Traditional Approach:**
|
||||
```python
|
||||
# Crawl entire documentation site
|
||||
results = await crawler.crawl_bfs("https://docs.company.com", max_depth=5)
|
||||
# Result: 1,200 pages, 18 hours, $150 in API costs
|
||||
# Useful content: ~100 pages scattered throughout
|
||||
```
|
||||
|
||||
**Adaptive Approach:**
|
||||
```python
|
||||
# Grow knowledge based on actual support queries
|
||||
knowledge = await adaptive.digest(
|
||||
start_url="https://docs.company.com",
|
||||
query="payment processing errors refund policies"
|
||||
)
|
||||
# Result: 45 pages, 12 minutes, $8 in API costs
|
||||
# Useful content: 42 pages, all relevant
|
||||
```
|
||||
|
||||
**Savings: 93% time reduction, 95% cost reduction, 100% more sanity**
|
||||
|
||||
## The Dynamic Growth Pattern
|
||||
|
||||
<div style="text-align: center; padding: 40px; background-color: #1a1a1c; border: 1px dashed #3f3f44; margin: 30px 0;">
|
||||
<div style="font-size: 24px; color: #09b5a5; margin-bottom: 10px;">
|
||||
Knowledge grows like crystals in a supersaturated solution
|
||||
</div>
|
||||
<div style="color: #a3abba;">
|
||||
Add a query (seed), and relevant information crystallizes around it.<br>
|
||||
Change the query, and the knowledge structure adapts.
|
||||
</div>
|
||||
</div>
|
||||
|
||||
This is the beauty of adaptive crawling: your knowledge base becomes a living entity that grows based on actual needs, not hypothetical completeness.
|
||||
|
||||
## Why "Adaptive"?
|
||||
|
||||
I specifically chose "Adaptive" because it captures the essence: the system adapts to what it finds. Dense technical documentation might need 20 pages for confidence. A simple FAQ might need just 5. The crawler doesn't follow a recipe—it reads the room and adjusts.
|
||||
|
||||
This is my term, my concept, and I have extensive plans for its evolution.
|
||||
|
||||
## The Progressive Roadmap
|
||||
|
||||
This is just the beginning. My roadmap for Adaptive Crawling:
|
||||
|
||||
### Phase 1 (Current): Statistical Foundation
|
||||
- Pure information theory approach
|
||||
- No dependencies on expensive models
|
||||
- Proven efficiency gains
|
||||
|
||||
### Phase 2 (Now Available): Embedding Enhancement
|
||||
- Semantic understanding layered onto statistical base
|
||||
- Still efficient, now even smarter
|
||||
- Optional, not required
|
||||
|
||||
### Phase 3 (Future): LLM Integration
|
||||
- LLMs for complex reasoning tasks only
|
||||
- Used surgically, not wastefully
|
||||
- Always with statistical foundation underneath
|
||||
|
||||
## The Efficiency Revolution
|
||||
|
||||
<div style="background-color: #1a1a1c; border: 1px solid #3f3f44; padding: 20px; margin: 20px 0;">
|
||||
|
||||
### 💰 **The Economics of Intelligence**
|
||||
|
||||
For a typical SaaS documentation crawl:
|
||||
|
||||
**Traditional Deep Crawling:**
|
||||
- Pages crawled: 1,000
|
||||
- Useful pages: 80
|
||||
- Time spent: 3 hours
|
||||
- LLM tokens used: 2.5M
|
||||
- Cost: $75
|
||||
- Efficiency: 8%
|
||||
|
||||
**Adaptive Crawling:**
|
||||
- Pages crawled: 95
|
||||
- Useful pages: 88
|
||||
- Time spent: 15 minutes
|
||||
- LLM tokens used: 200K
|
||||
- Cost: $6
|
||||
- Efficiency: 93%
|
||||
|
||||
**That's not optimization. That's transformation.**
|
||||
|
||||
</div>
|
||||
|
||||
## Missing the Forest for the Trees
|
||||
|
||||
The startup world has a dangerous blind spot. We're so enamored with LLMs that we forget: just because you CAN process everything with an LLM doesn't mean you SHOULD.
|
||||
|
||||
Classic NLP and statistical methods can:
|
||||
- Filter irrelevant content before it reaches LLMs
|
||||
- Identify patterns without expensive inference
|
||||
- Make intelligent decisions in microseconds
|
||||
- Scale without breaking the bank
|
||||
|
||||
Adaptive crawling proves this. It uses battle-tested information theory to make smart decisions BEFORE expensive processing.
|
||||
|
||||
## Your Knowledge, On Demand
|
||||
|
||||
```python
|
||||
# Monday: Customer asks about authentication
|
||||
auth_knowledge = await adaptive.digest(
|
||||
"https://docs.api.com",
|
||||
"oauth jwt authentication"
|
||||
)
|
||||
|
||||
# Tuesday: They ask about rate limiting
|
||||
# The crawler adapts, builds on existing knowledge
|
||||
rate_limit_knowledge = await adaptive.digest(
|
||||
"https://docs.api.com",
|
||||
"rate limiting throttling quotas"
|
||||
)
|
||||
|
||||
# Your knowledge base grows intelligently, not indiscriminately
|
||||
```
|
||||
|
||||
## The Competitive Edge
|
||||
|
||||
Companies using adaptive crawling will have:
|
||||
- **90% lower crawling costs**
|
||||
- **Knowledge bases that actually answer questions**
|
||||
- **Update cycles in minutes, not days**
|
||||
- **Happy customers who find answers fast**
|
||||
- **Engineers who sleep at night**
|
||||
|
||||
Those still using brute force? They'll wonder why their infrastructure costs keep rising while their customers keep complaining.
|
||||
|
||||
## The Embedding Evolution (Now Available!)
|
||||
|
||||
<div style="background-color: #1a1a1c; border: 1px solid #3f3f44; padding: 20px; margin: 20px 0;">
|
||||
|
||||
### 🧠 **Semantic Understanding Without the Cost**
|
||||
|
||||
The embedding strategy brings semantic intelligence while maintaining efficiency:
|
||||
|
||||
```python
|
||||
# Statistical strategy - great for exact terms
|
||||
config_statistical = AdaptiveConfig(
|
||||
strategy="statistical" # Default
|
||||
)
|
||||
|
||||
# Embedding strategy - understands concepts
|
||||
config_embedding = AdaptiveConfig(
|
||||
strategy="embedding",
|
||||
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
|
||||
n_query_variations=10
|
||||
)
|
||||
```
|
||||
|
||||
**The magic**: It automatically expands your query into semantic variations, maps the coverage space, and identifies gaps to fill intelligently.
|
||||
|
||||
</div>
|
||||
|
||||
### Real-World Comparison
|
||||
|
||||
<div style="display: flex; gap: 20px; margin: 20px 0;">
|
||||
<div style="flex: 1; background-color: #1a1a1c; border: 1px solid #3f3f44; padding: 20px;">
|
||||
|
||||
**Query**: "authentication oauth"
|
||||
|
||||
**Statistical Strategy**:
|
||||
- Searches for exact terms
|
||||
- 12 pages crawled
|
||||
- 78% confidence
|
||||
- Fast but literal
|
||||
|
||||
</div>
|
||||
<div style="flex: 1; background-color: #1a1a1c; border: 1px solid #09b5a5; padding: 20px;">
|
||||
|
||||
**Embedding Strategy**:
|
||||
- Understands "auth", "login", "SSO"
|
||||
- 8 pages crawled
|
||||
- 92% confidence
|
||||
- Semantic comprehension
|
||||
|
||||
</div>
|
||||
</div>
|
||||
|
||||
### Detecting Irrelevance
|
||||
|
||||
One killer feature: the embedding strategy knows when to give up:
|
||||
|
||||
```python
|
||||
# Crawling Python docs with a cooking query
|
||||
result = await adaptive.digest(
|
||||
start_url="https://docs.python.org/3/",
|
||||
query="how to make spaghetti carbonara"
|
||||
)
|
||||
|
||||
# System detects irrelevance and stops
|
||||
# Confidence: 5% (below threshold)
|
||||
# Pages crawled: 2
|
||||
# Stopped reason: "below_minimum_relevance_threshold"
|
||||
```
|
||||
|
||||
No more crawling hundreds of pages hoping to find something that doesn't exist!
|
||||
|
||||
## Try It Yourself
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Choose your strategy
|
||||
config = AdaptiveConfig(
|
||||
strategy="embedding", # or "statistical"
|
||||
embedding_min_confidence_threshold=0.1 # Stop if irrelevant
|
||||
)
|
||||
|
||||
adaptive = AdaptiveCrawler(crawler, config)
|
||||
|
||||
# Watch intelligence at work
|
||||
result = await adaptive.digest(
|
||||
start_url="https://your-docs.com",
|
||||
query="your users' actual questions"
|
||||
)
|
||||
|
||||
# See the efficiency
|
||||
adaptive.print_stats()
|
||||
print(f"Found {adaptive.confidence:.0%} of needed information")
|
||||
print(f"In just {len(result.crawled_urls)} pages")
|
||||
print(f"Saving you {1000 - len(result.crawled_urls)} unnecessary crawls")
|
||||
```
|
||||
|
||||
## A Personal Note
|
||||
|
||||
I created Adaptive Crawling because I was tired of watching smart people make inefficient choices. We have incredibly powerful statistical tools that we've forgotten in our rush toward LLMs. This is my attempt to bring balance back to the Force.
|
||||
|
||||
This is not just a feature. It's a philosophy: **Grow knowledge on demand. Stop when you have enough. Save time, money, and computational resources for what really matters.**
|
||||
|
||||
## The Future is Adaptive
|
||||
|
||||
<div style="text-align: center; padding: 40px; background-color: #1a1a1c; border: 1px dashed #3f3f44; margin: 30px 0;">
|
||||
<div style="font-size: 24px; color: #09b5a5; margin-bottom: 10px;">
|
||||
Traditional Crawling: Drinking from a firehose<br>
|
||||
Adaptive Crawling: Sipping exactly what you need
|
||||
</div>
|
||||
<div style="color: #a3abba;">
|
||||
The future of web crawling isn't about processing more data.<br>
|
||||
It's about processing the <em>right</em> data.
|
||||
</div>
|
||||
</div>
|
||||
|
||||
Join me in making web crawling intelligent, efficient, and actually useful. Because in the age of information overload, the winners won't be those who collect the most data—they'll be those who collect the *right* data.
|
||||
|
||||
---
|
||||
|
||||
*Adaptive Crawling is now part of Crawl4AI. [Get started with the documentation](/core/adaptive-crawling/) or [dive into the mathematical framework](https://github.com/unclecode/crawl4ai/blob/main/PROGRESSIVE_CRAWLING.md). For updates on my work in information theory and efficient AI, follow me on [X/Twitter](https://x.com/unclecode).*
|
||||
|
||||
<style>
|
||||
/* Custom styles for this article */
|
||||
.markdown-body pre {
|
||||
background-color: #1e1e1e !important;
|
||||
border: 1px solid #3f3f44;
|
||||
}
|
||||
|
||||
.markdown-body code {
|
||||
background-color: #3f3f44;
|
||||
color: #50ffff;
|
||||
padding: 2px 6px;
|
||||
border-radius: 3px;
|
||||
}
|
||||
|
||||
.markdown-body pre code {
|
||||
background-color: transparent;
|
||||
color: #e8e9ed;
|
||||
padding: 0;
|
||||
}
|
||||
|
||||
.markdown-body blockquote {
|
||||
border-left: 4px solid #09b5a5;
|
||||
background-color: #1a1a1c;
|
||||
padding: 15px 20px;
|
||||
margin: 20px 0;
|
||||
}
|
||||
|
||||
.markdown-body h2 {
|
||||
color: #50ffff;
|
||||
border-bottom: 1px dashed #3f3f44;
|
||||
padding-bottom: 10px;
|
||||
}
|
||||
|
||||
.markdown-body h3 {
|
||||
color: #09b5a5;
|
||||
}
|
||||
|
||||
.markdown-body strong {
|
||||
color: #50ffff;
|
||||
}
|
||||
</style>
|
||||
@@ -2,6 +2,22 @@
|
||||
|
||||
Welcome to the Crawl4AI blog! Here you'll find detailed release notes, technical insights, and updates about the project. Whether you're looking for the latest improvements or want to dive deep into web crawling techniques, this is the place.
|
||||
|
||||
## Featured Articles
|
||||
|
||||
### [When to Stop Crawling: The Art of Knowing "Enough"](articles/adaptive-crawling-revolution.md)
|
||||
*January 29, 2025*
|
||||
|
||||
Traditional crawlers are like tourists with unlimited time—they'll visit every street, every alley, every dead end. But what if your crawler could think like a researcher with a deadline? Discover how Adaptive Crawling revolutionizes web scraping by knowing when to stop. Learn about the three-layer intelligence system that evaluates coverage, consistency, and saturation to build focused knowledge bases instead of endless page collections.
|
||||
|
||||
[Read the full article →](articles/adaptive-crawling-revolution.md)
|
||||
|
||||
### [The LLM Context Protocol: Why Your AI Assistant Needs Memory, Reasoning, and Examples](articles/llm-context-revolution.md)
|
||||
*January 24, 2025*
|
||||
|
||||
Ever wondered why your AI coding assistant struggles with your library despite comprehensive documentation? This article introduces the three-dimensional context protocol that transforms how AI understands code. Learn why memory, reasoning, and examples together create wisdom—not just information.
|
||||
|
||||
[Read the full article →](articles/llm-context-revolution.md)
|
||||
|
||||
## Latest Release
|
||||
|
||||
Here’s the blog index entry for **v0.6.0**, written to match the exact tone and structure of your previous entries:
|
||||
|
||||
347
docs/md_v2/core/adaptive-crawling.md
Normal file
347
docs/md_v2/core/adaptive-crawling.md
Normal file
@@ -0,0 +1,347 @@
|
||||
# Adaptive Web Crawling
|
||||
|
||||
## Introduction
|
||||
|
||||
Traditional web crawlers follow predetermined patterns, crawling pages blindly without knowing when they've gathered enough information. **Adaptive Crawling** changes this paradigm by introducing intelligence into the crawling process.
|
||||
|
||||
Think of it like research: when you're looking for information, you don't read every book in the library. You stop when you've found sufficient information to answer your question. That's exactly what Adaptive Crawling does for web scraping.
|
||||
|
||||
## Key Concepts
|
||||
|
||||
### The Problem It Solves
|
||||
|
||||
When crawling websites for specific information, you face two challenges:
|
||||
1. **Under-crawling**: Stopping too early and missing crucial information
|
||||
2. **Over-crawling**: Wasting resources by crawling irrelevant pages
|
||||
|
||||
Adaptive Crawling solves both by using a three-layer scoring system that determines when you have "enough" information.
|
||||
|
||||
### How It Works
|
||||
|
||||
The AdaptiveCrawler uses three metrics to measure information sufficiency:
|
||||
|
||||
- **Coverage**: How well your collected pages cover the query terms
|
||||
- **Consistency**: Whether the information is coherent across pages
|
||||
- **Saturation**: Detecting when new pages aren't adding new information
|
||||
|
||||
When these metrics indicate sufficient information has been gathered, crawling stops automatically.
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Create an adaptive crawler
|
||||
adaptive = AdaptiveCrawler(crawler)
|
||||
|
||||
# Start crawling with a query
|
||||
result = await adaptive.digest(
|
||||
start_url="https://docs.python.org/3/",
|
||||
query="async context managers"
|
||||
)
|
||||
|
||||
# View statistics
|
||||
adaptive.print_stats()
|
||||
|
||||
# Get the most relevant content
|
||||
relevant_pages = adaptive.get_relevant_content(top_k=5)
|
||||
for page in relevant_pages:
|
||||
print(f"- {page['url']} (score: {page['score']:.2f})")
|
||||
```
|
||||
|
||||
### Configuration Options
|
||||
|
||||
```python
|
||||
from crawl4ai import AdaptiveConfig
|
||||
|
||||
config = AdaptiveConfig(
|
||||
confidence_threshold=0.7, # Stop when 70% confident (default: 0.8)
|
||||
max_pages=20, # Maximum pages to crawl (default: 50)
|
||||
top_k_links=3, # Links to follow per page (default: 5)
|
||||
min_gain_threshold=0.05 # Minimum expected gain to continue (default: 0.1)
|
||||
)
|
||||
|
||||
adaptive = AdaptiveCrawler(crawler, config=config)
|
||||
```
|
||||
|
||||
## Crawling Strategies
|
||||
|
||||
Adaptive Crawling supports two distinct strategies for determining information sufficiency:
|
||||
|
||||
### Statistical Strategy (Default)
|
||||
|
||||
The statistical strategy uses pure information theory and term-based analysis:
|
||||
|
||||
- **Fast and efficient** - No API calls or model loading
|
||||
- **Term-based coverage** - Analyzes query term presence and distribution
|
||||
- **No external dependencies** - Works offline
|
||||
- **Best for**: Well-defined queries with specific terminology
|
||||
|
||||
```python
|
||||
# Default configuration uses statistical strategy
|
||||
config = AdaptiveConfig(
|
||||
strategy="statistical", # This is the default
|
||||
confidence_threshold=0.8
|
||||
)
|
||||
```
|
||||
|
||||
### Embedding Strategy
|
||||
|
||||
The embedding strategy uses semantic embeddings for deeper understanding:
|
||||
|
||||
- **Semantic understanding** - Captures meaning beyond exact term matches
|
||||
- **Query expansion** - Automatically generates query variations
|
||||
- **Gap-driven selection** - Identifies semantic gaps in knowledge
|
||||
- **Validation-based stopping** - Uses held-out queries to validate coverage
|
||||
- **Best for**: Complex queries, ambiguous topics, conceptual understanding
|
||||
|
||||
```python
|
||||
# Configure embedding strategy
|
||||
config = AdaptiveConfig(
|
||||
strategy="embedding",
|
||||
embedding_model="sentence-transformers/all-MiniLM-L6-v2", # Default
|
||||
n_query_variations=10, # Generate 10 query variations
|
||||
embedding_min_confidence_threshold=0.1 # Stop if completely irrelevant
|
||||
)
|
||||
|
||||
# With custom embedding provider (e.g., OpenAI)
|
||||
config = AdaptiveConfig(
|
||||
strategy="embedding",
|
||||
embedding_llm_config={
|
||||
'provider': 'openai/text-embedding-3-small',
|
||||
'api_token': 'your-api-key'
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
### Strategy Comparison
|
||||
|
||||
| Feature | Statistical | Embedding |
|
||||
|---------|------------|-----------|
|
||||
| **Speed** | Very fast | Moderate (API calls) |
|
||||
| **Cost** | Free | Depends on provider |
|
||||
| **Accuracy** | Good for exact terms | Excellent for concepts |
|
||||
| **Dependencies** | None | Embedding model/API |
|
||||
| **Query Understanding** | Literal | Semantic |
|
||||
| **Best Use Case** | Technical docs, specific terms | Research, broad topics |
|
||||
|
||||
### Embedding Strategy Configuration
|
||||
|
||||
The embedding strategy offers fine-tuned control through several parameters:
|
||||
|
||||
```python
|
||||
config = AdaptiveConfig(
|
||||
strategy="embedding",
|
||||
|
||||
# Model configuration
|
||||
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
|
||||
embedding_llm_config=None, # Use for API-based embeddings
|
||||
|
||||
# Query expansion
|
||||
n_query_variations=10, # Number of query variations to generate
|
||||
|
||||
# Coverage parameters
|
||||
embedding_coverage_radius=0.2, # Distance threshold for coverage
|
||||
embedding_k_exp=3.0, # Exponential decay factor (higher = stricter)
|
||||
|
||||
# Stopping criteria
|
||||
embedding_min_relative_improvement=0.1, # Min improvement to continue
|
||||
embedding_validation_min_score=0.3, # Min validation score
|
||||
embedding_min_confidence_threshold=0.1, # Below this = irrelevant
|
||||
|
||||
# Link selection
|
||||
embedding_overlap_threshold=0.85, # Similarity for deduplication
|
||||
|
||||
# Display confidence mapping
|
||||
embedding_quality_min_confidence=0.7, # Min displayed confidence
|
||||
embedding_quality_max_confidence=0.95 # Max displayed confidence
|
||||
)
|
||||
```
|
||||
|
||||
### Handling Irrelevant Queries
|
||||
|
||||
The embedding strategy can detect when a query is completely unrelated to the content:
|
||||
|
||||
```python
|
||||
# This will stop quickly with low confidence
|
||||
result = await adaptive.digest(
|
||||
start_url="https://docs.python.org/3/",
|
||||
query="how to cook pasta" # Irrelevant to Python docs
|
||||
)
|
||||
|
||||
# Check if query was irrelevant
|
||||
if result.metrics.get('is_irrelevant', False):
|
||||
print("Query is unrelated to the content!")
|
||||
```
|
||||
|
||||
## When to Use Adaptive Crawling
|
||||
|
||||
### Perfect For:
|
||||
- **Research Tasks**: Finding comprehensive information about a topic
|
||||
- **Question Answering**: Gathering sufficient context to answer specific queries
|
||||
- **Knowledge Base Building**: Creating focused datasets for AI/ML applications
|
||||
- **Competitive Intelligence**: Collecting complete information about specific products/features
|
||||
|
||||
### Not Recommended For:
|
||||
- **Full Site Archiving**: When you need every page regardless of content
|
||||
- **Structured Data Extraction**: When targeting specific, known page patterns
|
||||
- **Real-time Monitoring**: When you need continuous updates
|
||||
|
||||
## Understanding the Output
|
||||
|
||||
### Confidence Score
|
||||
|
||||
The confidence score (0-1) indicates how sufficient the gathered information is:
|
||||
- **0.0-0.3**: Insufficient information, needs more crawling
|
||||
- **0.3-0.6**: Partial information, may answer basic queries
|
||||
- **0.6-0.8**: Good coverage, can answer most queries
|
||||
- **0.8-1.0**: Excellent coverage, comprehensive information
|
||||
|
||||
### Statistics Display
|
||||
|
||||
```python
|
||||
adaptive.print_stats(detailed=False) # Summary table
|
||||
adaptive.print_stats(detailed=True) # Detailed metrics
|
||||
```
|
||||
|
||||
The summary shows:
|
||||
- Pages crawled vs. confidence achieved
|
||||
- Coverage, consistency, and saturation scores
|
||||
- Crawling efficiency metrics
|
||||
|
||||
## Persistence and Resumption
|
||||
|
||||
### Saving Progress
|
||||
|
||||
```python
|
||||
config = AdaptiveConfig(
|
||||
save_state=True,
|
||||
state_path="my_crawl_state.json"
|
||||
)
|
||||
|
||||
# Crawl will auto-save progress
|
||||
result = await adaptive.digest(start_url, query)
|
||||
```
|
||||
|
||||
### Resuming a Crawl
|
||||
|
||||
```python
|
||||
# Resume from saved state
|
||||
result = await adaptive.digest(
|
||||
start_url,
|
||||
query,
|
||||
resume_from="my_crawl_state.json"
|
||||
)
|
||||
```
|
||||
|
||||
### Exporting Knowledge Base
|
||||
|
||||
```python
|
||||
# Export collected pages to JSONL
|
||||
adaptive.export_knowledge_base("knowledge_base.jsonl")
|
||||
|
||||
# Import into another session
|
||||
new_adaptive = AdaptiveCrawler(crawler)
|
||||
new_adaptive.import_knowledge_base("knowledge_base.jsonl")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Query Formulation
|
||||
- Use specific, descriptive queries
|
||||
- Include key terms you expect to find
|
||||
- Avoid overly broad queries
|
||||
|
||||
### 2. Threshold Tuning
|
||||
- Start with default (0.8) for general use
|
||||
- Lower to 0.6-0.7 for exploratory crawling
|
||||
- Raise to 0.9+ for exhaustive coverage
|
||||
|
||||
### 3. Performance Optimization
|
||||
- Use appropriate `max_pages` limits
|
||||
- Adjust `top_k_links` based on site structure
|
||||
- Enable caching for repeat crawls
|
||||
|
||||
### 4. Link Selection
|
||||
- The crawler prioritizes links based on:
|
||||
- Relevance to query
|
||||
- Expected information gain
|
||||
- URL structure and depth
|
||||
|
||||
## Examples
|
||||
|
||||
### Research Assistant
|
||||
|
||||
```python
|
||||
# Gather information about a programming concept
|
||||
result = await adaptive.digest(
|
||||
start_url="https://realpython.com",
|
||||
query="python decorators implementation patterns"
|
||||
)
|
||||
|
||||
# Get the most relevant excerpts
|
||||
for doc in adaptive.get_relevant_content(top_k=3):
|
||||
print(f"\nFrom: {doc['url']}")
|
||||
print(f"Relevance: {doc['score']:.2%}")
|
||||
print(doc['content'][:500] + "...")
|
||||
```
|
||||
|
||||
### Knowledge Base Builder
|
||||
|
||||
```python
|
||||
# Build a focused knowledge base about machine learning
|
||||
queries = [
|
||||
"supervised learning algorithms",
|
||||
"neural network architectures",
|
||||
"model evaluation metrics"
|
||||
]
|
||||
|
||||
for query in queries:
|
||||
await adaptive.digest(
|
||||
start_url="https://scikit-learn.org/stable/",
|
||||
query=query
|
||||
)
|
||||
|
||||
# Export combined knowledge base
|
||||
adaptive.export_knowledge_base("ml_knowledge.jsonl")
|
||||
```
|
||||
|
||||
### API Documentation Crawler
|
||||
|
||||
```python
|
||||
# Intelligently crawl API documentation
|
||||
config = AdaptiveConfig(
|
||||
confidence_threshold=0.85, # Higher threshold for completeness
|
||||
max_pages=30
|
||||
)
|
||||
|
||||
adaptive = AdaptiveCrawler(crawler, config)
|
||||
result = await adaptive.digest(
|
||||
start_url="https://api.example.com/docs",
|
||||
query="authentication endpoints rate limits"
|
||||
)
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
- Learn about [Advanced Adaptive Strategies](../advanced/adaptive-strategies.md)
|
||||
- Explore the [AdaptiveCrawler API Reference](../api/adaptive-crawler.md)
|
||||
- See more [Examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples/adaptive_crawling)
|
||||
|
||||
## FAQ
|
||||
|
||||
**Q: How is this different from traditional crawling?**
|
||||
A: Traditional crawling follows fixed patterns (BFS/DFS). Adaptive crawling makes intelligent decisions about which links to follow and when to stop based on information gain.
|
||||
|
||||
**Q: Can I use this with JavaScript-heavy sites?**
|
||||
A: Yes! AdaptiveCrawler inherits all capabilities from AsyncWebCrawler, including JavaScript execution.
|
||||
|
||||
**Q: How does it handle large websites?**
|
||||
A: The algorithm naturally limits crawling to relevant sections. Use `max_pages` as a safety limit.
|
||||
|
||||
**Q: Can I customize the scoring algorithms?**
|
||||
A: Advanced users can implement custom strategies. See [Adaptive Strategies](../advanced/adaptive-strategies.md).
|
||||
@@ -28,7 +28,11 @@ This page provides a comprehensive list of example scripts that demonstrate vari
|
||||
| Example | Description | Link |
|
||||
|---------|-------------|------|
|
||||
| Deep Crawling | An extensive tutorial on deep crawling capabilities, demonstrating BFS and BestFirst strategies, stream vs. non-stream execution, filters, scorers, and advanced configurations. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/deepcrawl_example.py) |
|
||||
<<<<<<< HEAD
|
||||
| Virtual Scroll | Comprehensive examples for handling virtualized scrolling on sites like Twitter, Instagram. Demonstrates different scrolling scenarios with local test server. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/virtual_scroll_example.py) |
|
||||
=======
|
||||
| Adaptive Crawling | Demonstrates intelligent crawling that automatically determines when sufficient information has been gathered. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/adaptive_crawling/) |
|
||||
>>>>>>> feature/progressive-crawling
|
||||
| Dispatcher | Shows how to use the crawl dispatcher for advanced workload management. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/dispatcher_example.py) |
|
||||
| Storage State | Tutorial on managing browser storage state for persistence. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/storage_state_tutorial.md) |
|
||||
| Network Console Capture | Demonstrates how to capture and analyze network requests and console logs. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/network_console_capture_example.py) |
|
||||
|
||||
@@ -272,7 +272,43 @@ if __name__ == "__main__":
|
||||
|
||||
---
|
||||
|
||||
## 7. Multi-URL Concurrency (Preview)
|
||||
## 7. Adaptive Crawling (New!)
|
||||
|
||||
Crawl4AI now includes intelligent adaptive crawling that automatically determines when sufficient information has been gathered. Here's a quick example:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler
|
||||
|
||||
async def adaptive_example():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
adaptive = AdaptiveCrawler(crawler)
|
||||
|
||||
# Start adaptive crawling
|
||||
result = await adaptive.digest(
|
||||
start_url="https://docs.python.org/3/",
|
||||
query="async context managers"
|
||||
)
|
||||
|
||||
# View results
|
||||
adaptive.print_stats()
|
||||
print(f"Crawled {len(result.crawled_urls)} pages")
|
||||
print(f"Achieved {adaptive.confidence:.0%} confidence")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(adaptive_example())
|
||||
```
|
||||
|
||||
**What's special about adaptive crawling?**
|
||||
- **Automatic stopping**: Stops when sufficient information is gathered
|
||||
- **Intelligent link selection**: Follows only relevant links
|
||||
- **Confidence scoring**: Know how complete your information is
|
||||
|
||||
[Learn more about Adaptive Crawling →](adaptive-crawling.md)
|
||||
|
||||
---
|
||||
|
||||
## 8. Multi-URL Concurrency (Preview)
|
||||
|
||||
If you need to crawl multiple URLs in **parallel**, you can use `arun_many()`. By default, Crawl4AI employs a **MemoryAdaptiveDispatcher**, automatically adjusting concurrency based on system resources. Here’s a quick glimpse:
|
||||
|
||||
|
||||
@@ -48,6 +48,12 @@ Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant
|
||||
|
||||
> **Note**: If you're looking for the old documentation, you can access it [here](https://old.docs.crawl4ai.com).
|
||||
|
||||
## 🎯 New: Adaptive Web Crawling
|
||||
|
||||
Crawl4AI now features intelligent adaptive crawling that knows when to stop! Using advanced information foraging algorithms, it determines when sufficient information has been gathered to answer your query.
|
||||
|
||||
[Learn more about Adaptive Crawling →](core/adaptive-crawling.md)
|
||||
|
||||
|
||||
## Quick Start
|
||||
|
||||
|
||||
Reference in New Issue
Block a user