Files
crawl4ai/PROGRESSIVE_CRAWLING.md
UncleCode 1a73fb60db feat(crawl4ai): Implement adaptive crawling feature
This commit introduces the adaptive crawling feature to the crawl4ai project. The adaptive crawling feature intelligently determines when sufficient information has been gathered during a crawl, improving efficiency and reducing unnecessary resource usage.

The changes include the addition of new files related to the adaptive crawler, modifications to the existing files, and updates to the documentation. The new files include the main adaptive crawler script, utility functions, and various configuration and strategy scripts. The existing files that were modified include the project's initialization file and utility functions. The documentation has been updated to include detailed explanations and examples of the adaptive crawling feature.

The adaptive crawling feature will significantly enhance the capabilities of the crawl4ai project, providing users with a more efficient and intelligent web crawling tool.

Significant modifications:
- Added adaptive_crawler.py and related scripts
- Modified __init__.py and utils.py
- Updated documentation with details about the adaptive crawling feature
- Added tests for the new feature

BREAKING CHANGE: This is a significant feature addition that may affect the overall behavior of the crawl4ai project. Users are advised to review the updated documentation to understand how to use the new feature.

Refs: #123, #456
2025-07-04 15:16:53 +08:00

320 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Progressive Web Crawling with Adaptive Information Foraging
## Abstract
This paper presents a novel approach to web crawling that adaptively determines when sufficient information has been gathered to answer a given query. Unlike traditional exhaustive crawling methods, our Progressive Information Sufficiency (PIS) framework uses statistical measures to balance information completeness against crawling efficiency. We introduce a multi-strategy architecture supporting pure statistical, embedding-enhanced, and LLM-assisted approaches, with theoretical guarantees on convergence and practical evaluation methods using synthetic datasets.
## 1. Introduction
Traditional web crawling approaches follow predetermined patterns (breadth-first, depth-first) without consideration for information sufficiency. This work addresses the fundamental question: *"When do we have enough information to answer a query and similar queries in its domain?"*
We formalize this as an optimal stopping problem in information foraging, introducing metrics for coverage, consistency, and saturation that enable crawlers to make intelligent decisions about when to stop crawling and which links to follow.
## 2. Problem Formulation
### 2.1 Definitions
Let:
- **K** = {d₁, d₂, ..., dₙ} be the current knowledge base (crawled documents)
- **Q** be the user query
- **L** = {l₁, l₂, ..., lₘ} be available links with preview metadata
- **θ** be the confidence threshold for information sufficiency
### 2.2 Objectives
1. **Minimize** |K| (number of crawled pages)
2. **Maximize** P(answers(Q) | K) (probability of answering Q given K)
3. **Ensure** coverage of Q's domain (similar queries)
## 3. Mathematical Framework
### 3.1 Information Sufficiency Metric
We define Information Sufficiency as:
```
IS(K, Q) = min(Coverage(K, Q), Consistency(K, Q), 1 - Redundancy(K)) × DomainCoverage(K, Q)
```
### 3.2 Coverage Score
Coverage measures how well current knowledge covers query terms and related concepts:
```
Coverage(K, Q) = Σ(t ∈ Q) log(df(t, K) + 1) × idf(t) / |Q|
```
Where:
- df(t, K) = document frequency of term t in knowledge base K
- idf(t) = inverse document frequency weight
### 3.3 Consistency Score
Consistency measures information coherence across documents:
```
Consistency(K, Q) = 1 - Var(answers from random subsets of K)
```
This captures the principle that sufficient knowledge should provide stable answers regardless of document subset.
### 3.4 Saturation Score
Saturation detects diminishing returns:
```
Saturation(K) = 1 - (ΔInfo(Kₙ) / ΔInfo(K₁))
```
Where ΔInfo represents marginal information gain from the nth crawl.
### 3.5 Link Value Prediction
Expected information gain from uncrawled links:
```
ExpectedGain(l) = Relevance(l, Q) × Novelty(l, K) × Authority(l)
```
Components:
- **Relevance**: BM25(preview_text, Q)
- **Novelty**: 1 - max_similarity(preview, K)
- **Authority**: f(url_structure, domain_metrics)
## 4. Algorithmic Approach
### 4.1 Progressive Crawling Algorithm
```
Algorithm: ProgressiveCrawl(start_url, query, θ)
K ← ∅
crawled ← {start_url}
pending ← extract_links(crawl(start_url))
while IS(K, Q) < θ and |crawled| < max_pages:
candidates ← rank_by_expected_gain(pending, Q, K)
if max(ExpectedGain(candidates)) < min_gain:
break // Diminishing returns
to_crawl ← top_k(candidates)
new_docs ← parallel_crawl(to_crawl)
K ← K new_docs
crawled ← crawled to_crawl
pending ← extract_new_links(new_docs) - crawled
return K
```
### 4.2 Stopping Criteria
Crawling terminates when:
1. IS(K, Q) ≥ θ (sufficient information)
2. d(IS)/d(crawls) < ε (plateau reached)
3. |crawled| ≥ max_pages (resource limit)
4. max(ExpectedGain) < min_gain (no promising links)
## 5. Multi-Strategy Architecture
### 5.1 Strategy Pattern Design
```
AbstractStrategy
├── StatisticalStrategy (no LLM, no embeddings)
├── EmbeddingStrategy (with semantic similarity)
└── LLMStrategy (with language model assistance)
```
### 5.2 Statistical Strategy
Pure statistical approach using:
- BM25 for relevance scoring
- Term frequency analysis for coverage
- Graph structure for authority
- No external models required
**Advantages**: Fast, no API costs, works offline
**Best for**: Technical documentation, specific terminology
### 5.3 Embedding Strategy (Implemented)
Semantic understanding through embeddings:
- Query expansion into semantic variations
- Coverage mapping in embedding space
- Gap-driven link selection
- Validation-based stopping criteria
**Mathematical Framework**:
```
Coverage(K, Q) = mean(max_similarity(q, K) for q in Q_expanded)
Gap(q) = 1 - max_similarity(q, K)
LinkScore(l) = Σ(Gap(q) × relevance(l, q)) × (1 - redundancy(l, K))
```
**Key Parameters**:
- `embedding_k_exp`: Exponential decay factor for distance-to-score mapping
- `embedding_coverage_radius`: Distance threshold for query coverage
- `embedding_min_confidence_threshold`: Minimum relevance threshold
**Advantages**: Semantic understanding, handles ambiguity, detects irrelevance
**Best for**: Research queries, conceptual topics, diverse content
### 5.4 Progressive Enhancement Path
1. **Level 0**: Statistical only (implemented)
2. **Level 1**: + Embeddings for semantic similarity (implemented)
3. **Level 2**: + LLM for query understanding (future)
## 6. Evaluation Methodology
### 6.1 Synthetic Dataset Generation
Using LLM to create evaluation data:
```python
def generate_synthetic_dataset(domain_url):
# 1. Fully crawl domain
full_knowledge = exhaustive_crawl(domain_url)
# 2. Generate answerable queries
queries = llm_generate_queries(full_knowledge)
# 3. Create query variations
for q in queries:
variations = generate_variations(q) # synonyms, sub/super queries
return queries, variations, full_knowledge
```
### 6.2 Evaluation Metrics
1. **Efficiency**: Information gained / Pages crawled
2. **Completeness**: Answerable queries / Total queries
3. **Redundancy**: 1 - (Unique information / Total information)
4. **Convergence Rate**: Pages to 95% completeness
### 6.3 Ablation Studies
- Impact of each score component (coverage, consistency, saturation)
- Sensitivity to threshold parameters
- Performance across different domain types
## 7. Theoretical Properties
### 7.1 Convergence Guarantee
**Theorem**: For finite websites, ProgressiveCrawl converges to IS(K, Q) ≥ θ or exhausts all reachable pages.
**Proof sketch**: IS(K, Q) is monotonically non-decreasing with each crawl, bounded above by 1.
### 7.2 Optimality
Under certain assumptions about link preview accuracy:
- Expected crawls ≤ 2 × optimal_crawls
- Approximation ratio improves with preview quality
## 8. Implementation Design
### 8.1 Core Components
1. **CrawlState**: Maintains crawl history and metrics
2. **AdaptiveConfig**: Configuration parameters
3. **CrawlStrategy**: Pluggable strategy interface
4. **AdaptiveCrawler**: Main orchestrator
### 8.2 Integration with Crawl4AI
- Wraps existing AsyncWebCrawler
- Leverages link preview functionality
- Maintains backward compatibility
### 8.3 Persistence
Knowledge base serialization for:
- Resumable crawls
- Knowledge sharing
- Offline analysis
## 9. Future Directions
### 9.1 Advanced Scoring
- Temporal information value
- Multi-query optimization
- Active learning from user feedback
### 9.2 Distributed Crawling
- Collaborative knowledge building
- Federated information sufficiency
### 9.3 Domain Adaptation
- Transfer learning across domains
- Meta-learning for threshold selection
## 10. Conclusion
Progressive crawling with adaptive information foraging provides a principled approach to efficient web information extraction. By combining coverage, consistency, and saturation metrics, we can determine information sufficiency without ground truth labels. The multi-strategy architecture allows graceful enhancement from pure statistical to LLM-assisted approaches based on requirements and resources.
## References
1. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
2. Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval.
3. Pirolli, P., & Card, S. (1999). Information Foraging. Psychological Review, 106(4), 643-675.
4. Dasgupta, S. (2005). Analysis of a greedy active learning strategy. Advances in Neural Information Processing Systems.
## Appendix A: Implementation Pseudocode
```python
class StatisticalStrategy:
def calculate_confidence(self, state):
coverage = self.calculate_coverage(state)
consistency = self.calculate_consistency(state)
saturation = self.calculate_saturation(state)
return min(coverage, consistency, saturation)
def calculate_coverage(self, state):
# BM25-based term coverage
term_scores = []
for term in state.query.split():
df = state.document_frequencies.get(term, 0)
idf = self.idf_cache.get(term, 1.0)
term_scores.append(log(df + 1) * idf)
return mean(term_scores) / max_possible_score
def rank_links(self, state):
scored_links = []
for link in state.pending_links:
relevance = self.bm25_score(link.preview_text, state.query)
novelty = self.calculate_novelty(link, state.knowledge_base)
authority = self.url_authority(link.href)
score = relevance * novelty * authority
scored_links.append((link, score))
return sorted(scored_links, key=lambda x: x[1], reverse=True)
```
## Appendix B: Evaluation Protocol
1. **Dataset Creation**:
- Select diverse domains (documentation, blogs, e-commerce)
- Generate 100 queries per domain using LLM
- Create query variations (5-10 per query)
2. **Baseline Comparisons**:
- BFS crawler (depth-limited)
- DFS crawler (depth-limited)
- Random crawler
- Oracle (knows relevant pages)
3. **Metrics Collection**:
- Pages crawled vs query answerability
- Time to sufficient confidence
- False positive/negative rates
4. **Statistical Analysis**:
- ANOVA for strategy comparison
- Regression for parameter sensitivity
- Bootstrap for confidence intervals