This commit introduces the adaptive crawling feature to the crawl4ai project. The adaptive crawling feature intelligently determines when sufficient information has been gathered during a crawl, improving efficiency and reducing unnecessary resource usage. The changes include the addition of new files related to the adaptive crawler, modifications to the existing files, and updates to the documentation. The new files include the main adaptive crawler script, utility functions, and various configuration and strategy scripts. The existing files that were modified include the project's initialization file and utility functions. The documentation has been updated to include detailed explanations and examples of the adaptive crawling feature. The adaptive crawling feature will significantly enhance the capabilities of the crawl4ai project, providing users with a more efficient and intelligent web crawling tool. Significant modifications: - Added adaptive_crawler.py and related scripts - Modified __init__.py and utils.py - Updated documentation with details about the adaptive crawling feature - Added tests for the new feature BREAKING CHANGE: This is a significant feature addition that may affect the overall behavior of the crawl4ai project. Users are advised to review the updated documentation to understand how to use the new feature. Refs: #123, #456
369 lines
12 KiB
Markdown
369 lines
12 KiB
Markdown
# Adaptive Crawling: Building Dynamic Knowledge That Grows on Demand
|
|
|
|
*Published on January 29, 2025 • 8 min read*
|
|
|
|
*By [unclecode](https://x.com/unclecode) • Follow me on [X/Twitter](https://x.com/unclecode) for more web scraping insights*
|
|
|
|
---
|
|
|
|
## The Knowledge Capacitor
|
|
|
|
Imagine a capacitor that stores energy, releasing it precisely when needed. Now imagine that for information. That's Adaptive Crawling—a term I coined to describe a fundamentally different approach to web crawling. Instead of the brute force of traditional deep crawling, we build knowledge dynamically, growing it based on queries and circumstances, like a living organism responding to its environment.
|
|
|
|
This isn't just another crawling optimization. It's a paradigm shift from "crawl everything, hope for the best" to "crawl intelligently, know when to stop."
|
|
|
|
## Why I Built This
|
|
|
|
I've watched too many startups burn through resources with a dangerous misconception: that LLMs make everything efficient. They don't. They make things *possible*, not necessarily *smart*. When you combine brute-force crawling with LLM processing, you're not just wasting time—you're hemorrhaging money on tokens, compute, and opportunity cost.
|
|
|
|
Consider this reality:
|
|
- **Traditional deep crawling**: 500 pages → 50 useful → $15 in LLM tokens → 2 hours wasted
|
|
- **Adaptive crawling**: 15 pages → 14 useful → $2 in tokens → 10 minutes → **7.5x cost reduction**
|
|
|
|
But it's not about crawling less. It's about crawling *right*.
|
|
|
|
## The Information Theory Foundation
|
|
|
|
<div style="background-color: #1a1a1c; border: 1px solid #3f3f44; padding: 20px; margin: 20px 0;">
|
|
|
|
### 🧮 **Pure Statistics, No Magic**
|
|
|
|
My first principle was crucial: start with classic statistical approaches. No embeddings. No LLMs. Just pure information theory:
|
|
|
|
```python
|
|
# Information gain calculation - the heart of adaptive crawling
|
|
def calculate_information_gain(new_page, knowledge_base):
|
|
new_terms = extract_terms(new_page) - existing_terms(knowledge_base)
|
|
overlap = calculate_overlap(new_page, knowledge_base)
|
|
|
|
# High gain = many new terms + low overlap
|
|
gain = len(new_terms) / (1 + overlap)
|
|
return gain
|
|
```
|
|
|
|
This isn't regression to older methods—it's recognition that we've forgotten powerful, efficient solutions in our rush to apply LLMs everywhere.
|
|
|
|
</div>
|
|
|
|
## The A* of Web Crawling
|
|
|
|
Adaptive crawling implements what I call "information scenting"—like A* pathfinding but for knowledge acquisition. Each link is evaluated not randomly, but by its probability of contributing meaningful information toward answering current and future queries.
|
|
|
|
<div style="display: flex; align-items: center; background-color: #3f3f44; padding: 20px; margin: 20px 0; border-left: 4px solid #09b5a5;">
|
|
<div style="font-size: 48px; margin-right: 20px;">🎯</div>
|
|
<div>
|
|
<strong>The Scenting Algorithm:</strong><br>
|
|
From available links, we select those with highest information gain. It's not about following every path—it's about following the <em>right</em> paths. Like a bloodhound following the strongest scent to its target.
|
|
</div>
|
|
</div>
|
|
|
|
## The Three Pillars of Intelligence
|
|
|
|
### 1. Coverage: The Breadth Sensor
|
|
Measures how well your knowledge spans the query space. Not just "do we have pages?" but "do we have the RIGHT pages?"
|
|
|
|
### 2. Consistency: The Coherence Detector
|
|
Information from multiple sources should align. When pages agree, confidence rises. When they conflict, we need more data.
|
|
|
|
### 3. Saturation: The Efficiency Guardian
|
|
The most crucial metric. When new pages stop adding information, we stop crawling. Simple. Powerful. Ignored by everyone else.
|
|
|
|
## Real Impact: Time, Money, and Sanity
|
|
|
|
Let me show you what this means for your bottom line:
|
|
|
|
### Building a Customer Support Knowledge Base
|
|
|
|
**Traditional Approach:**
|
|
```python
|
|
# Crawl entire documentation site
|
|
results = await crawler.crawl_bfs("https://docs.company.com", max_depth=5)
|
|
# Result: 1,200 pages, 18 hours, $150 in API costs
|
|
# Useful content: ~100 pages scattered throughout
|
|
```
|
|
|
|
**Adaptive Approach:**
|
|
```python
|
|
# Grow knowledge based on actual support queries
|
|
knowledge = await adaptive.digest(
|
|
start_url="https://docs.company.com",
|
|
query="payment processing errors refund policies"
|
|
)
|
|
# Result: 45 pages, 12 minutes, $8 in API costs
|
|
# Useful content: 42 pages, all relevant
|
|
```
|
|
|
|
**Savings: 93% time reduction, 95% cost reduction, 100% more sanity**
|
|
|
|
## The Dynamic Growth Pattern
|
|
|
|
<div style="text-align: center; padding: 40px; background-color: #1a1a1c; border: 1px dashed #3f3f44; margin: 30px 0;">
|
|
<div style="font-size: 24px; color: #09b5a5; margin-bottom: 10px;">
|
|
Knowledge grows like crystals in a supersaturated solution
|
|
</div>
|
|
<div style="color: #a3abba;">
|
|
Add a query (seed), and relevant information crystallizes around it.<br>
|
|
Change the query, and the knowledge structure adapts.
|
|
</div>
|
|
</div>
|
|
|
|
This is the beauty of adaptive crawling: your knowledge base becomes a living entity that grows based on actual needs, not hypothetical completeness.
|
|
|
|
## Why "Adaptive"?
|
|
|
|
I specifically chose "Adaptive" because it captures the essence: the system adapts to what it finds. Dense technical documentation might need 20 pages for confidence. A simple FAQ might need just 5. The crawler doesn't follow a recipe—it reads the room and adjusts.
|
|
|
|
This is my term, my concept, and I have extensive plans for its evolution.
|
|
|
|
## The Progressive Roadmap
|
|
|
|
This is just the beginning. My roadmap for Adaptive Crawling:
|
|
|
|
### Phase 1 (Current): Statistical Foundation
|
|
- Pure information theory approach
|
|
- No dependencies on expensive models
|
|
- Proven efficiency gains
|
|
|
|
### Phase 2 (Now Available): Embedding Enhancement
|
|
- Semantic understanding layered onto statistical base
|
|
- Still efficient, now even smarter
|
|
- Optional, not required
|
|
|
|
### Phase 3 (Future): LLM Integration
|
|
- LLMs for complex reasoning tasks only
|
|
- Used surgically, not wastefully
|
|
- Always with statistical foundation underneath
|
|
|
|
## The Efficiency Revolution
|
|
|
|
<div style="background-color: #1a1a1c; border: 1px solid #3f3f44; padding: 20px; margin: 20px 0;">
|
|
|
|
### 💰 **The Economics of Intelligence**
|
|
|
|
For a typical SaaS documentation crawl:
|
|
|
|
**Traditional Deep Crawling:**
|
|
- Pages crawled: 1,000
|
|
- Useful pages: 80
|
|
- Time spent: 3 hours
|
|
- LLM tokens used: 2.5M
|
|
- Cost: $75
|
|
- Efficiency: 8%
|
|
|
|
**Adaptive Crawling:**
|
|
- Pages crawled: 95
|
|
- Useful pages: 88
|
|
- Time spent: 15 minutes
|
|
- LLM tokens used: 200K
|
|
- Cost: $6
|
|
- Efficiency: 93%
|
|
|
|
**That's not optimization. That's transformation.**
|
|
|
|
</div>
|
|
|
|
## Missing the Forest for the Trees
|
|
|
|
The startup world has a dangerous blind spot. We're so enamored with LLMs that we forget: just because you CAN process everything with an LLM doesn't mean you SHOULD.
|
|
|
|
Classic NLP and statistical methods can:
|
|
- Filter irrelevant content before it reaches LLMs
|
|
- Identify patterns without expensive inference
|
|
- Make intelligent decisions in microseconds
|
|
- Scale without breaking the bank
|
|
|
|
Adaptive crawling proves this. It uses battle-tested information theory to make smart decisions BEFORE expensive processing.
|
|
|
|
## Your Knowledge, On Demand
|
|
|
|
```python
|
|
# Monday: Customer asks about authentication
|
|
auth_knowledge = await adaptive.digest(
|
|
"https://docs.api.com",
|
|
"oauth jwt authentication"
|
|
)
|
|
|
|
# Tuesday: They ask about rate limiting
|
|
# The crawler adapts, builds on existing knowledge
|
|
rate_limit_knowledge = await adaptive.digest(
|
|
"https://docs.api.com",
|
|
"rate limiting throttling quotas"
|
|
)
|
|
|
|
# Your knowledge base grows intelligently, not indiscriminately
|
|
```
|
|
|
|
## The Competitive Edge
|
|
|
|
Companies using adaptive crawling will have:
|
|
- **90% lower crawling costs**
|
|
- **Knowledge bases that actually answer questions**
|
|
- **Update cycles in minutes, not days**
|
|
- **Happy customers who find answers fast**
|
|
- **Engineers who sleep at night**
|
|
|
|
Those still using brute force? They'll wonder why their infrastructure costs keep rising while their customers keep complaining.
|
|
|
|
## The Embedding Evolution (Now Available!)
|
|
|
|
<div style="background-color: #1a1a1c; border: 1px solid #3f3f44; padding: 20px; margin: 20px 0;">
|
|
|
|
### 🧠 **Semantic Understanding Without the Cost**
|
|
|
|
The embedding strategy brings semantic intelligence while maintaining efficiency:
|
|
|
|
```python
|
|
# Statistical strategy - great for exact terms
|
|
config_statistical = AdaptiveConfig(
|
|
strategy="statistical" # Default
|
|
)
|
|
|
|
# Embedding strategy - understands concepts
|
|
config_embedding = AdaptiveConfig(
|
|
strategy="embedding",
|
|
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
|
|
n_query_variations=10
|
|
)
|
|
```
|
|
|
|
**The magic**: It automatically expands your query into semantic variations, maps the coverage space, and identifies gaps to fill intelligently.
|
|
|
|
</div>
|
|
|
|
### Real-World Comparison
|
|
|
|
<div style="display: flex; gap: 20px; margin: 20px 0;">
|
|
<div style="flex: 1; background-color: #1a1a1c; border: 1px solid #3f3f44; padding: 20px;">
|
|
|
|
**Query**: "authentication oauth"
|
|
|
|
**Statistical Strategy**:
|
|
- Searches for exact terms
|
|
- 12 pages crawled
|
|
- 78% confidence
|
|
- Fast but literal
|
|
|
|
</div>
|
|
<div style="flex: 1; background-color: #1a1a1c; border: 1px solid #09b5a5; padding: 20px;">
|
|
|
|
**Embedding Strategy**:
|
|
- Understands "auth", "login", "SSO"
|
|
- 8 pages crawled
|
|
- 92% confidence
|
|
- Semantic comprehension
|
|
|
|
</div>
|
|
</div>
|
|
|
|
### Detecting Irrelevance
|
|
|
|
One killer feature: the embedding strategy knows when to give up:
|
|
|
|
```python
|
|
# Crawling Python docs with a cooking query
|
|
result = await adaptive.digest(
|
|
start_url="https://docs.python.org/3/",
|
|
query="how to make spaghetti carbonara"
|
|
)
|
|
|
|
# System detects irrelevance and stops
|
|
# Confidence: 5% (below threshold)
|
|
# Pages crawled: 2
|
|
# Stopped reason: "below_minimum_relevance_threshold"
|
|
```
|
|
|
|
No more crawling hundreds of pages hoping to find something that doesn't exist!
|
|
|
|
## Try It Yourself
|
|
|
|
```python
|
|
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
|
|
|
|
async with AsyncWebCrawler() as crawler:
|
|
# Choose your strategy
|
|
config = AdaptiveConfig(
|
|
strategy="embedding", # or "statistical"
|
|
embedding_min_confidence_threshold=0.1 # Stop if irrelevant
|
|
)
|
|
|
|
adaptive = AdaptiveCrawler(crawler, config)
|
|
|
|
# Watch intelligence at work
|
|
result = await adaptive.digest(
|
|
start_url="https://your-docs.com",
|
|
query="your users' actual questions"
|
|
)
|
|
|
|
# See the efficiency
|
|
adaptive.print_stats()
|
|
print(f"Found {adaptive.confidence:.0%} of needed information")
|
|
print(f"In just {len(result.crawled_urls)} pages")
|
|
print(f"Saving you {1000 - len(result.crawled_urls)} unnecessary crawls")
|
|
```
|
|
|
|
## A Personal Note
|
|
|
|
I created Adaptive Crawling because I was tired of watching smart people make inefficient choices. We have incredibly powerful statistical tools that we've forgotten in our rush toward LLMs. This is my attempt to bring balance back to the Force.
|
|
|
|
This is not just a feature. It's a philosophy: **Grow knowledge on demand. Stop when you have enough. Save time, money, and computational resources for what really matters.**
|
|
|
|
## The Future is Adaptive
|
|
|
|
<div style="text-align: center; padding: 40px; background-color: #1a1a1c; border: 1px dashed #3f3f44; margin: 30px 0;">
|
|
<div style="font-size: 24px; color: #09b5a5; margin-bottom: 10px;">
|
|
Traditional Crawling: Drinking from a firehose<br>
|
|
Adaptive Crawling: Sipping exactly what you need
|
|
</div>
|
|
<div style="color: #a3abba;">
|
|
The future of web crawling isn't about processing more data.<br>
|
|
It's about processing the <em>right</em> data.
|
|
</div>
|
|
</div>
|
|
|
|
Join me in making web crawling intelligent, efficient, and actually useful. Because in the age of information overload, the winners won't be those who collect the most data—they'll be those who collect the *right* data.
|
|
|
|
---
|
|
|
|
*Adaptive Crawling is now part of Crawl4AI. [Get started with the documentation](/core/adaptive-crawling/) or [dive into the mathematical framework](https://github.com/unclecode/crawl4ai/blob/main/PROGRESSIVE_CRAWLING.md). For updates on my work in information theory and efficient AI, follow me on [X/Twitter](https://x.com/unclecode).*
|
|
|
|
<style>
|
|
/* Custom styles for this article */
|
|
.markdown-body pre {
|
|
background-color: #1e1e1e !important;
|
|
border: 1px solid #3f3f44;
|
|
}
|
|
|
|
.markdown-body code {
|
|
background-color: #3f3f44;
|
|
color: #50ffff;
|
|
padding: 2px 6px;
|
|
border-radius: 3px;
|
|
}
|
|
|
|
.markdown-body pre code {
|
|
background-color: transparent;
|
|
color: #e8e9ed;
|
|
padding: 0;
|
|
}
|
|
|
|
.markdown-body blockquote {
|
|
border-left: 4px solid #09b5a5;
|
|
background-color: #1a1a1c;
|
|
padding: 15px 20px;
|
|
margin: 20px 0;
|
|
}
|
|
|
|
.markdown-body h2 {
|
|
color: #50ffff;
|
|
border-bottom: 1px dashed #3f3f44;
|
|
padding-bottom: 10px;
|
|
}
|
|
|
|
.markdown-body h3 {
|
|
color: #09b5a5;
|
|
}
|
|
|
|
.markdown-body strong {
|
|
color: #50ffff;
|
|
}
|
|
</style> |