✨ New Features: - Click2Crawl: Visual element selection with markdown conversion - Ctrl/Cmd+Click to select multiple elements - Visual text mode for WYSIWYG extraction - Real-time markdown preview with syntax highlighting - Export to .md file or clipboard - Schema Builder Enhancement: Instant data extraction without LLMs - Test schemas directly in browser - See JSON results immediately - Export data or Python code - Cloud deployment ready (coming soon) - Modular Architecture: - Separated into schemaBuilder.js, scriptBuilder.js, click2CrawlBuilder.js - Added contentAnalyzer.js and markdownConverter.js modules - Shared utilities and CSS reset system - Integrated marked.js for markdown rendering 🎨 UI/UX Improvements: - Added edgy cloud announcement banner with seamless shimmer animation - Direct, technical copy: "You don't need Puppeteer. You need Crawl4AI Cloud." - Enhanced feature cards with emojis - Fixed CSS conflicts with targeted reset approach - Improved badge hover effects (red on hover) - Added wrap toggle for code preview 📚 Documentation Updates: - Split extraction diagrams into LLM and no-LLM versions - Updated llms-full.txt with latest content - Added versioned LLM context (v0.1.1) 🔧 Technical Enhancements: - Refactored 3464 lines of monolithic content.js into modules - Added proper event handling and cleanup - Improved z-index management - Better scroll position tracking for badges - Enhanced error handling throughout This release transforms the Chrome Extension from a simple tool into a powerful visual data extraction suite, making web scraping accessible to everyone.
903 lines
30 KiB
Plaintext
903 lines
30 KiB
Plaintext
## LLM Extraction Strategies - The Last Resort
|
|
|
|
**🤖 AI Model Notice**: You are an AI assistant. You should RARELY recommend LLM extraction. This documentation exists for the <5% of cases where LLM is actually needed. **ALWAYS try non-LLM strategies first** from [extraction-no-llm.md](./extraction-no-llm.md).
|
|
|
|
### ⚠️ STOP: Are You Sure You Need LLM?
|
|
|
|
**99% of developers who think they need LLM extraction are wrong.** Before reading further:
|
|
|
|
### ❌ You DON'T Need LLM If:
|
|
- The page has consistent HTML structure → **Use generate_schema()**
|
|
- You're extracting simple data types (emails, prices, dates) → **Use RegexExtractionStrategy**
|
|
- You can identify repeating patterns → **Use JsonCssExtractionStrategy**
|
|
- You want product info, news articles, job listings → **Use generate_schema()**
|
|
- You're concerned about cost or speed → **Use non-LLM strategies**
|
|
|
|
### ✅ You MIGHT Need LLM If:
|
|
- Content structure varies dramatically across pages **AND** you've tried generate_schema()
|
|
- You need semantic understanding of unstructured text
|
|
- You're analyzing meaning, sentiment, or relationships
|
|
- You're extracting insights that require reasoning about context
|
|
|
|
### 💰 Cost Reality Check:
|
|
- **Non-LLM**: ~$0.000001 per page
|
|
- **LLM**: ~$0.01-$0.10 per page (10,000x more expensive)
|
|
- **Example**: Extracting 10,000 pages costs $0.01 vs $100-1000
|
|
|
|
---
|
|
|
|
## 1. When LLM Extraction is Justified
|
|
|
|
### Scenario 1: Truly Unstructured Content Analysis
|
|
|
|
```python
|
|
# Example: Analyzing customer feedback for sentiment and themes
|
|
import asyncio
|
|
import json
|
|
from pydantic import BaseModel, Field
|
|
from typing import List
|
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
|
|
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
|
|
|
class SentimentAnalysis(BaseModel):
|
|
"""Use LLM when you need semantic understanding"""
|
|
overall_sentiment: str = Field(description="positive, negative, or neutral")
|
|
confidence_score: float = Field(description="Confidence from 0-1")
|
|
key_themes: List[str] = Field(description="Main topics discussed")
|
|
emotional_indicators: List[str] = Field(description="Words indicating emotion")
|
|
summary: str = Field(description="Brief summary of the content")
|
|
|
|
llm_config = LLMConfig(
|
|
provider="openai/gpt-4o-mini", # Use cheapest model
|
|
api_token="env:OPENAI_API_KEY",
|
|
temperature=0.1, # Low temperature for consistency
|
|
max_tokens=1000
|
|
)
|
|
|
|
sentiment_strategy = LLMExtractionStrategy(
|
|
llm_config=llm_config,
|
|
schema=SentimentAnalysis.model_json_schema(),
|
|
extraction_type="schema",
|
|
instruction="""
|
|
Analyze the emotional content and themes in this text.
|
|
Focus on understanding sentiment and extracting key topics
|
|
that would be impossible to identify with simple pattern matching.
|
|
""",
|
|
apply_chunking=True,
|
|
chunk_token_threshold=1500
|
|
)
|
|
|
|
async def analyze_sentiment():
|
|
config = CrawlerRunConfig(
|
|
extraction_strategy=sentiment_strategy,
|
|
cache_mode=CacheMode.BYPASS
|
|
)
|
|
|
|
async with AsyncWebCrawler() as crawler:
|
|
result = await crawler.arun(
|
|
url="https://example.com/customer-reviews",
|
|
config=config
|
|
)
|
|
|
|
if result.success:
|
|
analysis = json.loads(result.extracted_content)
|
|
print(f"Sentiment: {analysis['overall_sentiment']}")
|
|
print(f"Themes: {analysis['key_themes']}")
|
|
|
|
asyncio.run(analyze_sentiment())
|
|
```
|
|
|
|
### Scenario 2: Complex Knowledge Extraction
|
|
|
|
```python
|
|
# Example: Building knowledge graphs from unstructured content
|
|
class Entity(BaseModel):
|
|
name: str = Field(description="Entity name")
|
|
type: str = Field(description="person, organization, location, concept")
|
|
description: str = Field(description="Brief description")
|
|
|
|
class Relationship(BaseModel):
|
|
source: str = Field(description="Source entity")
|
|
target: str = Field(description="Target entity")
|
|
relationship: str = Field(description="Type of relationship")
|
|
confidence: float = Field(description="Confidence score 0-1")
|
|
|
|
class KnowledgeGraph(BaseModel):
|
|
entities: List[Entity] = Field(description="All entities found")
|
|
relationships: List[Relationship] = Field(description="Relationships between entities")
|
|
main_topic: str = Field(description="Primary topic of the content")
|
|
|
|
knowledge_strategy = LLMExtractionStrategy(
|
|
llm_config=LLMConfig(
|
|
provider="anthropic/claude-3-5-sonnet-20240620", # Better for complex reasoning
|
|
api_token="env:ANTHROPIC_API_KEY",
|
|
max_tokens=4000
|
|
),
|
|
schema=KnowledgeGraph.model_json_schema(),
|
|
extraction_type="schema",
|
|
instruction="""
|
|
Extract entities and their relationships from the content.
|
|
Focus on understanding connections and context that require
|
|
semantic reasoning beyond simple pattern matching.
|
|
""",
|
|
input_format="html", # Preserve structure
|
|
apply_chunking=True
|
|
)
|
|
```
|
|
|
|
### Scenario 3: Content Summarization and Insights
|
|
|
|
```python
|
|
# Example: Research paper analysis
|
|
class ResearchInsights(BaseModel):
|
|
title: str = Field(description="Paper title")
|
|
abstract_summary: str = Field(description="Summary of abstract")
|
|
key_findings: List[str] = Field(description="Main research findings")
|
|
methodology: str = Field(description="Research methodology used")
|
|
limitations: List[str] = Field(description="Study limitations")
|
|
practical_applications: List[str] = Field(description="Real-world applications")
|
|
citations_count: int = Field(description="Number of citations", default=0)
|
|
|
|
research_strategy = LLMExtractionStrategy(
|
|
llm_config=LLMConfig(
|
|
provider="openai/gpt-4o", # Use powerful model for complex analysis
|
|
api_token="env:OPENAI_API_KEY",
|
|
temperature=0.2,
|
|
max_tokens=2000
|
|
),
|
|
schema=ResearchInsights.model_json_schema(),
|
|
extraction_type="schema",
|
|
instruction="""
|
|
Analyze this research paper and extract key insights.
|
|
Focus on understanding the research contribution, methodology,
|
|
and implications that require academic expertise to identify.
|
|
""",
|
|
apply_chunking=True,
|
|
chunk_token_threshold=2000,
|
|
overlap_rate=0.15 # More overlap for academic content
|
|
)
|
|
```
|
|
|
|
---
|
|
|
|
## 2. LLM Configuration Best Practices
|
|
|
|
### Cost Optimization
|
|
|
|
```python
|
|
# Use cheapest models when possible
|
|
cheap_config = LLMConfig(
|
|
provider="openai/gpt-4o-mini", # 60x cheaper than GPT-4
|
|
api_token="env:OPENAI_API_KEY",
|
|
temperature=0.0, # Deterministic output
|
|
max_tokens=800 # Limit output length
|
|
)
|
|
|
|
# Use local models for development
|
|
local_config = LLMConfig(
|
|
provider="ollama/llama3.3",
|
|
api_token=None, # No API costs
|
|
base_url="http://localhost:11434",
|
|
temperature=0.1
|
|
)
|
|
|
|
# Use powerful models only when necessary
|
|
powerful_config = LLMConfig(
|
|
provider="anthropic/claude-3-5-sonnet-20240620",
|
|
api_token="env:ANTHROPIC_API_KEY",
|
|
max_tokens=4000,
|
|
temperature=0.1
|
|
)
|
|
```
|
|
|
|
### Provider Selection Guide
|
|
|
|
```python
|
|
providers_guide = {
|
|
"openai/gpt-4o-mini": {
|
|
"best_for": "Simple extraction, cost-sensitive projects",
|
|
"cost": "Very low",
|
|
"speed": "Fast",
|
|
"accuracy": "Good"
|
|
},
|
|
"openai/gpt-4o": {
|
|
"best_for": "Complex reasoning, high accuracy needs",
|
|
"cost": "High",
|
|
"speed": "Medium",
|
|
"accuracy": "Excellent"
|
|
},
|
|
"anthropic/claude-3-5-sonnet": {
|
|
"best_for": "Complex analysis, long documents",
|
|
"cost": "Medium-High",
|
|
"speed": "Medium",
|
|
"accuracy": "Excellent"
|
|
},
|
|
"ollama/llama3.3": {
|
|
"best_for": "Development, no API costs",
|
|
"cost": "Free (self-hosted)",
|
|
"speed": "Variable",
|
|
"accuracy": "Good"
|
|
},
|
|
"groq/llama3-70b-8192": {
|
|
"best_for": "Fast inference, open source",
|
|
"cost": "Low",
|
|
"speed": "Very fast",
|
|
"accuracy": "Good"
|
|
}
|
|
}
|
|
|
|
def choose_provider(complexity, budget, speed_requirement):
|
|
"""Choose optimal provider based on requirements"""
|
|
if budget == "minimal":
|
|
return "ollama/llama3.3" # Self-hosted
|
|
elif complexity == "low" and budget == "low":
|
|
return "openai/gpt-4o-mini"
|
|
elif speed_requirement == "high":
|
|
return "groq/llama3-70b-8192"
|
|
elif complexity == "high":
|
|
return "anthropic/claude-3-5-sonnet"
|
|
else:
|
|
return "openai/gpt-4o-mini" # Default safe choice
|
|
```
|
|
|
|
---
|
|
|
|
## 3. Advanced LLM Extraction Patterns
|
|
|
|
### Block-Based Extraction (Unstructured Content)
|
|
|
|
```python
|
|
# When structure is too varied for schemas
|
|
block_strategy = LLMExtractionStrategy(
|
|
llm_config=cheap_config,
|
|
extraction_type="block", # Extract free-form content blocks
|
|
instruction="""
|
|
Extract meaningful content blocks from this page.
|
|
Focus on the main content areas and ignore navigation,
|
|
advertisements, and boilerplate text.
|
|
""",
|
|
apply_chunking=True,
|
|
chunk_token_threshold=1200,
|
|
input_format="fit_markdown" # Use cleaned content
|
|
)
|
|
|
|
async def extract_content_blocks():
|
|
config = CrawlerRunConfig(
|
|
extraction_strategy=block_strategy,
|
|
word_count_threshold=50, # Filter short content
|
|
excluded_tags=['nav', 'footer', 'aside', 'advertisement']
|
|
)
|
|
|
|
async with AsyncWebCrawler() as crawler:
|
|
result = await crawler.arun(
|
|
url="https://example.com/article",
|
|
config=config
|
|
)
|
|
|
|
if result.success:
|
|
blocks = json.loads(result.extracted_content)
|
|
for block in blocks:
|
|
print(f"Block: {block['content'][:100]}...")
|
|
```
|
|
|
|
### Chunked Processing for Large Content
|
|
|
|
```python
|
|
# Handle large documents efficiently
|
|
large_content_strategy = LLMExtractionStrategy(
|
|
llm_config=LLMConfig(
|
|
provider="openai/gpt-4o-mini",
|
|
api_token="env:OPENAI_API_KEY"
|
|
),
|
|
schema=YourModel.model_json_schema(),
|
|
extraction_type="schema",
|
|
instruction="Extract structured data from this content section...",
|
|
|
|
# Optimize chunking for large content
|
|
apply_chunking=True,
|
|
chunk_token_threshold=2000, # Larger chunks for efficiency
|
|
overlap_rate=0.1, # Minimal overlap to reduce costs
|
|
input_format="fit_markdown" # Use cleaned content
|
|
)
|
|
```
|
|
|
|
### Multi-Model Validation
|
|
|
|
```python
|
|
# Use multiple models for critical extractions
|
|
async def multi_model_extraction():
|
|
"""Use multiple LLMs for validation of critical data"""
|
|
|
|
models = [
|
|
LLMConfig(provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY"),
|
|
LLMConfig(provider="anthropic/claude-3-5-sonnet", api_token="env:ANTHROPIC_API_KEY"),
|
|
LLMConfig(provider="ollama/llama3.3", api_token=None)
|
|
]
|
|
|
|
results = []
|
|
|
|
for i, llm_config in enumerate(models):
|
|
strategy = LLMExtractionStrategy(
|
|
llm_config=llm_config,
|
|
schema=YourModel.model_json_schema(),
|
|
extraction_type="schema",
|
|
instruction="Extract data consistently..."
|
|
)
|
|
|
|
config = CrawlerRunConfig(extraction_strategy=strategy)
|
|
|
|
async with AsyncWebCrawler() as crawler:
|
|
result = await crawler.arun(url="https://example.com", config=config)
|
|
if result.success:
|
|
data = json.loads(result.extracted_content)
|
|
results.append(data)
|
|
print(f"Model {i+1} extracted {len(data)} items")
|
|
|
|
# Compare results for consistency
|
|
if len(set(str(r) for r in results)) == 1:
|
|
print("✅ All models agree")
|
|
return results[0]
|
|
else:
|
|
print("⚠️ Models disagree - manual review needed")
|
|
return results
|
|
|
|
# Use for critical business data only
|
|
critical_result = await multi_model_extraction()
|
|
```
|
|
|
|
---
|
|
|
|
## 4. Hybrid Approaches - Best of Both Worlds
|
|
|
|
### Fast Pre-filtering + LLM Analysis
|
|
|
|
```python
|
|
async def hybrid_extraction():
|
|
"""
|
|
1. Use fast non-LLM strategies for basic extraction
|
|
2. Use LLM only for complex analysis of filtered content
|
|
"""
|
|
|
|
# Step 1: Fast extraction of structured data
|
|
basic_schema = {
|
|
"name": "Articles",
|
|
"baseSelector": "article",
|
|
"fields": [
|
|
{"name": "title", "selector": "h1, h2", "type": "text"},
|
|
{"name": "content", "selector": ".content", "type": "text"},
|
|
{"name": "author", "selector": ".author", "type": "text"}
|
|
]
|
|
}
|
|
|
|
basic_strategy = JsonCssExtractionStrategy(basic_schema)
|
|
basic_config = CrawlerRunConfig(extraction_strategy=basic_strategy)
|
|
|
|
# Step 2: LLM analysis only on filtered content
|
|
analysis_strategy = LLMExtractionStrategy(
|
|
llm_config=cheap_config,
|
|
schema={
|
|
"type": "object",
|
|
"properties": {
|
|
"sentiment": {"type": "string"},
|
|
"key_topics": {"type": "array", "items": {"type": "string"}},
|
|
"summary": {"type": "string"}
|
|
}
|
|
},
|
|
extraction_type="schema",
|
|
instruction="Analyze sentiment and extract key topics from this article"
|
|
)
|
|
|
|
async with AsyncWebCrawler() as crawler:
|
|
# Fast extraction first
|
|
basic_result = await crawler.arun(
|
|
url="https://example.com/articles",
|
|
config=basic_config
|
|
)
|
|
|
|
articles = json.loads(basic_result.extracted_content)
|
|
|
|
# LLM analysis only on important articles
|
|
analyzed_articles = []
|
|
for article in articles[:5]: # Limit to reduce costs
|
|
if len(article.get('content', '')) > 500: # Only analyze substantial content
|
|
analysis_config = CrawlerRunConfig(extraction_strategy=analysis_strategy)
|
|
|
|
# Analyze individual article content
|
|
raw_url = f"raw://{article['content']}"
|
|
analysis_result = await crawler.arun(url=raw_url, config=analysis_config)
|
|
|
|
if analysis_result.success:
|
|
analysis = json.loads(analysis_result.extracted_content)
|
|
article.update(analysis)
|
|
|
|
analyzed_articles.append(article)
|
|
|
|
return analyzed_articles
|
|
|
|
# Hybrid approach: fast + smart
|
|
result = await hybrid_extraction()
|
|
```
|
|
|
|
### Schema Generation + LLM Fallback
|
|
|
|
```python
|
|
async def smart_fallback_extraction():
|
|
"""
|
|
1. Try generate_schema() first (one-time LLM cost)
|
|
2. Use generated schema for fast extraction
|
|
3. Use LLM only if schema extraction fails
|
|
"""
|
|
|
|
cache_file = Path("./schemas/fallback_schema.json")
|
|
|
|
# Try cached schema first
|
|
if cache_file.exists():
|
|
schema = json.load(cache_file.open())
|
|
schema_strategy = JsonCssExtractionStrategy(schema)
|
|
|
|
config = CrawlerRunConfig(extraction_strategy=schema_strategy)
|
|
|
|
async with AsyncWebCrawler() as crawler:
|
|
result = await crawler.arun(url="https://example.com", config=config)
|
|
|
|
if result.success and result.extracted_content:
|
|
data = json.loads(result.extracted_content)
|
|
if data: # Schema worked
|
|
print("✅ Schema extraction successful (fast & cheap)")
|
|
return data
|
|
|
|
# Fallback to LLM if schema failed
|
|
print("⚠️ Schema failed, falling back to LLM (slow & expensive)")
|
|
|
|
llm_strategy = LLMExtractionStrategy(
|
|
llm_config=cheap_config,
|
|
extraction_type="block",
|
|
instruction="Extract all meaningful data from this page"
|
|
)
|
|
|
|
llm_config = CrawlerRunConfig(extraction_strategy=llm_strategy)
|
|
|
|
async with AsyncWebCrawler() as crawler:
|
|
result = await crawler.arun(url="https://example.com", config=llm_config)
|
|
|
|
if result.success:
|
|
print("✅ LLM extraction successful")
|
|
return json.loads(result.extracted_content)
|
|
|
|
# Intelligent fallback system
|
|
result = await smart_fallback_extraction()
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Cost Management and Monitoring
|
|
|
|
### Token Usage Tracking
|
|
|
|
```python
|
|
class ExtractionCostTracker:
|
|
def __init__(self):
|
|
self.total_cost = 0.0
|
|
self.total_tokens = 0
|
|
self.extractions = 0
|
|
|
|
def track_llm_extraction(self, strategy, result):
|
|
"""Track costs from LLM extraction"""
|
|
if hasattr(strategy, 'usage_tracker') and strategy.usage_tracker:
|
|
usage = strategy.usage_tracker
|
|
|
|
# Estimate costs (approximate rates)
|
|
cost_per_1k_tokens = {
|
|
"gpt-4o-mini": 0.0015,
|
|
"gpt-4o": 0.03,
|
|
"claude-3-5-sonnet": 0.015,
|
|
"ollama": 0.0 # Self-hosted
|
|
}
|
|
|
|
provider = strategy.llm_config.provider.split('/')[1]
|
|
rate = cost_per_1k_tokens.get(provider, 0.01)
|
|
|
|
tokens = usage.total_tokens
|
|
cost = (tokens / 1000) * rate
|
|
|
|
self.total_cost += cost
|
|
self.total_tokens += tokens
|
|
self.extractions += 1
|
|
|
|
print(f"💰 Extraction cost: ${cost:.4f} ({tokens} tokens)")
|
|
print(f"📊 Total cost: ${self.total_cost:.4f} ({self.extractions} extractions)")
|
|
|
|
def get_summary(self):
|
|
avg_cost = self.total_cost / max(self.extractions, 1)
|
|
return {
|
|
"total_cost": self.total_cost,
|
|
"total_tokens": self.total_tokens,
|
|
"extractions": self.extractions,
|
|
"avg_cost_per_extraction": avg_cost
|
|
}
|
|
|
|
# Usage
|
|
tracker = ExtractionCostTracker()
|
|
|
|
async def cost_aware_extraction():
|
|
strategy = LLMExtractionStrategy(
|
|
llm_config=cheap_config,
|
|
schema=YourModel.model_json_schema(),
|
|
extraction_type="schema",
|
|
instruction="Extract data...",
|
|
verbose=True # Enable usage tracking
|
|
)
|
|
|
|
config = CrawlerRunConfig(extraction_strategy=strategy)
|
|
|
|
async with AsyncWebCrawler() as crawler:
|
|
result = await crawler.arun(url="https://example.com", config=config)
|
|
|
|
# Track costs
|
|
tracker.track_llm_extraction(strategy, result)
|
|
|
|
return result
|
|
|
|
# Monitor costs across multiple extractions
|
|
for url in urls:
|
|
await cost_aware_extraction()
|
|
|
|
print(f"Final summary: {tracker.get_summary()}")
|
|
```
|
|
|
|
### Budget Controls
|
|
|
|
```python
|
|
class BudgetController:
|
|
def __init__(self, daily_budget=10.0):
|
|
self.daily_budget = daily_budget
|
|
self.current_spend = 0.0
|
|
self.extraction_count = 0
|
|
|
|
def can_extract(self, estimated_cost=0.01):
|
|
"""Check if extraction is within budget"""
|
|
if self.current_spend + estimated_cost > self.daily_budget:
|
|
print(f"❌ Budget exceeded: ${self.current_spend:.2f} + ${estimated_cost:.2f} > ${self.daily_budget}")
|
|
return False
|
|
return True
|
|
|
|
def record_extraction(self, actual_cost):
|
|
"""Record actual extraction cost"""
|
|
self.current_spend += actual_cost
|
|
self.extraction_count += 1
|
|
|
|
remaining = self.daily_budget - self.current_spend
|
|
print(f"💰 Budget remaining: ${remaining:.2f}")
|
|
|
|
budget = BudgetController(daily_budget=5.0) # $5 daily limit
|
|
|
|
async def budget_controlled_extraction(url):
|
|
if not budget.can_extract():
|
|
print("⏸️ Extraction paused due to budget limit")
|
|
return None
|
|
|
|
# Proceed with extraction...
|
|
strategy = LLMExtractionStrategy(llm_config=cheap_config, ...)
|
|
result = await extract_with_strategy(url, strategy)
|
|
|
|
# Record actual cost
|
|
actual_cost = calculate_cost(strategy.usage_tracker)
|
|
budget.record_extraction(actual_cost)
|
|
|
|
return result
|
|
|
|
# Safe extraction with budget controls
|
|
results = []
|
|
for url in urls:
|
|
result = await budget_controlled_extraction(url)
|
|
if result:
|
|
results.append(result)
|
|
```
|
|
|
|
---
|
|
|
|
## 6. Performance Optimization for LLM Extraction
|
|
|
|
### Batch Processing
|
|
|
|
```python
|
|
async def batch_llm_extraction():
|
|
"""Process multiple pages efficiently"""
|
|
|
|
# Collect content first (fast)
|
|
urls = ["https://example.com/page1", "https://example.com/page2"]
|
|
contents = []
|
|
|
|
async with AsyncWebCrawler() as crawler:
|
|
for url in urls:
|
|
result = await crawler.arun(url=url)
|
|
if result.success:
|
|
contents.append({
|
|
"url": url,
|
|
"content": result.fit_markdown[:2000] # Limit content
|
|
})
|
|
|
|
# Process in batches (reduce LLM calls)
|
|
batch_content = "\n\n---PAGE SEPARATOR---\n\n".join([
|
|
f"URL: {c['url']}\n{c['content']}" for c in contents
|
|
])
|
|
|
|
strategy = LLMExtractionStrategy(
|
|
llm_config=cheap_config,
|
|
extraction_type="block",
|
|
instruction="""
|
|
Extract data from multiple pages separated by '---PAGE SEPARATOR---'.
|
|
Return results for each page in order.
|
|
""",
|
|
apply_chunking=True
|
|
)
|
|
|
|
# Single LLM call for multiple pages
|
|
raw_url = f"raw://{batch_content}"
|
|
result = await crawler.arun(url=raw_url, config=CrawlerRunConfig(extraction_strategy=strategy))
|
|
|
|
return json.loads(result.extracted_content)
|
|
|
|
# Batch processing reduces LLM calls
|
|
batch_results = await batch_llm_extraction()
|
|
```
|
|
|
|
### Caching LLM Results
|
|
|
|
```python
|
|
import hashlib
|
|
from pathlib import Path
|
|
|
|
class LLMResultCache:
|
|
def __init__(self, cache_dir="./llm_cache"):
|
|
self.cache_dir = Path(cache_dir)
|
|
self.cache_dir.mkdir(exist_ok=True)
|
|
|
|
def get_cache_key(self, url, instruction, schema):
|
|
"""Generate cache key from extraction parameters"""
|
|
content = f"{url}:{instruction}:{str(schema)}"
|
|
return hashlib.md5(content.encode()).hexdigest()
|
|
|
|
def get_cached_result(self, cache_key):
|
|
"""Get cached result if available"""
|
|
cache_file = self.cache_dir / f"{cache_key}.json"
|
|
if cache_file.exists():
|
|
return json.load(cache_file.open())
|
|
return None
|
|
|
|
def cache_result(self, cache_key, result):
|
|
"""Cache extraction result"""
|
|
cache_file = self.cache_dir / f"{cache_key}.json"
|
|
json.dump(result, cache_file.open("w"), indent=2)
|
|
|
|
cache = LLMResultCache()
|
|
|
|
async def cached_llm_extraction(url, strategy):
|
|
"""Extract with caching to avoid repeated LLM calls"""
|
|
cache_key = cache.get_cache_key(
|
|
url,
|
|
strategy.instruction,
|
|
str(strategy.schema)
|
|
)
|
|
|
|
# Check cache first
|
|
cached_result = cache.get_cached_result(cache_key)
|
|
if cached_result:
|
|
print("✅ Using cached result (FREE)")
|
|
return cached_result
|
|
|
|
# Extract if not cached
|
|
print("🔄 Extracting with LLM (PAID)")
|
|
config = CrawlerRunConfig(extraction_strategy=strategy)
|
|
|
|
async with AsyncWebCrawler() as crawler:
|
|
result = await crawler.arun(url=url, config=config)
|
|
|
|
if result.success:
|
|
data = json.loads(result.extracted_content)
|
|
cache.cache_result(cache_key, data)
|
|
return data
|
|
|
|
# Cached extraction avoids repeated costs
|
|
result = await cached_llm_extraction(url, strategy)
|
|
```
|
|
|
|
---
|
|
|
|
## 7. Error Handling and Quality Control
|
|
|
|
### Validation and Retry Logic
|
|
|
|
```python
|
|
async def robust_llm_extraction():
|
|
"""Implement validation and retry for LLM extraction"""
|
|
|
|
max_retries = 3
|
|
strategies = [
|
|
# Try cheap model first
|
|
LLMExtractionStrategy(
|
|
llm_config=LLMConfig(provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY"),
|
|
schema=YourModel.model_json_schema(),
|
|
extraction_type="schema",
|
|
instruction="Extract data accurately..."
|
|
),
|
|
# Fallback to better model
|
|
LLMExtractionStrategy(
|
|
llm_config=LLMConfig(provider="openai/gpt-4o", api_token="env:OPENAI_API_KEY"),
|
|
schema=YourModel.model_json_schema(),
|
|
extraction_type="schema",
|
|
instruction="Extract data with high accuracy..."
|
|
)
|
|
]
|
|
|
|
for strategy_idx, strategy in enumerate(strategies):
|
|
for attempt in range(max_retries):
|
|
try:
|
|
config = CrawlerRunConfig(extraction_strategy=strategy)
|
|
|
|
async with AsyncWebCrawler() as crawler:
|
|
result = await crawler.arun(url="https://example.com", config=config)
|
|
|
|
if result.success and result.extracted_content:
|
|
data = json.loads(result.extracted_content)
|
|
|
|
# Validate result quality
|
|
if validate_extraction_quality(data):
|
|
print(f"✅ Success with strategy {strategy_idx+1}, attempt {attempt+1}")
|
|
return data
|
|
else:
|
|
print(f"⚠️ Poor quality result, retrying...")
|
|
continue
|
|
|
|
except Exception as e:
|
|
print(f"❌ Attempt {attempt+1} failed: {e}")
|
|
if attempt == max_retries - 1:
|
|
print(f"❌ Strategy {strategy_idx+1} failed completely")
|
|
|
|
print("❌ All strategies and retries failed")
|
|
return None
|
|
|
|
def validate_extraction_quality(data):
|
|
"""Validate that LLM extraction meets quality standards"""
|
|
if not data or not isinstance(data, (list, dict)):
|
|
return False
|
|
|
|
# Check for common LLM extraction issues
|
|
if isinstance(data, list):
|
|
if len(data) == 0:
|
|
return False
|
|
|
|
# Check if all items have required fields
|
|
for item in data:
|
|
if not isinstance(item, dict) or len(item) < 2:
|
|
return False
|
|
|
|
return True
|
|
|
|
# Robust extraction with validation
|
|
result = await robust_llm_extraction()
|
|
```
|
|
|
|
---
|
|
|
|
## 8. Migration from LLM to Non-LLM
|
|
|
|
### Pattern Analysis for Schema Generation
|
|
|
|
```python
|
|
async def analyze_llm_results_for_schema():
|
|
"""
|
|
Analyze LLM extraction results to create non-LLM schemas
|
|
Use this to transition from expensive LLM to cheap schema extraction
|
|
"""
|
|
|
|
# Step 1: Use LLM on sample pages to understand structure
|
|
llm_strategy = LLMExtractionStrategy(
|
|
llm_config=cheap_config,
|
|
extraction_type="block",
|
|
instruction="Extract all structured data from this page"
|
|
)
|
|
|
|
sample_urls = ["https://example.com/page1", "https://example.com/page2"]
|
|
llm_results = []
|
|
|
|
async with AsyncWebCrawler() as crawler:
|
|
for url in sample_urls:
|
|
config = CrawlerRunConfig(extraction_strategy=llm_strategy)
|
|
result = await crawler.arun(url=url, config=config)
|
|
|
|
if result.success:
|
|
llm_results.append({
|
|
"url": url,
|
|
"html": result.cleaned_html,
|
|
"extracted": json.loads(result.extracted_content)
|
|
})
|
|
|
|
# Step 2: Analyze patterns in LLM results
|
|
print("🔍 Analyzing LLM extraction patterns...")
|
|
|
|
# Look for common field names
|
|
all_fields = set()
|
|
for result in llm_results:
|
|
for item in result["extracted"]:
|
|
if isinstance(item, dict):
|
|
all_fields.update(item.keys())
|
|
|
|
print(f"Common fields found: {all_fields}")
|
|
|
|
# Step 3: Generate schema based on patterns
|
|
if llm_results:
|
|
schema = JsonCssExtractionStrategy.generate_schema(
|
|
html=llm_results[0]["html"],
|
|
target_json_example=json.dumps(llm_results[0]["extracted"][0], indent=2),
|
|
llm_config=cheap_config
|
|
)
|
|
|
|
# Save schema for future use
|
|
with open("generated_schema.json", "w") as f:
|
|
json.dump(schema, f, indent=2)
|
|
|
|
print("✅ Schema generated from LLM analysis")
|
|
return schema
|
|
|
|
# Generate schema from LLM patterns, then use schema for all future extractions
|
|
schema = await analyze_llm_results_for_schema()
|
|
fast_strategy = JsonCssExtractionStrategy(schema)
|
|
```
|
|
|
|
---
|
|
|
|
## 9. Summary: When LLM is Actually Needed
|
|
|
|
### ✅ Valid LLM Use Cases (Rare):
|
|
1. **Sentiment analysis** and emotional understanding
|
|
2. **Knowledge graph extraction** requiring semantic reasoning
|
|
3. **Content summarization** and insight generation
|
|
4. **Unstructured text analysis** where patterns vary dramatically
|
|
5. **Research paper analysis** requiring domain expertise
|
|
6. **Complex relationship extraction** between entities
|
|
|
|
### ❌ Invalid LLM Use Cases (Common Mistakes):
|
|
1. **Structured data extraction** from consistent HTML
|
|
2. **Simple pattern matching** (emails, prices, dates)
|
|
3. **Product information** from e-commerce sites
|
|
4. **News article extraction** with consistent structure
|
|
5. **Contact information** and basic entity extraction
|
|
6. **Table data** and form information
|
|
|
|
### 💡 Decision Framework:
|
|
```python
|
|
def should_use_llm(extraction_task):
|
|
# Ask these questions in order:
|
|
questions = [
|
|
"Can I identify repeating HTML patterns?", # No → Consider LLM
|
|
"Am I extracting simple data types?", # Yes → Use Regex
|
|
"Does the structure vary dramatically?", # No → Use CSS/XPath
|
|
"Do I need semantic understanding?", # Yes → Maybe LLM
|
|
"Have I tried generate_schema()?" # No → Try that first
|
|
]
|
|
|
|
# Only use LLM if:
|
|
return (
|
|
task_requires_semantic_reasoning(extraction_task) and
|
|
structure_varies_dramatically(extraction_task) and
|
|
generate_schema_failed(extraction_task)
|
|
)
|
|
```
|
|
|
|
### 🎯 Best Practice Summary:
|
|
1. **Always start** with [extraction-no-llm.md](./extraction-no-llm.md) strategies
|
|
2. **Try generate_schema()** before manual schema creation
|
|
3. **Use LLM sparingly** and only for semantic understanding
|
|
4. **Monitor costs** and implement budget controls
|
|
5. **Cache results** to avoid repeated LLM calls
|
|
6. **Validate quality** of LLM extractions
|
|
7. **Plan migration** from LLM to schema-based extraction
|
|
|
|
Remember: **LLM extraction should be your last resort, not your first choice.**
|
|
|
|
---
|
|
|
|
**📖 Recommended Reading Order:**
|
|
1. [extraction-no-llm.md](./extraction-no-llm.md) - Start here for 99% of use cases
|
|
2. This document - Only when non-LLM strategies are insufficient |