Files
crawl4ai/docs/md_v2/assets/llm.txt/txt/extraction-llm.txt
UncleCode 0ac12da9f3 feat: Major Chrome Extension overhaul with Click2Crawl, instant Schema extraction, and modular architecture
 New Features:
- Click2Crawl: Visual element selection with markdown conversion
  - Ctrl/Cmd+Click to select multiple elements
  - Visual text mode for WYSIWYG extraction
  - Real-time markdown preview with syntax highlighting
  - Export to .md file or clipboard

- Schema Builder Enhancement: Instant data extraction without LLMs
  - Test schemas directly in browser
  - See JSON results immediately
  - Export data or Python code
  - Cloud deployment ready (coming soon)

- Modular Architecture:
  - Separated into schemaBuilder.js, scriptBuilder.js, click2CrawlBuilder.js
  - Added contentAnalyzer.js and markdownConverter.js modules
  - Shared utilities and CSS reset system
  - Integrated marked.js for markdown rendering

🎨 UI/UX Improvements:
- Added edgy cloud announcement banner with seamless shimmer animation
- Direct, technical copy: "You don't need Puppeteer. You need Crawl4AI Cloud."
- Enhanced feature cards with emojis
- Fixed CSS conflicts with targeted reset approach
- Improved badge hover effects (red on hover)
- Added wrap toggle for code preview

📚 Documentation Updates:
- Split extraction diagrams into LLM and no-LLM versions
- Updated llms-full.txt with latest content
- Added versioned LLM context (v0.1.1)

🔧 Technical Enhancements:
- Refactored 3464 lines of monolithic content.js into modules
- Added proper event handling and cleanup
- Improved z-index management
- Better scroll position tracking for badges
- Enhanced error handling throughout

This release transforms the Chrome Extension from a simple tool into a powerful
visual data extraction suite, making web scraping accessible to everyone.
2025-06-09 23:18:27 +08:00

903 lines
30 KiB
Plaintext

## LLM Extraction Strategies - The Last Resort
**🤖 AI Model Notice**: You are an AI assistant. You should RARELY recommend LLM extraction. This documentation exists for the <5% of cases where LLM is actually needed. **ALWAYS try non-LLM strategies first** from [extraction-no-llm.md](./extraction-no-llm.md).
### ⚠️ STOP: Are You Sure You Need LLM?
**99% of developers who think they need LLM extraction are wrong.** Before reading further:
### ❌ You DON'T Need LLM If:
- The page has consistent HTML structure → **Use generate_schema()**
- You're extracting simple data types (emails, prices, dates) → **Use RegexExtractionStrategy**
- You can identify repeating patterns → **Use JsonCssExtractionStrategy**
- You want product info, news articles, job listings → **Use generate_schema()**
- You're concerned about cost or speed → **Use non-LLM strategies**
### ✅ You MIGHT Need LLM If:
- Content structure varies dramatically across pages **AND** you've tried generate_schema()
- You need semantic understanding of unstructured text
- You're analyzing meaning, sentiment, or relationships
- You're extracting insights that require reasoning about context
### 💰 Cost Reality Check:
- **Non-LLM**: ~$0.000001 per page
- **LLM**: ~$0.01-$0.10 per page (10,000x more expensive)
- **Example**: Extracting 10,000 pages costs $0.01 vs $100-1000
---
## 1. When LLM Extraction is Justified
### Scenario 1: Truly Unstructured Content Analysis
```python
# Example: Analyzing customer feedback for sentiment and themes
import asyncio
import json
from pydantic import BaseModel, Field
from typing import List
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
class SentimentAnalysis(BaseModel):
"""Use LLM when you need semantic understanding"""
overall_sentiment: str = Field(description="positive, negative, or neutral")
confidence_score: float = Field(description="Confidence from 0-1")
key_themes: List[str] = Field(description="Main topics discussed")
emotional_indicators: List[str] = Field(description="Words indicating emotion")
summary: str = Field(description="Brief summary of the content")
llm_config = LLMConfig(
provider="openai/gpt-4o-mini", # Use cheapest model
api_token="env:OPENAI_API_KEY",
temperature=0.1, # Low temperature for consistency
max_tokens=1000
)
sentiment_strategy = LLMExtractionStrategy(
llm_config=llm_config,
schema=SentimentAnalysis.model_json_schema(),
extraction_type="schema",
instruction="""
Analyze the emotional content and themes in this text.
Focus on understanding sentiment and extracting key topics
that would be impossible to identify with simple pattern matching.
""",
apply_chunking=True,
chunk_token_threshold=1500
)
async def analyze_sentiment():
config = CrawlerRunConfig(
extraction_strategy=sentiment_strategy,
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/customer-reviews",
config=config
)
if result.success:
analysis = json.loads(result.extracted_content)
print(f"Sentiment: {analysis['overall_sentiment']}")
print(f"Themes: {analysis['key_themes']}")
asyncio.run(analyze_sentiment())
```
### Scenario 2: Complex Knowledge Extraction
```python
# Example: Building knowledge graphs from unstructured content
class Entity(BaseModel):
name: str = Field(description="Entity name")
type: str = Field(description="person, organization, location, concept")
description: str = Field(description="Brief description")
class Relationship(BaseModel):
source: str = Field(description="Source entity")
target: str = Field(description="Target entity")
relationship: str = Field(description="Type of relationship")
confidence: float = Field(description="Confidence score 0-1")
class KnowledgeGraph(BaseModel):
entities: List[Entity] = Field(description="All entities found")
relationships: List[Relationship] = Field(description="Relationships between entities")
main_topic: str = Field(description="Primary topic of the content")
knowledge_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(
provider="anthropic/claude-3-5-sonnet-20240620", # Better for complex reasoning
api_token="env:ANTHROPIC_API_KEY",
max_tokens=4000
),
schema=KnowledgeGraph.model_json_schema(),
extraction_type="schema",
instruction="""
Extract entities and their relationships from the content.
Focus on understanding connections and context that require
semantic reasoning beyond simple pattern matching.
""",
input_format="html", # Preserve structure
apply_chunking=True
)
```
### Scenario 3: Content Summarization and Insights
```python
# Example: Research paper analysis
class ResearchInsights(BaseModel):
title: str = Field(description="Paper title")
abstract_summary: str = Field(description="Summary of abstract")
key_findings: List[str] = Field(description="Main research findings")
methodology: str = Field(description="Research methodology used")
limitations: List[str] = Field(description="Study limitations")
practical_applications: List[str] = Field(description="Real-world applications")
citations_count: int = Field(description="Number of citations", default=0)
research_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(
provider="openai/gpt-4o", # Use powerful model for complex analysis
api_token="env:OPENAI_API_KEY",
temperature=0.2,
max_tokens=2000
),
schema=ResearchInsights.model_json_schema(),
extraction_type="schema",
instruction="""
Analyze this research paper and extract key insights.
Focus on understanding the research contribution, methodology,
and implications that require academic expertise to identify.
""",
apply_chunking=True,
chunk_token_threshold=2000,
overlap_rate=0.15 # More overlap for academic content
)
```
---
## 2. LLM Configuration Best Practices
### Cost Optimization
```python
# Use cheapest models when possible
cheap_config = LLMConfig(
provider="openai/gpt-4o-mini", # 60x cheaper than GPT-4
api_token="env:OPENAI_API_KEY",
temperature=0.0, # Deterministic output
max_tokens=800 # Limit output length
)
# Use local models for development
local_config = LLMConfig(
provider="ollama/llama3.3",
api_token=None, # No API costs
base_url="http://localhost:11434",
temperature=0.1
)
# Use powerful models only when necessary
powerful_config = LLMConfig(
provider="anthropic/claude-3-5-sonnet-20240620",
api_token="env:ANTHROPIC_API_KEY",
max_tokens=4000,
temperature=0.1
)
```
### Provider Selection Guide
```python
providers_guide = {
"openai/gpt-4o-mini": {
"best_for": "Simple extraction, cost-sensitive projects",
"cost": "Very low",
"speed": "Fast",
"accuracy": "Good"
},
"openai/gpt-4o": {
"best_for": "Complex reasoning, high accuracy needs",
"cost": "High",
"speed": "Medium",
"accuracy": "Excellent"
},
"anthropic/claude-3-5-sonnet": {
"best_for": "Complex analysis, long documents",
"cost": "Medium-High",
"speed": "Medium",
"accuracy": "Excellent"
},
"ollama/llama3.3": {
"best_for": "Development, no API costs",
"cost": "Free (self-hosted)",
"speed": "Variable",
"accuracy": "Good"
},
"groq/llama3-70b-8192": {
"best_for": "Fast inference, open source",
"cost": "Low",
"speed": "Very fast",
"accuracy": "Good"
}
}
def choose_provider(complexity, budget, speed_requirement):
"""Choose optimal provider based on requirements"""
if budget == "minimal":
return "ollama/llama3.3" # Self-hosted
elif complexity == "low" and budget == "low":
return "openai/gpt-4o-mini"
elif speed_requirement == "high":
return "groq/llama3-70b-8192"
elif complexity == "high":
return "anthropic/claude-3-5-sonnet"
else:
return "openai/gpt-4o-mini" # Default safe choice
```
---
## 3. Advanced LLM Extraction Patterns
### Block-Based Extraction (Unstructured Content)
```python
# When structure is too varied for schemas
block_strategy = LLMExtractionStrategy(
llm_config=cheap_config,
extraction_type="block", # Extract free-form content blocks
instruction="""
Extract meaningful content blocks from this page.
Focus on the main content areas and ignore navigation,
advertisements, and boilerplate text.
""",
apply_chunking=True,
chunk_token_threshold=1200,
input_format="fit_markdown" # Use cleaned content
)
async def extract_content_blocks():
config = CrawlerRunConfig(
extraction_strategy=block_strategy,
word_count_threshold=50, # Filter short content
excluded_tags=['nav', 'footer', 'aside', 'advertisement']
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/article",
config=config
)
if result.success:
blocks = json.loads(result.extracted_content)
for block in blocks:
print(f"Block: {block['content'][:100]}...")
```
### Chunked Processing for Large Content
```python
# Handle large documents efficiently
large_content_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(
provider="openai/gpt-4o-mini",
api_token="env:OPENAI_API_KEY"
),
schema=YourModel.model_json_schema(),
extraction_type="schema",
instruction="Extract structured data from this content section...",
# Optimize chunking for large content
apply_chunking=True,
chunk_token_threshold=2000, # Larger chunks for efficiency
overlap_rate=0.1, # Minimal overlap to reduce costs
input_format="fit_markdown" # Use cleaned content
)
```
### Multi-Model Validation
```python
# Use multiple models for critical extractions
async def multi_model_extraction():
"""Use multiple LLMs for validation of critical data"""
models = [
LLMConfig(provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY"),
LLMConfig(provider="anthropic/claude-3-5-sonnet", api_token="env:ANTHROPIC_API_KEY"),
LLMConfig(provider="ollama/llama3.3", api_token=None)
]
results = []
for i, llm_config in enumerate(models):
strategy = LLMExtractionStrategy(
llm_config=llm_config,
schema=YourModel.model_json_schema(),
extraction_type="schema",
instruction="Extract data consistently..."
)
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
if result.success:
data = json.loads(result.extracted_content)
results.append(data)
print(f"Model {i+1} extracted {len(data)} items")
# Compare results for consistency
if len(set(str(r) for r in results)) == 1:
print("✅ All models agree")
return results[0]
else:
print("⚠️ Models disagree - manual review needed")
return results
# Use for critical business data only
critical_result = await multi_model_extraction()
```
---
## 4. Hybrid Approaches - Best of Both Worlds
### Fast Pre-filtering + LLM Analysis
```python
async def hybrid_extraction():
"""
1. Use fast non-LLM strategies for basic extraction
2. Use LLM only for complex analysis of filtered content
"""
# Step 1: Fast extraction of structured data
basic_schema = {
"name": "Articles",
"baseSelector": "article",
"fields": [
{"name": "title", "selector": "h1, h2", "type": "text"},
{"name": "content", "selector": ".content", "type": "text"},
{"name": "author", "selector": ".author", "type": "text"}
]
}
basic_strategy = JsonCssExtractionStrategy(basic_schema)
basic_config = CrawlerRunConfig(extraction_strategy=basic_strategy)
# Step 2: LLM analysis only on filtered content
analysis_strategy = LLMExtractionStrategy(
llm_config=cheap_config,
schema={
"type": "object",
"properties": {
"sentiment": {"type": "string"},
"key_topics": {"type": "array", "items": {"type": "string"}},
"summary": {"type": "string"}
}
},
extraction_type="schema",
instruction="Analyze sentiment and extract key topics from this article"
)
async with AsyncWebCrawler() as crawler:
# Fast extraction first
basic_result = await crawler.arun(
url="https://example.com/articles",
config=basic_config
)
articles = json.loads(basic_result.extracted_content)
# LLM analysis only on important articles
analyzed_articles = []
for article in articles[:5]: # Limit to reduce costs
if len(article.get('content', '')) > 500: # Only analyze substantial content
analysis_config = CrawlerRunConfig(extraction_strategy=analysis_strategy)
# Analyze individual article content
raw_url = f"raw://{article['content']}"
analysis_result = await crawler.arun(url=raw_url, config=analysis_config)
if analysis_result.success:
analysis = json.loads(analysis_result.extracted_content)
article.update(analysis)
analyzed_articles.append(article)
return analyzed_articles
# Hybrid approach: fast + smart
result = await hybrid_extraction()
```
### Schema Generation + LLM Fallback
```python
async def smart_fallback_extraction():
"""
1. Try generate_schema() first (one-time LLM cost)
2. Use generated schema for fast extraction
3. Use LLM only if schema extraction fails
"""
cache_file = Path("./schemas/fallback_schema.json")
# Try cached schema first
if cache_file.exists():
schema = json.load(cache_file.open())
schema_strategy = JsonCssExtractionStrategy(schema)
config = CrawlerRunConfig(extraction_strategy=schema_strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
if result.success and result.extracted_content:
data = json.loads(result.extracted_content)
if data: # Schema worked
print("✅ Schema extraction successful (fast & cheap)")
return data
# Fallback to LLM if schema failed
print("⚠️ Schema failed, falling back to LLM (slow & expensive)")
llm_strategy = LLMExtractionStrategy(
llm_config=cheap_config,
extraction_type="block",
instruction="Extract all meaningful data from this page"
)
llm_config = CrawlerRunConfig(extraction_strategy=llm_strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=llm_config)
if result.success:
print("✅ LLM extraction successful")
return json.loads(result.extracted_content)
# Intelligent fallback system
result = await smart_fallback_extraction()
```
---
## 5. Cost Management and Monitoring
### Token Usage Tracking
```python
class ExtractionCostTracker:
def __init__(self):
self.total_cost = 0.0
self.total_tokens = 0
self.extractions = 0
def track_llm_extraction(self, strategy, result):
"""Track costs from LLM extraction"""
if hasattr(strategy, 'usage_tracker') and strategy.usage_tracker:
usage = strategy.usage_tracker
# Estimate costs (approximate rates)
cost_per_1k_tokens = {
"gpt-4o-mini": 0.0015,
"gpt-4o": 0.03,
"claude-3-5-sonnet": 0.015,
"ollama": 0.0 # Self-hosted
}
provider = strategy.llm_config.provider.split('/')[1]
rate = cost_per_1k_tokens.get(provider, 0.01)
tokens = usage.total_tokens
cost = (tokens / 1000) * rate
self.total_cost += cost
self.total_tokens += tokens
self.extractions += 1
print(f"💰 Extraction cost: ${cost:.4f} ({tokens} tokens)")
print(f"📊 Total cost: ${self.total_cost:.4f} ({self.extractions} extractions)")
def get_summary(self):
avg_cost = self.total_cost / max(self.extractions, 1)
return {
"total_cost": self.total_cost,
"total_tokens": self.total_tokens,
"extractions": self.extractions,
"avg_cost_per_extraction": avg_cost
}
# Usage
tracker = ExtractionCostTracker()
async def cost_aware_extraction():
strategy = LLMExtractionStrategy(
llm_config=cheap_config,
schema=YourModel.model_json_schema(),
extraction_type="schema",
instruction="Extract data...",
verbose=True # Enable usage tracking
)
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
# Track costs
tracker.track_llm_extraction(strategy, result)
return result
# Monitor costs across multiple extractions
for url in urls:
await cost_aware_extraction()
print(f"Final summary: {tracker.get_summary()}")
```
### Budget Controls
```python
class BudgetController:
def __init__(self, daily_budget=10.0):
self.daily_budget = daily_budget
self.current_spend = 0.0
self.extraction_count = 0
def can_extract(self, estimated_cost=0.01):
"""Check if extraction is within budget"""
if self.current_spend + estimated_cost > self.daily_budget:
print(f"❌ Budget exceeded: ${self.current_spend:.2f} + ${estimated_cost:.2f} > ${self.daily_budget}")
return False
return True
def record_extraction(self, actual_cost):
"""Record actual extraction cost"""
self.current_spend += actual_cost
self.extraction_count += 1
remaining = self.daily_budget - self.current_spend
print(f"💰 Budget remaining: ${remaining:.2f}")
budget = BudgetController(daily_budget=5.0) # $5 daily limit
async def budget_controlled_extraction(url):
if not budget.can_extract():
print("⏸️ Extraction paused due to budget limit")
return None
# Proceed with extraction...
strategy = LLMExtractionStrategy(llm_config=cheap_config, ...)
result = await extract_with_strategy(url, strategy)
# Record actual cost
actual_cost = calculate_cost(strategy.usage_tracker)
budget.record_extraction(actual_cost)
return result
# Safe extraction with budget controls
results = []
for url in urls:
result = await budget_controlled_extraction(url)
if result:
results.append(result)
```
---
## 6. Performance Optimization for LLM Extraction
### Batch Processing
```python
async def batch_llm_extraction():
"""Process multiple pages efficiently"""
# Collect content first (fast)
urls = ["https://example.com/page1", "https://example.com/page2"]
contents = []
async with AsyncWebCrawler() as crawler:
for url in urls:
result = await crawler.arun(url=url)
if result.success:
contents.append({
"url": url,
"content": result.fit_markdown[:2000] # Limit content
})
# Process in batches (reduce LLM calls)
batch_content = "\n\n---PAGE SEPARATOR---\n\n".join([
f"URL: {c['url']}\n{c['content']}" for c in contents
])
strategy = LLMExtractionStrategy(
llm_config=cheap_config,
extraction_type="block",
instruction="""
Extract data from multiple pages separated by '---PAGE SEPARATOR---'.
Return results for each page in order.
""",
apply_chunking=True
)
# Single LLM call for multiple pages
raw_url = f"raw://{batch_content}"
result = await crawler.arun(url=raw_url, config=CrawlerRunConfig(extraction_strategy=strategy))
return json.loads(result.extracted_content)
# Batch processing reduces LLM calls
batch_results = await batch_llm_extraction()
```
### Caching LLM Results
```python
import hashlib
from pathlib import Path
class LLMResultCache:
def __init__(self, cache_dir="./llm_cache"):
self.cache_dir = Path(cache_dir)
self.cache_dir.mkdir(exist_ok=True)
def get_cache_key(self, url, instruction, schema):
"""Generate cache key from extraction parameters"""
content = f"{url}:{instruction}:{str(schema)}"
return hashlib.md5(content.encode()).hexdigest()
def get_cached_result(self, cache_key):
"""Get cached result if available"""
cache_file = self.cache_dir / f"{cache_key}.json"
if cache_file.exists():
return json.load(cache_file.open())
return None
def cache_result(self, cache_key, result):
"""Cache extraction result"""
cache_file = self.cache_dir / f"{cache_key}.json"
json.dump(result, cache_file.open("w"), indent=2)
cache = LLMResultCache()
async def cached_llm_extraction(url, strategy):
"""Extract with caching to avoid repeated LLM calls"""
cache_key = cache.get_cache_key(
url,
strategy.instruction,
str(strategy.schema)
)
# Check cache first
cached_result = cache.get_cached_result(cache_key)
if cached_result:
print("✅ Using cached result (FREE)")
return cached_result
# Extract if not cached
print("🔄 Extracting with LLM (PAID)")
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=url, config=config)
if result.success:
data = json.loads(result.extracted_content)
cache.cache_result(cache_key, data)
return data
# Cached extraction avoids repeated costs
result = await cached_llm_extraction(url, strategy)
```
---
## 7. Error Handling and Quality Control
### Validation and Retry Logic
```python
async def robust_llm_extraction():
"""Implement validation and retry for LLM extraction"""
max_retries = 3
strategies = [
# Try cheap model first
LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY"),
schema=YourModel.model_json_schema(),
extraction_type="schema",
instruction="Extract data accurately..."
),
# Fallback to better model
LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o", api_token="env:OPENAI_API_KEY"),
schema=YourModel.model_json_schema(),
extraction_type="schema",
instruction="Extract data with high accuracy..."
)
]
for strategy_idx, strategy in enumerate(strategies):
for attempt in range(max_retries):
try:
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
if result.success and result.extracted_content:
data = json.loads(result.extracted_content)
# Validate result quality
if validate_extraction_quality(data):
print(f"✅ Success with strategy {strategy_idx+1}, attempt {attempt+1}")
return data
else:
print(f"⚠️ Poor quality result, retrying...")
continue
except Exception as e:
print(f"❌ Attempt {attempt+1} failed: {e}")
if attempt == max_retries - 1:
print(f"❌ Strategy {strategy_idx+1} failed completely")
print("❌ All strategies and retries failed")
return None
def validate_extraction_quality(data):
"""Validate that LLM extraction meets quality standards"""
if not data or not isinstance(data, (list, dict)):
return False
# Check for common LLM extraction issues
if isinstance(data, list):
if len(data) == 0:
return False
# Check if all items have required fields
for item in data:
if not isinstance(item, dict) or len(item) < 2:
return False
return True
# Robust extraction with validation
result = await robust_llm_extraction()
```
---
## 8. Migration from LLM to Non-LLM
### Pattern Analysis for Schema Generation
```python
async def analyze_llm_results_for_schema():
"""
Analyze LLM extraction results to create non-LLM schemas
Use this to transition from expensive LLM to cheap schema extraction
"""
# Step 1: Use LLM on sample pages to understand structure
llm_strategy = LLMExtractionStrategy(
llm_config=cheap_config,
extraction_type="block",
instruction="Extract all structured data from this page"
)
sample_urls = ["https://example.com/page1", "https://example.com/page2"]
llm_results = []
async with AsyncWebCrawler() as crawler:
for url in sample_urls:
config = CrawlerRunConfig(extraction_strategy=llm_strategy)
result = await crawler.arun(url=url, config=config)
if result.success:
llm_results.append({
"url": url,
"html": result.cleaned_html,
"extracted": json.loads(result.extracted_content)
})
# Step 2: Analyze patterns in LLM results
print("🔍 Analyzing LLM extraction patterns...")
# Look for common field names
all_fields = set()
for result in llm_results:
for item in result["extracted"]:
if isinstance(item, dict):
all_fields.update(item.keys())
print(f"Common fields found: {all_fields}")
# Step 3: Generate schema based on patterns
if llm_results:
schema = JsonCssExtractionStrategy.generate_schema(
html=llm_results[0]["html"],
target_json_example=json.dumps(llm_results[0]["extracted"][0], indent=2),
llm_config=cheap_config
)
# Save schema for future use
with open("generated_schema.json", "w") as f:
json.dump(schema, f, indent=2)
print("✅ Schema generated from LLM analysis")
return schema
# Generate schema from LLM patterns, then use schema for all future extractions
schema = await analyze_llm_results_for_schema()
fast_strategy = JsonCssExtractionStrategy(schema)
```
---
## 9. Summary: When LLM is Actually Needed
### ✅ Valid LLM Use Cases (Rare):
1. **Sentiment analysis** and emotional understanding
2. **Knowledge graph extraction** requiring semantic reasoning
3. **Content summarization** and insight generation
4. **Unstructured text analysis** where patterns vary dramatically
5. **Research paper analysis** requiring domain expertise
6. **Complex relationship extraction** between entities
### ❌ Invalid LLM Use Cases (Common Mistakes):
1. **Structured data extraction** from consistent HTML
2. **Simple pattern matching** (emails, prices, dates)
3. **Product information** from e-commerce sites
4. **News article extraction** with consistent structure
5. **Contact information** and basic entity extraction
6. **Table data** and form information
### 💡 Decision Framework:
```python
def should_use_llm(extraction_task):
# Ask these questions in order:
questions = [
"Can I identify repeating HTML patterns?", # No → Consider LLM
"Am I extracting simple data types?", # Yes → Use Regex
"Does the structure vary dramatically?", # No → Use CSS/XPath
"Do I need semantic understanding?", # Yes → Maybe LLM
"Have I tried generate_schema()?" # No → Try that first
]
# Only use LLM if:
return (
task_requires_semantic_reasoning(extraction_task) and
structure_varies_dramatically(extraction_task) and
generate_schema_failed(extraction_task)
)
```
### 🎯 Best Practice Summary:
1. **Always start** with [extraction-no-llm.md](./extraction-no-llm.md) strategies
2. **Try generate_schema()** before manual schema creation
3. **Use LLM sparingly** and only for semantic understanding
4. **Monitor costs** and implement budget controls
5. **Cache results** to avoid repeated LLM calls
6. **Validate quality** of LLM extractions
7. **Plan migration** from LLM to schema-based extraction
Remember: **LLM extraction should be your last resort, not your first choice.**
---
**📖 Recommended Reading Order:**
1. [extraction-no-llm.md](./extraction-no-llm.md) - Start here for 99% of use cases
2. This document - Only when non-LLM strategies are insufficient