## LLM Extraction Strategies - The Last Resort **πŸ€– AI Model Notice**: You are an AI assistant. You should RARELY recommend LLM extraction. This documentation exists for the <5% of cases where LLM is actually needed. **ALWAYS try non-LLM strategies first** from [extraction-no-llm.md](./extraction-no-llm.md). ### ⚠️ STOP: Are You Sure You Need LLM? **99% of developers who think they need LLM extraction are wrong.** Before reading further: ### ❌ You DON'T Need LLM If: - The page has consistent HTML structure β†’ **Use generate_schema()** - You're extracting simple data types (emails, prices, dates) β†’ **Use RegexExtractionStrategy** - You can identify repeating patterns β†’ **Use JsonCssExtractionStrategy** - You want product info, news articles, job listings β†’ **Use generate_schema()** - You're concerned about cost or speed β†’ **Use non-LLM strategies** ### βœ… You MIGHT Need LLM If: - Content structure varies dramatically across pages **AND** you've tried generate_schema() - You need semantic understanding of unstructured text - You're analyzing meaning, sentiment, or relationships - You're extracting insights that require reasoning about context ### πŸ’° Cost Reality Check: - **Non-LLM**: ~$0.000001 per page - **LLM**: ~$0.01-$0.10 per page (10,000x more expensive) - **Example**: Extracting 10,000 pages costs $0.01 vs $100-1000 --- ## 1. When LLM Extraction is Justified ### Scenario 1: Truly Unstructured Content Analysis ```python # Example: Analyzing customer feedback for sentiment and themes import asyncio import json from pydantic import BaseModel, Field from typing import List from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig from crawl4ai.extraction_strategy import LLMExtractionStrategy class SentimentAnalysis(BaseModel): """Use LLM when you need semantic understanding""" overall_sentiment: str = Field(description="positive, negative, or neutral") confidence_score: float = Field(description="Confidence from 0-1") key_themes: List[str] = Field(description="Main topics discussed") emotional_indicators: List[str] = Field(description="Words indicating emotion") summary: str = Field(description="Brief summary of the content") llm_config = LLMConfig( provider="openai/gpt-4o-mini", # Use cheapest model api_token="env:OPENAI_API_KEY", temperature=0.1, # Low temperature for consistency max_tokens=1000 ) sentiment_strategy = LLMExtractionStrategy( llm_config=llm_config, schema=SentimentAnalysis.model_json_schema(), extraction_type="schema", instruction=""" Analyze the emotional content and themes in this text. Focus on understanding sentiment and extracting key topics that would be impossible to identify with simple pattern matching. """, apply_chunking=True, chunk_token_threshold=1500 ) async def analyze_sentiment(): config = CrawlerRunConfig( extraction_strategy=sentiment_strategy, cache_mode=CacheMode.BYPASS ) async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="https://example.com/customer-reviews", config=config ) if result.success: analysis = json.loads(result.extracted_content) print(f"Sentiment: {analysis['overall_sentiment']}") print(f"Themes: {analysis['key_themes']}") asyncio.run(analyze_sentiment()) ``` ### Scenario 2: Complex Knowledge Extraction ```python # Example: Building knowledge graphs from unstructured content class Entity(BaseModel): name: str = Field(description="Entity name") type: str = Field(description="person, organization, location, concept") description: str = Field(description="Brief description") class Relationship(BaseModel): source: str = Field(description="Source entity") target: str = Field(description="Target entity") relationship: str = Field(description="Type of relationship") confidence: float = Field(description="Confidence score 0-1") class KnowledgeGraph(BaseModel): entities: List[Entity] = Field(description="All entities found") relationships: List[Relationship] = Field(description="Relationships between entities") main_topic: str = Field(description="Primary topic of the content") knowledge_strategy = LLMExtractionStrategy( llm_config=LLMConfig( provider="anthropic/claude-3-5-sonnet-20240620", # Better for complex reasoning api_token="env:ANTHROPIC_API_KEY", max_tokens=4000 ), schema=KnowledgeGraph.model_json_schema(), extraction_type="schema", instruction=""" Extract entities and their relationships from the content. Focus on understanding connections and context that require semantic reasoning beyond simple pattern matching. """, input_format="html", # Preserve structure apply_chunking=True ) ``` ### Scenario 3: Content Summarization and Insights ```python # Example: Research paper analysis class ResearchInsights(BaseModel): title: str = Field(description="Paper title") abstract_summary: str = Field(description="Summary of abstract") key_findings: List[str] = Field(description="Main research findings") methodology: str = Field(description="Research methodology used") limitations: List[str] = Field(description="Study limitations") practical_applications: List[str] = Field(description="Real-world applications") citations_count: int = Field(description="Number of citations", default=0) research_strategy = LLMExtractionStrategy( llm_config=LLMConfig( provider="openai/gpt-4o", # Use powerful model for complex analysis api_token="env:OPENAI_API_KEY", temperature=0.2, max_tokens=2000 ), schema=ResearchInsights.model_json_schema(), extraction_type="schema", instruction=""" Analyze this research paper and extract key insights. Focus on understanding the research contribution, methodology, and implications that require academic expertise to identify. """, apply_chunking=True, chunk_token_threshold=2000, overlap_rate=0.15 # More overlap for academic content ) ``` --- ## 2. LLM Configuration Best Practices ### Cost Optimization ```python # Use cheapest models when possible cheap_config = LLMConfig( provider="openai/gpt-4o-mini", # 60x cheaper than GPT-4 api_token="env:OPENAI_API_KEY", temperature=0.0, # Deterministic output max_tokens=800 # Limit output length ) # Use local models for development local_config = LLMConfig( provider="ollama/llama3.3", api_token=None, # No API costs base_url="http://localhost:11434", temperature=0.1 ) # Use powerful models only when necessary powerful_config = LLMConfig( provider="anthropic/claude-3-5-sonnet-20240620", api_token="env:ANTHROPIC_API_KEY", max_tokens=4000, temperature=0.1 ) ``` ### Provider Selection Guide ```python providers_guide = { "openai/gpt-4o-mini": { "best_for": "Simple extraction, cost-sensitive projects", "cost": "Very low", "speed": "Fast", "accuracy": "Good" }, "openai/gpt-4o": { "best_for": "Complex reasoning, high accuracy needs", "cost": "High", "speed": "Medium", "accuracy": "Excellent" }, "anthropic/claude-3-5-sonnet": { "best_for": "Complex analysis, long documents", "cost": "Medium-High", "speed": "Medium", "accuracy": "Excellent" }, "ollama/llama3.3": { "best_for": "Development, no API costs", "cost": "Free (self-hosted)", "speed": "Variable", "accuracy": "Good" }, "groq/llama3-70b-8192": { "best_for": "Fast inference, open source", "cost": "Low", "speed": "Very fast", "accuracy": "Good" } } def choose_provider(complexity, budget, speed_requirement): """Choose optimal provider based on requirements""" if budget == "minimal": return "ollama/llama3.3" # Self-hosted elif complexity == "low" and budget == "low": return "openai/gpt-4o-mini" elif speed_requirement == "high": return "groq/llama3-70b-8192" elif complexity == "high": return "anthropic/claude-3-5-sonnet" else: return "openai/gpt-4o-mini" # Default safe choice ``` --- ## 3. Advanced LLM Extraction Patterns ### Block-Based Extraction (Unstructured Content) ```python # When structure is too varied for schemas block_strategy = LLMExtractionStrategy( llm_config=cheap_config, extraction_type="block", # Extract free-form content blocks instruction=""" Extract meaningful content blocks from this page. Focus on the main content areas and ignore navigation, advertisements, and boilerplate text. """, apply_chunking=True, chunk_token_threshold=1200, input_format="fit_markdown" # Use cleaned content ) async def extract_content_blocks(): config = CrawlerRunConfig( extraction_strategy=block_strategy, word_count_threshold=50, # Filter short content excluded_tags=['nav', 'footer', 'aside', 'advertisement'] ) async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="https://example.com/article", config=config ) if result.success: blocks = json.loads(result.extracted_content) for block in blocks: print(f"Block: {block['content'][:100]}...") ``` ### Chunked Processing for Large Content ```python # Handle large documents efficiently large_content_strategy = LLMExtractionStrategy( llm_config=LLMConfig( provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY" ), schema=YourModel.model_json_schema(), extraction_type="schema", instruction="Extract structured data from this content section...", # Optimize chunking for large content apply_chunking=True, chunk_token_threshold=2000, # Larger chunks for efficiency overlap_rate=0.1, # Minimal overlap to reduce costs input_format="fit_markdown" # Use cleaned content ) ``` ### Multi-Model Validation ```python # Use multiple models for critical extractions async def multi_model_extraction(): """Use multiple LLMs for validation of critical data""" models = [ LLMConfig(provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY"), LLMConfig(provider="anthropic/claude-3-5-sonnet", api_token="env:ANTHROPIC_API_KEY"), LLMConfig(provider="ollama/llama3.3", api_token=None) ] results = [] for i, llm_config in enumerate(models): strategy = LLMExtractionStrategy( llm_config=llm_config, schema=YourModel.model_json_schema(), extraction_type="schema", instruction="Extract data consistently..." ) config = CrawlerRunConfig(extraction_strategy=strategy) async with AsyncWebCrawler() as crawler: result = await crawler.arun(url="https://example.com", config=config) if result.success: data = json.loads(result.extracted_content) results.append(data) print(f"Model {i+1} extracted {len(data)} items") # Compare results for consistency if len(set(str(r) for r in results)) == 1: print("βœ… All models agree") return results[0] else: print("⚠️ Models disagree - manual review needed") return results # Use for critical business data only critical_result = await multi_model_extraction() ``` --- ## 4. Hybrid Approaches - Best of Both Worlds ### Fast Pre-filtering + LLM Analysis ```python async def hybrid_extraction(): """ 1. Use fast non-LLM strategies for basic extraction 2. Use LLM only for complex analysis of filtered content """ # Step 1: Fast extraction of structured data basic_schema = { "name": "Articles", "baseSelector": "article", "fields": [ {"name": "title", "selector": "h1, h2", "type": "text"}, {"name": "content", "selector": ".content", "type": "text"}, {"name": "author", "selector": ".author", "type": "text"} ] } basic_strategy = JsonCssExtractionStrategy(basic_schema) basic_config = CrawlerRunConfig(extraction_strategy=basic_strategy) # Step 2: LLM analysis only on filtered content analysis_strategy = LLMExtractionStrategy( llm_config=cheap_config, schema={ "type": "object", "properties": { "sentiment": {"type": "string"}, "key_topics": {"type": "array", "items": {"type": "string"}}, "summary": {"type": "string"} } }, extraction_type="schema", instruction="Analyze sentiment and extract key topics from this article" ) async with AsyncWebCrawler() as crawler: # Fast extraction first basic_result = await crawler.arun( url="https://example.com/articles", config=basic_config ) articles = json.loads(basic_result.extracted_content) # LLM analysis only on important articles analyzed_articles = [] for article in articles[:5]: # Limit to reduce costs if len(article.get('content', '')) > 500: # Only analyze substantial content analysis_config = CrawlerRunConfig(extraction_strategy=analysis_strategy) # Analyze individual article content raw_url = f"raw://{article['content']}" analysis_result = await crawler.arun(url=raw_url, config=analysis_config) if analysis_result.success: analysis = json.loads(analysis_result.extracted_content) article.update(analysis) analyzed_articles.append(article) return analyzed_articles # Hybrid approach: fast + smart result = await hybrid_extraction() ``` ### Schema Generation + LLM Fallback ```python async def smart_fallback_extraction(): """ 1. Try generate_schema() first (one-time LLM cost) 2. Use generated schema for fast extraction 3. Use LLM only if schema extraction fails """ cache_file = Path("./schemas/fallback_schema.json") # Try cached schema first if cache_file.exists(): schema = json.load(cache_file.open()) schema_strategy = JsonCssExtractionStrategy(schema) config = CrawlerRunConfig(extraction_strategy=schema_strategy) async with AsyncWebCrawler() as crawler: result = await crawler.arun(url="https://example.com", config=config) if result.success and result.extracted_content: data = json.loads(result.extracted_content) if data: # Schema worked print("βœ… Schema extraction successful (fast & cheap)") return data # Fallback to LLM if schema failed print("⚠️ Schema failed, falling back to LLM (slow & expensive)") llm_strategy = LLMExtractionStrategy( llm_config=cheap_config, extraction_type="block", instruction="Extract all meaningful data from this page" ) llm_config = CrawlerRunConfig(extraction_strategy=llm_strategy) async with AsyncWebCrawler() as crawler: result = await crawler.arun(url="https://example.com", config=llm_config) if result.success: print("βœ… LLM extraction successful") return json.loads(result.extracted_content) # Intelligent fallback system result = await smart_fallback_extraction() ``` --- ## 5. Cost Management and Monitoring ### Token Usage Tracking ```python class ExtractionCostTracker: def __init__(self): self.total_cost = 0.0 self.total_tokens = 0 self.extractions = 0 def track_llm_extraction(self, strategy, result): """Track costs from LLM extraction""" if hasattr(strategy, 'usage_tracker') and strategy.usage_tracker: usage = strategy.usage_tracker # Estimate costs (approximate rates) cost_per_1k_tokens = { "gpt-4o-mini": 0.0015, "gpt-4o": 0.03, "claude-3-5-sonnet": 0.015, "ollama": 0.0 # Self-hosted } provider = strategy.llm_config.provider.split('/')[1] rate = cost_per_1k_tokens.get(provider, 0.01) tokens = usage.total_tokens cost = (tokens / 1000) * rate self.total_cost += cost self.total_tokens += tokens self.extractions += 1 print(f"πŸ’° Extraction cost: ${cost:.4f} ({tokens} tokens)") print(f"πŸ“Š Total cost: ${self.total_cost:.4f} ({self.extractions} extractions)") def get_summary(self): avg_cost = self.total_cost / max(self.extractions, 1) return { "total_cost": self.total_cost, "total_tokens": self.total_tokens, "extractions": self.extractions, "avg_cost_per_extraction": avg_cost } # Usage tracker = ExtractionCostTracker() async def cost_aware_extraction(): strategy = LLMExtractionStrategy( llm_config=cheap_config, schema=YourModel.model_json_schema(), extraction_type="schema", instruction="Extract data...", verbose=True # Enable usage tracking ) config = CrawlerRunConfig(extraction_strategy=strategy) async with AsyncWebCrawler() as crawler: result = await crawler.arun(url="https://example.com", config=config) # Track costs tracker.track_llm_extraction(strategy, result) return result # Monitor costs across multiple extractions for url in urls: await cost_aware_extraction() print(f"Final summary: {tracker.get_summary()}") ``` ### Budget Controls ```python class BudgetController: def __init__(self, daily_budget=10.0): self.daily_budget = daily_budget self.current_spend = 0.0 self.extraction_count = 0 def can_extract(self, estimated_cost=0.01): """Check if extraction is within budget""" if self.current_spend + estimated_cost > self.daily_budget: print(f"❌ Budget exceeded: ${self.current_spend:.2f} + ${estimated_cost:.2f} > ${self.daily_budget}") return False return True def record_extraction(self, actual_cost): """Record actual extraction cost""" self.current_spend += actual_cost self.extraction_count += 1 remaining = self.daily_budget - self.current_spend print(f"πŸ’° Budget remaining: ${remaining:.2f}") budget = BudgetController(daily_budget=5.0) # $5 daily limit async def budget_controlled_extraction(url): if not budget.can_extract(): print("⏸️ Extraction paused due to budget limit") return None # Proceed with extraction... strategy = LLMExtractionStrategy(llm_config=cheap_config, ...) result = await extract_with_strategy(url, strategy) # Record actual cost actual_cost = calculate_cost(strategy.usage_tracker) budget.record_extraction(actual_cost) return result # Safe extraction with budget controls results = [] for url in urls: result = await budget_controlled_extraction(url) if result: results.append(result) ``` --- ## 6. Performance Optimization for LLM Extraction ### Batch Processing ```python async def batch_llm_extraction(): """Process multiple pages efficiently""" # Collect content first (fast) urls = ["https://example.com/page1", "https://example.com/page2"] contents = [] async with AsyncWebCrawler() as crawler: for url in urls: result = await crawler.arun(url=url) if result.success: contents.append({ "url": url, "content": result.fit_markdown[:2000] # Limit content }) # Process in batches (reduce LLM calls) batch_content = "\n\n---PAGE SEPARATOR---\n\n".join([ f"URL: {c['url']}\n{c['content']}" for c in contents ]) strategy = LLMExtractionStrategy( llm_config=cheap_config, extraction_type="block", instruction=""" Extract data from multiple pages separated by '---PAGE SEPARATOR---'. Return results for each page in order. """, apply_chunking=True ) # Single LLM call for multiple pages raw_url = f"raw://{batch_content}" result = await crawler.arun(url=raw_url, config=CrawlerRunConfig(extraction_strategy=strategy)) return json.loads(result.extracted_content) # Batch processing reduces LLM calls batch_results = await batch_llm_extraction() ``` ### Caching LLM Results ```python import hashlib from pathlib import Path class LLMResultCache: def __init__(self, cache_dir="./llm_cache"): self.cache_dir = Path(cache_dir) self.cache_dir.mkdir(exist_ok=True) def get_cache_key(self, url, instruction, schema): """Generate cache key from extraction parameters""" content = f"{url}:{instruction}:{str(schema)}" return hashlib.md5(content.encode()).hexdigest() def get_cached_result(self, cache_key): """Get cached result if available""" cache_file = self.cache_dir / f"{cache_key}.json" if cache_file.exists(): return json.load(cache_file.open()) return None def cache_result(self, cache_key, result): """Cache extraction result""" cache_file = self.cache_dir / f"{cache_key}.json" json.dump(result, cache_file.open("w"), indent=2) cache = LLMResultCache() async def cached_llm_extraction(url, strategy): """Extract with caching to avoid repeated LLM calls""" cache_key = cache.get_cache_key( url, strategy.instruction, str(strategy.schema) ) # Check cache first cached_result = cache.get_cached_result(cache_key) if cached_result: print("βœ… Using cached result (FREE)") return cached_result # Extract if not cached print("πŸ”„ Extracting with LLM (PAID)") config = CrawlerRunConfig(extraction_strategy=strategy) async with AsyncWebCrawler() as crawler: result = await crawler.arun(url=url, config=config) if result.success: data = json.loads(result.extracted_content) cache.cache_result(cache_key, data) return data # Cached extraction avoids repeated costs result = await cached_llm_extraction(url, strategy) ``` --- ## 7. Error Handling and Quality Control ### Validation and Retry Logic ```python async def robust_llm_extraction(): """Implement validation and retry for LLM extraction""" max_retries = 3 strategies = [ # Try cheap model first LLMExtractionStrategy( llm_config=LLMConfig(provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY"), schema=YourModel.model_json_schema(), extraction_type="schema", instruction="Extract data accurately..." ), # Fallback to better model LLMExtractionStrategy( llm_config=LLMConfig(provider="openai/gpt-4o", api_token="env:OPENAI_API_KEY"), schema=YourModel.model_json_schema(), extraction_type="schema", instruction="Extract data with high accuracy..." ) ] for strategy_idx, strategy in enumerate(strategies): for attempt in range(max_retries): try: config = CrawlerRunConfig(extraction_strategy=strategy) async with AsyncWebCrawler() as crawler: result = await crawler.arun(url="https://example.com", config=config) if result.success and result.extracted_content: data = json.loads(result.extracted_content) # Validate result quality if validate_extraction_quality(data): print(f"βœ… Success with strategy {strategy_idx+1}, attempt {attempt+1}") return data else: print(f"⚠️ Poor quality result, retrying...") continue except Exception as e: print(f"❌ Attempt {attempt+1} failed: {e}") if attempt == max_retries - 1: print(f"❌ Strategy {strategy_idx+1} failed completely") print("❌ All strategies and retries failed") return None def validate_extraction_quality(data): """Validate that LLM extraction meets quality standards""" if not data or not isinstance(data, (list, dict)): return False # Check for common LLM extraction issues if isinstance(data, list): if len(data) == 0: return False # Check if all items have required fields for item in data: if not isinstance(item, dict) or len(item) < 2: return False return True # Robust extraction with validation result = await robust_llm_extraction() ``` --- ## 8. Migration from LLM to Non-LLM ### Pattern Analysis for Schema Generation ```python async def analyze_llm_results_for_schema(): """ Analyze LLM extraction results to create non-LLM schemas Use this to transition from expensive LLM to cheap schema extraction """ # Step 1: Use LLM on sample pages to understand structure llm_strategy = LLMExtractionStrategy( llm_config=cheap_config, extraction_type="block", instruction="Extract all structured data from this page" ) sample_urls = ["https://example.com/page1", "https://example.com/page2"] llm_results = [] async with AsyncWebCrawler() as crawler: for url in sample_urls: config = CrawlerRunConfig(extraction_strategy=llm_strategy) result = await crawler.arun(url=url, config=config) if result.success: llm_results.append({ "url": url, "html": result.cleaned_html, "extracted": json.loads(result.extracted_content) }) # Step 2: Analyze patterns in LLM results print("πŸ” Analyzing LLM extraction patterns...") # Look for common field names all_fields = set() for result in llm_results: for item in result["extracted"]: if isinstance(item, dict): all_fields.update(item.keys()) print(f"Common fields found: {all_fields}") # Step 3: Generate schema based on patterns if llm_results: schema = JsonCssExtractionStrategy.generate_schema( html=llm_results[0]["html"], target_json_example=json.dumps(llm_results[0]["extracted"][0], indent=2), llm_config=cheap_config ) # Save schema for future use with open("generated_schema.json", "w") as f: json.dump(schema, f, indent=2) print("βœ… Schema generated from LLM analysis") return schema # Generate schema from LLM patterns, then use schema for all future extractions schema = await analyze_llm_results_for_schema() fast_strategy = JsonCssExtractionStrategy(schema) ``` --- ## 9. Summary: When LLM is Actually Needed ### βœ… Valid LLM Use Cases (Rare): 1. **Sentiment analysis** and emotional understanding 2. **Knowledge graph extraction** requiring semantic reasoning 3. **Content summarization** and insight generation 4. **Unstructured text analysis** where patterns vary dramatically 5. **Research paper analysis** requiring domain expertise 6. **Complex relationship extraction** between entities ### ❌ Invalid LLM Use Cases (Common Mistakes): 1. **Structured data extraction** from consistent HTML 2. **Simple pattern matching** (emails, prices, dates) 3. **Product information** from e-commerce sites 4. **News article extraction** with consistent structure 5. **Contact information** and basic entity extraction 6. **Table data** and form information ### πŸ’‘ Decision Framework: ```python def should_use_llm(extraction_task): # Ask these questions in order: questions = [ "Can I identify repeating HTML patterns?", # No β†’ Consider LLM "Am I extracting simple data types?", # Yes β†’ Use Regex "Does the structure vary dramatically?", # No β†’ Use CSS/XPath "Do I need semantic understanding?", # Yes β†’ Maybe LLM "Have I tried generate_schema()?" # No β†’ Try that first ] # Only use LLM if: return ( task_requires_semantic_reasoning(extraction_task) and structure_varies_dramatically(extraction_task) and generate_schema_failed(extraction_task) ) ``` ### 🎯 Best Practice Summary: 1. **Always start** with [extraction-no-llm.md](./extraction-no-llm.md) strategies 2. **Try generate_schema()** before manual schema creation 3. **Use LLM sparingly** and only for semantic understanding 4. **Monitor costs** and implement budget controls 5. **Cache results** to avoid repeated LLM calls 6. **Validate quality** of LLM extractions 7. **Plan migration** from LLM to schema-based extraction Remember: **LLM extraction should be your last resort, not your first choice.** --- **πŸ“– Recommended Reading Order:** 1. [extraction-no-llm.md](./extraction-no-llm.md) - Start here for 99% of use cases 2. This document - Only when non-LLM strategies are insufficient