## LLM Extraction Strategies - The Last Resort **πŸ€– AI Model Notice**: You are an AI assistant. You should RARELY recommend LLM extraction. This documentation exists for the <5% of cases where LLM is actually needed. **ALWAYS try non-LLM strategies first** from [extraction-no-llm.md](./extraction-no-llm.md). ### ⚠️ STOP: Are You Sure You Need LLM? **99% of developers who think they need LLM extraction are wrong.** Before reading further: ### ❌ You DON'T Need LLM If: - The page has consistent HTML structure β†’ **Use generate_schema()** - You're extracting simple data types (emails, prices, dates) β†’ **Use RegexExtractionStrategy** - You can identify repeating patterns β†’ **Use JsonCssExtractionStrategy** - You want product info, news articles, job listings β†’ **Use generate_schema()** - You're concerned about cost or speed β†’ **Use non-LLM strategies** ### βœ… You MIGHT Need LLM If: - Content structure varies dramatically across pages **AND** you've tried generate_schema() - You need semantic understanding of unstructured text - You're analyzing meaning, sentiment, or relationships - You're extracting insights that require reasoning about context ### πŸ’° Cost Reality Check: - **Non-LLM**: ~$0.000001 per page - **LLM**: ~$0.01-$0.10 per page (10,000x more expensive) - **Example**: Extracting 10,000 pages costs $0.01 vs $100-1000 --- ## 1. When LLM Extraction is Justified ### Scenario 1: Truly Unstructured Content Analysis ```python # Example: Analyzing customer feedback for sentiment and themes import asyncio import json from pydantic import BaseModel, Field from typing import List from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig from crawl4ai import LLMExtractionStrategy class SentimentAnalysis(BaseModel): """Use LLM when you need semantic understanding""" overall_sentiment: str = Field(description="positive, negative, or neutral") confidence_score: float = Field(description="Confidence from 0-1") key_themes: List[str] = Field(description="Main topics discussed") emotional_indicators: List[str] = Field(description="Words indicating emotion") summary: str = Field(description="Brief summary of the content") llm_config = LLMConfig( provider="openai/gpt-4o-mini", # Use cheapest model api_token="env:OPENAI_API_KEY", temperature=0.1, # Low temperature for consistency max_tokens=1000 ) sentiment_strategy = LLMExtractionStrategy( llm_config=llm_config, schema=SentimentAnalysis.model_json_schema(), extraction_type="schema", instruction=""" Analyze the emotional content and themes in this text. Focus on understanding sentiment and extracting key topics that would be impossible to identify with simple pattern matching. """, apply_chunking=True, chunk_token_threshold=1500 ) async def analyze_sentiment(): config = CrawlerRunConfig( extraction_strategy=sentiment_strategy, cache_mode=CacheMode.BYPASS ) async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="https://example.com/customer-reviews", config=config ) if result.success: analysis = json.loads(result.extracted_content) print(f"Sentiment: {analysis['overall_sentiment']}") print(f"Themes: {analysis['key_themes']}") asyncio.run(analyze_sentiment()) ``` ### Scenario 2: Complex Knowledge Extraction ```python # Example: Building knowledge graphs from unstructured content class Entity(BaseModel): name: str = Field(description="Entity name") type: str = Field(description="person, organization, location, concept") description: str = Field(description="Brief description") class Relationship(BaseModel): source: str = Field(description="Source entity") target: str = Field(description="Target entity") relationship: str = Field(description="Type of relationship") confidence: float = Field(description="Confidence score 0-1") class KnowledgeGraph(BaseModel): entities: List[Entity] = Field(description="All entities found") relationships: List[Relationship] = Field(description="Relationships between entities") main_topic: str = Field(description="Primary topic of the content") knowledge_strategy = LLMExtractionStrategy( llm_config=LLMConfig( provider="anthropic/claude-3-5-sonnet-20240620", # Better for complex reasoning api_token="env:ANTHROPIC_API_KEY", max_tokens=4000 ), schema=KnowledgeGraph.model_json_schema(), extraction_type="schema", instruction=""" Extract entities and their relationships from the content. Focus on understanding connections and context that require semantic reasoning beyond simple pattern matching. """, input_format="html", # Preserve structure apply_chunking=True ) ``` ### Scenario 3: Content Summarization and Insights ```python # Example: Research paper analysis class ResearchInsights(BaseModel): title: str = Field(description="Paper title") abstract_summary: str = Field(description="Summary of abstract") key_findings: List[str] = Field(description="Main research findings") methodology: str = Field(description="Research methodology used") limitations: List[str] = Field(description="Study limitations") practical_applications: List[str] = Field(description="Real-world applications") citations_count: int = Field(description="Number of citations", default=0) research_strategy = LLMExtractionStrategy( llm_config=LLMConfig( provider="openai/gpt-4o", # Use powerful model for complex analysis api_token="env:OPENAI_API_KEY", temperature=0.2, max_tokens=2000 ), schema=ResearchInsights.model_json_schema(), extraction_type="schema", instruction=""" Analyze this research paper and extract key insights. Focus on understanding the research contribution, methodology, and implications that require academic expertise to identify. """, apply_chunking=True, chunk_token_threshold=2000, overlap_rate=0.15 # More overlap for academic content ) ``` --- ## 2. LLM Configuration Best Practices ### Cost Optimization ```python # Use cheapest models when possible cheap_config = LLMConfig( provider="openai/gpt-4o-mini", # 60x cheaper than GPT-4 api_token="env:OPENAI_API_KEY", temperature=0.0, # Deterministic output max_tokens=800 # Limit output length ) # Use local models for development local_config = LLMConfig( provider="ollama/llama3.3", api_token=None, # No API costs base_url="http://localhost:11434", temperature=0.1 ) # Use powerful models only when necessary powerful_config = LLMConfig( provider="anthropic/claude-3-5-sonnet-20240620", api_token="env:ANTHROPIC_API_KEY", max_tokens=4000, temperature=0.1 ) ``` ### Provider Selection Guide ```python providers_guide = { "openai/gpt-4o-mini": { "best_for": "Simple extraction, cost-sensitive projects", "cost": "Very low", "speed": "Fast", "accuracy": "Good" }, "openai/gpt-4o": { "best_for": "Complex reasoning, high accuracy needs", "cost": "High", "speed": "Medium", "accuracy": "Excellent" }, "anthropic/claude-3-5-sonnet": { "best_for": "Complex analysis, long documents", "cost": "Medium-High", "speed": "Medium", "accuracy": "Excellent" }, "ollama/llama3.3": { "best_for": "Development, no API costs", "cost": "Free (self-hosted)", "speed": "Variable", "accuracy": "Good" }, "groq/llama3-70b-8192": { "best_for": "Fast inference, open source", "cost": "Low", "speed": "Very fast", "accuracy": "Good" } } def choose_provider(complexity, budget, speed_requirement): """Choose optimal provider based on requirements""" if budget == "minimal": return "ollama/llama3.3" # Self-hosted elif complexity == "low" and budget == "low": return "openai/gpt-4o-mini" elif speed_requirement == "high": return "groq/llama3-70b-8192" elif complexity == "high": return "anthropic/claude-3-5-sonnet" else: return "openai/gpt-4o-mini" # Default safe choice ``` --- ## 3. Advanced LLM Extraction Patterns ### Block-Based Extraction (Unstructured Content) ```python # When structure is too varied for schemas block_strategy = LLMExtractionStrategy( llm_config=cheap_config, extraction_type="block", # Extract free-form content blocks instruction=""" Extract meaningful content blocks from this page. Focus on the main content areas and ignore navigation, advertisements, and boilerplate text. """, apply_chunking=True, chunk_token_threshold=1200, input_format="fit_markdown" # Use cleaned content ) async def extract_content_blocks(): config = CrawlerRunConfig( extraction_strategy=block_strategy, word_count_threshold=50, # Filter short content excluded_tags=['nav', 'footer', 'aside', 'advertisement'] ) async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="https://example.com/article", config=config ) if result.success: blocks = json.loads(result.extracted_content) for block in blocks: print(f"Block: {block['content'][:100]}...") ``` ### Chunked Processing for Large Content ```python # Handle large documents efficiently large_content_strategy = LLMExtractionStrategy( llm_config=LLMConfig( provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY" ), schema=YourModel.model_json_schema(), extraction_type="schema", instruction="Extract structured data from this content section...", # Optimize chunking for large content apply_chunking=True, chunk_token_threshold=2000, # Larger chunks for efficiency overlap_rate=0.1, # Minimal overlap to reduce costs input_format="fit_markdown" # Use cleaned content ) ``` ### Multi-Model Validation ```python # Use multiple models for critical extractions async def multi_model_extraction(): """Use multiple LLMs for validation of critical data""" models = [ LLMConfig(provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY"), LLMConfig(provider="anthropic/claude-3-5-sonnet", api_token="env:ANTHROPIC_API_KEY"), LLMConfig(provider="ollama/llama3.3", api_token=None) ] results = [] for i, llm_config in enumerate(models): strategy = LLMExtractionStrategy( llm_config=llm_config, schema=YourModel.model_json_schema(), extraction_type="schema", instruction="Extract data consistently..." ) config = CrawlerRunConfig(extraction_strategy=strategy) async with AsyncWebCrawler() as crawler: result = await crawler.arun(url="https://example.com", config=config) if result.success: data = json.loads(result.extracted_content) results.append(data) print(f"Model {i+1} extracted {len(data)} items") # Compare results for consistency if len(set(str(r) for r in results)) == 1: print("βœ… All models agree") return results[0] else: print("⚠️ Models disagree - manual review needed") return results # Use for critical business data only critical_result = await multi_model_extraction() ``` --- ## 4. Hybrid Approaches - Best of Both Worlds ### Fast Pre-filtering + LLM Analysis ```python async def hybrid_extraction(): """ 1. Use fast non-LLM strategies for basic extraction 2. Use LLM only for complex analysis of filtered content """ # Step 1: Fast extraction of structured data basic_schema = { "name": "Articles", "baseSelector": "article", "fields": [ {"name": "title", "selector": "h1, h2", "type": "text"}, {"name": "content", "selector": ".content", "type": "text"}, {"name": "author", "selector": ".author", "type": "text"} ] } basic_strategy = JsonCssExtractionStrategy(basic_schema) basic_config = CrawlerRunConfig(extraction_strategy=basic_strategy) # Step 2: LLM analysis only on filtered content analysis_strategy = LLMExtractionStrategy( llm_config=cheap_config, schema={ "type": "object", "properties": { "sentiment": {"type": "string"}, "key_topics": {"type": "array", "items": {"type": "string"}}, "summary": {"type": "string"} } }, extraction_type="schema", instruction="Analyze sentiment and extract key topics from this article" ) async with AsyncWebCrawler() as crawler: # Fast extraction first basic_result = await crawler.arun( url="https://example.com/articles", config=basic_config ) articles = json.loads(basic_result.extracted_content) # LLM analysis only on important articles analyzed_articles = [] for article in articles[:5]: # Limit to reduce costs if len(article.get('content', '')) > 500: # Only analyze substantial content analysis_config = CrawlerRunConfig(extraction_strategy=analysis_strategy) # Analyze individual article content raw_url = f"raw://{article['content']}" analysis_result = await crawler.arun(url=raw_url, config=analysis_config) if analysis_result.success: analysis = json.loads(analysis_result.extracted_content) article.update(analysis) analyzed_articles.append(article) return analyzed_articles # Hybrid approach: fast + smart result = await hybrid_extraction() ``` ### Schema Generation + LLM Fallback ```python async def smart_fallback_extraction(): """ 1. Try generate_schema() first (one-time LLM cost) 2. Use generated schema for fast extraction 3. Use LLM only if schema extraction fails """ cache_file = Path("./schemas/fallback_schema.json") # Try cached schema first if cache_file.exists(): schema = json.load(cache_file.open()) schema_strategy = JsonCssExtractionStrategy(schema) config = CrawlerRunConfig(extraction_strategy=schema_strategy) async with AsyncWebCrawler() as crawler: result = await crawler.arun(url="https://example.com", config=config) if result.success and result.extracted_content: data = json.loads(result.extracted_content) if data: # Schema worked print("βœ… Schema extraction successful (fast & cheap)") return data # Fallback to LLM if schema failed print("⚠️ Schema failed, falling back to LLM (slow & expensive)") llm_strategy = LLMExtractionStrategy( llm_config=cheap_config, extraction_type="block", instruction="Extract all meaningful data from this page" ) llm_config = CrawlerRunConfig(extraction_strategy=llm_strategy) async with AsyncWebCrawler() as crawler: result = await crawler.arun(url="https://example.com", config=llm_config) if result.success: print("βœ… LLM extraction successful") return json.loads(result.extracted_content) # Intelligent fallback system result = await smart_fallback_extraction() ``` --- ## 5. Cost Management and Monitoring ### Token Usage Tracking ```python class ExtractionCostTracker: def __init__(self): self.total_cost = 0.0 self.total_tokens = 0 self.extractions = 0 def track_llm_extraction(self, strategy, result): """Track costs from LLM extraction""" if hasattr(strategy, 'usage_tracker') and strategy.usage_tracker: usage = strategy.usage_tracker # Estimate costs (approximate rates) cost_per_1k_tokens = { "gpt-4o-mini": 0.0015, "gpt-4o": 0.03, "claude-3-5-sonnet": 0.015, "ollama": 0.0 # Self-hosted } provider = strategy.llm_config.provider.split('/')[1] rate = cost_per_1k_tokens.get(provider, 0.01) tokens = usage.total_tokens cost = (tokens / 1000) * rate self.total_cost += cost self.total_tokens += tokens self.extractions += 1 print(f"πŸ’° Extraction cost: ${cost:.4f} ({tokens} tokens)") print(f"πŸ“Š Total cost: ${self.total_cost:.4f} ({self.extractions} extractions)") def get_summary(self): avg_cost = self.total_cost / max(self.extractions, 1) return { "total_cost": self.total_cost, "total_tokens": self.total_tokens, "extractions": self.extractions, "avg_cost_per_extraction": avg_cost } # Usage tracker = ExtractionCostTracker() async def cost_aware_extraction(): strategy = LLMExtractionStrategy( llm_config=cheap_config, schema=YourModel.model_json_schema(), extraction_type="schema", instruction="Extract data...", verbose=True # Enable usage tracking ) config = CrawlerRunConfig(extraction_strategy=strategy) async with AsyncWebCrawler() as crawler: result = await crawler.arun(url="https://example.com", config=config) # Track costs tracker.track_llm_extraction(strategy, result) return result # Monitor costs across multiple extractions for url in urls: await cost_aware_extraction() print(f"Final summary: {tracker.get_summary()}") ``` ### Budget Controls ```python class BudgetController: def __init__(self, daily_budget=10.0): self.daily_budget = daily_budget self.current_spend = 0.0 self.extraction_count = 0 def can_extract(self, estimated_cost=0.01): """Check if extraction is within budget""" if self.current_spend + estimated_cost > self.daily_budget: print(f"❌ Budget exceeded: ${self.current_spend:.2f} + ${estimated_cost:.2f} > ${self.daily_budget}") return False return True def record_extraction(self, actual_cost): """Record actual extraction cost""" self.current_spend += actual_cost self.extraction_count += 1 remaining = self.daily_budget - self.current_spend print(f"πŸ’° Budget remaining: ${remaining:.2f}") budget = BudgetController(daily_budget=5.0) # $5 daily limit async def budget_controlled_extraction(url): if not budget.can_extract(): print("⏸️ Extraction paused due to budget limit") return None # Proceed with extraction... strategy = LLMExtractionStrategy(llm_config=cheap_config, ...) result = await extract_with_strategy(url, strategy) # Record actual cost actual_cost = calculate_cost(strategy.usage_tracker) budget.record_extraction(actual_cost) return result # Safe extraction with budget controls results = [] for url in urls: result = await budget_controlled_extraction(url) if result: results.append(result) ``` --- ## 6. Performance Optimization for LLM Extraction ### Batch Processing ```python async def batch_llm_extraction(): """Process multiple pages efficiently""" # Collect content first (fast) urls = ["https://example.com/page1", "https://example.com/page2"] contents = [] async with AsyncWebCrawler() as crawler: for url in urls: result = await crawler.arun(url=url) if result.success: contents.append({ "url": url, "content": result.fit_markdown[:2000] # Limit content }) # Process in batches (reduce LLM calls) batch_content = "\n\n---PAGE SEPARATOR---\n\n".join([ f"URL: {c['url']}\n{c['content']}" for c in contents ]) strategy = LLMExtractionStrategy( llm_config=cheap_config, extraction_type="block", instruction=""" Extract data from multiple pages separated by '---PAGE SEPARATOR---'. Return results for each page in order. """, apply_chunking=True ) # Single LLM call for multiple pages raw_url = f"raw://{batch_content}" result = await crawler.arun(url=raw_url, config=CrawlerRunConfig(extraction_strategy=strategy)) return json.loads(result.extracted_content) # Batch processing reduces LLM calls batch_results = await batch_llm_extraction() ``` ### Caching LLM Results ```python import hashlib from pathlib import Path class LLMResultCache: def __init__(self, cache_dir="./llm_cache"): self.cache_dir = Path(cache_dir) self.cache_dir.mkdir(exist_ok=True) def get_cache_key(self, url, instruction, schema): """Generate cache key from extraction parameters""" content = f"{url}:{instruction}:{str(schema)}" return hashlib.md5(content.encode()).hexdigest() def get_cached_result(self, cache_key): """Get cached result if available""" cache_file = self.cache_dir / f"{cache_key}.json" if cache_file.exists(): return json.load(cache_file.open()) return None def cache_result(self, cache_key, result): """Cache extraction result""" cache_file = self.cache_dir / f"{cache_key}.json" json.dump(result, cache_file.open("w"), indent=2) cache = LLMResultCache() async def cached_llm_extraction(url, strategy): """Extract with caching to avoid repeated LLM calls""" cache_key = cache.get_cache_key( url, strategy.instruction, str(strategy.schema) ) # Check cache first cached_result = cache.get_cached_result(cache_key) if cached_result: print("βœ… Using cached result (FREE)") return cached_result # Extract if not cached print("πŸ”„ Extracting with LLM (PAID)") config = CrawlerRunConfig(extraction_strategy=strategy) async with AsyncWebCrawler() as crawler: result = await crawler.arun(url=url, config=config) if result.success: data = json.loads(result.extracted_content) cache.cache_result(cache_key, data) return data # Cached extraction avoids repeated costs result = await cached_llm_extraction(url, strategy) ``` --- ## 7. Error Handling and Quality Control ### Validation and Retry Logic ```python async def robust_llm_extraction(): """Implement validation and retry for LLM extraction""" max_retries = 3 strategies = [ # Try cheap model first LLMExtractionStrategy( llm_config=LLMConfig(provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY"), schema=YourModel.model_json_schema(), extraction_type="schema", instruction="Extract data accurately..." ), # Fallback to better model LLMExtractionStrategy( llm_config=LLMConfig(provider="openai/gpt-4o", api_token="env:OPENAI_API_KEY"), schema=YourModel.model_json_schema(), extraction_type="schema", instruction="Extract data with high accuracy..." ) ] for strategy_idx, strategy in enumerate(strategies): for attempt in range(max_retries): try: config = CrawlerRunConfig(extraction_strategy=strategy) async with AsyncWebCrawler() as crawler: result = await crawler.arun(url="https://example.com", config=config) if result.success and result.extracted_content: data = json.loads(result.extracted_content) # Validate result quality if validate_extraction_quality(data): print(f"βœ… Success with strategy {strategy_idx+1}, attempt {attempt+1}") return data else: print(f"⚠️ Poor quality result, retrying...") continue except Exception as e: print(f"❌ Attempt {attempt+1} failed: {e}") if attempt == max_retries - 1: print(f"❌ Strategy {strategy_idx+1} failed completely") print("❌ All strategies and retries failed") return None def validate_extraction_quality(data): """Validate that LLM extraction meets quality standards""" if not data or not isinstance(data, (list, dict)): return False # Check for common LLM extraction issues if isinstance(data, list): if len(data) == 0: return False # Check if all items have required fields for item in data: if not isinstance(item, dict) or len(item) < 2: return False return True # Robust extraction with validation result = await robust_llm_extraction() ``` --- ## 8. Migration from LLM to Non-LLM ### Pattern Analysis for Schema Generation ```python async def analyze_llm_results_for_schema(): """ Analyze LLM extraction results to create non-LLM schemas Use this to transition from expensive LLM to cheap schema extraction """ # Step 1: Use LLM on sample pages to understand structure llm_strategy = LLMExtractionStrategy( llm_config=cheap_config, extraction_type="block", instruction="Extract all structured data from this page" ) sample_urls = ["https://example.com/page1", "https://example.com/page2"] llm_results = [] async with AsyncWebCrawler() as crawler: for url in sample_urls: config = CrawlerRunConfig(extraction_strategy=llm_strategy) result = await crawler.arun(url=url, config=config) if result.success: llm_results.append({ "url": url, "html": result.cleaned_html, "extracted": json.loads(result.extracted_content) }) # Step 2: Analyze patterns in LLM results print("πŸ” Analyzing LLM extraction patterns...") # Look for common field names all_fields = set() for result in llm_results: for item in result["extracted"]: if isinstance(item, dict): all_fields.update(item.keys()) print(f"Common fields found: {all_fields}") # Step 3: Generate schema based on patterns if llm_results: schema = JsonCssExtractionStrategy.generate_schema( html=llm_results[0]["html"], target_json_example=json.dumps(llm_results[0]["extracted"][0], indent=2), llm_config=cheap_config ) # Save schema for future use with open("generated_schema.json", "w") as f: json.dump(schema, f, indent=2) print("βœ… Schema generated from LLM analysis") return schema # Generate schema from LLM patterns, then use schema for all future extractions schema = await analyze_llm_results_for_schema() fast_strategy = JsonCssExtractionStrategy(schema) ``` --- ## 9. Summary: When LLM is Actually Needed ### βœ… Valid LLM Use Cases (Rare): 1. **Sentiment analysis** and emotional understanding 2. **Knowledge graph extraction** requiring semantic reasoning 3. **Content summarization** and insight generation 4. **Unstructured text analysis** where patterns vary dramatically 5. **Research paper analysis** requiring domain expertise 6. **Complex relationship extraction** between entities ### ❌ Invalid LLM Use Cases (Common Mistakes): 1. **Structured data extraction** from consistent HTML 2. **Simple pattern matching** (emails, prices, dates) 3. **Product information** from e-commerce sites 4. **News article extraction** with consistent structure 5. **Contact information** and basic entity extraction 6. **Table data** and form information ### πŸ’‘ Decision Framework: ```python def should_use_llm(extraction_task): # Ask these questions in order: questions = [ "Can I identify repeating HTML patterns?", # No β†’ Consider LLM "Am I extracting simple data types?", # Yes β†’ Use Regex "Does the structure vary dramatically?", # No β†’ Use CSS/XPath "Do I need semantic understanding?", # Yes β†’ Maybe LLM "Have I tried generate_schema()?" # No β†’ Try that first ] # Only use LLM if: return ( task_requires_semantic_reasoning(extraction_task) and structure_varies_dramatically(extraction_task) and generate_schema_failed(extraction_task) ) ``` ### 🎯 Best Practice Summary: 1. **Always start** with [extraction-no-llm.md](./extraction-no-llm.md) strategies 2. **Try generate_schema()** before manual schema creation 3. **Use LLM sparingly** and only for semantic understanding 4. **Monitor costs** and implement budget controls 5. **Cache results** to avoid repeated LLM calls 6. **Validate quality** of LLM extractions 7. **Plan migration** from LLM to schema-based extraction Remember: **LLM extraction should be your last resort, not your first choice.** --- **πŸ“– Recommended Reading Order:** 1. [extraction-no-llm.md](./extraction-no-llm.md) - Start here for 99% of use cases 2. This document - Only when non-LLM strategies are insufficient