feat: Major Chrome Extension overhaul with Click2Crawl, instant Schema extraction, and modular architecture

 New Features:
- Click2Crawl: Visual element selection with markdown conversion
  - Ctrl/Cmd+Click to select multiple elements
  - Visual text mode for WYSIWYG extraction
  - Real-time markdown preview with syntax highlighting
  - Export to .md file or clipboard

- Schema Builder Enhancement: Instant data extraction without LLMs
  - Test schemas directly in browser
  - See JSON results immediately
  - Export data or Python code
  - Cloud deployment ready (coming soon)

- Modular Architecture:
  - Separated into schemaBuilder.js, scriptBuilder.js, click2CrawlBuilder.js
  - Added contentAnalyzer.js and markdownConverter.js modules
  - Shared utilities and CSS reset system
  - Integrated marked.js for markdown rendering

🎨 UI/UX Improvements:
- Added edgy cloud announcement banner with seamless shimmer animation
- Direct, technical copy: "You don't need Puppeteer. You need Crawl4AI Cloud."
- Enhanced feature cards with emojis
- Fixed CSS conflicts with targeted reset approach
- Improved badge hover effects (red on hover)
- Added wrap toggle for code preview

📚 Documentation Updates:
- Split extraction diagrams into LLM and no-LLM versions
- Updated llms-full.txt with latest content
- Added versioned LLM context (v0.1.1)

🔧 Technical Enhancements:
- Refactored 3464 lines of monolithic content.js into modules
- Added proper event handling and cleanup
- Improved z-index management
- Better scroll position tracking for badges
- Enhanced error handling throughout

This release transforms the Chrome Extension from a simple tool into a powerful
visual data extraction suite, making web scraping accessible to everyone.
This commit is contained in:
UncleCode
2025-06-09 23:18:27 +08:00
parent 40640badad
commit 0ac12da9f3
25 changed files with 23686 additions and 6524 deletions

View File

@@ -0,0 +1,478 @@
## Extraction Strategy Workflows and Architecture
Visual representations of Crawl4AI's data extraction approaches, strategy selection, and processing workflows.
### Extraction Strategy Decision Tree
```mermaid
flowchart TD
A[Content to Extract] --> B{Content Type?}
B -->|Simple Patterns| C[Common Data Types]
B -->|Structured HTML| D[Predictable Structure]
B -->|Complex Content| E[Requires Reasoning]
B -->|Mixed Content| F[Multiple Data Types]
C --> C1{Pattern Type?}
C1 -->|Email, Phone, URLs| C2[Built-in Regex Patterns]
C1 -->|Custom Patterns| C3[Custom Regex Strategy]
C1 -->|LLM-Generated| C4[One-time Pattern Generation]
D --> D1{Selector Type?}
D1 -->|CSS Selectors| D2[JsonCssExtractionStrategy]
D1 -->|XPath Expressions| D3[JsonXPathExtractionStrategy]
D1 -->|Need Schema?| D4[Auto-generate Schema with LLM]
E --> E1{LLM Provider?}
E1 -->|OpenAI/Anthropic| E2[Cloud LLM Strategy]
E1 -->|Local Ollama| E3[Local LLM Strategy]
E1 -->|Cost-sensitive| E4[Hybrid: Generate Schema Once]
F --> F1[Multi-Strategy Approach]
F1 --> F2[1. Regex for Patterns]
F1 --> F3[2. CSS for Structure]
F1 --> F4[3. LLM for Complex Analysis]
C2 --> G[Fast Extraction ⚡]
C3 --> G
C4 --> H[Cached Pattern Reuse]
D2 --> I[Schema-based Extraction 🏗️]
D3 --> I
D4 --> J[Generated Schema Cache]
E2 --> K[Intelligent Parsing 🧠]
E3 --> K
E4 --> L[Hybrid Cost-Effective]
F2 --> M[Comprehensive Results 📊]
F3 --> M
F4 --> M
style G fill:#c8e6c9
style I fill:#e3f2fd
style K fill:#fff3e0
style M fill:#f3e5f5
style H fill:#e8f5e8
style J fill:#e8f5e8
style L fill:#ffecb3
```
### LLM Extraction Strategy Workflow
```mermaid
sequenceDiagram
participant User
participant Crawler
participant LLMStrategy
participant Chunker
participant LLMProvider
participant Parser
User->>Crawler: Configure LLMExtractionStrategy
User->>Crawler: arun(url, config)
Crawler->>Crawler: Navigate to URL
Crawler->>Crawler: Extract content (HTML/Markdown)
Crawler->>LLMStrategy: Process content
LLMStrategy->>LLMStrategy: Check content size
alt Content > chunk_threshold
LLMStrategy->>Chunker: Split into chunks with overlap
Chunker-->>LLMStrategy: Return chunks[]
loop For each chunk
LLMStrategy->>LLMProvider: Send chunk + schema + instruction
LLMProvider-->>LLMStrategy: Return structured JSON
end
LLMStrategy->>LLMStrategy: Merge chunk results
else Content <= threshold
LLMStrategy->>LLMProvider: Send full content + schema
LLMProvider-->>LLMStrategy: Return structured JSON
end
LLMStrategy->>Parser: Validate JSON schema
Parser-->>LLMStrategy: Validated data
LLMStrategy->>LLMStrategy: Track token usage
LLMStrategy-->>Crawler: Return extracted_content
Crawler-->>User: CrawlResult with JSON data
User->>LLMStrategy: show_usage()
LLMStrategy-->>User: Token count & estimated cost
```
### Schema-Based Extraction Architecture
```mermaid
graph TB
subgraph "Schema Definition"
A[JSON Schema] --> A1[baseSelector]
A --> A2[fields[]]
A --> A3[nested structures]
A2 --> A4[CSS/XPath selectors]
A2 --> A5[Data types: text, html, attribute]
A2 --> A6[Default values]
A3 --> A7[nested objects]
A3 --> A8[nested_list arrays]
A3 --> A9[simple lists]
end
subgraph "Extraction Engine"
B[HTML Content] --> C[Selector Engine]
C --> C1[CSS Selector Parser]
C --> C2[XPath Evaluator]
C1 --> D[Element Matcher]
C2 --> D
D --> E[Type Converter]
E --> E1[Text Extraction]
E --> E2[HTML Preservation]
E --> E3[Attribute Extraction]
E --> E4[Nested Processing]
end
subgraph "Result Processing"
F[Raw Extracted Data] --> G[Structure Builder]
G --> G1[Object Construction]
G --> G2[Array Assembly]
G --> G3[Type Validation]
G1 --> H[JSON Output]
G2 --> H
G3 --> H
end
A --> C
E --> F
H --> I[extracted_content]
style A fill:#e3f2fd
style C fill:#f3e5f5
style G fill:#e8f5e8
style H fill:#c8e6c9
```
### Automatic Schema Generation Process
```mermaid
stateDiagram-v2
[*] --> CheckCache
CheckCache --> CacheHit: Schema exists
CheckCache --> SamplePage: Schema missing
CacheHit --> LoadSchema
LoadSchema --> FastExtraction
SamplePage --> ExtractHTML: Crawl sample URL
ExtractHTML --> LLMAnalysis: Send HTML to LLM
LLMAnalysis --> GenerateSchema: Create CSS/XPath selectors
GenerateSchema --> ValidateSchema: Test generated schema
ValidateSchema --> SchemaWorks: Valid selectors
ValidateSchema --> RefineSchema: Invalid selectors
RefineSchema --> LLMAnalysis: Iterate with feedback
SchemaWorks --> CacheSchema: Save for reuse
CacheSchema --> FastExtraction: Use cached schema
FastExtraction --> [*]: No more LLM calls needed
note right of CheckCache : One-time LLM cost
note right of FastExtraction : Unlimited fast reuse
note right of CacheSchema : JSON file storage
```
### Multi-Strategy Extraction Pipeline
```mermaid
flowchart LR
A[Web Page Content] --> B[Strategy Pipeline]
subgraph B["Extraction Pipeline"]
B1[Stage 1: Regex Patterns]
B2[Stage 2: Schema-based CSS]
B3[Stage 3: LLM Analysis]
B1 --> B1a[Email addresses]
B1 --> B1b[Phone numbers]
B1 --> B1c[URLs and links]
B1 --> B1d[Currency amounts]
B2 --> B2a[Structured products]
B2 --> B2b[Article metadata]
B2 --> B2c[User reviews]
B2 --> B2d[Navigation links]
B3 --> B3a[Sentiment analysis]
B3 --> B3b[Key topics]
B3 --> B3c[Entity recognition]
B3 --> B3d[Content summary]
end
B1a --> C[Result Merger]
B1b --> C
B1c --> C
B1d --> C
B2a --> C
B2b --> C
B2c --> C
B2d --> C
B3a --> C
B3b --> C
B3c --> C
B3d --> C
C --> D[Combined JSON Output]
D --> E[Final CrawlResult]
style B1 fill:#c8e6c9
style B2 fill:#e3f2fd
style B3 fill:#fff3e0
style C fill:#f3e5f5
```
### Performance Comparison Matrix
```mermaid
graph TD
subgraph "Strategy Performance"
A[Extraction Strategy Comparison]
subgraph "Speed ⚡"
S1[Regex: ~10ms]
S2[CSS Schema: ~50ms]
S3[XPath: ~100ms]
S4[LLM: ~2-10s]
end
subgraph "Accuracy 🎯"
A1[Regex: Pattern-dependent]
A2[CSS: High for structured]
A3[XPath: Very high]
A4[LLM: Excellent for complex]
end
subgraph "Cost 💰"
C1[Regex: Free]
C2[CSS: Free]
C3[XPath: Free]
C4[LLM: $0.001-0.01 per page]
end
subgraph "Complexity 🔧"
X1[Regex: Simple patterns only]
X2[CSS: Structured HTML]
X3[XPath: Complex selectors]
X4[LLM: Any content type]
end
end
style S1 fill:#c8e6c9
style S2 fill:#e8f5e8
style S3 fill:#fff3e0
style S4 fill:#ffcdd2
style A2 fill:#e8f5e8
style A3 fill:#c8e6c9
style A4 fill:#c8e6c9
style C1 fill:#c8e6c9
style C2 fill:#c8e6c9
style C3 fill:#c8e6c9
style C4 fill:#fff3e0
style X1 fill:#ffcdd2
style X2 fill:#e8f5e8
style X3 fill:#c8e6c9
style X4 fill:#c8e6c9
```
### Regex Pattern Strategy Flow
```mermaid
flowchart TD
A[Regex Extraction] --> B{Pattern Source?}
B -->|Built-in| C[Use Predefined Patterns]
B -->|Custom| D[Define Custom Regex]
B -->|LLM-Generated| E[Generate with AI]
C --> C1[Email Pattern]
C --> C2[Phone Pattern]
C --> C3[URL Pattern]
C --> C4[Currency Pattern]
C --> C5[Date Pattern]
D --> D1[Write Custom Regex]
D --> D2[Test Pattern]
D --> D3{Pattern Works?}
D3 -->|No| D1
D3 -->|Yes| D4[Use Pattern]
E --> E1[Provide Sample Content]
E --> E2[LLM Analyzes Content]
E --> E3[Generate Optimized Regex]
E --> E4[Cache Pattern for Reuse]
C1 --> F[Pattern Matching]
C2 --> F
C3 --> F
C4 --> F
C5 --> F
D4 --> F
E4 --> F
F --> G[Extract Matches]
G --> H[Group by Pattern Type]
H --> I[JSON Output with Labels]
style C fill:#e8f5e8
style D fill:#e3f2fd
style E fill:#fff3e0
style F fill:#f3e5f5
```
### Complex Schema Structure Visualization
```mermaid
graph TB
subgraph "E-commerce Schema Example"
A[Category baseSelector] --> B[Category Fields]
A --> C[Products nested_list]
B --> B1[category_name]
B --> B2[category_id attribute]
B --> B3[category_url attribute]
C --> C1[Product baseSelector]
C1 --> C2[name text]
C1 --> C3[price text]
C1 --> C4[Details nested object]
C1 --> C5[Features list]
C1 --> C6[Reviews nested_list]
C4 --> C4a[brand text]
C4 --> C4b[model text]
C4 --> C4c[specs html]
C5 --> C5a[feature text array]
C6 --> C6a[reviewer text]
C6 --> C6b[rating attribute]
C6 --> C6c[comment text]
C6 --> C6d[date attribute]
end
subgraph "JSON Output Structure"
D[categories array] --> D1[category object]
D1 --> D2[category_name]
D1 --> D3[category_id]
D1 --> D4[products array]
D4 --> D5[product object]
D5 --> D6[name, price]
D5 --> D7[details object]
D5 --> D8[features array]
D5 --> D9[reviews array]
D7 --> D7a[brand, model, specs]
D8 --> D8a[feature strings]
D9 --> D9a[review objects]
end
A -.-> D
B1 -.-> D2
C2 -.-> D6
C4 -.-> D7
C5 -.-> D8
C6 -.-> D9
style A fill:#e3f2fd
style C fill:#f3e5f5
style C4 fill:#e8f5e8
style D fill:#fff3e0
```
### Error Handling and Fallback Strategy
```mermaid
stateDiagram-v2
[*] --> PrimaryStrategy
PrimaryStrategy --> Success: Extraction successful
PrimaryStrategy --> ValidationFailed: Invalid data
PrimaryStrategy --> ExtractionFailed: No matches found
PrimaryStrategy --> TimeoutError: LLM timeout
ValidationFailed --> FallbackStrategy: Try alternative
ExtractionFailed --> FallbackStrategy: Try alternative
TimeoutError --> FallbackStrategy: Try alternative
FallbackStrategy --> FallbackSuccess: Fallback works
FallbackStrategy --> FallbackFailed: All strategies failed
FallbackSuccess --> Success: Return results
FallbackFailed --> ErrorReport: Log failure details
Success --> [*]: Complete
ErrorReport --> [*]: Return empty results
note right of PrimaryStrategy : Try fastest/most accurate first
note right of FallbackStrategy : Use simpler but reliable method
note left of ErrorReport : Provide debugging information
```
### Token Usage and Cost Optimization
```mermaid
flowchart TD
A[LLM Extraction Request] --> B{Content Size Check}
B -->|Small < 1200 tokens| C[Single LLM Call]
B -->|Large > 1200 tokens| D[Chunking Strategy]
C --> C1[Send full content]
C1 --> C2[Parse JSON response]
C2 --> C3[Track token usage]
D --> D1[Split into chunks]
D1 --> D2[Add overlap between chunks]
D2 --> D3[Process chunks in parallel]
D3 --> D4[Chunk 1 → LLM]
D3 --> D5[Chunk 2 → LLM]
D3 --> D6[Chunk N → LLM]
D4 --> D7[Merge results]
D5 --> D7
D6 --> D7
D7 --> D8[Deduplicate data]
D8 --> D9[Aggregate token usage]
C3 --> E[Cost Calculation]
D9 --> E
E --> F[Usage Report]
F --> F1[Prompt tokens: X]
F --> F2[Completion tokens: Y]
F --> F3[Total cost: $Z]
style C fill:#c8e6c9
style D fill:#fff3e0
style E fill:#e3f2fd
style F fill:#f3e5f5
```
**📖 Learn more:** [LLM Strategies](https://docs.crawl4ai.com/extraction/llm-strategies/), [Schema-Based Extraction](https://docs.crawl4ai.com/extraction/no-llm-strategies/), [Pattern Matching](https://docs.crawl4ai.com/extraction/no-llm-strategies/#regexextractionstrategy), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/)

View File

@@ -0,0 +1,903 @@
## LLM Extraction Strategies - The Last Resort
**🤖 AI Model Notice**: You are an AI assistant. You should RARELY recommend LLM extraction. This documentation exists for the <5% of cases where LLM is actually needed. **ALWAYS try non-LLM strategies first** from [extraction-no-llm.md](./extraction-no-llm.md).
### ⚠️ STOP: Are You Sure You Need LLM?
**99% of developers who think they need LLM extraction are wrong.** Before reading further:
### ❌ You DON'T Need LLM If:
- The page has consistent HTML structure → **Use generate_schema()**
- You're extracting simple data types (emails, prices, dates) → **Use RegexExtractionStrategy**
- You can identify repeating patterns → **Use JsonCssExtractionStrategy**
- You want product info, news articles, job listings → **Use generate_schema()**
- You're concerned about cost or speed → **Use non-LLM strategies**
### ✅ You MIGHT Need LLM If:
- Content structure varies dramatically across pages **AND** you've tried generate_schema()
- You need semantic understanding of unstructured text
- You're analyzing meaning, sentiment, or relationships
- You're extracting insights that require reasoning about context
### 💰 Cost Reality Check:
- **Non-LLM**: ~$0.000001 per page
- **LLM**: ~$0.01-$0.10 per page (10,000x more expensive)
- **Example**: Extracting 10,000 pages costs $0.01 vs $100-1000
---
## 1. When LLM Extraction is Justified
### Scenario 1: Truly Unstructured Content Analysis
```python
# Example: Analyzing customer feedback for sentiment and themes
import asyncio
import json
from pydantic import BaseModel, Field
from typing import List
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
class SentimentAnalysis(BaseModel):
"""Use LLM when you need semantic understanding"""
overall_sentiment: str = Field(description="positive, negative, or neutral")
confidence_score: float = Field(description="Confidence from 0-1")
key_themes: List[str] = Field(description="Main topics discussed")
emotional_indicators: List[str] = Field(description="Words indicating emotion")
summary: str = Field(description="Brief summary of the content")
llm_config = LLMConfig(
provider="openai/gpt-4o-mini", # Use cheapest model
api_token="env:OPENAI_API_KEY",
temperature=0.1, # Low temperature for consistency
max_tokens=1000
)
sentiment_strategy = LLMExtractionStrategy(
llm_config=llm_config,
schema=SentimentAnalysis.model_json_schema(),
extraction_type="schema",
instruction="""
Analyze the emotional content and themes in this text.
Focus on understanding sentiment and extracting key topics
that would be impossible to identify with simple pattern matching.
""",
apply_chunking=True,
chunk_token_threshold=1500
)
async def analyze_sentiment():
config = CrawlerRunConfig(
extraction_strategy=sentiment_strategy,
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/customer-reviews",
config=config
)
if result.success:
analysis = json.loads(result.extracted_content)
print(f"Sentiment: {analysis['overall_sentiment']}")
print(f"Themes: {analysis['key_themes']}")
asyncio.run(analyze_sentiment())
```
### Scenario 2: Complex Knowledge Extraction
```python
# Example: Building knowledge graphs from unstructured content
class Entity(BaseModel):
name: str = Field(description="Entity name")
type: str = Field(description="person, organization, location, concept")
description: str = Field(description="Brief description")
class Relationship(BaseModel):
source: str = Field(description="Source entity")
target: str = Field(description="Target entity")
relationship: str = Field(description="Type of relationship")
confidence: float = Field(description="Confidence score 0-1")
class KnowledgeGraph(BaseModel):
entities: List[Entity] = Field(description="All entities found")
relationships: List[Relationship] = Field(description="Relationships between entities")
main_topic: str = Field(description="Primary topic of the content")
knowledge_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(
provider="anthropic/claude-3-5-sonnet-20240620", # Better for complex reasoning
api_token="env:ANTHROPIC_API_KEY",
max_tokens=4000
),
schema=KnowledgeGraph.model_json_schema(),
extraction_type="schema",
instruction="""
Extract entities and their relationships from the content.
Focus on understanding connections and context that require
semantic reasoning beyond simple pattern matching.
""",
input_format="html", # Preserve structure
apply_chunking=True
)
```
### Scenario 3: Content Summarization and Insights
```python
# Example: Research paper analysis
class ResearchInsights(BaseModel):
title: str = Field(description="Paper title")
abstract_summary: str = Field(description="Summary of abstract")
key_findings: List[str] = Field(description="Main research findings")
methodology: str = Field(description="Research methodology used")
limitations: List[str] = Field(description="Study limitations")
practical_applications: List[str] = Field(description="Real-world applications")
citations_count: int = Field(description="Number of citations", default=0)
research_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(
provider="openai/gpt-4o", # Use powerful model for complex analysis
api_token="env:OPENAI_API_KEY",
temperature=0.2,
max_tokens=2000
),
schema=ResearchInsights.model_json_schema(),
extraction_type="schema",
instruction="""
Analyze this research paper and extract key insights.
Focus on understanding the research contribution, methodology,
and implications that require academic expertise to identify.
""",
apply_chunking=True,
chunk_token_threshold=2000,
overlap_rate=0.15 # More overlap for academic content
)
```
---
## 2. LLM Configuration Best Practices
### Cost Optimization
```python
# Use cheapest models when possible
cheap_config = LLMConfig(
provider="openai/gpt-4o-mini", # 60x cheaper than GPT-4
api_token="env:OPENAI_API_KEY",
temperature=0.0, # Deterministic output
max_tokens=800 # Limit output length
)
# Use local models for development
local_config = LLMConfig(
provider="ollama/llama3.3",
api_token=None, # No API costs
base_url="http://localhost:11434",
temperature=0.1
)
# Use powerful models only when necessary
powerful_config = LLMConfig(
provider="anthropic/claude-3-5-sonnet-20240620",
api_token="env:ANTHROPIC_API_KEY",
max_tokens=4000,
temperature=0.1
)
```
### Provider Selection Guide
```python
providers_guide = {
"openai/gpt-4o-mini": {
"best_for": "Simple extraction, cost-sensitive projects",
"cost": "Very low",
"speed": "Fast",
"accuracy": "Good"
},
"openai/gpt-4o": {
"best_for": "Complex reasoning, high accuracy needs",
"cost": "High",
"speed": "Medium",
"accuracy": "Excellent"
},
"anthropic/claude-3-5-sonnet": {
"best_for": "Complex analysis, long documents",
"cost": "Medium-High",
"speed": "Medium",
"accuracy": "Excellent"
},
"ollama/llama3.3": {
"best_for": "Development, no API costs",
"cost": "Free (self-hosted)",
"speed": "Variable",
"accuracy": "Good"
},
"groq/llama3-70b-8192": {
"best_for": "Fast inference, open source",
"cost": "Low",
"speed": "Very fast",
"accuracy": "Good"
}
}
def choose_provider(complexity, budget, speed_requirement):
"""Choose optimal provider based on requirements"""
if budget == "minimal":
return "ollama/llama3.3" # Self-hosted
elif complexity == "low" and budget == "low":
return "openai/gpt-4o-mini"
elif speed_requirement == "high":
return "groq/llama3-70b-8192"
elif complexity == "high":
return "anthropic/claude-3-5-sonnet"
else:
return "openai/gpt-4o-mini" # Default safe choice
```
---
## 3. Advanced LLM Extraction Patterns
### Block-Based Extraction (Unstructured Content)
```python
# When structure is too varied for schemas
block_strategy = LLMExtractionStrategy(
llm_config=cheap_config,
extraction_type="block", # Extract free-form content blocks
instruction="""
Extract meaningful content blocks from this page.
Focus on the main content areas and ignore navigation,
advertisements, and boilerplate text.
""",
apply_chunking=True,
chunk_token_threshold=1200,
input_format="fit_markdown" # Use cleaned content
)
async def extract_content_blocks():
config = CrawlerRunConfig(
extraction_strategy=block_strategy,
word_count_threshold=50, # Filter short content
excluded_tags=['nav', 'footer', 'aside', 'advertisement']
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/article",
config=config
)
if result.success:
blocks = json.loads(result.extracted_content)
for block in blocks:
print(f"Block: {block['content'][:100]}...")
```
### Chunked Processing for Large Content
```python
# Handle large documents efficiently
large_content_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(
provider="openai/gpt-4o-mini",
api_token="env:OPENAI_API_KEY"
),
schema=YourModel.model_json_schema(),
extraction_type="schema",
instruction="Extract structured data from this content section...",
# Optimize chunking for large content
apply_chunking=True,
chunk_token_threshold=2000, # Larger chunks for efficiency
overlap_rate=0.1, # Minimal overlap to reduce costs
input_format="fit_markdown" # Use cleaned content
)
```
### Multi-Model Validation
```python
# Use multiple models for critical extractions
async def multi_model_extraction():
"""Use multiple LLMs for validation of critical data"""
models = [
LLMConfig(provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY"),
LLMConfig(provider="anthropic/claude-3-5-sonnet", api_token="env:ANTHROPIC_API_KEY"),
LLMConfig(provider="ollama/llama3.3", api_token=None)
]
results = []
for i, llm_config in enumerate(models):
strategy = LLMExtractionStrategy(
llm_config=llm_config,
schema=YourModel.model_json_schema(),
extraction_type="schema",
instruction="Extract data consistently..."
)
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
if result.success:
data = json.loads(result.extracted_content)
results.append(data)
print(f"Model {i+1} extracted {len(data)} items")
# Compare results for consistency
if len(set(str(r) for r in results)) == 1:
print("✅ All models agree")
return results[0]
else:
print("⚠️ Models disagree - manual review needed")
return results
# Use for critical business data only
critical_result = await multi_model_extraction()
```
---
## 4. Hybrid Approaches - Best of Both Worlds
### Fast Pre-filtering + LLM Analysis
```python
async def hybrid_extraction():
"""
1. Use fast non-LLM strategies for basic extraction
2. Use LLM only for complex analysis of filtered content
"""
# Step 1: Fast extraction of structured data
basic_schema = {
"name": "Articles",
"baseSelector": "article",
"fields": [
{"name": "title", "selector": "h1, h2", "type": "text"},
{"name": "content", "selector": ".content", "type": "text"},
{"name": "author", "selector": ".author", "type": "text"}
]
}
basic_strategy = JsonCssExtractionStrategy(basic_schema)
basic_config = CrawlerRunConfig(extraction_strategy=basic_strategy)
# Step 2: LLM analysis only on filtered content
analysis_strategy = LLMExtractionStrategy(
llm_config=cheap_config,
schema={
"type": "object",
"properties": {
"sentiment": {"type": "string"},
"key_topics": {"type": "array", "items": {"type": "string"}},
"summary": {"type": "string"}
}
},
extraction_type="schema",
instruction="Analyze sentiment and extract key topics from this article"
)
async with AsyncWebCrawler() as crawler:
# Fast extraction first
basic_result = await crawler.arun(
url="https://example.com/articles",
config=basic_config
)
articles = json.loads(basic_result.extracted_content)
# LLM analysis only on important articles
analyzed_articles = []
for article in articles[:5]: # Limit to reduce costs
if len(article.get('content', '')) > 500: # Only analyze substantial content
analysis_config = CrawlerRunConfig(extraction_strategy=analysis_strategy)
# Analyze individual article content
raw_url = f"raw://{article['content']}"
analysis_result = await crawler.arun(url=raw_url, config=analysis_config)
if analysis_result.success:
analysis = json.loads(analysis_result.extracted_content)
article.update(analysis)
analyzed_articles.append(article)
return analyzed_articles
# Hybrid approach: fast + smart
result = await hybrid_extraction()
```
### Schema Generation + LLM Fallback
```python
async def smart_fallback_extraction():
"""
1. Try generate_schema() first (one-time LLM cost)
2. Use generated schema for fast extraction
3. Use LLM only if schema extraction fails
"""
cache_file = Path("./schemas/fallback_schema.json")
# Try cached schema first
if cache_file.exists():
schema = json.load(cache_file.open())
schema_strategy = JsonCssExtractionStrategy(schema)
config = CrawlerRunConfig(extraction_strategy=schema_strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
if result.success and result.extracted_content:
data = json.loads(result.extracted_content)
if data: # Schema worked
print("✅ Schema extraction successful (fast & cheap)")
return data
# Fallback to LLM if schema failed
print("⚠️ Schema failed, falling back to LLM (slow & expensive)")
llm_strategy = LLMExtractionStrategy(
llm_config=cheap_config,
extraction_type="block",
instruction="Extract all meaningful data from this page"
)
llm_config = CrawlerRunConfig(extraction_strategy=llm_strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=llm_config)
if result.success:
print("✅ LLM extraction successful")
return json.loads(result.extracted_content)
# Intelligent fallback system
result = await smart_fallback_extraction()
```
---
## 5. Cost Management and Monitoring
### Token Usage Tracking
```python
class ExtractionCostTracker:
def __init__(self):
self.total_cost = 0.0
self.total_tokens = 0
self.extractions = 0
def track_llm_extraction(self, strategy, result):
"""Track costs from LLM extraction"""
if hasattr(strategy, 'usage_tracker') and strategy.usage_tracker:
usage = strategy.usage_tracker
# Estimate costs (approximate rates)
cost_per_1k_tokens = {
"gpt-4o-mini": 0.0015,
"gpt-4o": 0.03,
"claude-3-5-sonnet": 0.015,
"ollama": 0.0 # Self-hosted
}
provider = strategy.llm_config.provider.split('/')[1]
rate = cost_per_1k_tokens.get(provider, 0.01)
tokens = usage.total_tokens
cost = (tokens / 1000) * rate
self.total_cost += cost
self.total_tokens += tokens
self.extractions += 1
print(f"💰 Extraction cost: ${cost:.4f} ({tokens} tokens)")
print(f"📊 Total cost: ${self.total_cost:.4f} ({self.extractions} extractions)")
def get_summary(self):
avg_cost = self.total_cost / max(self.extractions, 1)
return {
"total_cost": self.total_cost,
"total_tokens": self.total_tokens,
"extractions": self.extractions,
"avg_cost_per_extraction": avg_cost
}
# Usage
tracker = ExtractionCostTracker()
async def cost_aware_extraction():
strategy = LLMExtractionStrategy(
llm_config=cheap_config,
schema=YourModel.model_json_schema(),
extraction_type="schema",
instruction="Extract data...",
verbose=True # Enable usage tracking
)
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
# Track costs
tracker.track_llm_extraction(strategy, result)
return result
# Monitor costs across multiple extractions
for url in urls:
await cost_aware_extraction()
print(f"Final summary: {tracker.get_summary()}")
```
### Budget Controls
```python
class BudgetController:
def __init__(self, daily_budget=10.0):
self.daily_budget = daily_budget
self.current_spend = 0.0
self.extraction_count = 0
def can_extract(self, estimated_cost=0.01):
"""Check if extraction is within budget"""
if self.current_spend + estimated_cost > self.daily_budget:
print(f"❌ Budget exceeded: ${self.current_spend:.2f} + ${estimated_cost:.2f} > ${self.daily_budget}")
return False
return True
def record_extraction(self, actual_cost):
"""Record actual extraction cost"""
self.current_spend += actual_cost
self.extraction_count += 1
remaining = self.daily_budget - self.current_spend
print(f"💰 Budget remaining: ${remaining:.2f}")
budget = BudgetController(daily_budget=5.0) # $5 daily limit
async def budget_controlled_extraction(url):
if not budget.can_extract():
print("⏸️ Extraction paused due to budget limit")
return None
# Proceed with extraction...
strategy = LLMExtractionStrategy(llm_config=cheap_config, ...)
result = await extract_with_strategy(url, strategy)
# Record actual cost
actual_cost = calculate_cost(strategy.usage_tracker)
budget.record_extraction(actual_cost)
return result
# Safe extraction with budget controls
results = []
for url in urls:
result = await budget_controlled_extraction(url)
if result:
results.append(result)
```
---
## 6. Performance Optimization for LLM Extraction
### Batch Processing
```python
async def batch_llm_extraction():
"""Process multiple pages efficiently"""
# Collect content first (fast)
urls = ["https://example.com/page1", "https://example.com/page2"]
contents = []
async with AsyncWebCrawler() as crawler:
for url in urls:
result = await crawler.arun(url=url)
if result.success:
contents.append({
"url": url,
"content": result.fit_markdown[:2000] # Limit content
})
# Process in batches (reduce LLM calls)
batch_content = "\n\n---PAGE SEPARATOR---\n\n".join([
f"URL: {c['url']}\n{c['content']}" for c in contents
])
strategy = LLMExtractionStrategy(
llm_config=cheap_config,
extraction_type="block",
instruction="""
Extract data from multiple pages separated by '---PAGE SEPARATOR---'.
Return results for each page in order.
""",
apply_chunking=True
)
# Single LLM call for multiple pages
raw_url = f"raw://{batch_content}"
result = await crawler.arun(url=raw_url, config=CrawlerRunConfig(extraction_strategy=strategy))
return json.loads(result.extracted_content)
# Batch processing reduces LLM calls
batch_results = await batch_llm_extraction()
```
### Caching LLM Results
```python
import hashlib
from pathlib import Path
class LLMResultCache:
def __init__(self, cache_dir="./llm_cache"):
self.cache_dir = Path(cache_dir)
self.cache_dir.mkdir(exist_ok=True)
def get_cache_key(self, url, instruction, schema):
"""Generate cache key from extraction parameters"""
content = f"{url}:{instruction}:{str(schema)}"
return hashlib.md5(content.encode()).hexdigest()
def get_cached_result(self, cache_key):
"""Get cached result if available"""
cache_file = self.cache_dir / f"{cache_key}.json"
if cache_file.exists():
return json.load(cache_file.open())
return None
def cache_result(self, cache_key, result):
"""Cache extraction result"""
cache_file = self.cache_dir / f"{cache_key}.json"
json.dump(result, cache_file.open("w"), indent=2)
cache = LLMResultCache()
async def cached_llm_extraction(url, strategy):
"""Extract with caching to avoid repeated LLM calls"""
cache_key = cache.get_cache_key(
url,
strategy.instruction,
str(strategy.schema)
)
# Check cache first
cached_result = cache.get_cached_result(cache_key)
if cached_result:
print("✅ Using cached result (FREE)")
return cached_result
# Extract if not cached
print("🔄 Extracting with LLM (PAID)")
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=url, config=config)
if result.success:
data = json.loads(result.extracted_content)
cache.cache_result(cache_key, data)
return data
# Cached extraction avoids repeated costs
result = await cached_llm_extraction(url, strategy)
```
---
## 7. Error Handling and Quality Control
### Validation and Retry Logic
```python
async def robust_llm_extraction():
"""Implement validation and retry for LLM extraction"""
max_retries = 3
strategies = [
# Try cheap model first
LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY"),
schema=YourModel.model_json_schema(),
extraction_type="schema",
instruction="Extract data accurately..."
),
# Fallback to better model
LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o", api_token="env:OPENAI_API_KEY"),
schema=YourModel.model_json_schema(),
extraction_type="schema",
instruction="Extract data with high accuracy..."
)
]
for strategy_idx, strategy in enumerate(strategies):
for attempt in range(max_retries):
try:
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
if result.success and result.extracted_content:
data = json.loads(result.extracted_content)
# Validate result quality
if validate_extraction_quality(data):
print(f"✅ Success with strategy {strategy_idx+1}, attempt {attempt+1}")
return data
else:
print(f"⚠️ Poor quality result, retrying...")
continue
except Exception as e:
print(f"❌ Attempt {attempt+1} failed: {e}")
if attempt == max_retries - 1:
print(f"❌ Strategy {strategy_idx+1} failed completely")
print("❌ All strategies and retries failed")
return None
def validate_extraction_quality(data):
"""Validate that LLM extraction meets quality standards"""
if not data or not isinstance(data, (list, dict)):
return False
# Check for common LLM extraction issues
if isinstance(data, list):
if len(data) == 0:
return False
# Check if all items have required fields
for item in data:
if not isinstance(item, dict) or len(item) < 2:
return False
return True
# Robust extraction with validation
result = await robust_llm_extraction()
```
---
## 8. Migration from LLM to Non-LLM
### Pattern Analysis for Schema Generation
```python
async def analyze_llm_results_for_schema():
"""
Analyze LLM extraction results to create non-LLM schemas
Use this to transition from expensive LLM to cheap schema extraction
"""
# Step 1: Use LLM on sample pages to understand structure
llm_strategy = LLMExtractionStrategy(
llm_config=cheap_config,
extraction_type="block",
instruction="Extract all structured data from this page"
)
sample_urls = ["https://example.com/page1", "https://example.com/page2"]
llm_results = []
async with AsyncWebCrawler() as crawler:
for url in sample_urls:
config = CrawlerRunConfig(extraction_strategy=llm_strategy)
result = await crawler.arun(url=url, config=config)
if result.success:
llm_results.append({
"url": url,
"html": result.cleaned_html,
"extracted": json.loads(result.extracted_content)
})
# Step 2: Analyze patterns in LLM results
print("🔍 Analyzing LLM extraction patterns...")
# Look for common field names
all_fields = set()
for result in llm_results:
for item in result["extracted"]:
if isinstance(item, dict):
all_fields.update(item.keys())
print(f"Common fields found: {all_fields}")
# Step 3: Generate schema based on patterns
if llm_results:
schema = JsonCssExtractionStrategy.generate_schema(
html=llm_results[0]["html"],
target_json_example=json.dumps(llm_results[0]["extracted"][0], indent=2),
llm_config=cheap_config
)
# Save schema for future use
with open("generated_schema.json", "w") as f:
json.dump(schema, f, indent=2)
print("✅ Schema generated from LLM analysis")
return schema
# Generate schema from LLM patterns, then use schema for all future extractions
schema = await analyze_llm_results_for_schema()
fast_strategy = JsonCssExtractionStrategy(schema)
```
---
## 9. Summary: When LLM is Actually Needed
### ✅ Valid LLM Use Cases (Rare):
1. **Sentiment analysis** and emotional understanding
2. **Knowledge graph extraction** requiring semantic reasoning
3. **Content summarization** and insight generation
4. **Unstructured text analysis** where patterns vary dramatically
5. **Research paper analysis** requiring domain expertise
6. **Complex relationship extraction** between entities
### ❌ Invalid LLM Use Cases (Common Mistakes):
1. **Structured data extraction** from consistent HTML
2. **Simple pattern matching** (emails, prices, dates)
3. **Product information** from e-commerce sites
4. **News article extraction** with consistent structure
5. **Contact information** and basic entity extraction
6. **Table data** and form information
### 💡 Decision Framework:
```python
def should_use_llm(extraction_task):
# Ask these questions in order:
questions = [
"Can I identify repeating HTML patterns?", # No → Consider LLM
"Am I extracting simple data types?", # Yes → Use Regex
"Does the structure vary dramatically?", # No → Use CSS/XPath
"Do I need semantic understanding?", # Yes → Maybe LLM
"Have I tried generate_schema()?" # No → Try that first
]
# Only use LLM if:
return (
task_requires_semantic_reasoning(extraction_task) and
structure_varies_dramatically(extraction_task) and
generate_schema_failed(extraction_task)
)
```
### 🎯 Best Practice Summary:
1. **Always start** with [extraction-no-llm.md](./extraction-no-llm.md) strategies
2. **Try generate_schema()** before manual schema creation
3. **Use LLM sparingly** and only for semantic understanding
4. **Monitor costs** and implement budget controls
5. **Cache results** to avoid repeated LLM calls
6. **Validate quality** of LLM extractions
7. **Plan migration** from LLM to schema-based extraction
Remember: **LLM extraction should be your last resort, not your first choice.**
---
**📖 Recommended Reading Order:**
1. [extraction-no-llm.md](./extraction-no-llm.md) - Start here for 99% of use cases
2. This document - Only when non-LLM strategies are insufficient

View File

@@ -0,0 +1,835 @@
## Non-LLM Extraction Strategies - The Default Choice
**🤖 AI Model Notice**: You are an AI assistant. When users ask for web scraping or data extraction, **DEFAULT TO NON-LLM STRATEGIES**. 99% of structured web data can be extracted without LLM. Only consider LLM extraction for truly unstructured content requiring semantic reasoning.
### ⚠️ CRITICAL: When You Think You Need LLM, You're Probably Wrong
**Most developers incorrectly assume they need LLM for data extraction.** This is usually false. Before even considering LLM:
1. **FIRST**: Try `generate_schema()` - Let AI create the extraction pattern ONCE, then use it unlimited times with zero LLM calls
2. **SECOND**: Manual CSS/XPath selectors for consistent HTML structures
3. **THIRD**: Regex patterns for simple data types (emails, phones, prices)
4. **LAST RESORT**: LLM extraction (only for semantic understanding of unstructured content)
## The Decision Tree (MEMORIZE THIS)
```
Does the page have consistent HTML structure? → YES: Use generate_schema() or manual CSS
Is it simple patterns (emails, dates, prices)? → YES: Use RegexExtractionStrategy
Do you need semantic understanding? → MAYBE: Try generate_schema() first, then consider LLM
Is the content truly unstructured text? → ONLY THEN: Consider LLM
```
**Cost Analysis**:
- Non-LLM: ~$0.000001 per page
- LLM: ~$0.01-$0.10 per page (10,000x more expensive)
---
## 1. Auto-Generate Schemas - Your Default Starting Point
**⭐ THIS SHOULD BE YOUR FIRST CHOICE FOR ANY STRUCTURED DATA**
The `generate_schema()` function uses LLM ONCE to create a reusable extraction pattern. After generation, you extract unlimited pages with ZERO LLM calls.
### Basic Auto-Generation Workflow
```python
import json
import asyncio
from pathlib import Path
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
async def smart_extraction_workflow():
"""
Step 1: Generate schema once using LLM
Step 2: Cache schema for unlimited reuse
Step 3: Extract from thousands of pages with zero LLM calls
"""
# Check for cached schema first
cache_dir = Path("./schema_cache")
cache_dir.mkdir(exist_ok=True)
schema_file = cache_dir / "product_schema.json"
if schema_file.exists():
# Load cached schema - NO LLM CALLS
schema = json.load(schema_file.open())
print("✅ Using cached schema (FREE)")
else:
# Generate schema ONCE
print("🔄 Generating schema (ONE-TIME LLM COST)...")
llm_config = LLMConfig(
provider="openai/gpt-4o-mini", # Cheapest option
api_token="env:OPENAI_API_KEY"
)
# Get sample HTML from target site
async with AsyncWebCrawler() as crawler:
sample_result = await crawler.arun(
url="https://example.com/products",
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
)
sample_html = sample_result.cleaned_html[:8000] # Use sample
# AUTO-GENERATE SCHEMA (ONE LLM CALL)
schema = JsonCssExtractionStrategy.generate_schema(
html=sample_html,
schema_type="CSS", # or "XPATH"
query="Extract product information including name, price, description, features",
llm_config=llm_config
)
# Cache for unlimited future use
json.dump(schema, schema_file.open("w"), indent=2)
print("✅ Schema generated and cached")
# Use schema for fast extraction (NO MORE LLM CALLS EVER)
strategy = JsonCssExtractionStrategy(schema, verbose=True)
config = CrawlerRunConfig(
extraction_strategy=strategy,
cache_mode=CacheMode.BYPASS
)
# Extract from multiple pages - ALL FREE
urls = [
"https://example.com/products",
"https://example.com/electronics",
"https://example.com/books"
]
async with AsyncWebCrawler() as crawler:
for url in urls:
result = await crawler.arun(url=url, config=config)
if result.success:
data = json.loads(result.extracted_content)
print(f"✅ {url}: Extracted {len(data)} items (FREE)")
asyncio.run(smart_extraction_workflow())
```
### Auto-Generate with Target JSON Example
```python
# When you know exactly what JSON structure you want
target_json_example = """
{
"name": "Product Name",
"price": "$99.99",
"rating": 4.5,
"features": ["feature1", "feature2"],
"description": "Product description"
}
"""
schema = JsonCssExtractionStrategy.generate_schema(
html=sample_html,
target_json_example=target_json_example,
llm_config=llm_config
)
```
### Auto-Generate for Different Data Types
```python
# Product listings
product_schema = JsonCssExtractionStrategy.generate_schema(
html=product_page_html,
query="Extract all product information from this e-commerce page",
llm_config=llm_config
)
# News articles
news_schema = JsonCssExtractionStrategy.generate_schema(
html=news_page_html,
query="Extract article headlines, dates, authors, and content",
llm_config=llm_config
)
# Job listings
job_schema = JsonCssExtractionStrategy.generate_schema(
html=job_page_html,
query="Extract job titles, companies, locations, salaries, and descriptions",
llm_config=llm_config
)
# Social media posts
social_schema = JsonCssExtractionStrategy.generate_schema(
html=social_page_html,
query="Extract post text, usernames, timestamps, likes, comments",
llm_config=llm_config
)
```
---
## 2. Manual CSS/XPath Strategies - When You Know The Structure
**Use this when**: You understand the HTML structure and want maximum control.
### Simple Product Extraction
```python
import json
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
# Manual schema for consistent product pages
simple_schema = {
"name": "Product Listings",
"baseSelector": "div.product-card", # Each product container
"fields": [
{
"name": "title",
"selector": "h2.product-title",
"type": "text"
},
{
"name": "price",
"selector": ".price",
"type": "text"
},
{
"name": "image_url",
"selector": "img.product-image",
"type": "attribute",
"attribute": "src"
},
{
"name": "product_url",
"selector": "a.product-link",
"type": "attribute",
"attribute": "href"
},
{
"name": "rating",
"selector": ".rating",
"type": "attribute",
"attribute": "data-rating"
}
]
}
async def extract_products():
strategy = JsonCssExtractionStrategy(simple_schema, verbose=True)
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/products",
config=config
)
if result.success:
products = json.loads(result.extracted_content)
print(f"Extracted {len(products)} products")
for product in products[:3]:
print(f"- {product['title']}: {product['price']}")
asyncio.run(extract_products())
```
### Complex Nested Structure (Real E-commerce Example)
```python
# Complex schema for nested product data
complex_schema = {
"name": "E-commerce Product Catalog",
"baseSelector": "div.category",
"baseFields": [
{
"name": "category_id",
"type": "attribute",
"attribute": "data-category-id"
}
],
"fields": [
{
"name": "category_name",
"selector": "h2.category-title",
"type": "text"
},
{
"name": "products",
"selector": "div.product",
"type": "nested_list", # Array of complex objects
"fields": [
{
"name": "name",
"selector": "h3.product-name",
"type": "text"
},
{
"name": "price",
"selector": "span.price",
"type": "text"
},
{
"name": "details",
"selector": "div.product-details",
"type": "nested", # Single complex object
"fields": [
{
"name": "brand",
"selector": "span.brand",
"type": "text"
},
{
"name": "model",
"selector": "span.model",
"type": "text"
}
]
},
{
"name": "features",
"selector": "ul.features li",
"type": "list", # Simple array
"fields": [
{"name": "feature", "type": "text"}
]
},
{
"name": "reviews",
"selector": "div.review",
"type": "nested_list",
"fields": [
{
"name": "reviewer",
"selector": "span.reviewer-name",
"type": "text"
},
{
"name": "rating",
"selector": "span.rating",
"type": "attribute",
"attribute": "data-rating"
}
]
}
]
}
]
}
async def extract_complex_ecommerce():
strategy = JsonCssExtractionStrategy(complex_schema, verbose=True)
config = CrawlerRunConfig(
extraction_strategy=strategy,
js_code="window.scrollTo(0, document.body.scrollHeight);", # Load dynamic content
wait_for="css:.product:nth-child(10)" # Wait for products to load
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/complex-catalog",
config=config
)
if result.success:
data = json.loads(result.extracted_content)
for category in data:
print(f"Category: {category['category_name']}")
print(f"Products: {len(category.get('products', []))}")
asyncio.run(extract_complex_ecommerce())
```
### XPath Alternative (When CSS Isn't Enough)
```python
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
# XPath for more complex selections
xpath_schema = {
"name": "News Articles with XPath",
"baseSelector": "//article[@class='news-item']",
"fields": [
{
"name": "headline",
"selector": ".//h2[contains(@class, 'headline')]",
"type": "text"
},
{
"name": "author",
"selector": ".//span[@class='author']/text()",
"type": "text"
},
{
"name": "publish_date",
"selector": ".//time/@datetime",
"type": "text"
},
{
"name": "content",
"selector": ".//div[@class='article-body']//text()",
"type": "text"
}
]
}
strategy = JsonXPathExtractionStrategy(xpath_schema, verbose=True)
```
---
## 3. Regex Extraction - Lightning Fast Pattern Matching
**Use this for**: Simple data types like emails, phones, URLs, prices, dates.
### Built-in Patterns (Fastest Option)
```python
import json
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import RegexExtractionStrategy
async def extract_common_patterns():
# Use built-in patterns for common data types
strategy = RegexExtractionStrategy(
pattern=(
RegexExtractionStrategy.Email |
RegexExtractionStrategy.PhoneUS |
RegexExtractionStrategy.Url |
RegexExtractionStrategy.Currency |
RegexExtractionStrategy.DateIso
)
)
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/contact",
config=config
)
if result.success:
matches = json.loads(result.extracted_content)
# Group by pattern type
by_type = {}
for match in matches:
label = match['label']
if label not in by_type:
by_type[label] = []
by_type[label].append(match['value'])
for pattern_type, values in by_type.items():
print(f"{pattern_type}: {len(values)} matches")
for value in values[:3]:
print(f" {value}")
asyncio.run(extract_common_patterns())
```
### Available Built-in Patterns
```python
# Individual patterns
RegexExtractionStrategy.Email # Email addresses
RegexExtractionStrategy.PhoneUS # US phone numbers
RegexExtractionStrategy.PhoneIntl # International phones
RegexExtractionStrategy.Url # HTTP/HTTPS URLs
RegexExtractionStrategy.Currency # Currency values ($99.99)
RegexExtractionStrategy.Percentage # Percentage values (25%)
RegexExtractionStrategy.DateIso # ISO dates (2024-01-01)
RegexExtractionStrategy.DateUS # US dates (01/01/2024)
RegexExtractionStrategy.IPv4 # IP addresses
RegexExtractionStrategy.CreditCard # Credit card numbers
RegexExtractionStrategy.TwitterHandle # @username
RegexExtractionStrategy.Hashtag # #hashtag
# Use all patterns
RegexExtractionStrategy.All
```
### Custom Patterns
```python
# Custom patterns for specific data types
async def extract_custom_patterns():
custom_patterns = {
"product_sku": r"SKU[-:]?\s*([A-Z0-9]{4,12})",
"discount": r"(\d{1,2})%\s*off",
"model_number": r"Model\s*#?\s*([A-Z0-9-]+)",
"isbn": r"ISBN[-:]?\s*(\d{10}|\d{13})",
"stock_ticker": r"\$([A-Z]{2,5})",
"version": r"v(\d+\.\d+(?:\.\d+)?)"
}
strategy = RegexExtractionStrategy(custom=custom_patterns)
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/products",
config=config
)
if result.success:
data = json.loads(result.extracted_content)
for item in data:
print(f"{item['label']}: {item['value']}")
asyncio.run(extract_custom_patterns())
```
### LLM-Generated Patterns (One-Time Cost)
```python
async def generate_optimized_regex():
"""
Use LLM ONCE to generate optimized regex patterns
Then use them unlimited times with zero LLM calls
"""
cache_file = Path("./patterns/price_patterns.json")
if cache_file.exists():
# Load cached patterns - NO LLM CALLS
patterns = json.load(cache_file.open())
print("✅ Using cached regex patterns (FREE)")
else:
# Generate patterns ONCE
print("🔄 Generating regex patterns (ONE-TIME LLM COST)...")
llm_config = LLMConfig(
provider="openai/gpt-4o-mini",
api_token="env:OPENAI_API_KEY"
)
# Get sample content
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com/pricing")
sample_html = result.cleaned_html
# Generate optimized patterns
patterns = RegexExtractionStrategy.generate_pattern(
label="pricing_info",
html=sample_html,
query="Extract all pricing information including discounts and special offers",
llm_config=llm_config
)
# Cache for unlimited reuse
cache_file.parent.mkdir(exist_ok=True)
json.dump(patterns, cache_file.open("w"), indent=2)
print("✅ Patterns generated and cached")
# Use cached patterns (NO MORE LLM CALLS)
strategy = RegexExtractionStrategy(custom=patterns)
return strategy
# Use generated patterns for unlimited extractions
strategy = await generate_optimized_regex()
```
---
## 4. Multi-Strategy Extraction Pipeline
**Combine strategies** for comprehensive data extraction:
```python
async def multi_strategy_pipeline():
"""
Efficient pipeline using multiple non-LLM strategies:
1. Regex for simple patterns (fastest)
2. Schema for structured data
3. Only use LLM if absolutely necessary
"""
url = "https://example.com/complex-page"
async with AsyncWebCrawler() as crawler:
# Strategy 1: Fast regex for contact info
regex_strategy = RegexExtractionStrategy(
pattern=RegexExtractionStrategy.Email | RegexExtractionStrategy.PhoneUS
)
regex_config = CrawlerRunConfig(extraction_strategy=regex_strategy)
regex_result = await crawler.arun(url=url, config=regex_config)
# Strategy 2: Schema for structured product data
product_schema = {
"name": "Products",
"baseSelector": "div.product",
"fields": [
{"name": "name", "selector": "h3", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"}
]
}
css_strategy = JsonCssExtractionStrategy(product_schema)
css_config = CrawlerRunConfig(extraction_strategy=css_strategy)
css_result = await crawler.arun(url=url, config=css_config)
# Combine results
results = {
"contacts": json.loads(regex_result.extracted_content) if regex_result.success else [],
"products": json.loads(css_result.extracted_content) if css_result.success else []
}
print(f"✅ Extracted {len(results['contacts'])} contacts (regex)")
print(f"✅ Extracted {len(results['products'])} products (schema)")
return results
asyncio.run(multi_strategy_pipeline())
```
---
## 5. Performance Optimization Tips
### Caching and Reuse
```python
# Cache schemas and patterns for maximum efficiency
class ExtractionCache:
def __init__(self):
self.schemas = {}
self.patterns = {}
def get_schema(self, site_name):
if site_name not in self.schemas:
schema_file = Path(f"./cache/{site_name}_schema.json")
if schema_file.exists():
self.schemas[site_name] = json.load(schema_file.open())
return self.schemas.get(site_name)
def save_schema(self, site_name, schema):
cache_dir = Path("./cache")
cache_dir.mkdir(exist_ok=True)
schema_file = cache_dir / f"{site_name}_schema.json"
json.dump(schema, schema_file.open("w"), indent=2)
self.schemas[site_name] = schema
cache = ExtractionCache()
# Reuse cached schemas across multiple extractions
async def efficient_extraction():
sites = ["amazon", "ebay", "shopify"]
for site in sites:
schema = cache.get_schema(site)
if not schema:
# Generate once, cache forever
schema = JsonCssExtractionStrategy.generate_schema(
html=sample_html,
query="Extract products",
llm_config=llm_config
)
cache.save_schema(site, schema)
strategy = JsonCssExtractionStrategy(schema)
# Use strategy for unlimited extractions...
```
### Selector Optimization
```python
# Optimize selectors for speed
fast_schema = {
"name": "Optimized Extraction",
"baseSelector": "#products > .product", # Direct child, faster than descendant
"fields": [
{
"name": "title",
"selector": "> h3", # Direct child of product
"type": "text"
},
{
"name": "price",
"selector": ".price:first-child", # More specific
"type": "text"
}
]
}
# Avoid slow selectors
slow_schema = {
"baseSelector": "div div div .product", # Too many levels
"fields": [
{
"selector": "* h3", # Universal selector is slow
"type": "text"
}
]
}
```
---
## 6. Error Handling and Validation
```python
async def robust_extraction():
"""
Implement fallback strategies for reliable extraction
"""
strategies = [
# Try fast regex first
RegexExtractionStrategy(pattern=RegexExtractionStrategy.Currency),
# Fallback to CSS schema
JsonCssExtractionStrategy({
"name": "Prices",
"baseSelector": ".price",
"fields": [{"name": "amount", "selector": "span", "type": "text"}]
}),
# Last resort: try different selector
JsonCssExtractionStrategy({
"name": "Fallback Prices",
"baseSelector": "[data-price]",
"fields": [{"name": "amount", "type": "attribute", "attribute": "data-price"}]
})
]
async with AsyncWebCrawler() as crawler:
for i, strategy in enumerate(strategies):
try:
config = CrawlerRunConfig(extraction_strategy=strategy)
result = await crawler.arun(url="https://example.com", config=config)
if result.success and result.extracted_content:
data = json.loads(result.extracted_content)
if data: # Validate non-empty results
print(f"✅ Success with strategy {i+1}: {strategy.__class__.__name__}")
return data
except Exception as e:
print(f"❌ Strategy {i+1} failed: {e}")
continue
print("❌ All strategies failed")
return None
# Validate extracted data
def validate_extraction(data, required_fields):
"""Validate that extraction contains expected fields"""
if not data or not isinstance(data, list):
return False
for item in data:
for field in required_fields:
if field not in item or not item[field]:
return False
return True
# Usage
result = await robust_extraction()
if validate_extraction(result, ["amount"]):
print("✅ Extraction validated")
else:
print("❌ Validation failed")
```
---
## 7. Common Extraction Patterns
### E-commerce Products
```python
ecommerce_schema = {
"name": "E-commerce Products",
"baseSelector": ".product, [data-product], .item",
"fields": [
{"name": "title", "selector": "h1, h2, h3, .title, .name", "type": "text"},
{"name": "price", "selector": ".price, .cost, [data-price]", "type": "text"},
{"name": "image", "selector": "img", "type": "attribute", "attribute": "src"},
{"name": "url", "selector": "a", "type": "attribute", "attribute": "href"},
{"name": "rating", "selector": ".rating, .stars", "type": "text"},
{"name": "availability", "selector": ".stock, .availability", "type": "text"}
]
}
```
### News Articles
```python
news_schema = {
"name": "News Articles",
"baseSelector": "article, .article, .post",
"fields": [
{"name": "headline", "selector": "h1, h2, .headline, .title", "type": "text"},
{"name": "author", "selector": ".author, .byline, [rel='author']", "type": "text"},
{"name": "date", "selector": "time, .date, .published", "type": "text"},
{"name": "content", "selector": ".content, .body, .text", "type": "text"},
{"name": "category", "selector": ".category, .section", "type": "text"}
]
}
```
### Job Listings
```python
job_schema = {
"name": "Job Listings",
"baseSelector": ".job, .listing, [data-job]",
"fields": [
{"name": "title", "selector": ".job-title, h2, h3", "type": "text"},
{"name": "company", "selector": ".company, .employer", "type": "text"},
{"name": "location", "selector": ".location, .place", "type": "text"},
{"name": "salary", "selector": ".salary, .pay, .compensation", "type": "text"},
{"name": "description", "selector": ".description, .summary", "type": "text"},
{"name": "url", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
```
### Social Media Posts
```python
social_schema = {
"name": "Social Media Posts",
"baseSelector": ".post, .tweet, .update",
"fields": [
{"name": "username", "selector": ".username, .handle, .author", "type": "text"},
{"name": "content", "selector": ".content, .text, .message", "type": "text"},
{"name": "timestamp", "selector": ".time, .date, time", "type": "text"},
{"name": "likes", "selector": ".likes, .hearts", "type": "text"},
{"name": "shares", "selector": ".shares, .retweets", "type": "text"}
]
}
```
---
## 8. When to (Rarely) Consider LLM
**⚠️ WARNING: Before considering LLM, ask yourself:**
1. "Can I identify repeating HTML patterns?" → Use CSS/XPath schema
2. "Am I extracting simple data types?" → Use Regex patterns
3. "Can I provide a JSON example of what I want?" → Use generate_schema()
4. "Is this truly unstructured text requiring semantic understanding?" → Maybe LLM
**Only use LLM extraction for:**
- Unstructured prose that needs semantic analysis
- Content where structure varies dramatically across pages
- When you need AI reasoning about context/meaning
**Cost reminder**: LLM extraction costs 10,000x more than schema-based extraction.
---
## 9. Summary: The Extraction Hierarchy
1. **🥇 FIRST CHOICE**: `generate_schema()` - AI generates pattern once, use unlimited times
2. **🥈 SECOND CHOICE**: Manual CSS/XPath - Full control, maximum speed
3. **🥉 THIRD CHOICE**: Regex patterns - Simple data types, lightning fast
4. **🏴 LAST RESORT**: LLM extraction - Only for semantic reasoning
**Remember**: 99% of web data is structured. You almost never need LLM for extraction. Save LLM for analysis, not extraction.
**Performance**: Non-LLM strategies are 100-1000x faster and 10,000x cheaper than LLM extraction.
---
**📖 Next**: If you absolutely must use LLM extraction, see [extraction-llm.md](./extraction-llm.md) for guidance on the rare cases where it's justified.

View File

@@ -1,788 +0,0 @@
## Extraction Strategies
Powerful data extraction from web pages using LLM-based intelligent parsing or fast schema/pattern-based approaches.
### LLM-Based Extraction - Intelligent Content Understanding
```python
import os
import asyncio
import json
from pydantic import BaseModel, Field
from typing import List
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
# Define structured data model
class Product(BaseModel):
name: str = Field(description="Product name")
price: str = Field(description="Product price")
description: str = Field(description="Product description")
features: List[str] = Field(description="List of product features")
rating: float = Field(description="Product rating out of 5")
# Configure LLM provider
llm_config = LLMConfig(
provider="openai/gpt-4o-mini", # or "ollama/llama3.3", "anthropic/claude-3-5-sonnet"
api_token=os.getenv("OPENAI_API_KEY"), # or "env:OPENAI_API_KEY"
temperature=0.1,
max_tokens=2000
)
# Create LLM extraction strategy
llm_strategy = LLMExtractionStrategy(
llm_config=llm_config,
schema=Product.model_json_schema(),
extraction_type="schema", # or "block" for freeform text
instruction="""
Extract product information from the webpage content.
Focus on finding complete product details including:
- Product name and price
- Detailed description
- All listed features
- Customer rating if available
Return valid JSON array of products.
""",
chunk_token_threshold=1200, # Split content if too large
overlap_rate=0.1, # 10% overlap between chunks
apply_chunking=True, # Enable automatic chunking
input_format="markdown", # "html", "fit_markdown", or "markdown"
extra_args={"temperature": 0.0, "max_tokens": 800},
verbose=True
)
async def extract_with_llm():
browser_config = BrowserConfig(headless=True)
crawl_config = CrawlerRunConfig(
extraction_strategy=llm_strategy,
cache_mode=CacheMode.BYPASS,
word_count_threshold=10
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://example.com/products",
config=crawl_config
)
if result.success:
# Parse extracted JSON
products = json.loads(result.extracted_content)
print(f"Extracted {len(products)} products")
for product in products[:3]: # Show first 3
print(f"Product: {product['name']}")
print(f"Price: {product['price']}")
print(f"Rating: {product.get('rating', 'N/A')}")
# Show token usage and cost
llm_strategy.show_usage()
else:
print(f"Extraction failed: {result.error_message}")
asyncio.run(extract_with_llm())
```
### LLM Strategy Advanced Configuration
```python
# Multiple provider configurations
providers = {
"openai": LLMConfig(
provider="openai/gpt-4o",
api_token="env:OPENAI_API_KEY",
temperature=0.1
),
"anthropic": LLMConfig(
provider="anthropic/claude-3-5-sonnet-20240620",
api_token="env:ANTHROPIC_API_KEY",
max_tokens=4000
),
"ollama": LLMConfig(
provider="ollama/llama3.3",
api_token=None, # Not needed for Ollama
base_url="http://localhost:11434"
),
"groq": LLMConfig(
provider="groq/llama3-70b-8192",
api_token="env:GROQ_API_KEY"
)
}
# Advanced chunking for large content
large_content_strategy = LLMExtractionStrategy(
llm_config=providers["openai"],
schema=YourModel.model_json_schema(),
extraction_type="schema",
instruction="Extract detailed information...",
# Chunking parameters
chunk_token_threshold=2000, # Larger chunks for complex content
overlap_rate=0.15, # More overlap for context preservation
apply_chunking=True,
# Input format selection
input_format="fit_markdown", # Use filtered content if available
# LLM parameters
extra_args={
"temperature": 0.0, # Deterministic output
"top_p": 0.9,
"frequency_penalty": 0.1,
"presence_penalty": 0.1,
"max_tokens": 1500
},
verbose=True
)
# Knowledge graph extraction
class Entity(BaseModel):
name: str
type: str # "person", "organization", "location", etc.
description: str
class Relationship(BaseModel):
source: str
target: str
relationship: str
confidence: float
class KnowledgeGraph(BaseModel):
entities: List[Entity]
relationships: List[Relationship]
summary: str
knowledge_strategy = LLMExtractionStrategy(
llm_config=providers["anthropic"],
schema=KnowledgeGraph.model_json_schema(),
extraction_type="schema",
instruction="""
Create a knowledge graph from the content by:
1. Identifying key entities (people, organizations, locations, concepts)
2. Finding relationships between entities
3. Providing confidence scores for relationships
4. Summarizing the main topics
""",
input_format="html", # Use HTML for better structure preservation
apply_chunking=True,
chunk_token_threshold=1500
)
```
### JSON CSS Extraction - Fast Schema-Based Extraction
```python
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
# Basic CSS extraction schema
simple_schema = {
"name": "Product Listings",
"baseSelector": "div.product-card",
"fields": [
{
"name": "title",
"selector": "h2.product-title",
"type": "text"
},
{
"name": "price",
"selector": ".price",
"type": "text"
},
{
"name": "image_url",
"selector": "img.product-image",
"type": "attribute",
"attribute": "src"
},
{
"name": "product_url",
"selector": "a.product-link",
"type": "attribute",
"attribute": "href"
}
]
}
# Complex nested schema with multiple data types
complex_schema = {
"name": "E-commerce Product Catalog",
"baseSelector": "div.category",
"baseFields": [
{
"name": "category_id",
"type": "attribute",
"attribute": "data-category-id"
},
{
"name": "category_url",
"type": "attribute",
"attribute": "data-url"
}
],
"fields": [
{
"name": "category_name",
"selector": "h2.category-title",
"type": "text"
},
{
"name": "products",
"selector": "div.product",
"type": "nested_list", # Array of complex objects
"fields": [
{
"name": "name",
"selector": "h3.product-name",
"type": "text",
"default": "Unknown Product"
},
{
"name": "price",
"selector": "span.price",
"type": "text"
},
{
"name": "details",
"selector": "div.product-details",
"type": "nested", # Single complex object
"fields": [
{
"name": "brand",
"selector": "span.brand",
"type": "text"
},
{
"name": "model",
"selector": "span.model",
"type": "text"
},
{
"name": "specs",
"selector": "div.specifications",
"type": "html" # Preserve HTML structure
}
]
},
{
"name": "features",
"selector": "ul.features li",
"type": "list", # Simple array of strings
"fields": [
{"name": "feature", "type": "text"}
]
},
{
"name": "reviews",
"selector": "div.review",
"type": "nested_list",
"fields": [
{
"name": "reviewer",
"selector": "span.reviewer-name",
"type": "text"
},
{
"name": "rating",
"selector": "span.rating",
"type": "attribute",
"attribute": "data-rating"
},
{
"name": "comment",
"selector": "p.review-text",
"type": "text"
},
{
"name": "date",
"selector": "time.review-date",
"type": "attribute",
"attribute": "datetime"
}
]
}
]
}
]
}
async def extract_with_css_schema():
strategy = JsonCssExtractionStrategy(complex_schema, verbose=True)
config = CrawlerRunConfig(
extraction_strategy=strategy,
cache_mode=CacheMode.BYPASS,
# Enable dynamic content loading if needed
js_code="window.scrollTo(0, document.body.scrollHeight);",
wait_for="css:.product:nth-child(10)", # Wait for products to load
process_iframes=True
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/catalog",
config=config
)
if result.success:
data = json.loads(result.extracted_content)
print(f"Extracted {len(data)} categories")
for category in data:
print(f"Category: {category['category_name']}")
print(f"Products: {len(category.get('products', []))}")
# Show first product details
if category.get('products'):
product = category['products'][0]
print(f" First product: {product.get('name')}")
print(f" Features: {len(product.get('features', []))}")
print(f" Reviews: {len(product.get('reviews', []))}")
asyncio.run(extract_with_css_schema())
```
### Automatic Schema Generation - One-Time LLM, Unlimited Use
```python
import json
import asyncio
from pathlib import Path
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
async def generate_and_use_schema():
"""
1. Use LLM once to generate schema from sample HTML
2. Cache the schema for reuse
3. Use cached schema for fast extraction without LLM calls
"""
cache_dir = Path("./schema_cache")
cache_dir.mkdir(exist_ok=True)
schema_file = cache_dir / "ecommerce_schema.json"
# Step 1: Generate or load cached schema
if schema_file.exists():
schema = json.load(schema_file.open())
print("Using cached schema")
else:
print("Generating schema using LLM...")
# Configure LLM for schema generation
llm_config = LLMConfig(
provider="openai/gpt-4o", # or "ollama/llama3.3" for local
api_token="env:OPENAI_API_KEY"
)
# Get sample HTML from target site
async with AsyncWebCrawler() as crawler:
sample_result = await crawler.arun(
url="https://example.com/products",
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
)
sample_html = sample_result.cleaned_html[:5000] # Use first 5k chars
# Generate schema automatically (ONE-TIME LLM COST)
schema = JsonCssExtractionStrategy.generate_schema(
html=sample_html,
schema_type="css",
llm_config=llm_config,
instruction="Extract product information including name, price, description, and features"
)
# Cache schema for future use (NO MORE LLM CALLS)
json.dump(schema, schema_file.open("w"), indent=2)
print("Schema generated and cached")
# Step 2: Use schema for fast extraction (NO LLM CALLS)
strategy = JsonCssExtractionStrategy(schema, verbose=True)
config = CrawlerRunConfig(
extraction_strategy=strategy,
cache_mode=CacheMode.BYPASS
)
# Step 3: Extract from multiple pages using same schema
urls = [
"https://example.com/products",
"https://example.com/electronics",
"https://example.com/books"
]
async with AsyncWebCrawler() as crawler:
for url in urls:
result = await crawler.arun(url=url, config=config)
if result.success:
data = json.loads(result.extracted_content)
print(f"{url}: Extracted {len(data)} items")
else:
print(f"{url}: Failed - {result.error_message}")
asyncio.run(generate_and_use_schema())
```
### XPath Extraction Strategy
```python
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
# XPath-based schema (alternative to CSS)
xpath_schema = {
"name": "News Articles",
"baseSelector": "//article[@class='news-item']",
"baseFields": [
{
"name": "article_id",
"type": "attribute",
"attribute": "data-id"
}
],
"fields": [
{
"name": "headline",
"selector": ".//h2[@class='headline']",
"type": "text"
},
{
"name": "author",
"selector": ".//span[@class='author']/text()",
"type": "text"
},
{
"name": "publish_date",
"selector": ".//time/@datetime",
"type": "text"
},
{
"name": "content",
"selector": ".//div[@class='article-body']",
"type": "html"
},
{
"name": "tags",
"selector": ".//div[@class='tags']/span[@class='tag']",
"type": "list",
"fields": [
{"name": "tag", "type": "text"}
]
}
]
}
# Generate XPath schema automatically
async def generate_xpath_schema():
llm_config = LLMConfig(provider="ollama/llama3.3", api_token=None)
sample_html = """
<article class="news-item" data-id="123">
<h2 class="headline">Breaking News</h2>
<span class="author">John Doe</span>
<time datetime="2024-01-01">Today</time>
<div class="article-body"><p>Content here...</p></div>
</article>
"""
schema = JsonXPathExtractionStrategy.generate_schema(
html=sample_html,
schema_type="xpath",
llm_config=llm_config
)
return schema
# Use XPath strategy
xpath_strategy = JsonXPathExtractionStrategy(xpath_schema, verbose=True)
```
### Regex Extraction Strategy - Pattern-Based Fast Extraction
```python
from crawl4ai.extraction_strategy import RegexExtractionStrategy
# Built-in patterns for common data types
async def extract_with_builtin_patterns():
# Use multiple built-in patterns
strategy = RegexExtractionStrategy(
pattern=(
RegexExtractionStrategy.Email |
RegexExtractionStrategy.PhoneUS |
RegexExtractionStrategy.Url |
RegexExtractionStrategy.Currency |
RegexExtractionStrategy.DateIso
)
)
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/contact",
config=config
)
if result.success:
matches = json.loads(result.extracted_content)
# Group by pattern type
by_type = {}
for match in matches:
label = match['label']
if label not in by_type:
by_type[label] = []
by_type[label].append(match['value'])
for pattern_type, values in by_type.items():
print(f"{pattern_type}: {len(values)} matches")
for value in values[:3]: # Show first 3
print(f" {value}")
# Custom regex patterns
custom_patterns = {
"product_code": r"SKU-\d{4,6}",
"discount": r"\d{1,2}%\s*off",
"model_number": r"Model:\s*([A-Z0-9-]+)"
}
async def extract_with_custom_patterns():
strategy = RegexExtractionStrategy(custom=custom_patterns)
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/products",
config=config
)
if result.success:
data = json.loads(result.extracted_content)
for item in data:
print(f"{item['label']}: {item['value']}")
# LLM-generated patterns (one-time cost)
async def generate_custom_patterns():
cache_file = Path("./patterns/price_patterns.json")
if cache_file.exists():
patterns = json.load(cache_file.open())
else:
llm_config = LLMConfig(
provider="openai/gpt-4o-mini",
api_token="env:OPENAI_API_KEY"
)
# Get sample content
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com/pricing")
sample_html = result.cleaned_html
# Generate optimized patterns
patterns = RegexExtractionStrategy.generate_pattern(
label="pricing_info",
html=sample_html,
query="Extract all pricing information including discounts and special offers",
llm_config=llm_config
)
# Cache for reuse
cache_file.parent.mkdir(exist_ok=True)
json.dump(patterns, cache_file.open("w"), indent=2)
# Use cached patterns (no more LLM calls)
strategy = RegexExtractionStrategy(custom=patterns)
return strategy
asyncio.run(extract_with_builtin_patterns())
asyncio.run(extract_with_custom_patterns())
```
### Complete Extraction Workflow - Combining Strategies
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.extraction_strategy import (
JsonCssExtractionStrategy,
RegexExtractionStrategy,
LLMExtractionStrategy
)
async def multi_strategy_extraction():
"""
Demonstrate using multiple extraction strategies in sequence:
1. Fast regex for common patterns
2. Schema-based for structured data
3. LLM for complex reasoning
"""
browser_config = BrowserConfig(headless=True)
# Strategy 1: Fast regex extraction
regex_strategy = RegexExtractionStrategy(
pattern=RegexExtractionStrategy.Email | RegexExtractionStrategy.PhoneUS
)
# Strategy 2: Schema-based structured extraction
product_schema = {
"name": "Products",
"baseSelector": "div.product",
"fields": [
{"name": "name", "selector": "h3", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"},
{"name": "rating", "selector": ".rating", "type": "attribute", "attribute": "data-rating"}
]
}
css_strategy = JsonCssExtractionStrategy(product_schema)
# Strategy 3: LLM for complex analysis
llm_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY"),
schema={
"type": "object",
"properties": {
"sentiment": {"type": "string"},
"key_topics": {"type": "array", "items": {"type": "string"}},
"summary": {"type": "string"}
}
},
extraction_type="schema",
instruction="Analyze the content sentiment, extract key topics, and provide a summary"
)
url = "https://example.com/product-reviews"
async with AsyncWebCrawler(config=browser_config) as crawler:
# Extract contact info with regex
regex_config = CrawlerRunConfig(extraction_strategy=regex_strategy)
regex_result = await crawler.arun(url=url, config=regex_config)
# Extract structured product data
css_config = CrawlerRunConfig(extraction_strategy=css_strategy)
css_result = await crawler.arun(url=url, config=css_config)
# Extract insights with LLM
llm_config = CrawlerRunConfig(extraction_strategy=llm_strategy)
llm_result = await crawler.arun(url=url, config=llm_config)
# Combine results
results = {
"contacts": json.loads(regex_result.extracted_content) if regex_result.success else [],
"products": json.loads(css_result.extracted_content) if css_result.success else [],
"analysis": json.loads(llm_result.extracted_content) if llm_result.success else {}
}
print(f"Found {len(results['contacts'])} contact entries")
print(f"Found {len(results['products'])} products")
print(f"Sentiment: {results['analysis'].get('sentiment', 'N/A')}")
return results
# Performance comparison
async def compare_extraction_performance():
"""Compare speed and accuracy of different strategies"""
import time
url = "https://example.com/large-catalog"
strategies = {
"regex": RegexExtractionStrategy(pattern=RegexExtractionStrategy.Currency),
"css": JsonCssExtractionStrategy({
"name": "Prices",
"baseSelector": ".price",
"fields": [{"name": "amount", "selector": "span", "type": "text"}]
}),
"llm": LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY"),
instruction="Extract all prices from the content",
extraction_type="block"
)
}
async with AsyncWebCrawler() as crawler:
for name, strategy in strategies.items():
start_time = time.time()
config = CrawlerRunConfig(extraction_strategy=strategy)
result = await crawler.arun(url=url, config=config)
duration = time.time() - start_time
if result.success:
data = json.loads(result.extracted_content)
print(f"{name}: {len(data)} items in {duration:.2f}s")
else:
print(f"{name}: Failed in {duration:.2f}s")
asyncio.run(multi_strategy_extraction())
asyncio.run(compare_extraction_performance())
```
### Best Practices and Strategy Selection
```python
# Strategy selection guide
def choose_extraction_strategy(use_case):
"""
Guide for selecting the right extraction strategy
"""
strategies = {
# Fast pattern matching for common data types
"contact_info": RegexExtractionStrategy(
pattern=RegexExtractionStrategy.Email | RegexExtractionStrategy.PhoneUS
),
# Structured data from consistent HTML
"product_catalogs": JsonCssExtractionStrategy,
# Complex reasoning and semantic understanding
"content_analysis": LLMExtractionStrategy,
# Mixed approach for comprehensive extraction
"complete_site_analysis": "multi_strategy"
}
recommendations = {
"speed_priority": "Use RegexExtractionStrategy for simple patterns, JsonCssExtractionStrategy for structured data",
"accuracy_priority": "Use LLMExtractionStrategy for complex content, JsonCssExtractionStrategy for predictable structure",
"cost_priority": "Avoid LLM strategies, use schema generation once then JsonCssExtractionStrategy",
"scale_priority": "Cache schemas, use regex for simple patterns, avoid LLM for high-volume extraction"
}
return recommendations.get(use_case, "Combine strategies based on content complexity")
# Error handling and validation
async def robust_extraction():
strategies = [
RegexExtractionStrategy(pattern=RegexExtractionStrategy.Email),
JsonCssExtractionStrategy(simple_schema),
# LLM as fallback for complex cases
]
async with AsyncWebCrawler() as crawler:
for strategy in strategies:
try:
config = CrawlerRunConfig(extraction_strategy=strategy)
result = await crawler.arun(url="https://example.com", config=config)
if result.success and result.extracted_content:
data = json.loads(result.extracted_content)
if data: # Validate non-empty results
print(f"Success with {strategy.__class__.__name__}")
return data
except Exception as e:
print(f"Strategy {strategy.__class__.__name__} failed: {e}")
continue
print("All strategies failed")
return None
```
**📖 Learn more:** [LLM Strategies Deep Dive](https://docs.crawl4ai.com/extraction/llm-strategies/), [Schema-Based Extraction](https://docs.crawl4ai.com/extraction/no-llm-strategies/), [Regex Patterns](https://docs.crawl4ai.com/extraction/no-llm-strategies/#regexextractionstrategy), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/)

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff