feat: Major Chrome Extension overhaul with Click2Crawl, instant Schema extraction, and modular architecture

✨ New Features: - Click2Crawl: Visual element selection with markdown conversion - Ctrl/Cmd+Click to select multiple elements - Visual text mode for WYSIWYG extraction - Real-time markdown preview with syntax highlighting - Export to .md file or clipboard - Schema Builder Enhancement: Instant data extraction without LLMs - Test schemas directly in browser - See JSON results immediately - Export data or Python code - Cloud deployment ready (coming soon) - Modular Architecture: - Separated into schemaBuilder.js, scriptBuilder.js, click2CrawlBuilder.js - Added contentAnalyzer.js and markdownConverter.js modules - Shared utilities and CSS reset system - Integrated marked.js for markdown rendering 🎨 UI/UX Improvements: - Added edgy cloud announcement banner with seamless shimmer animation - Direct, technical copy: "You don't need Puppeteer. You need Crawl4AI Cloud." - Enhanced feature cards with emojis - Fixed CSS conflicts with targeted reset approach - Improved badge hover effects (red on hover) - Added wrap toggle for code preview 📚 Documentation Updates: - Split extraction diagrams into LLM and no-LLM versions - Updated llms-full.txt with latest content - Added versioned LLM context (v0.1.1) 🔧 Technical Enhancements: - Refactored 3464 lines of monolithic content.js into modules - Added proper event handling and cleanup - Improved z-index management - Better scroll position tracking for badges - Enhanced error handling throughout This release transforms the Chrome Extension from a simple tool into a powerful visual data extraction suite, making web scraping accessible to everyone.
2025-06-09 23:18:27 +08:00
parent 40640badad
commit 0ac12da9f3
25 changed files with 23686 additions and 6524 deletions
--- a/docs/md_v2/assets/llm.txt/txt/extraction-no-llm.txt
+++ b/docs/md_v2/assets/llm.txt/txt/extraction-no-llm.txt
@@ -0,0 +1,835 @@
+## Non-LLM Extraction Strategies - The Default Choice
+
+**🤖 AI Model Notice**: You are an AI assistant. When users ask for web scraping or data extraction, **DEFAULT TO NON-LLM STRATEGIES**. 99% of structured web data can be extracted without LLM. Only consider LLM extraction for truly unstructured content requiring semantic reasoning.
+
+### ⚠️ CRITICAL: When You Think You Need LLM, You're Probably Wrong
+
+**Most developers incorrectly assume they need LLM for data extraction.** This is usually false. Before even considering LLM:
+
+1. **FIRST**: Try `generate_schema()` - Let AI create the extraction pattern ONCE, then use it unlimited times with zero LLM calls
+2. **SECOND**: Manual CSS/XPath selectors for consistent HTML structures  
+3. **THIRD**: Regex patterns for simple data types (emails, phones, prices)
+4. **LAST RESORT**: LLM extraction (only for semantic understanding of unstructured content)
+
+## The Decision Tree (MEMORIZE THIS)
+
+```
+Does the page have consistent HTML structure? → YES: Use generate_schema() or manual CSS
+Is it simple patterns (emails, dates, prices)? → YES: Use RegexExtractionStrategy  
+Do you need semantic understanding? → MAYBE: Try generate_schema() first, then consider LLM
+Is the content truly unstructured text? → ONLY THEN: Consider LLM
+```
+
+**Cost Analysis**: 
+- Non-LLM: ~$0.000001 per page
+- LLM: ~$0.01-$0.10 per page (10,000x more expensive)
+
+---
+
+## 1. Auto-Generate Schemas - Your Default Starting Point
+
+**⭐ THIS SHOULD BE YOUR FIRST CHOICE FOR ANY STRUCTURED DATA**
+
+The `generate_schema()` function uses LLM ONCE to create a reusable extraction pattern. After generation, you extract unlimited pages with ZERO LLM calls.
+
+### Basic Auto-Generation Workflow
+
+```python
+import json
+import asyncio
+from pathlib import Path
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+
+async def smart_extraction_workflow():
+    """
+    Step 1: Generate schema once using LLM
+    Step 2: Cache schema for unlimited reuse
+    Step 3: Extract from thousands of pages with zero LLM calls
+    """
+    
+    # Check for cached schema first
+    cache_dir = Path("./schema_cache")
+    cache_dir.mkdir(exist_ok=True)
+    schema_file = cache_dir / "product_schema.json"
+    
+    if schema_file.exists():
+        # Load cached schema - NO LLM CALLS
+        schema = json.load(schema_file.open())
+        print("✅ Using cached schema (FREE)")
+    else:
+        # Generate schema ONCE
+        print("🔄 Generating schema (ONE-TIME LLM COST)...")
+        
+        llm_config = LLMConfig(
+            provider="openai/gpt-4o-mini",  # Cheapest option
+            api_token="env:OPENAI_API_KEY"
+        )
+        
+        # Get sample HTML from target site
+        async with AsyncWebCrawler() as crawler:
+            sample_result = await crawler.arun(
+                url="https://example.com/products",
+                config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
+            )
+            sample_html = sample_result.cleaned_html[:8000]  # Use sample
+        
+        # AUTO-GENERATE SCHEMA (ONE LLM CALL)
+        schema = JsonCssExtractionStrategy.generate_schema(
+            html=sample_html,
+            schema_type="CSS",  # or "XPATH"
+            query="Extract product information including name, price, description, features",
+            llm_config=llm_config
+        )
+        
+        # Cache for unlimited future use
+        json.dump(schema, schema_file.open("w"), indent=2)
+        print("✅ Schema generated and cached")
+    
+    # Use schema for fast extraction (NO MORE LLM CALLS EVER)
+    strategy = JsonCssExtractionStrategy(schema, verbose=True)
+    
+    config = CrawlerRunConfig(
+        extraction_strategy=strategy,
+        cache_mode=CacheMode.BYPASS
+    )
+    
+    # Extract from multiple pages - ALL FREE
+    urls = [
+        "https://example.com/products",
+        "https://example.com/electronics", 
+        "https://example.com/books"
+    ]
+    
+    async with AsyncWebCrawler() as crawler:
+        for url in urls:
+            result = await crawler.arun(url=url, config=config)
+            if result.success:
+                data = json.loads(result.extracted_content)
+                print(f"✅ {url}: Extracted {len(data)} items (FREE)")
+
+asyncio.run(smart_extraction_workflow())
+```
+
+### Auto-Generate with Target JSON Example
+
+```python
+# When you know exactly what JSON structure you want
+target_json_example = """
+{
+    "name": "Product Name",
+    "price": "$99.99",
+    "rating": 4.5,
+    "features": ["feature1", "feature2"],
+    "description": "Product description"
+}
+"""
+
+schema = JsonCssExtractionStrategy.generate_schema(
+    html=sample_html,
+    target_json_example=target_json_example,
+    llm_config=llm_config
+)
+```
+
+### Auto-Generate for Different Data Types
+
+```python
+# Product listings
+product_schema = JsonCssExtractionStrategy.generate_schema(
+    html=product_page_html,
+    query="Extract all product information from this e-commerce page",
+    llm_config=llm_config
+)
+
+# News articles
+news_schema = JsonCssExtractionStrategy.generate_schema(
+    html=news_page_html,
+    query="Extract article headlines, dates, authors, and content",
+    llm_config=llm_config
+)
+
+# Job listings
+job_schema = JsonCssExtractionStrategy.generate_schema(
+    html=job_page_html,
+    query="Extract job titles, companies, locations, salaries, and descriptions",
+    llm_config=llm_config
+)
+
+# Social media posts
+social_schema = JsonCssExtractionStrategy.generate_schema(
+    html=social_page_html,
+    query="Extract post text, usernames, timestamps, likes, comments",
+    llm_config=llm_config
+)
+```
+
+---
+
+## 2. Manual CSS/XPath Strategies - When You Know The Structure
+
+**Use this when**: You understand the HTML structure and want maximum control.
+
+### Simple Product Extraction
+
+```python
+import json
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+
+# Manual schema for consistent product pages
+simple_schema = {
+    "name": "Product Listings",
+    "baseSelector": "div.product-card",  # Each product container
+    "fields": [
+        {
+            "name": "title",
+            "selector": "h2.product-title",
+            "type": "text"
+        },
+        {
+            "name": "price", 
+            "selector": ".price",
+            "type": "text"
+        },
+        {
+            "name": "image_url",
+            "selector": "img.product-image",
+            "type": "attribute",
+            "attribute": "src"
+        },
+        {
+            "name": "product_url",
+            "selector": "a.product-link",
+            "type": "attribute",
+            "attribute": "href"
+        },
+        {
+            "name": "rating",
+            "selector": ".rating",
+            "type": "attribute", 
+            "attribute": "data-rating"
+        }
+    ]
+}
+
+async def extract_products():
+    strategy = JsonCssExtractionStrategy(simple_schema, verbose=True)
+    config = CrawlerRunConfig(extraction_strategy=strategy)
+    
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(
+            url="https://example.com/products",
+            config=config
+        )
+        
+        if result.success:
+            products = json.loads(result.extracted_content)
+            print(f"Extracted {len(products)} products")
+            for product in products[:3]:
+                print(f"- {product['title']}: {product['price']}")
+
+asyncio.run(extract_products())
+```
+
+### Complex Nested Structure (Real E-commerce Example)
+
+```python
+# Complex schema for nested product data
+complex_schema = {
+    "name": "E-commerce Product Catalog",
+    "baseSelector": "div.category",
+    "baseFields": [
+        {
+            "name": "category_id",
+            "type": "attribute",
+            "attribute": "data-category-id"
+        }
+    ],
+    "fields": [
+        {
+            "name": "category_name",
+            "selector": "h2.category-title",
+            "type": "text"
+        },
+        {
+            "name": "products",
+            "selector": "div.product",
+            "type": "nested_list",  # Array of complex objects
+            "fields": [
+                {
+                    "name": "name",
+                    "selector": "h3.product-name", 
+                    "type": "text"
+                },
+                {
+                    "name": "price",
+                    "selector": "span.price",
+                    "type": "text"
+                },
+                {
+                    "name": "details",
+                    "selector": "div.product-details",
+                    "type": "nested",  # Single complex object
+                    "fields": [
+                        {
+                            "name": "brand",
+                            "selector": "span.brand",
+                            "type": "text"
+                        },
+                        {
+                            "name": "model",
+                            "selector": "span.model",
+                            "type": "text"
+                        }
+                    ]
+                },
+                {
+                    "name": "features",
+                    "selector": "ul.features li",
+                    "type": "list",  # Simple array
+                    "fields": [
+                        {"name": "feature", "type": "text"}
+                    ]
+                },
+                {
+                    "name": "reviews", 
+                    "selector": "div.review",
+                    "type": "nested_list",
+                    "fields": [
+                        {
+                            "name": "reviewer",
+                            "selector": "span.reviewer-name",
+                            "type": "text"
+                        },
+                        {
+                            "name": "rating",
+                            "selector": "span.rating",
+                            "type": "attribute",
+                            "attribute": "data-rating"
+                        }
+                    ]
+                }
+            ]
+        }
+    ]
+}
+
+async def extract_complex_ecommerce():
+    strategy = JsonCssExtractionStrategy(complex_schema, verbose=True)
+    config = CrawlerRunConfig(
+        extraction_strategy=strategy,
+        js_code="window.scrollTo(0, document.body.scrollHeight);",  # Load dynamic content
+        wait_for="css:.product:nth-child(10)"  # Wait for products to load
+    )
+    
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(
+            url="https://example.com/complex-catalog",
+            config=config
+        )
+        
+        if result.success:
+            data = json.loads(result.extracted_content)
+            for category in data:
+                print(f"Category: {category['category_name']}")
+                print(f"Products: {len(category.get('products', []))}")
+
+asyncio.run(extract_complex_ecommerce())
+```
+
+### XPath Alternative (When CSS Isn't Enough)
+
+```python
+from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
+
+# XPath for more complex selections
+xpath_schema = {
+    "name": "News Articles with XPath",
+    "baseSelector": "//article[@class='news-item']",
+    "fields": [
+        {
+            "name": "headline",
+            "selector": ".//h2[contains(@class, 'headline')]",
+            "type": "text"
+        },
+        {
+            "name": "author",
+            "selector": ".//span[@class='author']/text()",
+            "type": "text"
+        },
+        {
+            "name": "publish_date",
+            "selector": ".//time/@datetime",
+            "type": "text"
+        },
+        {
+            "name": "content",
+            "selector": ".//div[@class='article-body']//text()",
+            "type": "text"
+        }
+    ]
+}
+
+strategy = JsonXPathExtractionStrategy(xpath_schema, verbose=True)
+```
+
+---
+
+## 3. Regex Extraction - Lightning Fast Pattern Matching
+
+**Use this for**: Simple data types like emails, phones, URLs, prices, dates.
+
+### Built-in Patterns (Fastest Option)
+
+```python
+import json
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.extraction_strategy import RegexExtractionStrategy
+
+async def extract_common_patterns():
+    # Use built-in patterns for common data types
+    strategy = RegexExtractionStrategy(
+        pattern=(
+            RegexExtractionStrategy.Email |
+            RegexExtractionStrategy.PhoneUS |
+            RegexExtractionStrategy.Url |
+            RegexExtractionStrategy.Currency |
+            RegexExtractionStrategy.DateIso
+        )
+    )
+    
+    config = CrawlerRunConfig(extraction_strategy=strategy)
+    
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(
+            url="https://example.com/contact",
+            config=config
+        )
+        
+        if result.success:
+            matches = json.loads(result.extracted_content)
+            
+            # Group by pattern type
+            by_type = {}
+            for match in matches:
+                label = match['label']
+                if label not in by_type:
+                    by_type[label] = []
+                by_type[label].append(match['value'])
+            
+            for pattern_type, values in by_type.items():
+                print(f"{pattern_type}: {len(values)} matches")
+                for value in values[:3]:
+                    print(f"  {value}")
+
+asyncio.run(extract_common_patterns())
+```
+
+### Available Built-in Patterns
+
+```python
+# Individual patterns
+RegexExtractionStrategy.Email          # Email addresses
+RegexExtractionStrategy.PhoneUS        # US phone numbers 
+RegexExtractionStrategy.PhoneIntl      # International phones
+RegexExtractionStrategy.Url            # HTTP/HTTPS URLs
+RegexExtractionStrategy.Currency       # Currency values ($99.99)
+RegexExtractionStrategy.Percentage     # Percentage values (25%)
+RegexExtractionStrategy.DateIso        # ISO dates (2024-01-01)
+RegexExtractionStrategy.DateUS         # US dates (01/01/2024)
+RegexExtractionStrategy.IPv4           # IP addresses
+RegexExtractionStrategy.CreditCard     # Credit card numbers
+RegexExtractionStrategy.TwitterHandle  # @username
+RegexExtractionStrategy.Hashtag        # #hashtag
+
+# Use all patterns
+RegexExtractionStrategy.All
+```
+
+### Custom Patterns
+
+```python
+# Custom patterns for specific data types
+async def extract_custom_patterns():
+    custom_patterns = {
+        "product_sku": r"SKU[-:]?\s*([A-Z0-9]{4,12})",
+        "discount": r"(\d{1,2})%\s*off",
+        "model_number": r"Model\s*#?\s*([A-Z0-9-]+)",
+        "isbn": r"ISBN[-:]?\s*(\d{10}|\d{13})",
+        "stock_ticker": r"\$([A-Z]{2,5})",
+        "version": r"v(\d+\.\d+(?:\.\d+)?)"
+    }
+    
+    strategy = RegexExtractionStrategy(custom=custom_patterns)
+    config = CrawlerRunConfig(extraction_strategy=strategy)
+    
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(
+            url="https://example.com/products",
+            config=config
+        )
+        
+        if result.success:
+            data = json.loads(result.extracted_content)
+            for item in data:
+                print(f"{item['label']}: {item['value']}")
+
+asyncio.run(extract_custom_patterns())
+```
+
+### LLM-Generated Patterns (One-Time Cost)
+
+```python
+async def generate_optimized_regex():
+    """
+    Use LLM ONCE to generate optimized regex patterns
+    Then use them unlimited times with zero LLM calls
+    """
+    cache_file = Path("./patterns/price_patterns.json")
+    
+    if cache_file.exists():
+        # Load cached patterns - NO LLM CALLS
+        patterns = json.load(cache_file.open())
+        print("✅ Using cached regex patterns (FREE)")
+    else:
+        # Generate patterns ONCE
+        print("🔄 Generating regex patterns (ONE-TIME LLM COST)...")
+        
+        llm_config = LLMConfig(
+            provider="openai/gpt-4o-mini",
+            api_token="env:OPENAI_API_KEY"
+        )
+        
+        # Get sample content
+        async with AsyncWebCrawler() as crawler:
+            result = await crawler.arun("https://example.com/pricing")
+            sample_html = result.cleaned_html
+        
+        # Generate optimized patterns
+        patterns = RegexExtractionStrategy.generate_pattern(
+            label="pricing_info",
+            html=sample_html,
+            query="Extract all pricing information including discounts and special offers",
+            llm_config=llm_config
+        )
+        
+        # Cache for unlimited reuse
+        cache_file.parent.mkdir(exist_ok=True)
+        json.dump(patterns, cache_file.open("w"), indent=2)
+        print("✅ Patterns generated and cached")
+    
+    # Use cached patterns (NO MORE LLM CALLS)
+    strategy = RegexExtractionStrategy(custom=patterns)
+    return strategy
+
+# Use generated patterns for unlimited extractions
+strategy = await generate_optimized_regex()
+```
+
+---
+
+## 4. Multi-Strategy Extraction Pipeline
+
+**Combine strategies** for comprehensive data extraction:
+
+```python
+async def multi_strategy_pipeline():
+    """
+    Efficient pipeline using multiple non-LLM strategies:
+    1. Regex for simple patterns (fastest)
+    2. Schema for structured data 
+    3. Only use LLM if absolutely necessary
+    """
+    
+    url = "https://example.com/complex-page"
+    
+    async with AsyncWebCrawler() as crawler:
+        # Strategy 1: Fast regex for contact info
+        regex_strategy = RegexExtractionStrategy(
+            pattern=RegexExtractionStrategy.Email | RegexExtractionStrategy.PhoneUS
+        )
+        regex_config = CrawlerRunConfig(extraction_strategy=regex_strategy)
+        regex_result = await crawler.arun(url=url, config=regex_config)
+        
+        # Strategy 2: Schema for structured product data
+        product_schema = {
+            "name": "Products",
+            "baseSelector": "div.product",
+            "fields": [
+                {"name": "name", "selector": "h3", "type": "text"},
+                {"name": "price", "selector": ".price", "type": "text"}
+            ]
+        }
+        css_strategy = JsonCssExtractionStrategy(product_schema)
+        css_config = CrawlerRunConfig(extraction_strategy=css_strategy)
+        css_result = await crawler.arun(url=url, config=css_config)
+        
+        # Combine results
+        results = {
+            "contacts": json.loads(regex_result.extracted_content) if regex_result.success else [],
+            "products": json.loads(css_result.extracted_content) if css_result.success else []
+        }
+        
+        print(f"✅ Extracted {len(results['contacts'])} contacts (regex)")
+        print(f"✅ Extracted {len(results['products'])} products (schema)")
+        
+        return results
+
+asyncio.run(multi_strategy_pipeline())
+```
+
+---
+
+## 5. Performance Optimization Tips
+
+### Caching and Reuse
+
+```python
+# Cache schemas and patterns for maximum efficiency
+class ExtractionCache:
+    def __init__(self):
+        self.schemas = {}
+        self.patterns = {}
+    
+    def get_schema(self, site_name):
+        if site_name not in self.schemas:
+            schema_file = Path(f"./cache/{site_name}_schema.json")
+            if schema_file.exists():
+                self.schemas[site_name] = json.load(schema_file.open())
+        return self.schemas.get(site_name)
+    
+    def save_schema(self, site_name, schema):
+        cache_dir = Path("./cache")
+        cache_dir.mkdir(exist_ok=True)
+        schema_file = cache_dir / f"{site_name}_schema.json"
+        json.dump(schema, schema_file.open("w"), indent=2)
+        self.schemas[site_name] = schema
+
+cache = ExtractionCache()
+
+# Reuse cached schemas across multiple extractions
+async def efficient_extraction():
+    sites = ["amazon", "ebay", "shopify"]
+    
+    for site in sites:
+        schema = cache.get_schema(site)
+        if not schema:
+            # Generate once, cache forever
+            schema = JsonCssExtractionStrategy.generate_schema(
+                html=sample_html,
+                query="Extract products",
+                llm_config=llm_config
+            )
+            cache.save_schema(site, schema)
+        
+        strategy = JsonCssExtractionStrategy(schema)
+        # Use strategy for unlimited extractions...
+```
+
+### Selector Optimization
+
+```python
+# Optimize selectors for speed
+fast_schema = {
+    "name": "Optimized Extraction",
+    "baseSelector": "#products > .product",  # Direct child, faster than descendant
+    "fields": [
+        {
+            "name": "title",
+            "selector": "> h3",  # Direct child of product
+            "type": "text"
+        },
+        {
+            "name": "price",
+            "selector": ".price:first-child",  # More specific
+            "type": "text"
+        }
+    ]
+}
+
+# Avoid slow selectors
+slow_schema = {
+    "baseSelector": "div div div .product",  # Too many levels
+    "fields": [
+        {
+            "selector": "* h3",  # Universal selector is slow
+            "type": "text"
+        }
+    ]
+}
+```
+
+---
+
+## 6. Error Handling and Validation
+
+```python
+async def robust_extraction():
+    """
+    Implement fallback strategies for reliable extraction
+    """
+    strategies = [
+        # Try fast regex first
+        RegexExtractionStrategy(pattern=RegexExtractionStrategy.Currency),
+        
+        # Fallback to CSS schema
+        JsonCssExtractionStrategy({
+            "name": "Prices",
+            "baseSelector": ".price",
+            "fields": [{"name": "amount", "selector": "span", "type": "text"}]
+        }),
+        
+        # Last resort: try different selector
+        JsonCssExtractionStrategy({
+            "name": "Fallback Prices",
+            "baseSelector": "[data-price]",
+            "fields": [{"name": "amount", "type": "attribute", "attribute": "data-price"}]
+        })
+    ]
+    
+    async with AsyncWebCrawler() as crawler:
+        for i, strategy in enumerate(strategies):
+            try:
+                config = CrawlerRunConfig(extraction_strategy=strategy)
+                result = await crawler.arun(url="https://example.com", config=config)
+                
+                if result.success and result.extracted_content:
+                    data = json.loads(result.extracted_content)
+                    if data:  # Validate non-empty results
+                        print(f"✅ Success with strategy {i+1}: {strategy.__class__.__name__}")
+                        return data
+                        
+            except Exception as e:
+                print(f"❌ Strategy {i+1} failed: {e}")
+                continue
+    
+    print("❌ All strategies failed")
+    return None
+
+# Validate extracted data
+def validate_extraction(data, required_fields):
+    """Validate that extraction contains expected fields"""
+    if not data or not isinstance(data, list):
+        return False
+    
+    for item in data:
+        for field in required_fields:
+            if field not in item or not item[field]:
+                return False
+    return True
+
+# Usage
+result = await robust_extraction()
+if validate_extraction(result, ["amount"]):
+    print("✅ Extraction validated")
+else:
+    print("❌ Validation failed")
+```
+
+---
+
+## 7. Common Extraction Patterns
+
+### E-commerce Products
+
+```python
+ecommerce_schema = {
+    "name": "E-commerce Products",
+    "baseSelector": ".product, [data-product], .item",
+    "fields": [
+        {"name": "title", "selector": "h1, h2, h3, .title, .name", "type": "text"},
+        {"name": "price", "selector": ".price, .cost, [data-price]", "type": "text"},
+        {"name": "image", "selector": "img", "type": "attribute", "attribute": "src"},
+        {"name": "url", "selector": "a", "type": "attribute", "attribute": "href"},
+        {"name": "rating", "selector": ".rating, .stars", "type": "text"},
+        {"name": "availability", "selector": ".stock, .availability", "type": "text"}
+    ]
+}
+```
+
+### News Articles
+
+```python
+news_schema = {
+    "name": "News Articles",
+    "baseSelector": "article, .article, .post",
+    "fields": [
+        {"name": "headline", "selector": "h1, h2, .headline, .title", "type": "text"},
+        {"name": "author", "selector": ".author, .byline, [rel='author']", "type": "text"},
+        {"name": "date", "selector": "time, .date, .published", "type": "text"},
+        {"name": "content", "selector": ".content, .body, .text", "type": "text"},
+        {"name": "category", "selector": ".category, .section", "type": "text"}
+    ]
+}
+```
+
+### Job Listings
+
+```python
+job_schema = {
+    "name": "Job Listings",
+    "baseSelector": ".job, .listing, [data-job]",
+    "fields": [
+        {"name": "title", "selector": ".job-title, h2, h3", "type": "text"},
+        {"name": "company", "selector": ".company, .employer", "type": "text"},
+        {"name": "location", "selector": ".location, .place", "type": "text"},
+        {"name": "salary", "selector": ".salary, .pay, .compensation", "type": "text"},
+        {"name": "description", "selector": ".description, .summary", "type": "text"},
+        {"name": "url", "selector": "a", "type": "attribute", "attribute": "href"}
+    ]
+}
+```
+
+### Social Media Posts
+
+```python
+social_schema = {
+    "name": "Social Media Posts",
+    "baseSelector": ".post, .tweet, .update",
+    "fields": [
+        {"name": "username", "selector": ".username, .handle, .author", "type": "text"},
+        {"name": "content", "selector": ".content, .text, .message", "type": "text"},
+        {"name": "timestamp", "selector": ".time, .date, time", "type": "text"},
+        {"name": "likes", "selector": ".likes, .hearts", "type": "text"},
+        {"name": "shares", "selector": ".shares, .retweets", "type": "text"}
+    ]
+}
+```
+
+---
+
+## 8. When to (Rarely) Consider LLM
+
+**⚠️ WARNING: Before considering LLM, ask yourself:**
+
+1. "Can I identify repeating HTML patterns?" → Use CSS/XPath schema
+2. "Am I extracting simple data types?" → Use Regex patterns  
+3. "Can I provide a JSON example of what I want?" → Use generate_schema()
+4. "Is this truly unstructured text requiring semantic understanding?" → Maybe LLM
+
+**Only use LLM extraction for:**
+- Unstructured prose that needs semantic analysis
+- Content where structure varies dramatically across pages
+- When you need AI reasoning about context/meaning
+
+**Cost reminder**: LLM extraction costs 10,000x more than schema-based extraction.
+
+---
+
+## 9. Summary: The Extraction Hierarchy
+
+1. **🥇 FIRST CHOICE**: `generate_schema()` - AI generates pattern once, use unlimited times
+2. **🥈 SECOND CHOICE**: Manual CSS/XPath - Full control, maximum speed
+3. **🥉 THIRD CHOICE**: Regex patterns - Simple data types, lightning fast
+4. **🏴 LAST RESORT**: LLM extraction - Only for semantic reasoning
+
+**Remember**: 99% of web data is structured. You almost never need LLM for extraction. Save LLM for analysis, not extraction.
+
+**Performance**: Non-LLM strategies are 100-1000x faster and 10,000x cheaper than LLM extraction.
+
+---
+
+**📖 Next**: If you absolutely must use LLM extraction, see [extraction-llm.md](./extraction-llm.md) for guidance on the rare cases where it's justified.