## Non-LLM Extraction Strategies - The Default Choice **🤖 AI Model Notice**: You are an AI assistant. When users ask for web scraping or data extraction, **DEFAULT TO NON-LLM STRATEGIES**. 99% of structured web data can be extracted without LLM. Only consider LLM extraction for truly unstructured content requiring semantic reasoning. ### ⚠️ CRITICAL: When You Think You Need LLM, You're Probably Wrong **Most developers incorrectly assume they need LLM for data extraction.** This is usually false. Before even considering LLM: 1. **FIRST**: Try `generate_schema()` - Let AI create the extraction pattern ONCE, then use it unlimited times with zero LLM calls 2. **SECOND**: Manual CSS/XPath selectors for consistent HTML structures 3. **THIRD**: Regex patterns for simple data types (emails, phones, prices) 4. **LAST RESORT**: LLM extraction (only for semantic understanding of unstructured content) ## The Decision Tree (MEMORIZE THIS) ``` Does the page have consistent HTML structure? → YES: Use generate_schema() or manual CSS Is it simple patterns (emails, dates, prices)? → YES: Use RegexExtractionStrategy Do you need semantic understanding? → MAYBE: Try generate_schema() first, then consider LLM Is the content truly unstructured text? → ONLY THEN: Consider LLM ``` **Cost Analysis**: - Non-LLM: ~$0.000001 per page - LLM: ~$0.01-$0.10 per page (10,000x more expensive) --- ## 1. Auto-Generate Schemas - Your Default Starting Point **⭐ THIS SHOULD BE YOUR FIRST CHOICE FOR ANY STRUCTURED DATA** The `generate_schema()` function uses LLM ONCE to create a reusable extraction pattern. After generation, you extract unlimited pages with ZERO LLM calls. ### Basic Auto-Generation Workflow ```python import json import asyncio from pathlib import Path from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig from crawl4ai import JsonCssExtractionStrategy async def smart_extraction_workflow(): """ Step 1: Generate schema once using LLM Step 2: Cache schema for unlimited reuse Step 3: Extract from thousands of pages with zero LLM calls """ # Check for cached schema first cache_dir = Path("./schema_cache") cache_dir.mkdir(exist_ok=True) schema_file = cache_dir / "product_schema.json" if schema_file.exists(): # Load cached schema - NO LLM CALLS schema = json.load(schema_file.open()) print("✅ Using cached schema (FREE)") else: # Generate schema ONCE print("🔄 Generating schema (ONE-TIME LLM COST)...") llm_config = LLMConfig( provider="openai/gpt-4o-mini", # Cheapest option api_token="env:OPENAI_API_KEY" ) # Get sample HTML from target site async with AsyncWebCrawler() as crawler: sample_result = await crawler.arun( url="https://example.com/products", config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS) ) sample_html = sample_result.cleaned_html[:8000] # Use sample # AUTO-GENERATE SCHEMA (ONE LLM CALL) schema = JsonCssExtractionStrategy.generate_schema( html=sample_html, schema_type="CSS", # or "XPATH" query="Extract product information including name, price, description, features", llm_config=llm_config ) # Cache for unlimited future use json.dump(schema, schema_file.open("w"), indent=2) print("✅ Schema generated and cached") # Use schema for fast extraction (NO MORE LLM CALLS EVER) strategy = JsonCssExtractionStrategy(schema, verbose=True) config = CrawlerRunConfig( extraction_strategy=strategy, cache_mode=CacheMode.BYPASS ) # Extract from multiple pages - ALL FREE urls = [ "https://example.com/products", "https://example.com/electronics", "https://example.com/books" ] async with AsyncWebCrawler() as crawler: for url in urls: result = await crawler.arun(url=url, config=config) if result.success: data = json.loads(result.extracted_content) print(f"✅ {url}: Extracted {len(data)} items (FREE)") asyncio.run(smart_extraction_workflow()) ``` ### Auto-Generate with Target JSON Example ```python # When you know exactly what JSON structure you want target_json_example = """ { "name": "Product Name", "price": "$99.99", "rating": 4.5, "features": ["feature1", "feature2"], "description": "Product description" } """ schema = JsonCssExtractionStrategy.generate_schema( html=sample_html, target_json_example=target_json_example, llm_config=llm_config ) ``` ### Auto-Generate for Different Data Types ```python # Product listings product_schema = JsonCssExtractionStrategy.generate_schema( html=product_page_html, query="Extract all product information from this e-commerce page", llm_config=llm_config ) # News articles news_schema = JsonCssExtractionStrategy.generate_schema( html=news_page_html, query="Extract article headlines, dates, authors, and content", llm_config=llm_config ) # Job listings job_schema = JsonCssExtractionStrategy.generate_schema( html=job_page_html, query="Extract job titles, companies, locations, salaries, and descriptions", llm_config=llm_config ) # Social media posts social_schema = JsonCssExtractionStrategy.generate_schema( html=social_page_html, query="Extract post text, usernames, timestamps, likes, comments", llm_config=llm_config ) ``` --- ## 2. Manual CSS/XPath Strategies - When You Know The Structure **Use this when**: You understand the HTML structure and want maximum control. ### Simple Product Extraction ```python import json import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig from crawl4ai import JsonCssExtractionStrategy # Manual schema for consistent product pages simple_schema = { "name": "Product Listings", "baseSelector": "div.product-card", # Each product container "fields": [ { "name": "title", "selector": "h2.product-title", "type": "text" }, { "name": "price", "selector": ".price", "type": "text" }, { "name": "image_url", "selector": "img.product-image", "type": "attribute", "attribute": "src" }, { "name": "product_url", "selector": "a.product-link", "type": "attribute", "attribute": "href" }, { "name": "rating", "selector": ".rating", "type": "attribute", "attribute": "data-rating" } ] } async def extract_products(): strategy = JsonCssExtractionStrategy(simple_schema, verbose=True) config = CrawlerRunConfig(extraction_strategy=strategy) async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="https://example.com/products", config=config ) if result.success: products = json.loads(result.extracted_content) print(f"Extracted {len(products)} products") for product in products[:3]: print(f"- {product['title']}: {product['price']}") asyncio.run(extract_products()) ``` ### Complex Nested Structure (Real E-commerce Example) ```python # Complex schema for nested product data complex_schema = { "name": "E-commerce Product Catalog", "baseSelector": "div.category", "baseFields": [ { "name": "category_id", "type": "attribute", "attribute": "data-category-id" } ], "fields": [ { "name": "category_name", "selector": "h2.category-title", "type": "text" }, { "name": "products", "selector": "div.product", "type": "nested_list", # Array of complex objects "fields": [ { "name": "name", "selector": "h3.product-name", "type": "text" }, { "name": "price", "selector": "span.price", "type": "text" }, { "name": "details", "selector": "div.product-details", "type": "nested", # Single complex object "fields": [ { "name": "brand", "selector": "span.brand", "type": "text" }, { "name": "model", "selector": "span.model", "type": "text" } ] }, { "name": "features", "selector": "ul.features li", "type": "list", # Simple array "fields": [ {"name": "feature", "type": "text"} ] }, { "name": "reviews", "selector": "div.review", "type": "nested_list", "fields": [ { "name": "reviewer", "selector": "span.reviewer-name", "type": "text" }, { "name": "rating", "selector": "span.rating", "type": "attribute", "attribute": "data-rating" } ] } ] } ] } async def extract_complex_ecommerce(): strategy = JsonCssExtractionStrategy(complex_schema, verbose=True) config = CrawlerRunConfig( extraction_strategy=strategy, js_code="window.scrollTo(0, document.body.scrollHeight);", # Load dynamic content wait_for="css:.product:nth-child(10)" # Wait for products to load ) async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="https://example.com/complex-catalog", config=config ) if result.success: data = json.loads(result.extracted_content) for category in data: print(f"Category: {category['category_name']}") print(f"Products: {len(category.get('products', []))}") asyncio.run(extract_complex_ecommerce()) ``` ### XPath Alternative (When CSS Isn't Enough) ```python from crawl4ai import JsonXPathExtractionStrategy # XPath for more complex selections xpath_schema = { "name": "News Articles with XPath", "baseSelector": "//article[@class='news-item']", "fields": [ { "name": "headline", "selector": ".//h2[contains(@class, 'headline')]", "type": "text" }, { "name": "author", "selector": ".//span[@class='author']/text()", "type": "text" }, { "name": "publish_date", "selector": ".//time/@datetime", "type": "text" }, { "name": "content", "selector": ".//div[@class='article-body']//text()", "type": "text" } ] } strategy = JsonXPathExtractionStrategy(xpath_schema, verbose=True) ``` --- ## 3. Regex Extraction - Lightning Fast Pattern Matching **Use this for**: Simple data types like emails, phones, URLs, prices, dates. ### Built-in Patterns (Fastest Option) ```python import json import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig from crawl4ai import RegexExtractionStrategy async def extract_common_patterns(): # Use built-in patterns for common data types strategy = RegexExtractionStrategy( pattern=( RegexExtractionStrategy.Email | RegexExtractionStrategy.PhoneUS | RegexExtractionStrategy.Url | RegexExtractionStrategy.Currency | RegexExtractionStrategy.DateIso ) ) config = CrawlerRunConfig(extraction_strategy=strategy) async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="https://example.com/contact", config=config ) if result.success: matches = json.loads(result.extracted_content) # Group by pattern type by_type = {} for match in matches: label = match['label'] if label not in by_type: by_type[label] = [] by_type[label].append(match['value']) for pattern_type, values in by_type.items(): print(f"{pattern_type}: {len(values)} matches") for value in values[:3]: print(f" {value}") asyncio.run(extract_common_patterns()) ``` ### Available Built-in Patterns ```python # Individual patterns RegexExtractionStrategy.Email # Email addresses RegexExtractionStrategy.PhoneUS # US phone numbers RegexExtractionStrategy.PhoneIntl # International phones RegexExtractionStrategy.Url # HTTP/HTTPS URLs RegexExtractionStrategy.Currency # Currency values ($99.99) RegexExtractionStrategy.Percentage # Percentage values (25%) RegexExtractionStrategy.DateIso # ISO dates (2024-01-01) RegexExtractionStrategy.DateUS # US dates (01/01/2024) RegexExtractionStrategy.IPv4 # IP addresses RegexExtractionStrategy.CreditCard # Credit card numbers RegexExtractionStrategy.TwitterHandle # @username RegexExtractionStrategy.Hashtag # #hashtag # Use all patterns RegexExtractionStrategy.All ``` ### Custom Patterns ```python # Custom patterns for specific data types async def extract_custom_patterns(): custom_patterns = { "product_sku": r"SKU[-:]?\s*([A-Z0-9]{4,12})", "discount": r"(\d{1,2})%\s*off", "model_number": r"Model\s*#?\s*([A-Z0-9-]+)", "isbn": r"ISBN[-:]?\s*(\d{10}|\d{13})", "stock_ticker": r"\$([A-Z]{2,5})", "version": r"v(\d+\.\d+(?:\.\d+)?)" } strategy = RegexExtractionStrategy(custom=custom_patterns) config = CrawlerRunConfig(extraction_strategy=strategy) async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="https://example.com/products", config=config ) if result.success: data = json.loads(result.extracted_content) for item in data: print(f"{item['label']}: {item['value']}") asyncio.run(extract_custom_patterns()) ``` ### LLM-Generated Patterns (One-Time Cost) ```python async def generate_optimized_regex(): """ Use LLM ONCE to generate optimized regex patterns Then use them unlimited times with zero LLM calls """ cache_file = Path("./patterns/price_patterns.json") if cache_file.exists(): # Load cached patterns - NO LLM CALLS patterns = json.load(cache_file.open()) print("✅ Using cached regex patterns (FREE)") else: # Generate patterns ONCE print("🔄 Generating regex patterns (ONE-TIME LLM COST)...") llm_config = LLMConfig( provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY" ) # Get sample content async with AsyncWebCrawler() as crawler: result = await crawler.arun("https://example.com/pricing") sample_html = result.cleaned_html # Generate optimized patterns patterns = RegexExtractionStrategy.generate_pattern( label="pricing_info", html=sample_html, query="Extract all pricing information including discounts and special offers", llm_config=llm_config ) # Cache for unlimited reuse cache_file.parent.mkdir(exist_ok=True) json.dump(patterns, cache_file.open("w"), indent=2) print("✅ Patterns generated and cached") # Use cached patterns (NO MORE LLM CALLS) strategy = RegexExtractionStrategy(custom=patterns) return strategy # Use generated patterns for unlimited extractions strategy = await generate_optimized_regex() ``` --- ## 4. Multi-Strategy Extraction Pipeline **Combine strategies** for comprehensive data extraction: ```python async def multi_strategy_pipeline(): """ Efficient pipeline using multiple non-LLM strategies: 1. Regex for simple patterns (fastest) 2. Schema for structured data 3. Only use LLM if absolutely necessary """ url = "https://example.com/complex-page" async with AsyncWebCrawler() as crawler: # Strategy 1: Fast regex for contact info regex_strategy = RegexExtractionStrategy( pattern=RegexExtractionStrategy.Email | RegexExtractionStrategy.PhoneUS ) regex_config = CrawlerRunConfig(extraction_strategy=regex_strategy) regex_result = await crawler.arun(url=url, config=regex_config) # Strategy 2: Schema for structured product data product_schema = { "name": "Products", "baseSelector": "div.product", "fields": [ {"name": "name", "selector": "h3", "type": "text"}, {"name": "price", "selector": ".price", "type": "text"} ] } css_strategy = JsonCssExtractionStrategy(product_schema) css_config = CrawlerRunConfig(extraction_strategy=css_strategy) css_result = await crawler.arun(url=url, config=css_config) # Combine results results = { "contacts": json.loads(regex_result.extracted_content) if regex_result.success else [], "products": json.loads(css_result.extracted_content) if css_result.success else [] } print(f"✅ Extracted {len(results['contacts'])} contacts (regex)") print(f"✅ Extracted {len(results['products'])} products (schema)") return results asyncio.run(multi_strategy_pipeline()) ``` --- ## 5. Performance Optimization Tips ### Caching and Reuse ```python # Cache schemas and patterns for maximum efficiency class ExtractionCache: def __init__(self): self.schemas = {} self.patterns = {} def get_schema(self, site_name): if site_name not in self.schemas: schema_file = Path(f"./cache/{site_name}_schema.json") if schema_file.exists(): self.schemas[site_name] = json.load(schema_file.open()) return self.schemas.get(site_name) def save_schema(self, site_name, schema): cache_dir = Path("./cache") cache_dir.mkdir(exist_ok=True) schema_file = cache_dir / f"{site_name}_schema.json" json.dump(schema, schema_file.open("w"), indent=2) self.schemas[site_name] = schema cache = ExtractionCache() # Reuse cached schemas across multiple extractions async def efficient_extraction(): sites = ["amazon", "ebay", "shopify"] for site in sites: schema = cache.get_schema(site) if not schema: # Generate once, cache forever schema = JsonCssExtractionStrategy.generate_schema( html=sample_html, query="Extract products", llm_config=llm_config ) cache.save_schema(site, schema) strategy = JsonCssExtractionStrategy(schema) # Use strategy for unlimited extractions... ``` ### Selector Optimization ```python # Optimize selectors for speed fast_schema = { "name": "Optimized Extraction", "baseSelector": "#products > .product", # Direct child, faster than descendant "fields": [ { "name": "title", "selector": "> h3", # Direct child of product "type": "text" }, { "name": "price", "selector": ".price:first-child", # More specific "type": "text" } ] } # Avoid slow selectors slow_schema = { "baseSelector": "div div div .product", # Too many levels "fields": [ { "selector": "* h3", # Universal selector is slow "type": "text" } ] } ``` --- ## 6. Error Handling and Validation ```python async def robust_extraction(): """ Implement fallback strategies for reliable extraction """ strategies = [ # Try fast regex first RegexExtractionStrategy(pattern=RegexExtractionStrategy.Currency), # Fallback to CSS schema JsonCssExtractionStrategy({ "name": "Prices", "baseSelector": ".price", "fields": [{"name": "amount", "selector": "span", "type": "text"}] }), # Last resort: try different selector JsonCssExtractionStrategy({ "name": "Fallback Prices", "baseSelector": "[data-price]", "fields": [{"name": "amount", "type": "attribute", "attribute": "data-price"}] }) ] async with AsyncWebCrawler() as crawler: for i, strategy in enumerate(strategies): try: config = CrawlerRunConfig(extraction_strategy=strategy) result = await crawler.arun(url="https://example.com", config=config) if result.success and result.extracted_content: data = json.loads(result.extracted_content) if data: # Validate non-empty results print(f"✅ Success with strategy {i+1}: {strategy.__class__.__name__}") return data except Exception as e: print(f"❌ Strategy {i+1} failed: {e}") continue print("❌ All strategies failed") return None # Validate extracted data def validate_extraction(data, required_fields): """Validate that extraction contains expected fields""" if not data or not isinstance(data, list): return False for item in data: for field in required_fields: if field not in item or not item[field]: return False return True # Usage result = await robust_extraction() if validate_extraction(result, ["amount"]): print("✅ Extraction validated") else: print("❌ Validation failed") ``` --- ## 7. Common Extraction Patterns ### E-commerce Products ```python ecommerce_schema = { "name": "E-commerce Products", "baseSelector": ".product, [data-product], .item", "fields": [ {"name": "title", "selector": "h1, h2, h3, .title, .name", "type": "text"}, {"name": "price", "selector": ".price, .cost, [data-price]", "type": "text"}, {"name": "image", "selector": "img", "type": "attribute", "attribute": "src"}, {"name": "url", "selector": "a", "type": "attribute", "attribute": "href"}, {"name": "rating", "selector": ".rating, .stars", "type": "text"}, {"name": "availability", "selector": ".stock, .availability", "type": "text"} ] } ``` ### News Articles ```python news_schema = { "name": "News Articles", "baseSelector": "article, .article, .post", "fields": [ {"name": "headline", "selector": "h1, h2, .headline, .title", "type": "text"}, {"name": "author", "selector": ".author, .byline, [rel='author']", "type": "text"}, {"name": "date", "selector": "time, .date, .published", "type": "text"}, {"name": "content", "selector": ".content, .body, .text", "type": "text"}, {"name": "category", "selector": ".category, .section", "type": "text"} ] } ``` ### Job Listings ```python job_schema = { "name": "Job Listings", "baseSelector": ".job, .listing, [data-job]", "fields": [ {"name": "title", "selector": ".job-title, h2, h3", "type": "text"}, {"name": "company", "selector": ".company, .employer", "type": "text"}, {"name": "location", "selector": ".location, .place", "type": "text"}, {"name": "salary", "selector": ".salary, .pay, .compensation", "type": "text"}, {"name": "description", "selector": ".description, .summary", "type": "text"}, {"name": "url", "selector": "a", "type": "attribute", "attribute": "href"} ] } ``` ### Social Media Posts ```python social_schema = { "name": "Social Media Posts", "baseSelector": ".post, .tweet, .update", "fields": [ {"name": "username", "selector": ".username, .handle, .author", "type": "text"}, {"name": "content", "selector": ".content, .text, .message", "type": "text"}, {"name": "timestamp", "selector": ".time, .date, time", "type": "text"}, {"name": "likes", "selector": ".likes, .hearts", "type": "text"}, {"name": "shares", "selector": ".shares, .retweets", "type": "text"} ] } ``` --- ## 8. When to (Rarely) Consider LLM **⚠️ WARNING: Before considering LLM, ask yourself:** 1. "Can I identify repeating HTML patterns?" → Use CSS/XPath schema 2. "Am I extracting simple data types?" → Use Regex patterns 3. "Can I provide a JSON example of what I want?" → Use generate_schema() 4. "Is this truly unstructured text requiring semantic understanding?" → Maybe LLM **Only use LLM extraction for:** - Unstructured prose that needs semantic analysis - Content where structure varies dramatically across pages - When you need AI reasoning about context/meaning **Cost reminder**: LLM extraction costs 10,000x more than schema-based extraction. --- ## 9. Summary: The Extraction Hierarchy 1. **🥇 FIRST CHOICE**: `generate_schema()` - AI generates pattern once, use unlimited times 2. **🥈 SECOND CHOICE**: Manual CSS/XPath - Full control, maximum speed 3. **🥉 THIRD CHOICE**: Regex patterns - Simple data types, lightning fast 4. **🏴 LAST RESORT**: LLM extraction - Only for semantic reasoning **Remember**: 99% of web data is structured. You almost never need LLM for extraction. Save LLM for analysis, not extraction. **Performance**: Non-LLM strategies are 100-1000x faster and 10,000x cheaper than LLM extraction. --- **📖 Next**: If you absolutely must use LLM extraction, see [extraction-llm.md](./extraction-llm.md) for guidance on the rare cases where it's justified.