Files
crawl4ai/docs/examples/c4a_script/amazon_example

Amazon R2D2 Product Search Example

A real-world demonstration of Crawl4AI's multi-step crawling with LLM-generated automation scripts.

🎯 What This Example Shows

This example demonstrates advanced Crawl4AI features:

  • LLM-Generated Scripts: Automatically create C4A-Script from HTML snippets
  • Multi-Step Crawling: Navigate through multiple pages using session persistence
  • Structured Data Extraction: Extract product data using JSON CSS schemas
  • Visual Automation: Watch the browser perform the search (headless=False)

🚀 How It Works

1. Script Generation Phase

The example uses C4ACompiler.generate_script() to analyze Amazon's HTML and create:

  • Search Script: Automates filling the search box and clicking search
  • Extraction Schema: Defines how to extract product information

2. Crawling Workflow

Homepage → Execute Search Script → Extract Products → Save Results

All steps use the same session_id to maintain browser state.

3. Data Extraction

Products are extracted with:

  • Title, price, rating, reviews
  • Delivery information
  • Sponsored/Small Business badges
  • Direct product URLs

📁 Files

  • amazon_r2d2_search.py - Main example script
  • header.html - Amazon search bar HTML (provided)
  • product.html - Product card HTML (provided)
  • Generated files:
    • generated_search_script.c4a - Auto-generated search automation
    • generated_product_schema.json - Auto-generated extraction rules
    • extracted_products.json - Final scraped data
    • search_results_screenshot.png - Visual proof of results

🏃 Running the Example

  1. Prerequisites

    # Ensure Crawl4AI is installed
    pip install crawl4ai
    
    # Set up LLM API key (for script generation)
    export OPENAI_API_KEY="your-key-here"
    
  2. Run the scraper

    python amazon_r2d2_search.py
    
  3. Watch the magic!

    • Browser window opens (not headless)
    • Navigates to Amazon.com
    • Searches for "r2d2"
    • Extracts all products
    • Saves results to JSON

📊 Sample Output

[
  {
    "title": "Death Star BB8 R2D2 Golf Balls with 20 Printed tees",
    "price": "29.95",
    "rating": "4.7",
    "reviews_count": "184",
    "delivery": "FREE delivery Thu, Jun 19",
    "url": "https://www.amazon.com/Death-Star-R2D2-Balls-Printed/dp/B081XSYZMS",
    "is_sponsored": true,
    "small_business": true
  },
  ...
]

🔍 Key Features Demonstrated

Session Persistence

# Same session_id across multiple arun() calls
config = CrawlerRunConfig(
    session_id="amazon_r2d2_session",
    # ... other settings
)

LLM Script Generation

# Generate automation from natural language + HTML
script = C4ACompiler.generate_script(
    html=header_html,
    query="Find search box, type 'r2d2', click search",
    mode="c4a"
)

JSON CSS Extraction

# Structured data extraction with CSS selectors
schema = {
    "baseSelector": "[data-component-type='s-search-result']",
    "fields": [
        {"name": "title", "selector": "h2 a span", "type": "text"},
        {"name": "price", "selector": ".a-price-whole", "type": "text"}
    ]
}

🛠️ Customization

Search Different Products

Change the search term in the script generation:

search_goal = """
...
3. Type "star wars lego" into the search box
...
"""

Extract More Data

Add fields to the extraction schema:

"fields": [
    # ... existing fields
    {"name": "prime", "selector": ".s-prime", "type": "exists"},
    {"name": "image_url", "selector": "img.s-image", "type": "attribute", "attribute": "src"}
]

Use Different Sites

Adapt the approach for other e-commerce sites by:

  1. Providing their HTML snippets
  2. Adjusting the search goals
  3. Updating the extraction schema

🎓 Learning Points

  1. No Manual Scripting: LLM generates all automation code
  2. Session Management: Maintain state across page navigations
  3. Robust Extraction: Handle dynamic content and multiple products
  4. Error Handling: Graceful fallbacks if generation fails

🐛 Troubleshooting

  • "No products found": Check if Amazon's HTML structure changed
  • "Script generation failed": Ensure LLM API key is configured
  • "Page timeout": Increase wait times in the config
  • "Session lost": Ensure same session_id is used consistently

📚 Next Steps

  • Try searching for different products
  • Add pagination to get more results
  • Extract product details pages
  • Compare prices across different sellers
  • Build a price monitoring system

This example shows the power of combining LLM intelligence with web automation. The scripts adapt to HTML changes and natural language instructions make automation accessible to everyone!