crawl4ai/docs/examples/c4a_script/amazon_example/README.md

# Amazon R2D2 Product Search Example

A real-world demonstration of Crawl4AI's multi-step crawling with LLM-generated automation scripts.

## 🎯 What This Example Shows

This example demonstrates advanced Crawl4AI features:
- **LLM-Generated Scripts**: Automatically create C4A-Script from HTML snippets
- **Multi-Step Crawling**: Navigate through multiple pages using session persistence
- **Structured Data Extraction**: Extract product data using JSON CSS schemas
- **Visual Automation**: Watch the browser perform the search (headless=False)

## 🚀 How It Works

### 1. **Script Generation Phase**
The example uses `C4ACompiler.generate_script()` to analyze Amazon's HTML and create:
- **Search Script**: Automates filling the search box and clicking search
- **Extraction Schema**: Defines how to extract product information

### 2. **Crawling Workflow**
```
Homepage → Execute Search Script → Extract Products → Save Results
```

All steps use the same `session_id` to maintain browser state.

### 3. **Data Extraction**
Products are extracted with:
- Title, price, rating, reviews
- Delivery information
- Sponsored/Small Business badges
- Direct product URLs

## 📁 Files

- `amazon_r2d2_search.py` - Main example script
- `header.html` - Amazon search bar HTML (provided)
- `product.html` - Product card HTML (provided)
- **Generated files:**
  - `generated_search_script.c4a` - Auto-generated search automation
  - `generated_product_schema.json` - Auto-generated extraction rules
  - `extracted_products.json` - Final scraped data
  - `search_results_screenshot.png` - Visual proof of results

## 🏃 Running the Example

1. **Prerequisites**
   ```bash
   # Ensure Crawl4AI is installed
   pip install crawl4ai

   # Set up LLM API key (for script generation)
   export OPENAI_API_KEY="your-key-here"
   ```

2. **Run the scraper**
   ```bash
   python amazon_r2d2_search.py
   ```

3. **Watch the magic!**
   - Browser window opens (not headless)
   - Navigates to Amazon.com
   - Searches for "r2d2"
   - Extracts all products
   - Saves results to JSON

## 📊 Sample Output

```json
[
  {
    "title": "Death Star BB8 R2D2 Golf Balls with 20 Printed tees",
    "price": "29.95",
    "rating": "4.7",
    "reviews_count": "184",
    "delivery": "FREE delivery Thu, Jun 19",
    "url": "https://www.amazon.com/Death-Star-R2D2-Balls-Printed/dp/B081XSYZMS",
    "is_sponsored": true,
    "small_business": true
  },
  ...
]
```

## 🔍 Key Features Demonstrated

### Session Persistence
```python
# Same session_id across multiple arun() calls
config = CrawlerRunConfig(
    session_id="amazon_r2d2_session",
    # ... other settings
)
```

### LLM Script Generation
```python
# Generate automation from natural language + HTML
script = C4ACompiler.generate_script(
    html=header_html,
    query="Find search box, type 'r2d2', click search",
    mode="c4a"
)
```

### JSON CSS Extraction
```python
# Structured data extraction with CSS selectors
schema = {
    "baseSelector": "[data-component-type='s-search-result']",
    "fields": [
        {"name": "title", "selector": "h2 a span", "type": "text"},
        {"name": "price", "selector": ".a-price-whole", "type": "text"}
    ]
}
```

## 🛠️ Customization

### Search Different Products
Change the search term in the script generation:
```python
search_goal = """
...
3. Type "star wars lego" into the search box
...
"""
```

### Extract More Data
Add fields to the extraction schema:
```python
"fields": [
    # ... existing fields
    {"name": "prime", "selector": ".s-prime", "type": "exists"},
    {"name": "image_url", "selector": "img.s-image", "type": "attribute", "attribute": "src"}
]
```

### Use Different Sites
Adapt the approach for other e-commerce sites by:
1. Providing their HTML snippets
2. Adjusting the search goals
3. Updating the extraction schema

## 🎓 Learning Points

1. **No Manual Scripting**: LLM generates all automation code
2. **Session Management**: Maintain state across page navigations
3. **Robust Extraction**: Handle dynamic content and multiple products
4. **Error Handling**: Graceful fallbacks if generation fails

## 🐛 Troubleshooting

- **"No products found"**: Check if Amazon's HTML structure changed
- **"Script generation failed"**: Ensure LLM API key is configured
- **"Page timeout"**: Increase wait times in the config
- **"Session lost"**: Ensure same session_id is used consistently

## 📚 Next Steps

- Try searching for different products
- Add pagination to get more results
- Extract product details pages
- Compare prices across different sellers
- Build a price monitoring system

---

This example shows the power of combining LLM intelligence with web automation. The scripts adapt to HTML changes and natural language instructions make automation accessible to everyone!