# Web Scraper API with Custom Model Support

A powerful web scraping API that converts any website into structured data using AI. Features a beautiful minimalist frontend interface and support for custom LLM models!

## Features

- **AI-Powered Scraping**: Provide a URL and plain English query to extract structured data
- **Beautiful Frontend**: Modern minimalist black-and-white interface with smooth UX
- **Custom Model Support**: Use any LLM provider (OpenAI, Gemini, Anthropic, etc.) with your own API keys
- **Model Management**: Save, list, and manage multiple model configurations via web interface
- **Dual Scraping Approaches**: Choose between Schema-based (faster) or LLM-based (more flexible) extraction
- **API Request History**: Automatic saving and display of all API requests with cURL commands
- **Schema Caching**: Intelligent caching of generated schemas for faster subsequent requests
- **Duplicate Prevention**: Avoids saving duplicate requests (same URL + query)
- **RESTful API**: Easy-to-use HTTP endpoints for all operations

## Quick Start

### 1. Install Dependencies

```bash
pip install -r requirements.txt
```

### 2. Start the API Server

```bash
python app.py
```

The server will start on `http://localhost:8000` with a beautiful web interface!

### 3. Using the Web Interface

Once the server is running, open your browser and go to `http://localhost:8000` to access the modern web interface!

#### Pages:
- **Scrape Data**: Enter URLs and queries to extract structured data
- **Models**: Manage your AI model configurations (add, list, delete)
- **API Requests**: View history of all scraping requests with cURL commands

#### Features:
- **Minimalist Design**: Clean black-and-white theme inspired by modern web apps
- **Real-time Results**: See extracted data in formatted JSON
- **Copy to Clipboard**: Easy copying of results
- **Toast Notifications**: User-friendly feedback
- **Dual Scraping Modes**: Choose between Schema-based and LLM-based approaches

## Model Management

### Adding Models via Web Interface

1. Go to the **Models** page
2. Enter your model details:
   - **Provider**: LLM provider (e.g., `gemini/gemini-2.5-flash`, `openai/gpt-4o`)
   - **API Token**: Your API key for the provider
3. Click "Add Model"

### API Usage for Model Management

#### Save a Model Configuration

```bash
curl -X POST "http://localhost:8000/models" \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "gemini/gemini-2.5-flash",
    "api_token": "your-api-key-here"
  }'
```

#### List Saved Models

```bash
curl -X GET "http://localhost:8000/models"
```

#### Delete a Model Configuration

```bash
curl -X DELETE "http://localhost:8000/models/my-gemini"
```

## Scraping Approaches

### 1. Schema-based Scraping (Faster)
- Generates CSS selectors for targeted extraction
- Caches schemas for repeated requests
- Faster execution for structured websites

### 2. LLM-based Scraping (More Flexible)
- Direct LLM extraction without schema generation
- More flexible for complex or dynamic content
- Better for unstructured data extraction

## Supported LLM Providers

The API supports any LLM provider that crawl4ai supports, including:

- **Google Gemini**: `gemini/gemini-2.5-flash`, `gemini/gemini-pro`
- **OpenAI**: `openai/gpt-4`, `openai/gpt-3.5-turbo`
- **Anthropic**: `anthropic/claude-3-opus`, `anthropic/claude-3-sonnet`
- **And more...**

## API Endpoints

### Core Endpoints

- `POST /scrape` - Schema-based scraping
- `POST /scrape-with-llm` - LLM-based scraping
- `GET /schemas` - List cached schemas
- `POST /clear-cache` - Clear schema cache
- `GET /health` - Health check

### Model Management Endpoints

- `GET /models` - List saved model configurations
- `POST /models` - Save a new model configuration
- `DELETE /models/{model_name}` - Delete a model configuration

### API Request History

- `GET /saved-requests` - List all saved API requests
- `DELETE /saved-requests/{request_id}` - Delete a saved request

## Request/Response Examples

### Scrape Request

```json
{
  "url": "https://example.com",
  "query": "Extract the product name, price, and description",
  "model_name": "my-custom-model"
}
```

### Scrape Response

```json
{
  "success": true,
  "url": "https://example.com",
  "query": "Extract the product name, price, and description",
  "extracted_data": {
    "product_name": "Example Product",
    "price": "$99.99",
    "description": "This is an example product description"
  },
  "schema_used": { ... },
  "timestamp": "2024-01-01T12:00:00Z"
}
```

### Model Configuration Request

```json
{
  "provider": "gemini/gemini-2.5-flash",
  "api_token": "your-api-key-here"
}
```

## Testing

Run the test script to verify the model management functionality:

```bash
python test_models.py
```

## File Structure

```
parse_example/
├── api_server.py          # FastAPI server with all endpoints
├── web_scraper_lib.py     # Core scraping library
├── test_models.py         # Test script for model management
├── requirements.txt       # Dependencies
├── static/               # Frontend files
│   ├── index.html        # Main HTML interface
│   ├── styles.css        # CSS styles (minimalist theme)
│   └── script.js         # JavaScript functionality
├── schemas/              # Cached schemas
├── models/               # Saved model configurations
├── saved_requests/       # API request history
└── README.md            # This file
```

## Advanced Usage

### Using the Library Directly

```python
from web_scraper_lib import WebScraperAgent

# Initialize agent
agent = WebScraperAgent()

# Save a model configuration
agent.save_model_config(
    model_name="my-model",
    provider="openai/gpt-4",
    api_token="your-api-key"
)

# Schema-based scraping
result = await agent.scrape_data(
    url="https://example.com",
    query="Extract product information",
    model_name="my-model"
)

# LLM-based scraping
result = await agent.scrape_data_with_llm(
    url="https://example.com",
    query="Extract product information",
    model_name="my-model"
)
```

### Schema Caching

The system automatically caches generated schemas based on URL and query combinations:

- **First request**: Generates schema using AI
- **Subsequent requests**: Uses cached schema for faster extraction

### API Request History

All API requests are automatically saved with:
- Request details (URL, query, model used)
- Response data
- Timestamp
- cURL command for re-execution

### Duplicate Prevention

The system prevents saving duplicate requests:
- Same URL + query combinations are not saved multiple times
- Returns existing request ID for duplicates
- Keeps the API request history clean

## Error Handling

The API provides detailed error messages for common issues:

- Invalid URLs
- Missing model configurations
- API key errors
- Network timeouts
- Parsing errors