Files
crawl4ai/docs/examples/website-to-api
Soham Kukreti b1dff5a4d3 feat: Add comprehensive website to API example with frontend
This commit adds a complete, web scraping API example that demonstrates how to get structured data from any website and use it like an API using the crawl4ai library with a minimalist frontend interface.

Core Functionality
- AI-powered web scraping with plain English queries
- Dual scraping approaches: Schema-based (faster) and LLM-based (flexible)
- Intelligent schema caching for improved performance
- Custom LLM model support with API key management
- Automatic duplicate request prevention

Modern Frontend Interface
- Minimalist black-and-white design inspired by modern web apps
- Responsive layout with smooth animations and transitions
- Three main pages: Scrape Data, Models Management, API Request History
- Real-time results display with JSON formatting
- Copy-to-clipboard functionality for extracted data
- Toast notifications for user feedback
- Auto-scroll to results when scraping starts

Model Management System
- Web-based model configuration interface
- Support for any LLM provider (OpenAI, Gemini, Anthropic, etc.)
- Simplified configuration requiring only provider and API token
- Add, list, and delete model configurations
- Secure storage of API keys in local JSON files

API Request History
- Automatic saving of all API requests and responses
- Display of request history with URL, query, and cURL commands
- Duplicate prevention (same URL + query combinations)
- Request deletion functionality
- Clean, simplified display focusing on essential information

Technical Implementation

Backend (FastAPI)
- RESTful API with comprehensive endpoints
- Pydantic models for request/response validation
- Async web scraping with crawl4ai library
- Error handling with detailed error messages
- File-based storage for models and request history

Frontend (Vanilla JS/CSS/HTML)
- No framework dependencies - pure HTML, CSS, JavaScript
- Modern CSS Grid and Flexbox layouts
- Custom dropdown styling with SVG arrows
- Responsive design for mobile and desktop
- Smooth scrolling and animations

Core Library Integration
- WebScraperAgent class for orchestration
- ModelConfig class for LLM configuration management
- Schema generation and caching system
- LLM extraction strategy support
- Browser configuration with headless mode
2025-08-24 18:52:37 +05:30
..

Web Scraper API with Custom Model Support

A powerful web scraping API that converts any website into structured data using AI. Features a beautiful minimalist frontend interface and support for custom LLM models!

Features

  • AI-Powered Scraping: Provide a URL and plain English query to extract structured data
  • Beautiful Frontend: Modern minimalist black-and-white interface with smooth UX
  • Custom Model Support: Use any LLM provider (OpenAI, Gemini, Anthropic, etc.) with your own API keys
  • Model Management: Save, list, and manage multiple model configurations via web interface
  • Dual Scraping Approaches: Choose between Schema-based (faster) or LLM-based (more flexible) extraction
  • API Request History: Automatic saving and display of all API requests with cURL commands
  • Schema Caching: Intelligent caching of generated schemas for faster subsequent requests
  • Duplicate Prevention: Avoids saving duplicate requests (same URL + query)
  • RESTful API: Easy-to-use HTTP endpoints for all operations

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Start the API Server

python app.py

The server will start on http://localhost:8000 with a beautiful web interface!

3. Using the Web Interface

Once the server is running, open your browser and go to http://localhost:8000 to access the modern web interface!

Pages:

  • Scrape Data: Enter URLs and queries to extract structured data
  • Models: Manage your AI model configurations (add, list, delete)
  • API Requests: View history of all scraping requests with cURL commands

Features:

  • Minimalist Design: Clean black-and-white theme inspired by modern web apps
  • Real-time Results: See extracted data in formatted JSON
  • Copy to Clipboard: Easy copying of results
  • Toast Notifications: User-friendly feedback
  • Dual Scraping Modes: Choose between Schema-based and LLM-based approaches

Model Management

Adding Models via Web Interface

  1. Go to the Models page
  2. Enter your model details:
    • Provider: LLM provider (e.g., gemini/gemini-2.5-flash, openai/gpt-4o)
    • API Token: Your API key for the provider
  3. Click "Add Model"

API Usage for Model Management

Save a Model Configuration

curl -X POST "http://localhost:8000/models" \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "gemini/gemini-2.5-flash",
    "api_token": "your-api-key-here"
  }'

List Saved Models

curl -X GET "http://localhost:8000/models"

Delete a Model Configuration

curl -X DELETE "http://localhost:8000/models/my-gemini"

Scraping Approaches

1. Schema-based Scraping (Faster)

  • Generates CSS selectors for targeted extraction
  • Caches schemas for repeated requests
  • Faster execution for structured websites

2. LLM-based Scraping (More Flexible)

  • Direct LLM extraction without schema generation
  • More flexible for complex or dynamic content
  • Better for unstructured data extraction

Supported LLM Providers

The API supports any LLM provider that crawl4ai supports, including:

  • Google Gemini: gemini/gemini-2.5-flash, gemini/gemini-pro
  • OpenAI: openai/gpt-4, openai/gpt-3.5-turbo
  • Anthropic: anthropic/claude-3-opus, anthropic/claude-3-sonnet
  • And more...

API Endpoints

Core Endpoints

  • POST /scrape - Schema-based scraping
  • POST /scrape-with-llm - LLM-based scraping
  • GET /schemas - List cached schemas
  • POST /clear-cache - Clear schema cache
  • GET /health - Health check

Model Management Endpoints

  • GET /models - List saved model configurations
  • POST /models - Save a new model configuration
  • DELETE /models/{model_name} - Delete a model configuration

API Request History

  • GET /saved-requests - List all saved API requests
  • DELETE /saved-requests/{request_id} - Delete a saved request

Request/Response Examples

Scrape Request

{
  "url": "https://example.com",
  "query": "Extract the product name, price, and description",
  "model_name": "my-custom-model"
}

Scrape Response

{
  "success": true,
  "url": "https://example.com",
  "query": "Extract the product name, price, and description",
  "extracted_data": {
    "product_name": "Example Product",
    "price": "$99.99",
    "description": "This is an example product description"
  },
  "schema_used": { ... },
  "timestamp": "2024-01-01T12:00:00Z"
}

Model Configuration Request

{
  "provider": "gemini/gemini-2.5-flash",
  "api_token": "your-api-key-here"
}

Testing

Run the test script to verify the model management functionality:

python test_models.py

File Structure

parse_example/
├── api_server.py          # FastAPI server with all endpoints
├── web_scraper_lib.py     # Core scraping library
├── test_models.py         # Test script for model management
├── requirements.txt       # Dependencies
├── static/               # Frontend files
│   ├── index.html        # Main HTML interface
│   ├── styles.css        # CSS styles (minimalist theme)
│   └── script.js         # JavaScript functionality
├── schemas/              # Cached schemas
├── models/               # Saved model configurations
├── saved_requests/       # API request history
└── README.md            # This file

Advanced Usage

Using the Library Directly

from web_scraper_lib import WebScraperAgent

# Initialize agent
agent = WebScraperAgent()

# Save a model configuration
agent.save_model_config(
    model_name="my-model",
    provider="openai/gpt-4",
    api_token="your-api-key"
)

# Schema-based scraping
result = await agent.scrape_data(
    url="https://example.com",
    query="Extract product information",
    model_name="my-model"
)

# LLM-based scraping
result = await agent.scrape_data_with_llm(
    url="https://example.com",
    query="Extract product information",
    model_name="my-model"
)

Schema Caching

The system automatically caches generated schemas based on URL and query combinations:

  • First request: Generates schema using AI
  • Subsequent requests: Uses cached schema for faster extraction

API Request History

All API requests are automatically saved with:

  • Request details (URL, query, model used)
  • Response data
  • Timestamp
  • cURL command for re-execution

Duplicate Prevention

The system prevents saving duplicate requests:

  • Same URL + query combinations are not saved multiple times
  • Returns existing request ID for duplicates
  • Keeps the API request history clean

Error Handling

The API provides detailed error messages for common issues:

  • Invalid URLs
  • Missing model configurations
  • API key errors
  • Network timeouts
  • Parsing errors