Files

Soham Kukreti b1dff5a4d3 feat: Add comprehensive website to API example with frontend

This commit adds a complete, web scraping API example that demonstrates how to get structured data from any website and use it like an API using the crawl4ai library with a minimalist frontend interface.

Core Functionality
- AI-powered web scraping with plain English queries
- Dual scraping approaches: Schema-based (faster) and LLM-based (flexible)
- Intelligent schema caching for improved performance
- Custom LLM model support with API key management
- Automatic duplicate request prevention

Modern Frontend Interface
- Minimalist black-and-white design inspired by modern web apps
- Responsive layout with smooth animations and transitions
- Three main pages: Scrape Data, Models Management, API Request History
- Real-time results display with JSON formatting
- Copy-to-clipboard functionality for extracted data
- Toast notifications for user feedback
- Auto-scroll to results when scraping starts

Model Management System
- Web-based model configuration interface
- Support for any LLM provider (OpenAI, Gemini, Anthropic, etc.)
- Simplified configuration requiring only provider and API token
- Add, list, and delete model configurations
- Secure storage of API keys in local JSON files

API Request History
- Automatic saving of all API requests and responses
- Display of request history with URL, query, and cURL commands
- Duplicate prevention (same URL + query combinations)
- Request deletion functionality
- Clean, simplified display focusing on essential information

Technical Implementation

Backend (FastAPI)
- RESTful API with comprehensive endpoints
- Pydantic models for request/response validation
- Async web scraping with crawl4ai library
- Error handling with detailed error messages
- File-based storage for models and request history

Frontend (Vanilla JS/CSS/HTML)
- No framework dependencies - pure HTML, CSS, JavaScript
- Modern CSS Grid and Flexbox layouts
- Custom dropdown styling with SVG arrows
- Responsive design for mobile and desktop
- Smooth scrolling and animations

Core Library Integration
- WebScraperAgent class for orchestration
- ModelConfig class for LLM configuration management
- Schema generation and caching system
- LLM extraction strategy support
- Browser configuration with headless mode

2025-08-24 18:52:37 +05:30

assets

feat: Add comprehensive website to API example with frontend

2025-08-24 18:52:37 +05:30

static

feat: Add comprehensive website to API example with frontend

2025-08-24 18:52:37 +05:30

.gitignore

feat: Add comprehensive website to API example with frontend

2025-08-24 18:52:37 +05:30

api_server.py

feat: Add comprehensive website to API example with frontend

2025-08-24 18:52:37 +05:30

app.py

feat: Add comprehensive website to API example with frontend

2025-08-24 18:52:37 +05:30

README.md

feat: Add comprehensive website to API example with frontend

2025-08-24 18:52:37 +05:30

requirements.txt

feat: Add comprehensive website to API example with frontend

2025-08-24 18:52:37 +05:30

test_api.py

feat: Add comprehensive website to API example with frontend

2025-08-24 18:52:37 +05:30

test_models.py

feat: Add comprehensive website to API example with frontend

2025-08-24 18:52:37 +05:30

web_scraper_lib.py

feat: Add comprehensive website to API example with frontend

2025-08-24 18:52:37 +05:30

README.md

Web Scraper API with Custom Model Support

A powerful web scraping API that converts any website into structured data using AI. Features a beautiful minimalist frontend interface and support for custom LLM models!

Features

AI-Powered Scraping: Provide a URL and plain English query to extract structured data
Beautiful Frontend: Modern minimalist black-and-white interface with smooth UX
Custom Model Support: Use any LLM provider (OpenAI, Gemini, Anthropic, etc.) with your own API keys
Model Management: Save, list, and manage multiple model configurations via web interface
Dual Scraping Approaches: Choose between Schema-based (faster) or LLM-based (more flexible) extraction
API Request History: Automatic saving and display of all API requests with cURL commands
Schema Caching: Intelligent caching of generated schemas for faster subsequent requests
Duplicate Prevention: Avoids saving duplicate requests (same URL + query)
RESTful API: Easy-to-use HTTP endpoints for all operations

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Start the API Server

python app.py

The server will start on http://localhost:8000 with a beautiful web interface!

3. Using the Web Interface

Once the server is running, open your browser and go to http://localhost:8000 to access the modern web interface!

Features:

Minimalist Design: Clean black-and-white theme inspired by modern web apps
Real-time Results: See extracted data in formatted JSON
Copy to Clipboard: Easy copying of results
Toast Notifications: User-friendly feedback
Dual Scraping Modes: Choose between Schema-based and LLM-based approaches

Model Management

Adding Models via Web Interface

Go to the Models page
Enter your model details:
- Provider: LLM provider (e.g., gemini/gemini-2.5-flash, openai/gpt-4o)
- API Token: Your API key for the provider
Click "Add Model"

API Usage for Model Management

Save a Model Configuration

curl -X POST "http://localhost:8000/models" \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "gemini/gemini-2.5-flash",
    "api_token": "your-api-key-here"
  }'

List Saved Models

curl -X GET "http://localhost:8000/models"

Delete a Model Configuration

curl -X DELETE "http://localhost:8000/models/my-gemini"

Scraping Approaches

1. Schema-based Scraping (Faster)

Generates CSS selectors for targeted extraction
Caches schemas for repeated requests
Faster execution for structured websites

2. LLM-based Scraping (More Flexible)

Direct LLM extraction without schema generation
More flexible for complex or dynamic content
Better for unstructured data extraction

Supported LLM Providers

The API supports any LLM provider that crawl4ai supports, including:

Google Gemini: gemini/gemini-2.5-flash, gemini/gemini-pro
OpenAI: openai/gpt-4, openai/gpt-3.5-turbo
Anthropic: anthropic/claude-3-opus, anthropic/claude-3-sonnet
And more...

API Endpoints

Core Endpoints

POST /scrape - Schema-based scraping
POST /scrape-with-llm - LLM-based scraping
GET /schemas - List cached schemas
POST /clear-cache - Clear schema cache
GET /health - Health check

Model Management Endpoints

GET /models - List saved model configurations
POST /models - Save a new model configuration
DELETE /models/{model_name} - Delete a model configuration

API Request History

GET /saved-requests - List all saved API requests
DELETE /saved-requests/{request_id} - Delete a saved request

Request/Response Examples

Scrape Request

{
  "url": "https://example.com",
  "query": "Extract the product name, price, and description",
  "model_name": "my-custom-model"
}

Scrape Response

{
  "success": true,
  "url": "https://example.com",
  "query": "Extract the product name, price, and description",
  "extracted_data": {
    "product_name": "Example Product",
    "price": "$99.99",
    "description": "This is an example product description"
  },
  "schema_used": { ... },
  "timestamp": "2024-01-01T12:00:00Z"
}

Model Configuration Request

{
  "provider": "gemini/gemini-2.5-flash",
  "api_token": "your-api-key-here"
}

Testing

Run the test script to verify the model management functionality:

python test_models.py

File Structure

parse_example/
├── api_server.py          # FastAPI server with all endpoints
├── web_scraper_lib.py     # Core scraping library
├── test_models.py         # Test script for model management
├── requirements.txt       # Dependencies
├── static/               # Frontend files
│   ├── index.html        # Main HTML interface
│   ├── styles.css        # CSS styles (minimalist theme)
│   └── script.js         # JavaScript functionality
├── schemas/              # Cached schemas
├── models/               # Saved model configurations
├── saved_requests/       # API request history
└── README.md            # This file

Advanced Usage

Using the Library Directly

from web_scraper_lib import WebScraperAgent

# Initialize agent
agent = WebScraperAgent()

# Save a model configuration
agent.save_model_config(
    model_name="my-model",
    provider="openai/gpt-4",
    api_token="your-api-key"
)

# Schema-based scraping
result = await agent.scrape_data(
    url="https://example.com",
    query="Extract product information",
    model_name="my-model"
)

# LLM-based scraping
result = await agent.scrape_data_with_llm(
    url="https://example.com",
    query="Extract product information",
    model_name="my-model"
)

Schema Caching

The system automatically caches generated schemas based on URL and query combinations:

First request: Generates schema using AI
Subsequent requests: Uses cached schema for faster extraction

API Request History

All API requests are automatically saved with:

Request details (URL, query, model used)
Response data
Timestamp
cURL command for re-execution

Duplicate Prevention

The system prevents saving duplicate requests:

Same URL + query combinations are not saved multiple times
Returns existing request ID for duplicates
Keeps the API request history clean

Error Handling

The API provides detailed error messages for common issues:

Invalid URLs
Missing model configurations
API key errors
Network timeouts
Parsing errors

README.md

Web Scraper API with Custom Model Support

Features

Quick Start

1. Install Dependencies

2. Start the API Server

3. Using the Web Interface

Pages:

Features:

Model Management

Adding Models via Web Interface

API Usage for Model Management

Save a Model Configuration

List Saved Models

Delete a Model Configuration

Scraping Approaches

1. Schema-based Scraping (Faster)

2. LLM-based Scraping (More Flexible)

Supported LLM Providers

API Endpoints

Core Endpoints

Model Management Endpoints

API Request History

Request/Response Examples

Scrape Request

Scrape Response

Model Configuration Request

Testing

File Structure

Advanced Usage

Using the Library Directly

Schema Caching

API Request History

Duplicate Prevention

Error Handling