feat: Add comprehensive website to API example with frontend

This commit adds a complete, web scraping API example that demonstrates how to get structured data from any website and use it like an API using the crawl4ai library with a minimalist frontend interface. Core Functionality - AI-powered web scraping with plain English queries - Dual scraping approaches: Schema-based (faster) and LLM-based (flexible) - Intelligent schema caching for improved performance - Custom LLM model support with API key management - Automatic duplicate request prevention Modern Frontend Interface - Minimalist black-and-white design inspired by modern web apps - Responsive layout with smooth animations and transitions - Three main pages: Scrape Data, Models Management, API Request History - Real-time results display with JSON formatting - Copy-to-clipboard functionality for extracted data - Toast notifications for user feedback - Auto-scroll to results when scraping starts Model Management System - Web-based model configuration interface - Support for any LLM provider (OpenAI, Gemini, Anthropic, etc.) - Simplified configuration requiring only provider and API token - Add, list, and delete model configurations - Secure storage of API keys in local JSON files API Request History - Automatic saving of all API requests and responses - Display of request history with URL, query, and cURL commands - Duplicate prevention (same URL + query combinations) - Request deletion functionality - Clean, simplified display focusing on essential information Technical Implementation Backend (FastAPI) - RESTful API with comprehensive endpoints - Pydantic models for request/response validation - Async web scraping with crawl4ai library - Error handling with detailed error messages - File-based storage for models and request history Frontend (Vanilla JS/CSS/HTML) - No framework dependencies - pure HTML, CSS, JavaScript - Modern CSS Grid and Flexbox layouts - Custom dropdown styling with SVG arrows - Responsive design for mobile and desktop - Smooth scrolling and animations Core Library Integration - WebScraperAgent class for orchestration - ModelConfig class for LLM configuration management - Schema generation and caching system - LLM extraction strategy support - Browser configuration with headless mode
2025-08-24 18:20:15 +05:30
parent 40ab287c90
commit b1dff5a4d3
12 changed files with 2749 additions and 0 deletions
--- a/docs/examples/website-to-api/README.md
+++ b/docs/examples/website-to-api/README.md
@@ -0,0 +1,252 @@
+# Web Scraper API with Custom Model Support
+
+A powerful web scraping API that converts any website into structured data using AI. Features a beautiful minimalist frontend interface and support for custom LLM models!
+
+## Features
+
+- **AI-Powered Scraping**: Provide a URL and plain English query to extract structured data
+- **Beautiful Frontend**: Modern minimalist black-and-white interface with smooth UX
+- **Custom Model Support**: Use any LLM provider (OpenAI, Gemini, Anthropic, etc.) with your own API keys
+- **Model Management**: Save, list, and manage multiple model configurations via web interface
+- **Dual Scraping Approaches**: Choose between Schema-based (faster) or LLM-based (more flexible) extraction
+- **API Request History**: Automatic saving and display of all API requests with cURL commands
+- **Schema Caching**: Intelligent caching of generated schemas for faster subsequent requests
+- **Duplicate Prevention**: Avoids saving duplicate requests (same URL + query)
+- **RESTful API**: Easy-to-use HTTP endpoints for all operations
+
+## Quick Start
+
+### 1. Install Dependencies
+
+```bash
+pip install -r requirements.txt
+```
+
+### 2. Start the API Server
+
+```bash
+python app.py
+```
+
+The server will start on `http://localhost:8000` with a beautiful web interface!
+
+### 3. Using the Web Interface
+
+Once the server is running, open your browser and go to `http://localhost:8000` to access the modern web interface!
+
+#### Pages:
+- **Scrape Data**: Enter URLs and queries to extract structured data
+- **Models**: Manage your AI model configurations (add, list, delete)
+- **API Requests**: View history of all scraping requests with cURL commands
+
+#### Features:
+- **Minimalist Design**: Clean black-and-white theme inspired by modern web apps
+- **Real-time Results**: See extracted data in formatted JSON
+- **Copy to Clipboard**: Easy copying of results
+- **Toast Notifications**: User-friendly feedback
+- **Dual Scraping Modes**: Choose between Schema-based and LLM-based approaches
+
+## Model Management
+
+### Adding Models via Web Interface
+
+1. Go to the **Models** page
+2. Enter your model details:
+   - **Provider**: LLM provider (e.g., `gemini/gemini-2.5-flash`, `openai/gpt-4o`)
+   - **API Token**: Your API key for the provider
+3. Click "Add Model"
+
+### API Usage for Model Management
+
+#### Save a Model Configuration
+
+```bash
+curl -X POST "http://localhost:8000/models" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "provider": "gemini/gemini-2.5-flash",
+    "api_token": "your-api-key-here"
+  }'
+```
+
+#### List Saved Models
+
+```bash
+curl -X GET "http://localhost:8000/models"
+```
+
+#### Delete a Model Configuration
+
+```bash
+curl -X DELETE "http://localhost:8000/models/my-gemini"
+```
+
+## Scraping Approaches
+
+### 1. Schema-based Scraping (Faster)
+- Generates CSS selectors for targeted extraction
+- Caches schemas for repeated requests
+- Faster execution for structured websites
+
+### 2. LLM-based Scraping (More Flexible)
+- Direct LLM extraction without schema generation
+- More flexible for complex or dynamic content
+- Better for unstructured data extraction
+
+## Supported LLM Providers
+
+The API supports any LLM provider that crawl4ai supports, including:
+
+- **Google Gemini**: `gemini/gemini-2.5-flash`, `gemini/gemini-pro`
+- **OpenAI**: `openai/gpt-4`, `openai/gpt-3.5-turbo`
+- **Anthropic**: `anthropic/claude-3-opus`, `anthropic/claude-3-sonnet`
+- **And more...**
+
+## API Endpoints
+
+### Core Endpoints
+
+- `POST /scrape` - Schema-based scraping
+- `POST /scrape-with-llm` - LLM-based scraping
+- `GET /schemas` - List cached schemas
+- `POST /clear-cache` - Clear schema cache
+- `GET /health` - Health check
+
+### Model Management Endpoints
+
+- `GET /models` - List saved model configurations
+- `POST /models` - Save a new model configuration
+- `DELETE /models/{model_name}` - Delete a model configuration
+
+### API Request History
+
+- `GET /saved-requests` - List all saved API requests
+- `DELETE /saved-requests/{request_id}` - Delete a saved request
+
+## Request/Response Examples
+
+### Scrape Request
+
+```json
+{
+  "url": "https://example.com",
+  "query": "Extract the product name, price, and description",
+  "model_name": "my-custom-model"
+}
+```
+
+### Scrape Response
+
+```json
+{
+  "success": true,
+  "url": "https://example.com",
+  "query": "Extract the product name, price, and description",
+  "extracted_data": {
+    "product_name": "Example Product",
+    "price": "$99.99",
+    "description": "This is an example product description"
+  },
+  "schema_used": { ... },
+  "timestamp": "2024-01-01T12:00:00Z"
+}
+```
+
+### Model Configuration Request
+
+```json
+{
+  "provider": "gemini/gemini-2.5-flash",
+  "api_token": "your-api-key-here"
+}
+```
+
+## Testing
+
+Run the test script to verify the model management functionality:
+
+```bash
+python test_models.py
+```
+
+## File Structure
+
+```
+parse_example/
+├── api_server.py          # FastAPI server with all endpoints
+├── web_scraper_lib.py     # Core scraping library
+├── test_models.py         # Test script for model management
+├── requirements.txt       # Dependencies
+├── static/               # Frontend files
+│   ├── index.html        # Main HTML interface
+│   ├── styles.css        # CSS styles (minimalist theme)
+│   └── script.js         # JavaScript functionality
+├── schemas/              # Cached schemas
+├── models/               # Saved model configurations
+├── saved_requests/       # API request history
+└── README.md            # This file
+```
+
+## Advanced Usage
+
+### Using the Library Directly
+
+```python
+from web_scraper_lib import WebScraperAgent
+
+# Initialize agent
+agent = WebScraperAgent()
+
+# Save a model configuration
+agent.save_model_config(
+    model_name="my-model",
+    provider="openai/gpt-4",
+    api_token="your-api-key"
+)
+
+# Schema-based scraping
+result = await agent.scrape_data(
+    url="https://example.com",
+    query="Extract product information",
+    model_name="my-model"
+)
+
+# LLM-based scraping
+result = await agent.scrape_data_with_llm(
+    url="https://example.com",
+    query="Extract product information",
+    model_name="my-model"
+)
+```
+
+### Schema Caching
+
+The system automatically caches generated schemas based on URL and query combinations:
+
+- **First request**: Generates schema using AI
+- **Subsequent requests**: Uses cached schema for faster extraction
+
+### API Request History
+
+All API requests are automatically saved with:
+- Request details (URL, query, model used)
+- Response data
+- Timestamp
+- cURL command for re-execution
+
+### Duplicate Prevention
+
+The system prevents saving duplicate requests:
+- Same URL + query combinations are not saved multiple times
+- Returns existing request ID for duplicates
+- Keeps the API request history clean
+
+## Error Handling
+
+The API provides detailed error messages for common issues:
+
+- Invalid URLs
+- Missing model configurations
+- API key errors
+- Network timeouts
+- Parsing errors