feat: Add comprehensive website to API example with frontend
This commit adds a complete, web scraping API example that demonstrates how to get structured data from any website and use it like an API using the crawl4ai library with a minimalist frontend interface. Core Functionality - AI-powered web scraping with plain English queries - Dual scraping approaches: Schema-based (faster) and LLM-based (flexible) - Intelligent schema caching for improved performance - Custom LLM model support with API key management - Automatic duplicate request prevention Modern Frontend Interface - Minimalist black-and-white design inspired by modern web apps - Responsive layout with smooth animations and transitions - Three main pages: Scrape Data, Models Management, API Request History - Real-time results display with JSON formatting - Copy-to-clipboard functionality for extracted data - Toast notifications for user feedback - Auto-scroll to results when scraping starts Model Management System - Web-based model configuration interface - Support for any LLM provider (OpenAI, Gemini, Anthropic, etc.) - Simplified configuration requiring only provider and API token - Add, list, and delete model configurations - Secure storage of API keys in local JSON files API Request History - Automatic saving of all API requests and responses - Display of request history with URL, query, and cURL commands - Duplicate prevention (same URL + query combinations) - Request deletion functionality - Clean, simplified display focusing on essential information Technical Implementation Backend (FastAPI) - RESTful API with comprehensive endpoints - Pydantic models for request/response validation - Async web scraping with crawl4ai library - Error handling with detailed error messages - File-based storage for models and request history Frontend (Vanilla JS/CSS/HTML) - No framework dependencies - pure HTML, CSS, JavaScript - Modern CSS Grid and Flexbox layouts - Custom dropdown styling with SVG arrows - Responsive design for mobile and desktop - Smooth scrolling and animations Core Library Integration - WebScraperAgent class for orchestration - ModelConfig class for LLM configuration management - Schema generation and caching system - LLM extraction strategy support - Browser configuration with headless mode
This commit is contained in:
252
docs/examples/website-to-api/README.md
Normal file
252
docs/examples/website-to-api/README.md
Normal file
@@ -0,0 +1,252 @@
|
||||
# Web Scraper API with Custom Model Support
|
||||
|
||||
A powerful web scraping API that converts any website into structured data using AI. Features a beautiful minimalist frontend interface and support for custom LLM models!
|
||||
|
||||
## Features
|
||||
|
||||
- **AI-Powered Scraping**: Provide a URL and plain English query to extract structured data
|
||||
- **Beautiful Frontend**: Modern minimalist black-and-white interface with smooth UX
|
||||
- **Custom Model Support**: Use any LLM provider (OpenAI, Gemini, Anthropic, etc.) with your own API keys
|
||||
- **Model Management**: Save, list, and manage multiple model configurations via web interface
|
||||
- **Dual Scraping Approaches**: Choose between Schema-based (faster) or LLM-based (more flexible) extraction
|
||||
- **API Request History**: Automatic saving and display of all API requests with cURL commands
|
||||
- **Schema Caching**: Intelligent caching of generated schemas for faster subsequent requests
|
||||
- **Duplicate Prevention**: Avoids saving duplicate requests (same URL + query)
|
||||
- **RESTful API**: Easy-to-use HTTP endpoints for all operations
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Install Dependencies
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
### 2. Start the API Server
|
||||
|
||||
```bash
|
||||
python app.py
|
||||
```
|
||||
|
||||
The server will start on `http://localhost:8000` with a beautiful web interface!
|
||||
|
||||
### 3. Using the Web Interface
|
||||
|
||||
Once the server is running, open your browser and go to `http://localhost:8000` to access the modern web interface!
|
||||
|
||||
#### Pages:
|
||||
- **Scrape Data**: Enter URLs and queries to extract structured data
|
||||
- **Models**: Manage your AI model configurations (add, list, delete)
|
||||
- **API Requests**: View history of all scraping requests with cURL commands
|
||||
|
||||
#### Features:
|
||||
- **Minimalist Design**: Clean black-and-white theme inspired by modern web apps
|
||||
- **Real-time Results**: See extracted data in formatted JSON
|
||||
- **Copy to Clipboard**: Easy copying of results
|
||||
- **Toast Notifications**: User-friendly feedback
|
||||
- **Dual Scraping Modes**: Choose between Schema-based and LLM-based approaches
|
||||
|
||||
## Model Management
|
||||
|
||||
### Adding Models via Web Interface
|
||||
|
||||
1. Go to the **Models** page
|
||||
2. Enter your model details:
|
||||
- **Provider**: LLM provider (e.g., `gemini/gemini-2.5-flash`, `openai/gpt-4o`)
|
||||
- **API Token**: Your API key for the provider
|
||||
3. Click "Add Model"
|
||||
|
||||
### API Usage for Model Management
|
||||
|
||||
#### Save a Model Configuration
|
||||
|
||||
```bash
|
||||
curl -X POST "http://localhost:8000/models" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"provider": "gemini/gemini-2.5-flash",
|
||||
"api_token": "your-api-key-here"
|
||||
}'
|
||||
```
|
||||
|
||||
#### List Saved Models
|
||||
|
||||
```bash
|
||||
curl -X GET "http://localhost:8000/models"
|
||||
```
|
||||
|
||||
#### Delete a Model Configuration
|
||||
|
||||
```bash
|
||||
curl -X DELETE "http://localhost:8000/models/my-gemini"
|
||||
```
|
||||
|
||||
## Scraping Approaches
|
||||
|
||||
### 1. Schema-based Scraping (Faster)
|
||||
- Generates CSS selectors for targeted extraction
|
||||
- Caches schemas for repeated requests
|
||||
- Faster execution for structured websites
|
||||
|
||||
### 2. LLM-based Scraping (More Flexible)
|
||||
- Direct LLM extraction without schema generation
|
||||
- More flexible for complex or dynamic content
|
||||
- Better for unstructured data extraction
|
||||
|
||||
## Supported LLM Providers
|
||||
|
||||
The API supports any LLM provider that crawl4ai supports, including:
|
||||
|
||||
- **Google Gemini**: `gemini/gemini-2.5-flash`, `gemini/gemini-pro`
|
||||
- **OpenAI**: `openai/gpt-4`, `openai/gpt-3.5-turbo`
|
||||
- **Anthropic**: `anthropic/claude-3-opus`, `anthropic/claude-3-sonnet`
|
||||
- **And more...**
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Core Endpoints
|
||||
|
||||
- `POST /scrape` - Schema-based scraping
|
||||
- `POST /scrape-with-llm` - LLM-based scraping
|
||||
- `GET /schemas` - List cached schemas
|
||||
- `POST /clear-cache` - Clear schema cache
|
||||
- `GET /health` - Health check
|
||||
|
||||
### Model Management Endpoints
|
||||
|
||||
- `GET /models` - List saved model configurations
|
||||
- `POST /models` - Save a new model configuration
|
||||
- `DELETE /models/{model_name}` - Delete a model configuration
|
||||
|
||||
### API Request History
|
||||
|
||||
- `GET /saved-requests` - List all saved API requests
|
||||
- `DELETE /saved-requests/{request_id}` - Delete a saved request
|
||||
|
||||
## Request/Response Examples
|
||||
|
||||
### Scrape Request
|
||||
|
||||
```json
|
||||
{
|
||||
"url": "https://example.com",
|
||||
"query": "Extract the product name, price, and description",
|
||||
"model_name": "my-custom-model"
|
||||
}
|
||||
```
|
||||
|
||||
### Scrape Response
|
||||
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"url": "https://example.com",
|
||||
"query": "Extract the product name, price, and description",
|
||||
"extracted_data": {
|
||||
"product_name": "Example Product",
|
||||
"price": "$99.99",
|
||||
"description": "This is an example product description"
|
||||
},
|
||||
"schema_used": { ... },
|
||||
"timestamp": "2024-01-01T12:00:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
### Model Configuration Request
|
||||
|
||||
```json
|
||||
{
|
||||
"provider": "gemini/gemini-2.5-flash",
|
||||
"api_token": "your-api-key-here"
|
||||
}
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
Run the test script to verify the model management functionality:
|
||||
|
||||
```bash
|
||||
python test_models.py
|
||||
```
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
parse_example/
|
||||
├── api_server.py # FastAPI server with all endpoints
|
||||
├── web_scraper_lib.py # Core scraping library
|
||||
├── test_models.py # Test script for model management
|
||||
├── requirements.txt # Dependencies
|
||||
├── static/ # Frontend files
|
||||
│ ├── index.html # Main HTML interface
|
||||
│ ├── styles.css # CSS styles (minimalist theme)
|
||||
│ └── script.js # JavaScript functionality
|
||||
├── schemas/ # Cached schemas
|
||||
├── models/ # Saved model configurations
|
||||
├── saved_requests/ # API request history
|
||||
└── README.md # This file
|
||||
```
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Using the Library Directly
|
||||
|
||||
```python
|
||||
from web_scraper_lib import WebScraperAgent
|
||||
|
||||
# Initialize agent
|
||||
agent = WebScraperAgent()
|
||||
|
||||
# Save a model configuration
|
||||
agent.save_model_config(
|
||||
model_name="my-model",
|
||||
provider="openai/gpt-4",
|
||||
api_token="your-api-key"
|
||||
)
|
||||
|
||||
# Schema-based scraping
|
||||
result = await agent.scrape_data(
|
||||
url="https://example.com",
|
||||
query="Extract product information",
|
||||
model_name="my-model"
|
||||
)
|
||||
|
||||
# LLM-based scraping
|
||||
result = await agent.scrape_data_with_llm(
|
||||
url="https://example.com",
|
||||
query="Extract product information",
|
||||
model_name="my-model"
|
||||
)
|
||||
```
|
||||
|
||||
### Schema Caching
|
||||
|
||||
The system automatically caches generated schemas based on URL and query combinations:
|
||||
|
||||
- **First request**: Generates schema using AI
|
||||
- **Subsequent requests**: Uses cached schema for faster extraction
|
||||
|
||||
### API Request History
|
||||
|
||||
All API requests are automatically saved with:
|
||||
- Request details (URL, query, model used)
|
||||
- Response data
|
||||
- Timestamp
|
||||
- cURL command for re-execution
|
||||
|
||||
### Duplicate Prevention
|
||||
|
||||
The system prevents saving duplicate requests:
|
||||
- Same URL + query combinations are not saved multiple times
|
||||
- Returns existing request ID for duplicates
|
||||
- Keeps the API request history clean
|
||||
|
||||
## Error Handling
|
||||
|
||||
The API provides detailed error messages for common issues:
|
||||
|
||||
- Invalid URLs
|
||||
- Missing model configurations
|
||||
- API key errors
|
||||
- Network timeouts
|
||||
- Parsing errors
|
||||
Reference in New Issue
Block a user