This commit adds a complete, web scraping API example that demonstrates how to get structured data from any website and use it like an API using the crawl4ai library with a minimalist frontend interface. Core Functionality - AI-powered web scraping with plain English queries - Dual scraping approaches: Schema-based (faster) and LLM-based (flexible) - Intelligent schema caching for improved performance - Custom LLM model support with API key management - Automatic duplicate request prevention Modern Frontend Interface - Minimalist black-and-white design inspired by modern web apps - Responsive layout with smooth animations and transitions - Three main pages: Scrape Data, Models Management, API Request History - Real-time results display with JSON formatting - Copy-to-clipboard functionality for extracted data - Toast notifications for user feedback - Auto-scroll to results when scraping starts Model Management System - Web-based model configuration interface - Support for any LLM provider (OpenAI, Gemini, Anthropic, etc.) - Simplified configuration requiring only provider and API token - Add, list, and delete model configurations - Secure storage of API keys in local JSON files API Request History - Automatic saving of all API requests and responses - Display of request history with URL, query, and cURL commands - Duplicate prevention (same URL + query combinations) - Request deletion functionality - Clean, simplified display focusing on essential information Technical Implementation Backend (FastAPI) - RESTful API with comprehensive endpoints - Pydantic models for request/response validation - Async web scraping with crawl4ai library - Error handling with detailed error messages - File-based storage for models and request history Frontend (Vanilla JS/CSS/HTML) - No framework dependencies - pure HTML, CSS, JavaScript - Modern CSS Grid and Flexbox layouts - Custom dropdown styling with SVG arrows - Responsive design for mobile and desktop - Smooth scrolling and animations Core Library Integration - WebScraperAgent class for orchestration - ModelConfig class for LLM configuration management - Schema generation and caching system - LLM extraction strategy support - Browser configuration with headless mode
253 lines
6.8 KiB
Markdown
253 lines
6.8 KiB
Markdown
# Web Scraper API with Custom Model Support
|
|
|
|
A powerful web scraping API that converts any website into structured data using AI. Features a beautiful minimalist frontend interface and support for custom LLM models!
|
|
|
|
## Features
|
|
|
|
- **AI-Powered Scraping**: Provide a URL and plain English query to extract structured data
|
|
- **Beautiful Frontend**: Modern minimalist black-and-white interface with smooth UX
|
|
- **Custom Model Support**: Use any LLM provider (OpenAI, Gemini, Anthropic, etc.) with your own API keys
|
|
- **Model Management**: Save, list, and manage multiple model configurations via web interface
|
|
- **Dual Scraping Approaches**: Choose between Schema-based (faster) or LLM-based (more flexible) extraction
|
|
- **API Request History**: Automatic saving and display of all API requests with cURL commands
|
|
- **Schema Caching**: Intelligent caching of generated schemas for faster subsequent requests
|
|
- **Duplicate Prevention**: Avoids saving duplicate requests (same URL + query)
|
|
- **RESTful API**: Easy-to-use HTTP endpoints for all operations
|
|
|
|
## Quick Start
|
|
|
|
### 1. Install Dependencies
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
### 2. Start the API Server
|
|
|
|
```bash
|
|
python app.py
|
|
```
|
|
|
|
The server will start on `http://localhost:8000` with a beautiful web interface!
|
|
|
|
### 3. Using the Web Interface
|
|
|
|
Once the server is running, open your browser and go to `http://localhost:8000` to access the modern web interface!
|
|
|
|
#### Pages:
|
|
- **Scrape Data**: Enter URLs and queries to extract structured data
|
|
- **Models**: Manage your AI model configurations (add, list, delete)
|
|
- **API Requests**: View history of all scraping requests with cURL commands
|
|
|
|
#### Features:
|
|
- **Minimalist Design**: Clean black-and-white theme inspired by modern web apps
|
|
- **Real-time Results**: See extracted data in formatted JSON
|
|
- **Copy to Clipboard**: Easy copying of results
|
|
- **Toast Notifications**: User-friendly feedback
|
|
- **Dual Scraping Modes**: Choose between Schema-based and LLM-based approaches
|
|
|
|
## Model Management
|
|
|
|
### Adding Models via Web Interface
|
|
|
|
1. Go to the **Models** page
|
|
2. Enter your model details:
|
|
- **Provider**: LLM provider (e.g., `gemini/gemini-2.5-flash`, `openai/gpt-4o`)
|
|
- **API Token**: Your API key for the provider
|
|
3. Click "Add Model"
|
|
|
|
### API Usage for Model Management
|
|
|
|
#### Save a Model Configuration
|
|
|
|
```bash
|
|
curl -X POST "http://localhost:8000/models" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"provider": "gemini/gemini-2.5-flash",
|
|
"api_token": "your-api-key-here"
|
|
}'
|
|
```
|
|
|
|
#### List Saved Models
|
|
|
|
```bash
|
|
curl -X GET "http://localhost:8000/models"
|
|
```
|
|
|
|
#### Delete a Model Configuration
|
|
|
|
```bash
|
|
curl -X DELETE "http://localhost:8000/models/my-gemini"
|
|
```
|
|
|
|
## Scraping Approaches
|
|
|
|
### 1. Schema-based Scraping (Faster)
|
|
- Generates CSS selectors for targeted extraction
|
|
- Caches schemas for repeated requests
|
|
- Faster execution for structured websites
|
|
|
|
### 2. LLM-based Scraping (More Flexible)
|
|
- Direct LLM extraction without schema generation
|
|
- More flexible for complex or dynamic content
|
|
- Better for unstructured data extraction
|
|
|
|
## Supported LLM Providers
|
|
|
|
The API supports any LLM provider that crawl4ai supports, including:
|
|
|
|
- **Google Gemini**: `gemini/gemini-2.5-flash`, `gemini/gemini-pro`
|
|
- **OpenAI**: `openai/gpt-4`, `openai/gpt-3.5-turbo`
|
|
- **Anthropic**: `anthropic/claude-3-opus`, `anthropic/claude-3-sonnet`
|
|
- **And more...**
|
|
|
|
## API Endpoints
|
|
|
|
### Core Endpoints
|
|
|
|
- `POST /scrape` - Schema-based scraping
|
|
- `POST /scrape-with-llm` - LLM-based scraping
|
|
- `GET /schemas` - List cached schemas
|
|
- `POST /clear-cache` - Clear schema cache
|
|
- `GET /health` - Health check
|
|
|
|
### Model Management Endpoints
|
|
|
|
- `GET /models` - List saved model configurations
|
|
- `POST /models` - Save a new model configuration
|
|
- `DELETE /models/{model_name}` - Delete a model configuration
|
|
|
|
### API Request History
|
|
|
|
- `GET /saved-requests` - List all saved API requests
|
|
- `DELETE /saved-requests/{request_id}` - Delete a saved request
|
|
|
|
## Request/Response Examples
|
|
|
|
### Scrape Request
|
|
|
|
```json
|
|
{
|
|
"url": "https://example.com",
|
|
"query": "Extract the product name, price, and description",
|
|
"model_name": "my-custom-model"
|
|
}
|
|
```
|
|
|
|
### Scrape Response
|
|
|
|
```json
|
|
{
|
|
"success": true,
|
|
"url": "https://example.com",
|
|
"query": "Extract the product name, price, and description",
|
|
"extracted_data": {
|
|
"product_name": "Example Product",
|
|
"price": "$99.99",
|
|
"description": "This is an example product description"
|
|
},
|
|
"schema_used": { ... },
|
|
"timestamp": "2024-01-01T12:00:00Z"
|
|
}
|
|
```
|
|
|
|
### Model Configuration Request
|
|
|
|
```json
|
|
{
|
|
"provider": "gemini/gemini-2.5-flash",
|
|
"api_token": "your-api-key-here"
|
|
}
|
|
```
|
|
|
|
## Testing
|
|
|
|
Run the test script to verify the model management functionality:
|
|
|
|
```bash
|
|
python test_models.py
|
|
```
|
|
|
|
## File Structure
|
|
|
|
```
|
|
parse_example/
|
|
├── api_server.py # FastAPI server with all endpoints
|
|
├── web_scraper_lib.py # Core scraping library
|
|
├── test_models.py # Test script for model management
|
|
├── requirements.txt # Dependencies
|
|
├── static/ # Frontend files
|
|
│ ├── index.html # Main HTML interface
|
|
│ ├── styles.css # CSS styles (minimalist theme)
|
|
│ └── script.js # JavaScript functionality
|
|
├── schemas/ # Cached schemas
|
|
├── models/ # Saved model configurations
|
|
├── saved_requests/ # API request history
|
|
└── README.md # This file
|
|
```
|
|
|
|
## Advanced Usage
|
|
|
|
### Using the Library Directly
|
|
|
|
```python
|
|
from web_scraper_lib import WebScraperAgent
|
|
|
|
# Initialize agent
|
|
agent = WebScraperAgent()
|
|
|
|
# Save a model configuration
|
|
agent.save_model_config(
|
|
model_name="my-model",
|
|
provider="openai/gpt-4",
|
|
api_token="your-api-key"
|
|
)
|
|
|
|
# Schema-based scraping
|
|
result = await agent.scrape_data(
|
|
url="https://example.com",
|
|
query="Extract product information",
|
|
model_name="my-model"
|
|
)
|
|
|
|
# LLM-based scraping
|
|
result = await agent.scrape_data_with_llm(
|
|
url="https://example.com",
|
|
query="Extract product information",
|
|
model_name="my-model"
|
|
)
|
|
```
|
|
|
|
### Schema Caching
|
|
|
|
The system automatically caches generated schemas based on URL and query combinations:
|
|
|
|
- **First request**: Generates schema using AI
|
|
- **Subsequent requests**: Uses cached schema for faster extraction
|
|
|
|
### API Request History
|
|
|
|
All API requests are automatically saved with:
|
|
- Request details (URL, query, model used)
|
|
- Response data
|
|
- Timestamp
|
|
- cURL command for re-execution
|
|
|
|
### Duplicate Prevention
|
|
|
|
The system prevents saving duplicate requests:
|
|
- Same URL + query combinations are not saved multiple times
|
|
- Returns existing request ID for duplicates
|
|
- Keeps the API request history clean
|
|
|
|
## Error Handling
|
|
|
|
The API provides detailed error messages for common issues:
|
|
|
|
- Invalid URLs
|
|
- Missing model configurations
|
|
- API key errors
|
|
- Network timeouts
|
|
- Parsing errors
|