This commit adds a complete, web scraping API example that demonstrates how to get structured data from any website and use it like an API using the crawl4ai library with a minimalist frontend interface. Core Functionality - AI-powered web scraping with plain English queries - Dual scraping approaches: Schema-based (faster) and LLM-based (flexible) - Intelligent schema caching for improved performance - Custom LLM model support with API key management - Automatic duplicate request prevention Modern Frontend Interface - Minimalist black-and-white design inspired by modern web apps - Responsive layout with smooth animations and transitions - Three main pages: Scrape Data, Models Management, API Request History - Real-time results display with JSON formatting - Copy-to-clipboard functionality for extracted data - Toast notifications for user feedback - Auto-scroll to results when scraping starts Model Management System - Web-based model configuration interface - Support for any LLM provider (OpenAI, Gemini, Anthropic, etc.) - Simplified configuration requiring only provider and API token - Add, list, and delete model configurations - Secure storage of API keys in local JSON files API Request History - Automatic saving of all API requests and responses - Display of request history with URL, query, and cURL commands - Duplicate prevention (same URL + query combinations) - Request deletion functionality - Clean, simplified display focusing on essential information Technical Implementation Backend (FastAPI) - RESTful API with comprehensive endpoints - Pydantic models for request/response validation - Async web scraping with crawl4ai library - Error handling with detailed error messages - File-based storage for models and request history Frontend (Vanilla JS/CSS/HTML) - No framework dependencies - pure HTML, CSS, JavaScript - Modern CSS Grid and Flexbox layouts - Custom dropdown styling with SVG arrows - Responsive design for mobile and desktop - Smooth scrolling and animations Core Library Integration - WebScraperAgent class for orchestration - ModelConfig class for LLM configuration management - Schema generation and caching system - LLM extraction strategy support - Browser configuration with headless mode
28 lines
1.0 KiB
Python
28 lines
1.0 KiB
Python
import asyncio
|
|
from web_scraper_lib import scrape_website
|
|
import os
|
|
|
|
async def test_library():
|
|
"""Test the mini library directly."""
|
|
print("=== Testing Mini Library ===")
|
|
|
|
# Test 1: Scrape with a custom model
|
|
url = "https://marketplace.mainstreet.co.in/collections/adidas-yeezy/products/adidas-yeezy-boost-350-v2-yecheil-non-reflective"
|
|
query = "Extract the following data: Product name, Product price, Product description, Product size. DO NOT EXTRACT ANYTHING ELSE."
|
|
if os.path.exists("models"):
|
|
model_name = os.listdir("models")[0].split(".")[0]
|
|
else:
|
|
raise Exception("No models found in models directory")
|
|
|
|
print(f"Scraping: {url}")
|
|
print(f"Query: {query}")
|
|
|
|
try:
|
|
result = await scrape_website(url, query, model_name)
|
|
print("✅ Library test successful!")
|
|
print(f"Extracted data: {result['extracted_data']}")
|
|
except Exception as e:
|
|
print(f"❌ Library test failed: {e}")
|
|
|
|
if __name__ == "__main__":
|
|
asyncio.run(test_library()) |