refactor(docker): improve server architecture and configuration

Complete overhaul of Docker deployment setup with improved architecture:
- Add Redis integration for task management
- Implement rate limiting and security middleware
- Add Prometheus metrics and health checks
- Improve error handling and logging
- Add support for streaming responses
- Implement proper configuration management
- Add platform-specific optimizations for ARM64/AMD64

BREAKING CHANGE: Docker deployment now requires Redis and new config.yml structure
This commit is contained in:
UncleCode
2025-02-02 20:19:51 +08:00
parent 7b1ef07c41
commit 33a21d6a7a
16 changed files with 1918 additions and 344 deletions

View File

@@ -1,113 +1,764 @@
# Crawl4AI Docker Setup
# Crawl4AI Docker Guide 🐳
## Quick Start
1. Build the Docker image:
```bash
docker build -t crawl4ai-server:prod .
```
## Table of Contents
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Local Build](#local-build)
- [Docker Hub](#docker-hub)
- [Dockerfile Parameters](#dockerfile-parameters)
- [Using the API](#using-the-api)
- [Understanding Request Schema](#understanding-request-schema)
- [REST API Examples](#rest-api-examples)
- [Python SDK](#python-sdk)
- [Metrics & Monitoring](#metrics--monitoring)
- [Deployment Scenarios](#deployment-scenarios)
- [Complete Examples](#complete-examples)
- [Getting Help](#getting-help)
2. Run the container:
```bash
docker run -d -p 8000:8000 \
--env-file .llm.env \
--name crawl4ai \
crawl4ai-server:prod
```
## Prerequisites
---
Before we dive in, make sure you have:
- Docker installed and running (version 20.10.0 or higher)
- At least 4GB of RAM available for the container
- Python 3.10+ (if using the Python SDK)
- Node.js 16+ (if using the Node.js examples)
## Configuration Options
> 💡 **Pro tip**: Run `docker info` to check your Docker installation and available resources.
## Installation
### Local Build
Let's get your local environment set up step by step!
#### 1. Building the Image
First, clone the repository and build the Docker image:
### 1. **Using .llm.env File**
Create a `.llm.env` file with your API keys:
```bash
OPENAI_API_KEY=sk-your-key
DEEPSEEK_API_KEY=your-deepseek-key
# Clone the repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
# Build the Docker image
docker build -t crawl4ai-server:prod \
--build-arg PYTHON_VERSION=3.10 \
--build-arg INSTALL_TYPE=all \
--build-arg ENABLE_GPU=false \
deploy/docker/
```
Run with:
#### 2. Environment Setup
If you plan to use LLMs (Language Models), you'll need to set up your API keys. Create a `.llm.env` file:
```env
# OpenAI
OPENAI_API_KEY=sk-your-key
# Anthropic
ANTHROPIC_API_KEY=your-anthropic-key
# DeepSeek
DEEPSEEK_API_KEY=your-deepseek-key
# Check out https://docs.litellm.ai/docs/providers for more providers!
```
> 🔑 **Note**: Keep your API keys secure! Never commit them to version control.
#### 3. Running the Container
You have several options for running the container:
Basic run (no LLM support):
```bash
docker run -d -p 8000:8000 --name crawl4ai crawl4ai-server:prod
```
With LLM support:
```bash
docker run -d -p 8000:8000 \
--env-file .llm.env \
--name crawl4ai \
crawl4ai-server:prod
```
### 2. **Direct Environment Variables**
Pass keys directly:
Using host environment variables (Not a good practice, but works for local testing):
```bash
docker run -d -p 8000:8000 \
-e OPENAI_API_KEY="sk-your-key" \
-e DEEPSEEK_API_KEY="your-deepseek-key" \
--env-file .llm.env \
--env-from "$(env)" \
--name crawl4ai \
crawl4ai-server:prod
```
### 3. **Copy Host Environment Variables**
Use the `--copy-env` flag to copy `.llm.env` from the host:
### More on Building
You have several options for building the Docker image based on your needs:
#### Basic Build
```bash
docker run -d -p 8000:8000 \
--copy-env \
crawl4ai-server:prod
# Clone the repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
# Simple build with defaults
docker build -t crawl4ai-server:prod deploy/docker/
```
### 4. **Advanced: Docker Compose**
Create a `docker-compose.yml`:
```yaml
version: '3.8'
services:
crawl4ai:
image: crawl4ai-server:prod
ports:
- "8000:8000"
env_file:
- .llm.env
restart: unless-stopped
```
Run with:
#### Advanced Build Options
```bash
docker-compose up -d
# Build with custom parameters
docker build -t crawl4ai-server:prod \
--build-arg PYTHON_VERSION=3.10 \
--build-arg INSTALL_TYPE=all \
--build-arg ENABLE_GPU=false \
deploy/docker/
```
---
#### Platform-Specific Builds
The Dockerfile includes optimizations for different architectures (ARM64 and AMD64). Docker automatically detects your platform, but you can specify it explicitly:
## Supported Environment Variables
| Variable | Description |
|------------------------|--------------------------------------|
| `OPENAI_API_KEY` | OpenAI API key |
| `DEEPSEEK_API_KEY` | DeepSeek API key |
| `ANTHROPIC_API_KEY` | Anthropic API key |
| `GROQ_API_KEY` | Groq API key |
| `TOGETHER_API_KEY` | Together API key |
| `LLAMA_CLOUD_API_KEY` | Llama Cloud API key |
| `COHERE_API_KEY` | Cohere API key |
| `MISTRAL_API_KEY` | Mistral API key |
| `PERPLEXITY_API_KEY` | Perplexity API key |
| `VERTEXAI_PROJECT_ID` | Google Vertex AI project ID |
| `VERTEXAI_LOCATION` | Google Vertex AI location |
```bash
# Build for ARM64
docker build --platform linux/arm64 -t crawl4ai-server:arm64 deploy/docker/
---
# Build for AMD64
docker build --platform linux/amd64 -t crawl4ai-server:amd64 deploy/docker/
```
## Healthcheck
The container includes a healthcheck:
#### Multi-Platform Build
For distributing your image across different architectures, use `buildx`:
```bash
# Set up buildx builder
docker buildx create --use
# Build for multiple platforms
docker buildx build \
--platform linux/amd64,linux/arm64 \
-t yourusername/crawl4ai-server:multi \
--push \
deploy/docker/
```
> 💡 **Note**: Multi-platform builds require Docker Buildx and need to be pushed to a registry.
#### Development Build
For development, you might want to enable all features:
```bash
docker build -t crawl4ai-server:dev \
--build-arg INSTALL_TYPE=all \
--build-arg PYTHON_VERSION=3.10 \
--build-arg ENABLE_GPU=true \
deploy/docker/
```
#### GPU-Enabled Build
If you plan to use GPU acceleration:
```bash
docker build -t crawl4ai-server:gpu \
--build-arg ENABLE_GPU=true \
deploy/docker/
```
### Build Arguments Explained
| Argument | Description | Default | Options |
|----------|-------------|---------|----------|
| PYTHON_VERSION | Python version | 3.10 | 3.8, 3.9, 3.10 |
| INSTALL_TYPE | Feature set | default | default, all, torch, transformer |
| ENABLE_GPU | GPU support | false | true, false |
| APP_HOME | Install path | /app | any valid path |
### Build Best Practices
1. **Choose the Right Install Type**
- `default`: Basic installation, smallest image, to be honest, I use this most of the time.
- `all`: Full features, larger image (include transformer, and nltk, make sure you really need them)
2. **Platform Considerations**
- Let Docker auto-detect platform unless you need cross-compilation
- Use --platform for specific architecture requirements
- Consider buildx for multi-architecture distribution
3. **Development vs Production**
- Use `INSTALL_TYPE=all` for development
- Stick to `default` for production if you don't need extra features
- Enable GPU only if you have compatible hardware
4. **Performance Optimization**
- The image automatically includes platform-specific optimizations
- AMD64 gets OpenMP optimizations
- ARM64 gets OpenBLAS optimizations
### Docker Hub
> 🚧 Coming soon! The image will be available at `crawl4ai/server`. Stay tuned!
## Dockerfile Parameters
Configure your build with these parameters:
| Parameter | Description | Default | Options |
|-----------|-------------|---------|----------|
| PYTHON_VERSION | Python version to use | 3.10 | 3.8, 3.9, 3.10 |
| INSTALL_TYPE | Installation profile | default | default, all, torch, transformer |
| ENABLE_GPU | Enable GPU support | false | true, false |
| APP_HOME | Application directory | /app | any valid path |
| TARGETARCH | Target architecture | auto-detected | amd64, arm64 |
## Using the API
### Understanding Request Schema
This is super important! The API expects a specific structure that matches our Python classes. Let me show you how it works.
#### The Magic of Type Matching
When you send a request, each configuration object needs a "type" field that matches the exact class name from the library. Here's an example:
```python
# First, let's create objects the normal way
from crawl4ai import BrowserConfig, CrawlerRunConfig, PruningContentFilter
# Create some config objects
browser_config = BrowserConfig(headless=True, viewport={"width": 1200, "height": 800})
content_filter = PruningContentFilter(threshold=0.48, threshold_type="fixed")
# Use dump() to see the serialized format
print(browser_config.dump())
```
This will output something like:
```json
{
"type": "BrowserConfig",
"params": {
"headless": true,
"viewport": {
"width": 1200,
"height": 800
}
}
}
```
#### Making API Requests
So when making a request, your JSON should look like this:
```json
{
"urls": ["https://example.com"],
"browser_config": {
"type": "BrowserConfig",
"params": {
"headless": true,
"viewport": {"width": 1200, "height": 800}
}
},
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"cache_mode": "bypass",
"markdown_generator": {
"type": "DefaultMarkdownGenerator",
"params": {
"content_filter": {
"type": "PruningContentFilter",
"params": {
"threshold": 0.48,
"threshold_type": "fixed",
"min_word_threshold": 0
}
}
}
}
}
}
}
```
> 💡 **Pro tip**: Look at the class names in the library documentation - they map directly to the "type" fields in your requests!
### REST API Examples
Let's look at some practical examples:
#### Simple Crawl
```python
import requests
response = requests.post(
"http://localhost:8000/crawl",
json={
"urls": ["https://example.com"],
"browser_config": {
"type": "BrowserConfig",
"params": {"headless": True}
}
}
)
print(response.json())
```
#### Streaming Results
```python
import requests
response = requests.post(
"http://localhost:8000/crawl",
json={
"urls": ["https://example.com"],
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {"stream": True}
}
},
stream=True
)
for line in response.iter_lines():
if line:
print(line.decode())
```
### Python SDK
The SDK makes things even easier! Here's how to use it:
```python
from crawl4ai.docker_client import Crawl4aiDockerClient
from crawl4ai import BrowserConfig, CrawlerRunConfig
async with Crawl4aiDockerClient() as client:
# The SDK handles serialization for you!
result = await client.crawl(
urls=["https://example.com"],
browser_config=BrowserConfig(headless=True),
crawler_config=CrawlerRunConfig(stream=False)
)
print(result.markdown)
```
## Metrics & Monitoring
Keep an eye on your crawler with these endpoints:
- `/health` - Quick health check
- `/metrics` - Detailed Prometheus metrics
- `/schema` - Full API schema
Example health check:
```bash
curl http://localhost:8000/health
```
---
## Deployment Scenarios
## Troubleshooting
1. **Missing Keys**: Ensure all required keys are set in `.llm.env`.
2. **Permissions**: Run `chmod +x docker-entrypoint.sh` if permissions are denied.
3. **Logs**: Check logs with:
```bash
docker logs crawl4ai
> 🚧 Coming soon! We'll cover:
> - Kubernetes deployment
> - Cloud provider setups (AWS, GCP, Azure)
> - High-availability configurations
> - Load balancing strategies
## Complete Examples
Check out the `examples` folder in our repository for full working examples! Here's one to get you started:
```python
import requests
import time
import httpx
import asyncio
from typing import Dict, Any
from crawl4ai import (
BrowserConfig, CrawlerRunConfig, DefaultMarkdownGenerator,
PruningContentFilter, JsonCssExtractionStrategy, LLMContentFilter, CacheMode
)
from crawl4ai.docker_client import Crawl4aiDockerClient
class Crawl4AiTester:
def __init__(self, base_url: str = "http://localhost:11235"):
self.base_url = base_url
def submit_and_wait(
self, request_data: Dict[str, Any], timeout: int = 300
) -> Dict[str, Any]:
# Submit crawl job
response = requests.post(f"{self.base_url}/crawl", json=request_data)
task_id = response.json()["task_id"]
print(f"Task ID: {task_id}")
# Poll for result
start_time = time.time()
while True:
if time.time() - start_time > timeout:
raise TimeoutError(
f"Task {task_id} did not complete within {timeout} seconds"
)
result = requests.get(f"{self.base_url}/task/{task_id}")
status = result.json()
if status["status"] == "failed":
print("Task failed:", status.get("error"))
raise Exception(f"Task failed: {status.get('error')}")
if status["status"] == "completed":
return status
time.sleep(2)
async def test_direct_api():
"""Test direct API endpoints without using the client SDK"""
print("\n=== Testing Direct API Calls ===")
# Test 1: Basic crawl with content filtering
browser_config = BrowserConfig(
headless=True,
viewport_width=1200,
viewport_height=800
)
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(
threshold=0.48,
threshold_type="fixed",
min_word_threshold=0
),
options={"ignore_links": True}
)
)
request_data = {
"urls": ["https://example.com"],
"browser_config": browser_config.dump(),
"crawler_config": crawler_config.dump()
}
# Make direct API call
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:8000/crawl",
json=request_data,
timeout=300
)
assert response.status_code == 200
result = response.json()
print("Basic crawl result:", result["success"])
# Test 2: Structured extraction with JSON CSS
schema = {
"baseSelector": "article.post",
"fields": [
{"name": "title", "selector": "h1", "type": "text"},
{"name": "content", "selector": ".content", "type": "html"}
]
}
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema=schema)
)
request_data["crawler_config"] = crawler_config.dump()
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:8000/crawl",
json=request_data
)
assert response.status_code == 200
result = response.json()
print("Structured extraction result:", result["success"])
# Test 3: Get schema
# async with httpx.AsyncClient() as client:
# response = await client.get("http://localhost:8000/schema")
# assert response.status_code == 200
# schemas = response.json()
# print("Retrieved schemas for:", list(schemas.keys()))
async def test_with_client():
"""Test using the Crawl4AI Docker client SDK"""
print("\n=== Testing Client SDK ===")
async with Crawl4aiDockerClient(verbose=True) as client:
# Test 1: Basic crawl
browser_config = BrowserConfig(headless=True)
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(
threshold=0.48,
threshold_type="fixed"
)
)
)
result = await client.crawl(
urls=["https://example.com"],
browser_config=browser_config,
crawler_config=crawler_config
)
print("Client SDK basic crawl:", result.success)
# Test 2: LLM extraction with streaming
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(
content_filter=LLMContentFilter(
provider="openai/gpt-40",
instruction="Extract key technical concepts"
)
),
stream=True
)
async for result in await client.crawl(
urls=["https://example.com"],
browser_config=browser_config,
crawler_config=crawler_config
):
print(f"Streaming result for: {result.url}")
# # Test 3: Get schema
# schemas = await client.get_schema()
# print("Retrieved client schemas for:", list(schemas.keys()))
async def main():
"""Run all tests"""
# Test direct API
print("Testing direct API calls...")
await test_direct_api()
# Test client SDK
print("\nTesting client SDK...")
await test_with_client()
if __name__ == "__main__":
asyncio.run(main())
```
## Server Configuration
The server's behavior can be customized through the `config.yml` file. Let's explore how to configure your Crawl4AI server for optimal performance and security.
### Understanding config.yml
The configuration file is located at `deploy/docker/config.yml`. You can either modify this file before building the image or mount a custom configuration when running the container.
Here's a detailed breakdown of the configuration options:
```yaml
# Application Configuration
app:
title: "Crawl4AI API" # Server title in OpenAPI docs
version: "1.0.0" # API version
host: "0.0.0.0" # Listen on all interfaces
port: 8000 # Server port
reload: True # Enable hot reloading (development only)
timeout_keep_alive: 300 # Keep-alive timeout in seconds
# Rate Limiting Configuration
rate_limiting:
enabled: True # Enable/disable rate limiting
default_limit: "100/minute" # Rate limit format: "number/timeunit"
trusted_proxies: [] # List of trusted proxy IPs
storage_uri: "memory://" # Use "redis://localhost:6379" for production
# Security Configuration
security:
enabled: false # Master toggle for security features
https_redirect: True # Force HTTPS
trusted_hosts: ["*"] # Allowed hosts (use specific domains in production)
headers: # Security headers
x_content_type_options: "nosniff"
x_frame_options: "DENY"
content_security_policy: "default-src 'self'"
strict_transport_security: "max-age=63072000; includeSubDomains"
# Crawler Configuration
crawler:
memory_threshold_percent: 95.0 # Memory usage threshold
rate_limiter:
base_delay: [1.0, 2.0] # Min and max delay between requests
timeouts:
stream_init: 30.0 # Stream initialization timeout
batch_process: 300.0 # Batch processing timeout
# Logging Configuration
logging:
level: "INFO" # Log level (DEBUG, INFO, WARNING, ERROR)
format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
# Observability Configuration
observability:
prometheus:
enabled: True # Enable Prometheus metrics
endpoint: "/metrics" # Metrics endpoint
health_check:
endpoint: "/health" # Health check endpoint
```
### Configuration Tips and Best Practices
1. **Production Settings** 🏭
```yaml
app:
reload: False # Disable reload in production
timeout_keep_alive: 120 # Lower timeout for better resource management
rate_limiting:
storage_uri: "redis://redis:6379" # Use Redis for distributed rate limiting
default_limit: "50/minute" # More conservative rate limit
security:
enabled: true # Enable all security features
trusted_hosts: ["your-domain.com"] # Restrict to your domain
```
---
2. **Development Settings** 🛠️
```yaml
app:
reload: True # Enable hot reloading
timeout_keep_alive: 300 # Longer timeout for debugging
logging:
level: "DEBUG" # More verbose logging
```
## Security Best Practices
- Never commit `.llm.env` to version control.
- Use Docker secrets in production (Swarm/K8s).
- Rotate keys regularly.
3. **High-Traffic Settings** 🚦
```yaml
crawler:
memory_threshold_percent: 85.0 # More conservative memory limit
rate_limiter:
base_delay: [2.0, 4.0] # More aggressive rate limiting
```
### Customizing Your Configuration
#### Method 1: Pre-build Configuration
```bash
# Copy and modify config before building
cp deploy/docker/config.yml custom-config.yml
vim custom-config.yml
# Build with custom config
docker build -t crawl4ai-server:prod \
--build-arg CONFIG_PATH=custom-config.yml .
```
#### Method 2: Runtime Configuration
```bash
# Mount custom config at runtime
docker run -d -p 8000:8000 \
-v $(pwd)/custom-config.yml:/app/config.yml \
crawl4ai-server:prod
```
### Configuration Recommendations
1. **Security First** 🔒
- Always enable security in production
- Use specific trusted_hosts instead of wildcards
- Set up proper rate limiting to protect your server
- Consider your environment before enabling HTTPS redirect
2. **Resource Management** 💻
- Adjust memory_threshold_percent based on available RAM
- Set timeouts according to your content size and network conditions
- Use Redis for rate limiting in multi-container setups
3. **Monitoring** 📊
- Enable Prometheus if you need metrics
- Set DEBUG logging in development, INFO in production
- Regular health check monitoring is crucial
4. **Performance Tuning** ⚡
- Start with conservative rate limiter delays
- Increase batch_process timeout for large content
- Adjust stream_init timeout based on initial response times
### Configuration Migration
When upgrading Crawl4AI, follow these steps:
1. Back up your current config:
```bash
cp /app/config.yml /app/config.yml.backup
```
2. Use version control:
```bash
git add config.yml
git commit -m "Save current server configuration"
```
3. Test in staging first:
```bash
docker run -d -p 8001:8000 \ # Use different port
-v $(pwd)/new-config.yml:/app/config.yml \
crawl4ai-server:prod
```
### Common Configuration Scenarios
1. **Basic Development Setup**
```yaml
security:
enabled: false
logging:
level: "DEBUG"
```
2. **Production API Server**
```yaml
security:
enabled: true
trusted_hosts: ["api.yourdomain.com"]
rate_limiting:
enabled: true
default_limit: "50/minute"
```
3. **High-Performance Crawler**
```yaml
crawler:
memory_threshold_percent: 90.0
timeouts:
batch_process: 600.0
```
## Getting Help
We're here to help you succeed with Crawl4AI! Here's how to get support:
- 📖 Check our [full documentation](https://docs.crawl4ai.com)
- 🐛 Found a bug? [Open an issue](https://github.com/unclecode/crawl4ai/issues)
- 💬 Join our [Discord community](https://discord.gg/crawl4ai)
- ⭐ Star us on GitHub to show support!
## Summary
In this guide, we've covered everything you need to get started with Crawl4AI's Docker deployment:
- Building and running the Docker container
- Configuring the environment
- Making API requests with proper typing
- Using the Python SDK
- Monitoring your deployment
Remember, the examples in the `examples` folder are your friends - they show real-world usage patterns that you can adapt for your needs.
Keep exploring, and don't hesitate to reach out if you need help! We're building something amazing together. 🚀
Happy crawling! 🕷️

305
deploy/docker/api.py Normal file
View File

@@ -0,0 +1,305 @@
import os
import json
import logging
from typing import Optional, AsyncGenerator
from urllib.parse import unquote
from fastapi import HTTPException, Request, status
from fastapi.background import BackgroundTasks
from fastapi.responses import JSONResponse
from redis import asyncio as aioredis
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
LLMExtractionStrategy,
CacheMode
)
from crawl4ai.content_filter_strategy import (
PruningContentFilter,
BM25ContentFilter,
LLMContentFilter
)
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
from utils import (
TaskStatus,
FilterType,
get_base_url,
is_task_id,
should_cleanup_task,
decode_redis_hash
)
logger = logging.getLogger(__name__)
async def process_llm_extraction(
redis: aioredis.Redis,
config: dict,
task_id: str,
url: str,
instruction: str,
schema: Optional[str] = None,
cache: str = "0"
) -> None:
"""Process LLM extraction in background."""
try:
llm_strategy = LLMExtractionStrategy(
provider=config["llm"]["provider"],
api_token=os.environ.get(config["llm"].get("api_key_env", None), ""),
instruction=instruction,
schema=json.loads(schema) if schema else None,
)
cache_mode = CacheMode.ENABLED if cache == "1" else CacheMode.BYPASS
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=url,
config=CrawlerRunConfig(
extraction_strategy=llm_strategy,
scraping_strategy=LXMLWebScrapingStrategy(),
cache_mode=cache_mode
)
)
if not result.success:
await redis.hset(f"task:{task_id}", mapping={
"status": TaskStatus.FAILED,
"error": result.error_message
})
return
content = json.loads(result.extracted_content)
await redis.hset(f"task:{task_id}", mapping={
"status": TaskStatus.COMPLETED,
"result": json.dumps(content)
})
except Exception as e:
logger.error(f"LLM extraction error: {str(e)}", exc_info=True)
await redis.hset(f"task:{task_id}", mapping={
"status": TaskStatus.FAILED,
"error": str(e)
})
async def handle_markdown_request(
url: str,
filter_type: FilterType,
query: Optional[str] = None,
cache: str = "0",
config: Optional[dict] = None
) -> str:
"""Handle markdown generation requests."""
try:
decoded_url = unquote(url)
if not decoded_url.startswith(('http://', 'https://')):
decoded_url = 'https://' + decoded_url
if filter_type == FilterType.RAW:
md_generator = DefaultMarkdownGenerator()
else:
content_filter = {
FilterType.FIT: PruningContentFilter(),
FilterType.BM25: BM25ContentFilter(user_query=query or ""),
FilterType.LLM: LLMContentFilter(
provider=config["llm"]["provider"],
api_token=os.environ.get(config["llm"].get("api_key_env", None), ""),
instruction=query or "Extract main content"
)
}[filter_type]
md_generator = DefaultMarkdownGenerator(content_filter=content_filter)
cache_mode = CacheMode.ENABLED if cache == "1" else CacheMode.BYPASS
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=decoded_url,
config=CrawlerRunConfig(
markdown_generator=md_generator,
scraping_strategy=LXMLWebScrapingStrategy(),
cache_mode=cache_mode
)
)
if not result.success:
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=result.error_message
)
return (result.markdown_v2.raw_markdown
if filter_type == FilterType.RAW
else result.markdown_v2.fit_markdown)
except Exception as e:
logger.error(f"Markdown error: {str(e)}", exc_info=True)
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=str(e)
)
async def handle_llm_request(
redis: aioredis.Redis,
background_tasks: BackgroundTasks,
request: Request,
input_path: str,
query: Optional[str] = None,
schema: Optional[str] = None,
cache: str = "0",
config: Optional[dict] = None
) -> JSONResponse:
"""Handle LLM extraction requests."""
base_url = get_base_url(request)
try:
if is_task_id(input_path):
return await handle_task_status(
redis, input_path, base_url
)
if not query:
return JSONResponse({
"message": "Please provide an instruction",
"_links": {
"example": {
"href": f"{base_url}/llm/{input_path}?q=Extract+main+content",
"title": "Try this example"
}
}
})
return await create_new_task(
redis,
background_tasks,
input_path,
query,
schema,
cache,
base_url,
config
)
except Exception as e:
logger.error(f"LLM endpoint error: {str(e)}", exc_info=True)
return JSONResponse({
"error": str(e),
"_links": {
"retry": {"href": str(request.url)}
}
}, status_code=status.HTTP_500_INTERNAL_SERVER_ERROR)
async def handle_task_status(
redis: aioredis.Redis,
task_id: str,
base_url: str
) -> JSONResponse:
"""Handle task status check requests."""
task = await redis.hgetall(f"task:{task_id}")
if not task:
raise HTTPException(
status_code=status.HTTP_404_NOT_FOUND,
detail="Task not found"
)
task = decode_redis_hash(task)
response = create_task_response(task, task_id, base_url)
if task["status"] in [TaskStatus.COMPLETED, TaskStatus.FAILED]:
if should_cleanup_task(task["created_at"]):
await redis.delete(f"task:{task_id}")
return JSONResponse(response)
async def create_new_task(
redis: aioredis.Redis,
background_tasks: BackgroundTasks,
input_path: str,
query: str,
schema: Optional[str],
cache: str,
base_url: str,
config: dict
) -> JSONResponse:
"""Create and initialize a new task."""
decoded_url = unquote(input_path)
if not decoded_url.startswith(('http://', 'https://')):
decoded_url = 'https://' + decoded_url
from datetime import datetime
task_id = f"llm_{int(datetime.now().timestamp())}_{id(background_tasks)}"
await redis.hset(f"task:{task_id}", mapping={
"status": TaskStatus.PROCESSING,
"created_at": datetime.now().isoformat(),
"url": decoded_url
})
background_tasks.add_task(
process_llm_extraction,
redis,
config,
task_id,
decoded_url,
query,
schema,
cache
)
return JSONResponse({
"task_id": task_id,
"status": TaskStatus.PROCESSING,
"url": decoded_url,
"_links": {
"self": {"href": f"{base_url}/llm/{task_id}"},
"status": {"href": f"{base_url}/llm/{task_id}"}
}
})
def create_task_response(task: dict, task_id: str, base_url: str) -> dict:
"""Create response for task status check."""
response = {
"task_id": task_id,
"status": task["status"],
"created_at": task["created_at"],
"url": task["url"],
"_links": {
"self": {"href": f"{base_url}/llm/{task_id}"},
"refresh": {"href": f"{base_url}/llm/{task_id}"}
}
}
if task["status"] == TaskStatus.COMPLETED:
response["result"] = json.loads(task["result"])
elif task["status"] == TaskStatus.FAILED:
response["error"] = task["error"]
return response
async def stream_results(crawler: AsyncWebCrawler, results_gen: AsyncGenerator) -> AsyncGenerator[bytes, None]:
"""Stream results with heartbeats and completion markers."""
import asyncio
import json
from utils import datetime_handler
try:
async for result in results_gen:
try:
result_dict = result.model_dump()
logger.info(f"Streaming result for {result_dict.get('url', 'unknown')}")
data = json.dumps(result_dict, default=datetime_handler) + "\n"
yield data.encode('utf-8')
except Exception as e:
logger.error(f"Serialization error: {e}")
error_response = {"error": str(e), "url": getattr(result, 'url', 'unknown')}
yield (json.dumps(error_response) + "\n").encode('utf-8')
yield json.dumps({"status": "completed"}).encode('utf-8')
except asyncio.CancelledError:
logger.warning("Client disconnected during streaming")
finally:
try:
await crawler.close()
except Exception as e:
logger.error(f"Crawler cleanup error: {e}")

69
deploy/docker/config.yml Normal file
View File

@@ -0,0 +1,69 @@
# Application Configuration
app:
title: "Crawl4AI API"
version: "1.0.0"
host: "0.0.0.0"
port: 8000
reload: True
timeout_keep_alive: 300
# Default LLM Configuration
llm:
provider: "openai/gpt-4o-mini"
api_key_env: "OPENAI_API_KEY"
# Redis Configuration
redis:
host: "localhost"
port: 6379
db: 0
password: ""
ssl: False
ssl_cert_reqs: None
ssl_ca_certs: None
ssl_certfile: None
ssl_keyfile: None
ssl_cert_reqs: None
ssl_ca_certs: None
ssl_certfile: None
ssl_keyfile: None
# Rate Limiting Configuration
rate_limiting:
enabled: True
default_limit: "1000/minute"
trusted_proxies: []
storage_uri: "memory://" # Use "redis://localhost:6379" for production
# Security Configuration
security:
enabled: false
https_redirect: True
trusted_hosts: ["*"]
headers:
x_content_type_options: "nosniff"
x_frame_options: "DENY"
content_security_policy: "default-src 'self'"
strict_transport_security: "max-age=63072000; includeSubDomains"
# Crawler Configuration
crawler:
memory_threshold_percent: 95.0
rate_limiter:
base_delay: [1.0, 2.0]
timeouts:
stream_init: 30.0 # Timeout for stream initialization
batch_process: 300.0 # Timeout for batch processing
# Logging Configuration
logging:
level: "INFO"
format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
# Observability Configuration
observability:
prometheus:
enabled: True
endpoint: "/metrics"
health_check:
endpoint: "/health"

View File

@@ -1,4 +1,7 @@
crawl4ai
fastapi
uvicorn
gunicorn>=23.0.0
gunicorn>=23.0.0
slowapi>=0.1.9
prometheus-fastapi-instrumentator>=7.0.2
redis>=5.2.1

View File

@@ -1,120 +1,237 @@
import os
import sys
import time
from typing import List, Optional
sys.path.append(os.path.dirname(os.path.realpath(__file__)))
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
import json
import asyncio
from typing import AsyncGenerator
from crawl4ai import (
BrowserConfig,
CrawlerRunConfig,
AsyncWebCrawler,
MemoryAdaptiveDispatcher,
RateLimiter,
from redis import asyncio as aioredis
from fastapi import FastAPI, HTTPException, Request, status
from fastapi.responses import StreamingResponse, RedirectResponse
from fastapi.middleware.httpsredirect import HTTPSRedirectMiddleware
from fastapi.middleware.trustedhost import TrustedHostMiddleware
from pydantic import BaseModel, Field
from slowapi import Limiter
from slowapi.util import get_remote_address
from prometheus_fastapi_instrumentator import Instrumentator
from fastapi.responses import PlainTextResponse
from fastapi.responses import JSONResponse
from fastapi.background import BackgroundTasks
from typing import Dict
import os
from utils import (
FilterType,
load_config,
setup_logging
)
from api import (
handle_markdown_request,
handle_llm_request
)
from typing import List, Optional
from pydantic import BaseModel
# Load configuration and setup
config = load_config()
setup_logging(config)
# Initialize Redis
redis = aioredis.from_url(config["redis"].get("uri", "redis://localhost"))
# Initialize rate limiter
limiter = Limiter(
key_func=get_remote_address,
default_limits=[config["rate_limiting"]["default_limit"]],
storage_uri=config["rate_limiting"]["storage_uri"]
)
app = FastAPI(
title=config["app"]["title"],
version=config["app"]["version"]
)
# Configure middleware
if config["security"]["enabled"]:
if config["security"]["https_redirect"]:
app.add_middleware(HTTPSRedirectMiddleware)
if config["security"]["trusted_hosts"] and config["security"]["trusted_hosts"] != ["*"]:
app.add_middleware(
TrustedHostMiddleware,
allowed_hosts=config["security"]["trusted_hosts"]
)
# Prometheus instrumentation
if config["observability"]["prometheus"]["enabled"]:
Instrumentator().instrument(app).expose(app)
class CrawlRequest(BaseModel):
urls: List[str]
browser_config: Optional[dict] = None
crawler_config: Optional[dict] = None
class CrawlResponse(BaseModel):
success: bool
results: List[dict]
class Config:
arbitrary_types_allowed = True
app = FastAPI(title="Crawl4AI API")
async def stream_results(crawler: AsyncWebCrawler, results_gen: AsyncGenerator) -> AsyncGenerator[bytes, None]:
"""Stream results and manage crawler lifecycle"""
def datetime_handler(obj):
"""Custom handler for datetime objects during JSON serialization"""
if hasattr(obj, 'isoformat'):
return obj.isoformat()
raise TypeError(f"Object of type {type(obj)} is not JSON serializable")
try:
async for result in results_gen:
try:
# Use dump method for serialization
result_dict = result.model_dump()
print(f"Streaming result for URL: {result_dict['url']}, Success: {result_dict['success']}")
# Use custom JSON encoder with datetime handler
yield (json.dumps(result_dict, default=datetime_handler) + "\n").encode('utf-8')
except Exception as e:
print(f"Error serializing result: {e}")
error_response = {
"error": str(e),
"url": getattr(result, 'url', 'unknown')
}
yield (json.dumps(error_response, default=datetime_handler) + "\n").encode('utf-8')
except asyncio.CancelledError:
print("Client disconnected, cleaning up...")
finally:
try:
await crawler.close()
except Exception as e:
print(f"Error closing crawler: {e}")
@app.post("/crawl")
async def crawl(request: CrawlRequest):
# Load configs using our new utilities
browser_config = BrowserConfig.load(request.browser_config)
crawler_config = CrawlerRunConfig.load(request.crawler_config)
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=95.0,
rate_limiter=RateLimiter(base_delay=(1.0, 2.0)),
urls: List[str] = Field(
min_length=1,
max_length=100,
json_schema_extra={
"items": {"type": "string", "maxLength": 2000, "pattern": "\\S"}
}
)
browser_config: Optional[Dict] = Field(
default_factory=dict,
example={"headless": True, "viewport": {"width": 1200}}
)
crawler_config: Optional[Dict] = Field(
default_factory=dict,
example={"stream": True, "cache_mode": "aggressive"}
)
try:
if crawler_config.stream:
crawler = AsyncWebCrawler(config=browser_config)
await crawler.start()
@app.middleware("http")
async def add_security_headers(request: Request, call_next):
response = await call_next(request)
if config["security"]["enabled"]:
response.headers.update(config["security"]["headers"])
return response
results_gen = await crawler.arun_many(
urls=request.urls,
config=crawler_config,
dispatcher=dispatcher
)
@app.get("/md/{url:path}")
@limiter.limit(config["rate_limiting"]["default_limit"])
async def get_markdown(
request: Request,
url: str,
f: FilterType = FilterType.FIT,
q: Optional[str] = None,
c: Optional[str] = "0"
):
"""Get markdown from URL with optional filtering."""
result = await handle_markdown_request(url, f, q, c, config)
return PlainTextResponse(result)
return StreamingResponse(
stream_results(crawler, results_gen),
media_type='application/x-ndjson'
)
else:
async with AsyncWebCrawler(config=browser_config) as crawler:
results = await crawler.arun_many(
urls=request.urls,
config=crawler_config,
dispatcher=dispatcher
)
# Use dump method for each result
results_dict = [result.model_dump() for result in results]
return CrawlResponse(success=True, results=results_dict)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/llm/{input:path}")
@limiter.limit(config["rate_limiting"]["default_limit"])
async def llm_endpoint(
request: Request,
background_tasks: BackgroundTasks,
input: str,
q: Optional[str] = None,
s: Optional[str] = None,
c: Optional[str] = "0"
):
"""Handle LLM extraction requests."""
return await handle_llm_request(
redis, background_tasks, request, input, q, s, c, config
)
@app.get("/schema")
async def get_schema():
"""Return config schemas for client validation"""
"""Endpoint for client-side validation schema."""
from crawl4ai import BrowserConfig, CrawlerRunConfig
return {
"browser": BrowserConfig.model_json_schema(),
"crawler": CrawlerRunConfig.model_json_schema()
}
@app.get("/health")
@app.get(config["observability"]["health_check"]["endpoint"])
async def health():
return {"status": "ok"}
"""Health check endpoint."""
return {"status": "ok", "timestamp": time.time()}
@app.get(config["observability"]["prometheus"]["endpoint"])
async def metrics():
"""Prometheus metrics endpoint."""
return RedirectResponse(url=config["observability"]["prometheus"]["endpoint"])
@app.post("/crawl")
@limiter.limit(config["rate_limiting"]["default_limit"])
async def crawl(request: Request, crawl_request: CrawlRequest):
"""Handle crawl requests."""
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
MemoryAdaptiveDispatcher,
RateLimiter
)
import asyncio
import logging
logger = logging.getLogger(__name__)
crawler = None
try:
if not crawl_request.urls:
logger.error("Empty URL list received")
raise HTTPException(
status_code=status.HTTP_400_BAD_REQUEST,
detail="At least one URL required"
)
browser_config = BrowserConfig.load(crawl_request.browser_config)
crawler_config = CrawlerRunConfig.load(crawl_request.crawler_config)
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=config["crawler"]["memory_threshold_percent"],
rate_limiter=RateLimiter(
base_delay=tuple(config["crawler"]["rate_limiter"]["base_delay"])
)
)
if crawler_config.stream:
crawler = AsyncWebCrawler(config=browser_config)
await crawler.start()
results_gen = await asyncio.wait_for(
crawler.arun_many(
urls=crawl_request.urls,
config=crawler_config,
dispatcher=dispatcher
),
timeout=config["crawler"]["timeouts"]["stream_init"]
)
from api import stream_results
return StreamingResponse(
stream_results(crawler, results_gen),
media_type='application/x-ndjson',
headers={
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'X-Stream-Status': 'active'
}
)
else:
async with AsyncWebCrawler(config=browser_config) as crawler:
results = await asyncio.wait_for(
crawler.arun_many(
urls=crawl_request.urls,
config=crawler_config,
dispatcher=dispatcher
),
timeout=config["crawler"]["timeouts"]["batch_process"]
)
return JSONResponse({
"success": True,
"results": [result.model_dump() for result in results]
})
except asyncio.TimeoutError as e:
logger.error(f"Operation timed out: {str(e)}")
raise HTTPException(
status_code=status.HTTP_504_GATEWAY_TIMEOUT,
detail="Processing timeout"
)
except Exception as e:
logger.error(f"Server error: {str(e)}", exc_info=True)
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail="Internal server error"
)
finally:
if crawler:
try:
await crawler.close()
except Exception as e:
logger.error(f"Final crawler cleanup error: {e}")
if __name__ == "__main__":
import uvicorn
uvicorn.run("server:app", host="0.0.0.0", port=8000, reload=True)
uvicorn.run(
"server:app",
host=config["app"]["host"],
port=config["app"]["port"],
reload=config["app"]["reload"],
timeout_keep_alive=config["app"]["timeout_keep_alive"]
)

54
deploy/docker/utils.py Normal file
View File

@@ -0,0 +1,54 @@
import logging
import yaml
from datetime import datetime
from enum import Enum
from pathlib import Path
from fastapi import Request
from typing import Dict, Optional
class TaskStatus(str, Enum):
PROCESSING = "processing"
FAILED = "failed"
COMPLETED = "completed"
class FilterType(str, Enum):
RAW = "raw"
FIT = "fit"
BM25 = "bm25"
LLM = "llm"
def load_config() -> Dict:
"""Load and return application configuration."""
config_path = Path(__file__).parent / "config.yml"
with open(config_path, "r") as config_file:
return yaml.safe_load(config_file)
def setup_logging(config: Dict) -> None:
"""Configure application logging."""
logging.basicConfig(
level=config["logging"]["level"],
format=config["logging"]["format"]
)
def get_base_url(request: Request) -> str:
"""Get base URL including scheme and host."""
return f"{request.url.scheme}://{request.url.netloc}"
def is_task_id(value: str) -> bool:
"""Check if the value matches task ID pattern."""
return value.startswith("llm_") and "_" in value
def datetime_handler(obj: any) -> Optional[str]:
"""Handle datetime serialization for JSON."""
if hasattr(obj, 'isoformat'):
return obj.isoformat()
raise TypeError(f"Object of type {type(obj)} is not JSON serializable")
def should_cleanup_task(created_at: str) -> bool:
"""Check if task should be cleaned up based on creation time."""
created = datetime.fromisoformat(created_at)
return (datetime.now() - created).total_seconds() > 3600
def decode_redis_hash(hash_data: Dict[bytes, bytes]) -> Dict[str, str]:
"""Decode Redis hash data from bytes to strings."""
return {k.decode('utf-8'): v.decode('utf-8') for k, v in hash_data.items()}