crawl4ai/deploy/docker/README.md

# Crawl4AI Docker Guide 🐳

## Table of Contents
- [Prerequisites](#prerequisites)
- [Installation](#installation)
  - [Local Build](#local-build)
  - [Docker Hub](#docker-hub)
- [Dockerfile Parameters](#dockerfile-parameters)
- [Using the API](#using-the-api)
  - [Understanding Request Schema](#understanding-request-schema)
  - [REST API Examples](#rest-api-examples)
  - [Python SDK](#python-sdk)
- [Metrics & Monitoring](#metrics--monitoring)
- [Deployment Scenarios](#deployment-scenarios)
- [Complete Examples](#complete-examples)
- [Getting Help](#getting-help)

## Prerequisites

Before we dive in, make sure you have:
- Docker installed and running (version 20.10.0 or higher)
- At least 4GB of RAM available for the container
- Python 3.10+ (if using the Python SDK)
- Node.js 16+ (if using the Node.js examples)

> 💡 **Pro tip**: Run `docker info` to check your Docker installation and available resources.

## Installation

### Local Build

Let's get your local environment set up step by step!

#### 1. Building the Image

First, clone the repository and build the Docker image:

```bash
# Clone the repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai

# Build the Docker image
docker build -t crawl4ai-server:prod \
  --build-arg PYTHON_VERSION=3.10 \
  --build-arg INSTALL_TYPE=all \
  --build-arg ENABLE_GPU=false \
  deploy/docker/
```

#### 2. Environment Setup

If you plan to use LLMs (Language Models), you'll need to set up your API keys. Create a `.llm.env` file:

```env
# OpenAI
OPENAI_API_KEY=sk-your-key

# Anthropic
ANTHROPIC_API_KEY=your-anthropic-key

# DeepSeek
DEEPSEEK_API_KEY=your-deepseek-key

# Check out https://docs.litellm.ai/docs/providers for more providers!
```

> 🔑 **Note**: Keep your API keys secure! Never commit them to version control.

#### 3. Running the Container

You have several options for running the container:

Basic run (no LLM support):
```bash
docker run -d -p 8000:8000 --name crawl4ai crawl4ai-server:prod
```

With LLM support:
```bash
docker run -d -p 8000:8000 \
  --env-file .llm.env \
  --name crawl4ai \
  crawl4ai-server:prod
```

Using host environment variables (Not a good practice, but works for local testing):
```bash
docker run -d -p 8000:8000 \
  --env-file .llm.env \
  --env-from "$(env)" \
  --name crawl4ai \
  crawl4ai-server:prod
```

### More on Building

You have several options for building the Docker image based on your needs:

#### Basic Build
```bash
# Clone the repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai

# Simple build with defaults
docker build -t crawl4ai-server:prod deploy/docker/
```

#### Advanced Build Options
```bash
# Build with custom parameters
docker build -t crawl4ai-server:prod \
  --build-arg PYTHON_VERSION=3.10 \
  --build-arg INSTALL_TYPE=all \
  --build-arg ENABLE_GPU=false \
  deploy/docker/
```

#### Platform-Specific Builds
The Dockerfile includes optimizations for different architectures (ARM64 and AMD64). Docker automatically detects your platform, but you can specify it explicitly:

```bash
# Build for ARM64
docker build --platform linux/arm64 -t crawl4ai-server:arm64 deploy/docker/

# Build for AMD64
docker build --platform linux/amd64 -t crawl4ai-server:amd64 deploy/docker/
```

#### Multi-Platform Build
For distributing your image across different architectures, use `buildx`:

```bash
# Set up buildx builder
docker buildx create --use

# Build for multiple platforms
docker buildx build \
  --platform linux/amd64,linux/arm64 \
  -t yourusername/crawl4ai-server:multi \
  --push \
  deploy/docker/
```

> 💡 **Note**: Multi-platform builds require Docker Buildx and need to be pushed to a registry.

#### Development Build
For development, you might want to enable all features:

```bash
docker build -t crawl4ai-server:dev \
  --build-arg INSTALL_TYPE=all \
  --build-arg PYTHON_VERSION=3.10 \
  --build-arg ENABLE_GPU=true \
  deploy/docker/
```

#### GPU-Enabled Build
If you plan to use GPU acceleration:

```bash
docker build -t crawl4ai-server:gpu \
  --build-arg ENABLE_GPU=true \
  deploy/docker/
```

### Build Arguments Explained

| Argument | Description | Default | Options |
|----------|-------------|---------|----------|
| PYTHON_VERSION | Python version | 3.10 | 3.8, 3.9, 3.10 |
| INSTALL_TYPE | Feature set | default | default, all, torch, transformer |
| ENABLE_GPU | GPU support | false | true, false |
| APP_HOME | Install path | /app | any valid path |

### Build Best Practices

1. **Choose the Right Install Type**
   - `default`: Basic installation, smallest image, to be honest, I use this most of the time.
   - `all`: Full features, larger image (include transformer, and nltk, make sure you really need them)

2. **Platform Considerations**
   - Let Docker auto-detect platform unless you need cross-compilation
   - Use --platform for specific architecture requirements
   - Consider buildx for multi-architecture distribution

3. **Development vs Production**
   - Use `INSTALL_TYPE=all` for development
   - Stick to `default` for production if you don't need extra features
   - Enable GPU only if you have compatible hardware

4. **Performance Optimization**
   - The image automatically includes platform-specific optimizations
   - AMD64 gets OpenMP optimizations
   - ARM64 gets OpenBLAS optimizations

### Docker Hub

> 🚧 Coming soon! The image will be available at `crawl4ai/server`. Stay tuned!

## Dockerfile Parameters

Configure your build with these parameters:

| Parameter | Description | Default | Options |
|-----------|-------------|---------|----------|
| PYTHON_VERSION | Python version to use | 3.10 | 3.8, 3.9, 3.10 |
| INSTALL_TYPE | Installation profile | default | default, all, torch, transformer |
| ENABLE_GPU | Enable GPU support | false | true, false |
| APP_HOME | Application directory | /app | any valid path |
| TARGETARCH | Target architecture | auto-detected | amd64, arm64 |

## Using the API

### Understanding Request Schema

This is super important! The API expects a specific structure that matches our Python classes. Let me show you how it works.

#### The Magic of Type Matching

When you send a request, each configuration object needs a "type" field that matches the exact class name from the library. Here's an example:

```python
# First, let's create objects the normal way
from crawl4ai import BrowserConfig, CrawlerRunConfig, PruningContentFilter

# Create some config objects
browser_config = BrowserConfig(headless=True, viewport={"width": 1200, "height": 800})
content_filter = PruningContentFilter(threshold=0.48, threshold_type="fixed")

# Use dump() to see the serialized format
print(browser_config.dump())
```

This will output something like:
```json
{
    "type": "BrowserConfig",
    "params": {
        "headless": true,
        "viewport": {
            "width": 1200,
            "height": 800
        }
    }
}
```

#### Making API Requests

So when making a request, your JSON should look like this:

```json
{
    "urls": ["https://example.com"],
    "browser_config": {
        "type": "BrowserConfig",
        "params": {
            "headless": true,
            "viewport": {"width": 1200, "height": 800}
        }
    },
    "crawler_config": {
        "type": "CrawlerRunConfig",
        "params": {
            "cache_mode": "bypass",
            "markdown_generator": {
                "type": "DefaultMarkdownGenerator",
                "params": {
                    "content_filter": {
                        "type": "PruningContentFilter",
                        "params": {
                            "threshold": 0.48,
                            "threshold_type": "fixed",
                            "min_word_threshold": 0
                        }
                    }
                }
            }
        }
    }
}
```

> 💡 **Pro tip**: Look at the class names in the library documentation - they map directly to the "type" fields in your requests!

### REST API Examples

Let's look at some practical examples:

#### Simple Crawl

```python
import requests

response = requests.post(
    "http://localhost:8000/crawl",
    json={
        "urls": ["https://example.com"],
        "browser_config": {
            "type": "BrowserConfig",
            "params": {"headless": True}
        }
    }
)
print(response.json())
```

#### Streaming Results

```python
import requests

response = requests.post(
    "http://localhost:8000/crawl",
    json={
        "urls": ["https://example.com"],
        "crawler_config": {
            "type": "CrawlerRunConfig",
            "params": {"stream": True}
        }
    },
    stream=True
)

for line in response.iter_lines():
    if line:
        print(line.decode())
```

### Python SDK

The SDK makes things even easier! Here's how to use it:

```python
from crawl4ai.docker_client import Crawl4aiDockerClient
from crawl4ai import BrowserConfig, CrawlerRunConfig

async with Crawl4aiDockerClient() as client:
    # The SDK handles serialization for you!
    result = await client.crawl(
        urls=["https://example.com"],
        browser_config=BrowserConfig(headless=True),
        crawler_config=CrawlerRunConfig(stream=False)
    )
    print(result.markdown)
```

## Metrics & Monitoring

Keep an eye on your crawler with these endpoints:

- `/health` - Quick health check
- `/metrics` - Detailed Prometheus metrics
- `/schema` - Full API schema

Example health check:
```bash
curl http://localhost:8000/health
```

## Deployment Scenarios

> 🚧 Coming soon! We'll cover:
> - Kubernetes deployment
> - Cloud provider setups (AWS, GCP, Azure)
> - High-availability configurations
> - Load balancing strategies

## Complete Examples

Check out the `examples` folder in our repository for full working examples! Here's one to get you started:

```python
import requests
import time
import httpx
import asyncio
from typing import Dict, Any
from crawl4ai import (
    BrowserConfig, CrawlerRunConfig, DefaultMarkdownGenerator,
    PruningContentFilter, JsonCssExtractionStrategy, LLMContentFilter, CacheMode
)
from crawl4ai.docker_client import Crawl4aiDockerClient

class Crawl4AiTester:
    def __init__(self, base_url: str = "http://localhost:11235"):
        self.base_url = base_url

    def submit_and_wait(
        self, request_data: Dict[str, Any], timeout: int = 300
    ) -> Dict[str, Any]:
        # Submit crawl job
        response = requests.post(f"{self.base_url}/crawl", json=request_data)
        task_id = response.json()["task_id"]
        print(f"Task ID: {task_id}")

        # Poll for result
        start_time = time.time()
        while True:
            if time.time() - start_time > timeout:
                raise TimeoutError(
                    f"Task {task_id} did not complete within {timeout} seconds"
                )

            result = requests.get(f"{self.base_url}/task/{task_id}")
            status = result.json()

            if status["status"] == "failed":
                print("Task failed:", status.get("error"))
                raise Exception(f"Task failed: {status.get('error')}")

            if status["status"] == "completed":
                return status

            time.sleep(2)

async def test_direct_api():
    """Test direct API endpoints without using the client SDK"""
    print("\n=== Testing Direct API Calls ===")

    # Test 1: Basic crawl with content filtering
    browser_config = BrowserConfig(
        headless=True,
        viewport_width=1200,
        viewport_height=800
    )

    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        markdown_generator=DefaultMarkdownGenerator(
            content_filter=PruningContentFilter(
                threshold=0.48,
                threshold_type="fixed",
                min_word_threshold=0
            ),
            options={"ignore_links": True}
        )
    )

    request_data = {
        "urls": ["https://example.com"],
        "browser_config": browser_config.dump(),
        "crawler_config": crawler_config.dump()
    }

    # Make direct API call
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://localhost:8000/crawl",
            json=request_data,
            timeout=300
        )
        assert response.status_code == 200
        result = response.json()
        print("Basic crawl result:", result["success"])

    # Test 2: Structured extraction with JSON CSS
    schema = {
        "baseSelector": "article.post",
        "fields": [
            {"name": "title", "selector": "h1", "type": "text"},
            {"name": "content", "selector": ".content", "type": "html"}
        ]
    }

    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        extraction_strategy=JsonCssExtractionStrategy(schema=schema)
    )

    request_data["crawler_config"] = crawler_config.dump()

    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://localhost:8000/crawl",
            json=request_data
        )
        assert response.status_code == 200
        result = response.json()
        print("Structured extraction result:", result["success"])

    # Test 3: Get schema
    # async with httpx.AsyncClient() as client:
    #     response = await client.get("http://localhost:8000/schema")
    #     assert response.status_code == 200
    #     schemas = response.json()
    #     print("Retrieved schemas for:", list(schemas.keys()))

async def test_with_client():
    """Test using the Crawl4AI Docker client SDK"""
    print("\n=== Testing Client SDK ===")

    async with Crawl4aiDockerClient(verbose=True) as client:
        # Test 1: Basic crawl
        browser_config = BrowserConfig(headless=True)
        crawler_config = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,
            markdown_generator=DefaultMarkdownGenerator(
                content_filter=PruningContentFilter(
                    threshold=0.48,
                    threshold_type="fixed"
                )
            )
        )

        result = await client.crawl(
            urls=["https://example.com"],
            browser_config=browser_config,
            crawler_config=crawler_config
        )
        print("Client SDK basic crawl:", result.success)

        # Test 2: LLM extraction with streaming
        crawler_config = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,
            markdown_generator=DefaultMarkdownGenerator(
                content_filter=LLMContentFilter(
                    provider="openai/gpt-40",
                    instruction="Extract key technical concepts"
                )
            ),
            stream=True
        )

        async for result in await client.crawl(
            urls=["https://example.com"],
            browser_config=browser_config,
            crawler_config=crawler_config
        ):
            print(f"Streaming result for: {result.url}")

        # # Test 3: Get schema
        # schemas = await client.get_schema()
        # print("Retrieved client schemas for:", list(schemas.keys()))

async def main():
    """Run all tests"""
    # Test direct API
    print("Testing direct API calls...")
    await test_direct_api()

    # Test client SDK
    print("\nTesting client SDK...")
    await test_with_client()

if __name__ == "__main__":
    asyncio.run(main())
```

## Server Configuration

The server's behavior can be customized through the `config.yml` file. Let's explore how to configure your Crawl4AI server for optimal performance and security.

### Understanding config.yml

The configuration file is located at `deploy/docker/config.yml`. You can either modify this file before building the image or mount a custom configuration when running the container.

Here's a detailed breakdown of the configuration options:

```yaml
# Application Configuration
app:
  title: "Crawl4AI API"           # Server title in OpenAPI docs
  version: "1.0.0"               # API version
  host: "0.0.0.0"               # Listen on all interfaces
  port: 8000                    # Server port
  reload: True                  # Enable hot reloading (development only)
  timeout_keep_alive: 300       # Keep-alive timeout in seconds

# Rate Limiting Configuration
rate_limiting:
  enabled: True                 # Enable/disable rate limiting
  default_limit: "100/minute"   # Rate limit format: "number/timeunit"
  trusted_proxies: []          # List of trusted proxy IPs
  storage_uri: "memory://"     # Use "redis://localhost:6379" for production

# Security Configuration
security:
  enabled: false               # Master toggle for security features
  https_redirect: True         # Force HTTPS
  trusted_hosts: ["*"]        # Allowed hosts (use specific domains in production)
  headers:                     # Security headers
    x_content_type_options: "nosniff"
    x_frame_options: "DENY"
    content_security_policy: "default-src 'self'"
    strict_transport_security: "max-age=63072000; includeSubDomains"

# Crawler Configuration
crawler:
  memory_threshold_percent: 95.0  # Memory usage threshold
  rate_limiter:
    base_delay: [1.0, 2.0]      # Min and max delay between requests
  timeouts:
    stream_init: 30.0           # Stream initialization timeout
    batch_process: 300.0        # Batch processing timeout

# Logging Configuration
logging:
  level: "INFO"                 # Log level (DEBUG, INFO, WARNING, ERROR)
  format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"

# Observability Configuration
observability:
  prometheus:
    enabled: True              # Enable Prometheus metrics
    endpoint: "/metrics"       # Metrics endpoint
  health_check:
    endpoint: "/health"        # Health check endpoint
```

### Configuration Tips and Best Practices

1. **Production Settings** 🏭
   ```yaml
   app:
     reload: False              # Disable reload in production
     timeout_keep_alive: 120    # Lower timeout for better resource management

   rate_limiting:
     storage_uri: "redis://redis:6379"  # Use Redis for distributed rate limiting
     default_limit: "50/minute"         # More conservative rate limit

   security:
     enabled: true                      # Enable all security features
     trusted_hosts: ["your-domain.com"] # Restrict to your domain
   ```

2. **Development Settings** 🛠️
   ```yaml
   app:
     reload: True               # Enable hot reloading
     timeout_keep_alive: 300    # Longer timeout for debugging

   logging:
     level: "DEBUG"            # More verbose logging
   ```

3. **High-Traffic Settings** 🚦
   ```yaml
   crawler:
     memory_threshold_percent: 85.0  # More conservative memory limit
     rate_limiter:
       base_delay: [2.0, 4.0]       # More aggressive rate limiting
   ```

### Customizing Your Configuration

#### Method 1: Pre-build Configuration
```bash
# Copy and modify config before building
cp deploy/docker/config.yml custom-config.yml
vim custom-config.yml

# Build with custom config
docker build -t crawl4ai-server:prod \
  --build-arg CONFIG_PATH=custom-config.yml .
```

#### Method 2: Runtime Configuration
```bash
# Mount custom config at runtime
docker run -d -p 8000:8000 \
  -v $(pwd)/custom-config.yml:/app/config.yml \
  crawl4ai-server:prod
```

### Configuration Recommendations

1. **Security First** 🔒
   - Always enable security in production
   - Use specific trusted_hosts instead of wildcards
   - Set up proper rate limiting to protect your server
   - Consider your environment before enabling HTTPS redirect

2. **Resource Management** 💻
   - Adjust memory_threshold_percent based on available RAM
   - Set timeouts according to your content size and network conditions
   - Use Redis for rate limiting in multi-container setups

3. **Monitoring** 📊
   - Enable Prometheus if you need metrics
   - Set DEBUG logging in development, INFO in production
   - Regular health check monitoring is crucial

4. **Performance Tuning** ⚡
   - Start with conservative rate limiter delays
   - Increase batch_process timeout for large content
   - Adjust stream_init timeout based on initial response times

### Configuration Migration

When upgrading Crawl4AI, follow these steps:

1. Back up your current config:
   ```bash
   cp /app/config.yml /app/config.yml.backup
   ```

2. Use version control:
   ```bash
   git add config.yml
   git commit -m "Save current server configuration"
   ```

3. Test in staging first:
   ```bash
   docker run -d -p 8001:8000 \  # Use different port
     -v $(pwd)/new-config.yml:/app/config.yml \
     crawl4ai-server:prod
   ```

### Common Configuration Scenarios

1. **Basic Development Setup**
   ```yaml
   security:
     enabled: false
   logging:
     level: "DEBUG"
   ```

2. **Production API Server**
   ```yaml
   security:
     enabled: true
     trusted_hosts: ["api.yourdomain.com"]
   rate_limiting:
     enabled: true
     default_limit: "50/minute"
   ```

3. **High-Performance Crawler**
   ```yaml
   crawler:
     memory_threshold_percent: 90.0
     timeouts:
       batch_process: 600.0
   ```

## Getting Help

We're here to help you succeed with Crawl4AI! Here's how to get support:

- 📖 Check our [full documentation](https://docs.crawl4ai.com)
- 🐛 Found a bug? [Open an issue](https://github.com/unclecode/crawl4ai/issues)
- 💬 Join our [Discord community](https://discord.gg/crawl4ai)
- ⭐ Star us on GitHub to show support!

## Summary

In this guide, we've covered everything you need to get started with Crawl4AI's Docker deployment:
- Building and running the Docker container
- Configuring the environment
- Making API requests with proper typing
- Using the Python SDK
- Monitoring your deployment

Remember, the examples in the `examples` folder are your friends - they show real-world usage patterns that you can adapt for your needs.

Keep exploring, and don't hesitate to reach out if you need help! We're building something amazing together. 🚀

Happy crawling! 🕷️