crawl4ai/docs/md_v2/basic/docker-deploymeny.md

# Docker Deployment 🐳

Crawl4AI provides official Docker images for easy deployment and scalability. This guide covers installation, configuration, and usage of Crawl4AI in Docker environments.

## Docker Compose Setup 🐳

### Basic Usage

Create a `docker-compose.yml`:
```yaml
version: '3.8'

services:
  crawl4ai:
    image: unclecode/crawl4ai:all
    ports:
      - "11235:11235"
    volumes:
      - /dev/shm:/dev/shm
    deploy:
      resources:
        limits:
          memory: 4G
    restart: unless-stopped
```

Run with:
```bash
docker-compose up -d
```

### Secure Mode with API Token

To enable API authentication, simply set the `CRAWL4AI_API_TOKEN`:
```bash
CRAWL4AI_API_TOKEN=your-secret-token docker-compose up -d
```

### Using Environment Variables

Create a `.env` file for your API tokens:
```env
# Crawl4AI API Security (optional)
CRAWL4AI_API_TOKEN=your-secret-token

# LLM Provider API Keys
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=...
GEMINI_API_KEY=...
OLLAMA_API_KEY=...

# Additional Configuration
MAX_CONCURRENT_TASKS=5
```

Docker Compose will automatically load variables from the `.env` file. No additional configuration needed!

### Testing with API Token

```python
import requests

# Initialize headers with token if using secure mode
headers = {}
if api_token := os.getenv('CRAWL4AI_API_TOKEN'):
    headers['Authorization'] = f'Bearer {api_token}'

# Test crawl with authentication
response = requests.post(
    "http://localhost:11235/crawl",
    headers=headers,
    json={
        "urls": "https://www.nbcnews.com/business",
        "priority": 10
    }
)
task_id = response.json()["task_id"]
```

### Security Best Practices 🔒

- Add `.env` to your `.gitignore`
- Use different API tokens for development and production
- Rotate API tokens periodically
- Use secure methods to pass tokens in production environments
```

This addition to your documentation:
1. Shows how to use Docker Compose
2. Explains both secure and non-secure modes
3. Demonstrates environment variable configuration
4. Provides example code for authenticated requests
5. Includes security best practices


## Usage Examples 📝

### Basic Crawling

```python
request = {
    "urls": "https://www.nbcnews.com/business",
    "priority": 10
}

response = requests.post("http://localhost:11235/crawl", json=request)
task_id = response.json()["task_id"]

# Get results
result = requests.get(f"http://localhost:11235/task/{task_id}")
```

### Structured Data Extraction

```python
schema = {
    "name": "Crypto Prices",
    "baseSelector": ".cds-tableRow-t45thuk",
    "fields": [
        {
            "name": "crypto",
            "selector": "td:nth-child(1) h2",
            "type": "text",
        },
        {
            "name": "price",
            "selector": "td:nth-child(2)",
            "type": "text",
        }
    ],
}

request = {
    "urls": "https://www.coinbase.com/explore",
    "extraction_config": {
        "type": "json_css",
        "params": {"schema": schema}
    }
}
```

### Dynamic Content Handling

```python
request = {
    "urls": "https://www.nbcnews.com/business",
    "js_code": [
        "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
    ],
    "wait_for": "article.tease-card:nth-child(10)"
}
```

### AI-Powered Extraction (Full Version)

```python
request = {
    "urls": "https://www.nbcnews.com/business",
    "extraction_config": {
        "type": "cosine",
        "params": {
            "semantic_filter": "business finance economy",
            "word_count_threshold": 10,
            "max_dist": 0.2,
            "top_k": 3
        }
    }
}
```

## Platform-Specific Instructions 💻

### macOS
```bash
docker pull unclecode/crawl4ai:basic
docker run -p 11235:11235 unclecode/crawl4ai:basic
```

### Ubuntu
```bash
# Basic version
docker pull unclecode/crawl4ai:basic
docker run -p 11235:11235 unclecode/crawl4ai:basic

# With GPU support
docker pull unclecode/crawl4ai:gpu
docker run --gpus all -p 11235:11235 unclecode/crawl4ai:gpu
```

### Windows (PowerShell)
```powershell
docker pull unclecode/crawl4ai:basic
docker run -p 11235:11235 unclecode/crawl4ai:basic
```

## Testing 🧪

Save this as `test_docker.py`:

```python
import requests
import json
import time
import sys

class Crawl4AiTester:
    def __init__(self, base_url: str = "http://localhost:11235"):
        self.base_url = base_url

    def submit_and_wait(self, request_data: dict, timeout: int = 300) -> dict:
        # Submit crawl job
        response = requests.post(f"{self.base_url}/crawl", json=request_data)
        task_id = response.json()["task_id"]
        print(f"Task ID: {task_id}")

        # Poll for result
        start_time = time.time()
        while True:
            if time.time() - start_time > timeout:
                raise TimeoutError(f"Task {task_id} timeout")

            result = requests.get(f"{self.base_url}/task/{task_id}")
            status = result.json()

            if status["status"] == "completed":
                return status

            time.sleep(2)

def test_deployment():
    tester = Crawl4AiTester()

    # Test basic crawl
    request = {
        "urls": "https://www.nbcnews.com/business",
        "priority": 10
    }

    result = tester.submit_and_wait(request)
    print("Basic crawl successful!")
    print(f"Content length: {len(result['result']['markdown'])}")

if __name__ == "__main__":
    test_deployment()
```

## Advanced Configuration ⚙️

### Crawler Parameters

The `crawler_params` field allows you to configure the browser instance and crawling behavior. Here are key parameters you can use:

```python
request = {
    "urls": "https://example.com",
    "crawler_params": {
        # Browser Configuration
        "headless": True,                    # Run in headless mode
        "browser_type": "chromium",          # chromium/firefox/webkit
        "user_agent": "custom-agent",        # Custom user agent
        "proxy": "http://proxy:8080",        # Proxy configuration

        # Performance & Behavior
        "page_timeout": 30000,               # Page load timeout (ms)
        "verbose": True,                     # Enable detailed logging
        "semaphore_count": 5,               # Concurrent request limit

        # Anti-Detection Features
        "simulate_user": True,               # Simulate human behavior
        "magic": True,                       # Advanced anti-detection
        "override_navigator": True,          # Override navigator properties

        # Session Management
        "user_data_dir": "./browser-data",   # Browser profile location
        "use_managed_browser": True,         # Use persistent browser
    }
}
```

### Extra Parameters

The `extra` field allows passing additional parameters directly to the crawler's `arun` function:

```python
request = {
    "urls": "https://example.com",
    "extra": {
        "word_count_threshold": 10,          # Min words per block
        "only_text": True,                   # Extract only text
        "bypass_cache": True,                # Force fresh crawl
        "process_iframes": True,             # Include iframe content
    }
}
```

### Complete Examples

1. **Advanced News Crawling**
```python
request = {
    "urls": "https://www.nbcnews.com/business",
    "crawler_params": {
        "headless": True,
        "page_timeout": 30000,
        "remove_overlay_elements": True      # Remove popups
    },
    "extra": {
        "word_count_threshold": 50,          # Longer content blocks
        "bypass_cache": True                 # Fresh content
    },
    "css_selector": ".article-body"
}
```

2. **Anti-Detection Configuration**
```python
request = {
    "urls": "https://example.com",
    "crawler_params": {
        "simulate_user": True,
        "magic": True,
        "override_navigator": True,
        "user_agent": "Mozilla/5.0 ...",
        "headers": {
            "Accept-Language": "en-US,en;q=0.9"
        }
    }
}
```

3. **LLM Extraction with Custom Parameters**
```python
request = {
    "urls": "https://openai.com/pricing",
    "extraction_config": {
        "type": "llm",
        "params": {
            "provider": "openai/gpt-4",
            "schema": pricing_schema
        }
    },
    "crawler_params": {
        "verbose": True,
        "page_timeout": 60000
    },
    "extra": {
        "word_count_threshold": 1,
        "only_text": True
    }
}
```

4. **Session-Based Dynamic Content**
```python
request = {
    "urls": "https://example.com",
    "crawler_params": {
        "session_id": "dynamic_session",
        "headless": False,
        "page_timeout": 60000
    },
    "js_code": ["window.scrollTo(0, document.body.scrollHeight);"],
    "wait_for": "js:() => document.querySelectorAll('.item').length > 10",
    "extra": {
        "delay_before_return_html": 2.0
    }
}
```

5. **Screenshot with Custom Timing**
```python
request = {
    "urls": "https://example.com",
    "screenshot": True,
    "crawler_params": {
        "headless": True,
        "screenshot_wait_for": ".main-content"
    },
    "extra": {
        "delay_before_return_html": 3.0
    }
}
```

### Parameter Reference Table

| Category | Parameter | Type | Description |
|----------|-----------|------|-------------|
| Browser | headless | bool | Run browser in headless mode |
| Browser | browser_type | str | Browser engine selection |
| Browser | user_agent | str | Custom user agent string |
| Network | proxy | str | Proxy server URL |
| Network | headers | dict | Custom HTTP headers |
| Timing | page_timeout | int | Page load timeout (ms) |
| Timing | delay_before_return_html | float | Wait before capture |
| Anti-Detection | simulate_user | bool | Human behavior simulation |
| Anti-Detection | magic | bool | Advanced protection |
| Session | session_id | str | Browser session ID |
| Session | user_data_dir | str | Profile directory |
| Content | word_count_threshold | int | Minimum words per block |
| Content | only_text | bool | Text-only extraction |
| Content | process_iframes | bool | Include iframe content |
| Debug | verbose | bool | Detailed logging |
| Debug | log_console | bool | Browser console logs |

## Troubleshooting 🔍

### Common Issues

1. **Connection Refused**
   ```
   Error: Connection refused at localhost:11235
   ```
   Solution: Ensure the container is running and ports are properly mapped.

2. **Resource Limits**
   ```
   Error: No available slots
   ```
   Solution: Increase MAX_CONCURRENT_TASKS or container resources.

3. **GPU Access**
   ```
   Error: GPU not found
   ```
   Solution: Ensure proper NVIDIA drivers and use `--gpus all` flag.

### Debug Mode

Access container for debugging:
```bash
docker run -it --entrypoint /bin/bash unclecode/crawl4ai:all
```

View container logs:
```bash
docker logs [container_id]
```

## Best Practices 🌟

1. **Resource Management**
   - Set appropriate memory and CPU limits
   - Monitor resource usage via health endpoint
   - Use basic version for simple crawling tasks

2. **Scaling**
   - Use multiple containers for high load
   - Implement proper load balancing
   - Monitor performance metrics

3. **Security**
   - Use environment variables for sensitive data
   - Implement proper network isolation
   - Regular security updates

## API Reference 📚

### Health Check
```http
GET /health
```

### Submit Crawl Task
```http
POST /crawl
Content-Type: application/json

{
    "urls": "string or array",
    "extraction_config": {
        "type": "basic|llm|cosine|json_css",
        "params": {}
    },
    "priority": 1-10,
    "ttl": 3600
}
```

### Get Task Status
```http
GET /task/{task_id}
```

For more details, visit the [official documentation](https://crawl4ai.com/mkdocs/).