Files
crawl4ai/docs/md_v2/basic/docker-deploymeny.md

12 KiB

Docker Deployment 🐳

Crawl4AI provides official Docker images for easy deployment and scalability. This guide covers installation, configuration, and usage of Crawl4AI in Docker environments.

Docker Compose Setup 🐳

Basic Usage

Create a docker-compose.yml:

version: '3.8'

services:
  crawl4ai:
    image: unclecode/crawl4ai:all
    ports:
      - "11235:11235"
    volumes:
      - /dev/shm:/dev/shm
    deploy:
      resources:
        limits:
          memory: 4G
    restart: unless-stopped

Run with:

docker-compose up -d

Secure Mode with API Token

To enable API authentication, simply set the CRAWL4AI_API_TOKEN:

CRAWL4AI_API_TOKEN=your-secret-token docker-compose up -d

Using Environment Variables

Create a .env file for your API tokens:

# Crawl4AI API Security (optional)
CRAWL4AI_API_TOKEN=your-secret-token

# LLM Provider API Keys
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=...
GEMINI_API_KEY=...
OLLAMA_API_KEY=...

# Additional Configuration
MAX_CONCURRENT_TASKS=5

Docker Compose will automatically load variables from the .env file. No additional configuration needed!

Testing with API Token

import requests

# Initialize headers with token if using secure mode
headers = {}
if api_token := os.getenv('CRAWL4AI_API_TOKEN'):
    headers['Authorization'] = f'Bearer {api_token}'

# Test crawl with authentication
response = requests.post(
    "http://localhost:11235/crawl",
    headers=headers,
    json={
        "urls": "https://www.nbcnews.com/business",
        "priority": 10
    }
)
task_id = response.json()["task_id"]

Security Best Practices 🔒

  • Add .env to your .gitignore
  • Use different API tokens for development and production
  • Rotate API tokens periodically
  • Use secure methods to pass tokens in production environments

This addition to your documentation:
1. Shows how to use Docker Compose
2. Explains both secure and non-secure modes
3. Demonstrates environment variable configuration
4. Provides example code for authenticated requests
5. Includes security best practices
















## Usage Examples 📝

### Basic Crawling

```python
request = {
    "urls": "https://www.nbcnews.com/business",
    "priority": 10
}

response = requests.post("http://localhost:11235/crawl", json=request)
task_id = response.json()["task_id"]

# Get results
result = requests.get(f"http://localhost:11235/task/{task_id}")

Structured Data Extraction

schema = {
    "name": "Crypto Prices",
    "baseSelector": ".cds-tableRow-t45thuk",
    "fields": [
        {
            "name": "crypto",
            "selector": "td:nth-child(1) h2",
            "type": "text",
        },
        {
            "name": "price",
            "selector": "td:nth-child(2)",
            "type": "text",
        }
    ],
}

request = {
    "urls": "https://www.coinbase.com/explore",
    "extraction_config": {
        "type": "json_css",
        "params": {"schema": schema}
    }
}

Dynamic Content Handling

request = {
    "urls": "https://www.nbcnews.com/business",
    "js_code": [
        "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
    ],
    "wait_for": "article.tease-card:nth-child(10)"
}

AI-Powered Extraction (Full Version)

request = {
    "urls": "https://www.nbcnews.com/business",
    "extraction_config": {
        "type": "cosine",
        "params": {
            "semantic_filter": "business finance economy",
            "word_count_threshold": 10,
            "max_dist": 0.2,
            "top_k": 3
        }
    }
}

Platform-Specific Instructions 💻

macOS

docker pull unclecode/crawl4ai:basic
docker run -p 11235:11235 unclecode/crawl4ai:basic

Ubuntu

# Basic version
docker pull unclecode/crawl4ai:basic
docker run -p 11235:11235 unclecode/crawl4ai:basic

# With GPU support
docker pull unclecode/crawl4ai:gpu
docker run --gpus all -p 11235:11235 unclecode/crawl4ai:gpu

Windows (PowerShell)

docker pull unclecode/crawl4ai:basic
docker run -p 11235:11235 unclecode/crawl4ai:basic

Testing 🧪

Save this as test_docker.py:

import requests
import json
import time
import sys

class Crawl4AiTester:
    def __init__(self, base_url: str = "http://localhost:11235"):
        self.base_url = base_url
        
    def submit_and_wait(self, request_data: dict, timeout: int = 300) -> dict:
        # Submit crawl job
        response = requests.post(f"{self.base_url}/crawl", json=request_data)
        task_id = response.json()["task_id"]
        print(f"Task ID: {task_id}")
        
        # Poll for result
        start_time = time.time()
        while True:
            if time.time() - start_time > timeout:
                raise TimeoutError(f"Task {task_id} timeout")
                
            result = requests.get(f"{self.base_url}/task/{task_id}")
            status = result.json()
            
            if status["status"] == "completed":
                return status
                
            time.sleep(2)

def test_deployment():
    tester = Crawl4AiTester()
    
    # Test basic crawl
    request = {
        "urls": "https://www.nbcnews.com/business",
        "priority": 10
    }
    
    result = tester.submit_and_wait(request)
    print("Basic crawl successful!")
    print(f"Content length: {len(result['result']['markdown'])}")

if __name__ == "__main__":
    test_deployment()

Advanced Configuration ⚙️

Crawler Parameters

The crawler_params field allows you to configure the browser instance and crawling behavior. Here are key parameters you can use:

request = {
    "urls": "https://example.com",
    "crawler_params": {
        # Browser Configuration
        "headless": True,                    # Run in headless mode
        "browser_type": "chromium",          # chromium/firefox/webkit
        "user_agent": "custom-agent",        # Custom user agent
        "proxy": "http://proxy:8080",        # Proxy configuration
        
        # Performance & Behavior
        "page_timeout": 30000,               # Page load timeout (ms)
        "verbose": True,                     # Enable detailed logging
        "semaphore_count": 5,               # Concurrent request limit
        
        # Anti-Detection Features
        "simulate_user": True,               # Simulate human behavior
        "magic": True,                       # Advanced anti-detection
        "override_navigator": True,          # Override navigator properties
        
        # Session Management
        "user_data_dir": "./browser-data",   # Browser profile location
        "use_managed_browser": True,         # Use persistent browser
    }
}

Extra Parameters

The extra field allows passing additional parameters directly to the crawler's arun function:

request = {
    "urls": "https://example.com",
    "extra": {
        "word_count_threshold": 10,          # Min words per block
        "only_text": True,                   # Extract only text
        "bypass_cache": True,                # Force fresh crawl
        "process_iframes": True,             # Include iframe content
    }
}

Complete Examples

  1. Advanced News Crawling
request = {
    "urls": "https://www.nbcnews.com/business",
    "crawler_params": {
        "headless": True,
        "page_timeout": 30000,
        "remove_overlay_elements": True      # Remove popups
    },
    "extra": {
        "word_count_threshold": 50,          # Longer content blocks
        "bypass_cache": True                 # Fresh content
    },
    "css_selector": ".article-body"
}
  1. Anti-Detection Configuration
request = {
    "urls": "https://example.com",
    "crawler_params": {
        "simulate_user": True,
        "magic": True,
        "override_navigator": True,
        "user_agent": "Mozilla/5.0 ...",
        "headers": {
            "Accept-Language": "en-US,en;q=0.9"
        }
    }
}
  1. LLM Extraction with Custom Parameters
request = {
    "urls": "https://openai.com/pricing",
    "extraction_config": {
        "type": "llm",
        "params": {
            "provider": "openai/gpt-4",
            "schema": pricing_schema
        }
    },
    "crawler_params": {
        "verbose": True,
        "page_timeout": 60000
    },
    "extra": {
        "word_count_threshold": 1,
        "only_text": True
    }
}
  1. Session-Based Dynamic Content
request = {
    "urls": "https://example.com",
    "crawler_params": {
        "session_id": "dynamic_session",
        "headless": False,
        "page_timeout": 60000
    },
    "js_code": ["window.scrollTo(0, document.body.scrollHeight);"],
    "wait_for": "js:() => document.querySelectorAll('.item').length > 10",
    "extra": {
        "delay_before_return_html": 2.0
    }
}
  1. Screenshot with Custom Timing
request = {
    "urls": "https://example.com",
    "screenshot": True,
    "crawler_params": {
        "headless": True,
        "screenshot_wait_for": ".main-content"
    },
    "extra": {
        "delay_before_return_html": 3.0
    }
}

Parameter Reference Table

Category Parameter Type Description
Browser headless bool Run browser in headless mode
Browser browser_type str Browser engine selection
Browser user_agent str Custom user agent string
Network proxy str Proxy server URL
Network headers dict Custom HTTP headers
Timing page_timeout int Page load timeout (ms)
Timing delay_before_return_html float Wait before capture
Anti-Detection simulate_user bool Human behavior simulation
Anti-Detection magic bool Advanced protection
Session session_id str Browser session ID
Session user_data_dir str Profile directory
Content word_count_threshold int Minimum words per block
Content only_text bool Text-only extraction
Content process_iframes bool Include iframe content
Debug verbose bool Detailed logging
Debug log_console bool Browser console logs

Troubleshooting 🔍

Common Issues

  1. Connection Refused

    Error: Connection refused at localhost:11235
    

    Solution: Ensure the container is running and ports are properly mapped.

  2. Resource Limits

    Error: No available slots
    

    Solution: Increase MAX_CONCURRENT_TASKS or container resources.

  3. GPU Access

    Error: GPU not found
    

    Solution: Ensure proper NVIDIA drivers and use --gpus all flag.

Debug Mode

Access container for debugging:

docker run -it --entrypoint /bin/bash unclecode/crawl4ai:all

View container logs:

docker logs [container_id]

Best Practices 🌟

  1. Resource Management

    • Set appropriate memory and CPU limits
    • Monitor resource usage via health endpoint
    • Use basic version for simple crawling tasks
  2. Scaling

    • Use multiple containers for high load
    • Implement proper load balancing
    • Monitor performance metrics
  3. Security

    • Use environment variables for sensitive data
    • Implement proper network isolation
    • Regular security updates

API Reference 📚

Health Check

GET /health

Submit Crawl Task

POST /crawl
Content-Type: application/json

{
    "urls": "string or array",
    "extraction_config": {
        "type": "basic|llm|cosine|json_css",
        "params": {}
    },
    "priority": 1-10,
    "ttl": 3600
}

Get Task Status

GET /task/{task_id}

For more details, visit the official documentation.