Files

UncleCode 966fb47e64 feat(config): enhance serialization and add deep crawling exports

Improve configuration serialization with better handling of frozensets and slots.
Expand deep crawling module exports and documentation.
Add comprehensive API usage examples in Docker README.

- Add support for frozenset serialization
- Improve error handling in config loading
- Export additional deep crawling components
- Enhance Docker API documentation with detailed examples
- Fix ContentTypeFilter initialization

2025-02-13 21:45:19 +08:00

.dockerignore

feat(docker): enhance Docker deployment setup and configuration

2025-02-01 19:33:27 +08:00

.llm.env.example

feat(docker): enhance Docker deployment setup and configuration

2025-02-01 19:33:27 +08:00

api.py

feat(api): improve cache handling and add API tests

2025-02-02 20:53:31 +08:00

config.yml

refactor(docker): improve server architecture and configuration

2025-02-02 20:19:51 +08:00

Dockerfile.bak

refactor(docker): improve server architecture and configuration

2025-02-02 20:19:51 +08:00

README.md

feat(config): enhance serialization and add deep crawling exports

2025-02-13 21:45:19 +08:00

requirements.txt

refactor(docker): improve server architecture and configuration

2025-02-02 20:19:51 +08:00

server.py

refactor(docker): improve server architecture and configuration

2025-02-02 20:19:51 +08:00

utils.py

refactor(docker): improve server architecture and configuration

2025-02-02 20:19:51 +08:00

README.md

Crawl4AI Docker Guide 🐳

Prerequisites
Installation
- Local Build
- Docker Hub
Dockerfile Parameters
Using the API
Metrics & Monitoring
Deployment Scenarios
Complete Examples
Getting Help

Prerequisites

Before we dive in, make sure you have:

Docker installed and running (version 20.10.0 or higher)
At least 4GB of RAM available for the container
Python 3.10+ (if using the Python SDK)
Node.js 16+ (if using the Node.js examples)

💡 Pro tip: Run docker info to check your Docker installation and available resources.

Installation

Local Build

Let's get your local environment set up step by step!

1. Building the Image

First, clone the repository and build the Docker image:

# Clone the repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai

# Build the Docker image
docker build -t crawl4ai-server:prod \
  --build-arg PYTHON_VERSION=3.10 \
  --build-arg INSTALL_TYPE=all \
  --build-arg ENABLE_GPU=false \
  deploy/docker/

2. Environment Setup

If you plan to use LLMs (Language Models), you'll need to set up your API keys. Create a .llm.env file:

# OpenAI
OPENAI_API_KEY=sk-your-key

# Anthropic
ANTHROPIC_API_KEY=your-anthropic-key

# DeepSeek
DEEPSEEK_API_KEY=your-deepseek-key

# Check out https://docs.litellm.ai/docs/providers for more providers!

🔑 Note: Keep your API keys secure! Never commit them to version control.

3. Running the Container

You have several options for running the container:

Basic run (no LLM support):

docker run -d -p 8000:8000 --name crawl4ai crawl4ai-server:prod

With LLM support:

docker run -d -p 8000:8000 \
  --env-file .llm.env \
  --name crawl4ai \
  crawl4ai-server:prod

Using host environment variables (Not a good practice, but works for local testing):

docker run -d -p 8000:8000 \
  --env-file .llm.env \
  --env-from "$(env)" \
  --name crawl4ai \
  crawl4ai-server:prod

More on Building

You have several options for building the Docker image based on your needs:

Basic Build

# Clone the repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai

# Simple build with defaults
docker build -t crawl4ai-server:prod deploy/docker/

Advanced Build Options

# Build with custom parameters
docker build -t crawl4ai-server:prod \
  --build-arg PYTHON_VERSION=3.10 \
  --build-arg INSTALL_TYPE=all \
  --build-arg ENABLE_GPU=false \
  deploy/docker/

Platform-Specific Builds

The Dockerfile includes optimizations for different architectures (ARM64 and AMD64). Docker automatically detects your platform, but you can specify it explicitly:

# Build for ARM64
docker build --platform linux/arm64 -t crawl4ai-server:arm64 deploy/docker/

# Build for AMD64
docker build --platform linux/amd64 -t crawl4ai-server:amd64 deploy/docker/

Multi-Platform Build

For distributing your image across different architectures, use buildx:

# Set up buildx builder
docker buildx create --use

# Build for multiple platforms
docker buildx build \
  --platform linux/amd64,linux/arm64 \
  -t yourusername/crawl4ai-server:multi \
  --push \
  deploy/docker/

💡 Note: Multi-platform builds require Docker Buildx and need to be pushed to a registry.

Development Build

For development, you might want to enable all features:

docker build -t crawl4ai-server:dev \
  --build-arg INSTALL_TYPE=all \
  --build-arg PYTHON_VERSION=3.10 \
  --build-arg ENABLE_GPU=true \
  deploy/docker/

GPU-Enabled Build

If you plan to use GPU acceleration:

docker build -t crawl4ai-server:gpu \
  --build-arg ENABLE_GPU=true \
  deploy/docker/

Build Arguments Explained

Argument	Description	Default	Options
PYTHON_VERSION	Python version	3.10	3.8, 3.9, 3.10
INSTALL_TYPE	Feature set	default	default, all, torch, transformer
ENABLE_GPU	GPU support	false	true, false
APP_HOME	Install path	/app	any valid path

Build Best Practices

Choose the Right Install Type
- default: Basic installation, smallest image, to be honest, I use this most of the time.
- all: Full features, larger image (include transformer, and nltk, make sure you really need them)
Platform Considerations
- Let Docker auto-detect platform unless you need cross-compilation
- Use --platform for specific architecture requirements
- Consider buildx for multi-architecture distribution
Development vs Production
- Use INSTALL_TYPE=all for development
- Stick to default for production if you don't need extra features
- Enable GPU only if you have compatible hardware
Performance Optimization
- The image automatically includes platform-specific optimizations
- AMD64 gets OpenMP optimizations
- ARM64 gets OpenBLAS optimizations

Docker Hub

🚧 Coming soon! The image will be available at crawl4ai/server. Stay tuned!

Dockerfile Parameters

Configure your build with these parameters:

Parameter	Description	Default	Options
PYTHON_VERSION	Python version to use	3.10	3.8, 3.9, 3.10
INSTALL_TYPE	Installation profile	default	default, all, torch, transformer
ENABLE_GPU	Enable GPU support	false	true, false
APP_HOME	Application directory	/app	any valid path
TARGETARCH	Target architecture	auto-detected	amd64, arm64

Using the API

In the following sections, we discuss two ways to communicate with the Docker server. One option is to use the client SDK that I developed for Python, and I will soon develop one for Node.js. I highly recommend this approach to avoid mistakes. Alternatively, you can take a more technical route by using the JSON structure and passing it to all the URLs, which I will explain in detail.

Python SDK

The SDK makes things easier! Here's how to use it:

from crawl4ai.docker_client import Crawl4aiDockerClient
from crawl4ai import BrowserConfig, CrawlerRunConfig

async with Crawl4aiDockerClient() as client:
    # The SDK handles serialization for you!
    result = await client.crawl(
        urls=["https://example.com"],
        browser_config=BrowserConfig(headless=True),
        crawler_config=CrawlerRunConfig(stream=False)
    )
    print(result.markdown)

Crawl4aiDockerClient is an async context manager that handles the connection for you. You can pass in optional parameters for more control:

base_url (str): Base URL of the Crawl4AI Docker server
timeout (float): Default timeout for requests in seconds
verify_ssl (bool): Whether to verify SSL certificates
verbose (bool): Whether to show logging output
log_file (str, optional): Path to log file if file logging is desired

This client SDK generates a properly structured JSON request for the server's HTTP API.

Second Approach: Direct API Calls

This is super important! The API expects a specific structure that matches our Python classes. Let me show you how it works.

The Magic of Type Matching

When you send a request, each configuration object needs a "type" field that matches the exact class name from the library. Here's an example:

# First, let's create objects the normal way
from crawl4ai import BrowserConfig, CrawlerRunConfig, PruningContentFilter

# Create some config objects
browser_config = BrowserConfig(headless=True, viewport={"width": 1200, "height": 800})
content_filter = PruningContentFilter(threshold=0.48, threshold_type="fixed")

# Use dump() to see the serialized format
print(browser_config.dump())

This will output something like:

{
    "type": "BrowserConfig",
    "params": {
        "headless": true,
        "viewport": {
            "width": 1200,
            "height": 800
        }
    }
}

Structuring Your Requests

Basic Request Structure Every request must include URLs and may include configuration objects:

{
    "urls": ["https://example.com"],
    "browser_config": {...},
    "crawler_config": {...}
}

Understanding Type-Params Pattern All complex objects follow this pattern:

{
    "type": "ClassName",
    "params": {
        "param1": value1,
        "param2": value2
    }
}

💡 Note: Simple types (strings, numbers, booleans) are passed directly without the type-params wrapper.

Browser Configuration

{
    "urls": ["https://example.com"],
    "browser_config": {
        "type": "BrowserConfig",
        "params": {
            "headless": true,
            "viewport": {
                "type": "dict",
                "value": {
                    "width": 1200,
                    "height": 800
                }
            }
        }
    }
}

Simple Crawler Configuration

{
    "urls": ["https://example.com"],
    "crawler_config": {
        "type": "CrawlerRunConfig",
        "params": {
            "word_count_threshold": 200,
            "stream": true,
            "verbose": true
        }
    }
}

Advanced Crawler Configuration

{
    "urls": ["https://example.com"],
    "crawler_config": {
        "type": "CrawlerRunConfig",
        "params": {
            "cache_mode": "bypass",
            "markdown_generator": {
                "type": "DefaultMarkdownGenerator",
                "params": {
                    "content_filter": {
                        "type": "PruningContentFilter",
                        "params": {
                            "threshold": 0.48,
                            "threshold_type": "fixed",
                            "min_word_threshold": 0
                        }
                    }
                }
            }
        }
    }
}

Adding Strategies

Chunking Strategy:

{
    "crawler_config": {
        "type": "CrawlerRunConfig",
        "params": {
            "chunking_strategy": {
                "type": "RegexChunking",
                "params": {
                    "patterns": ["\n\n", "\\.\\s+"]
                }
            }
        }
    }
}

Extraction Strategy:

{
    "crawler_config": {
        "type": "CrawlerRunConfig",
        "params": {
            "extraction_strategy": {
                "type": "JsonCssExtractionStrategy",
                "params": {
                    "schema": {
                        "baseSelector": "article.post",
                        "fields": [
                            {"name": "title", "selector": "h1", "type": "text"},
                            {"name": "content", "selector": ".content", "type": "html"}
                        ]
                    }
                }
            }
        }
    }
}

LLM Extraction Strategy

{
  "crawler_config": {
    "type": "CrawlerRunConfig",
    "params": {
      "extraction_strategy": {
        "type": "LLMExtractionStrategy",
        "params": {
          "instruction": "Extract article title, author, publication date and main content",
          "provider": "openai/gpt-4",
          "api_token": "your-api-token",
          "schema": {
            "type": "dict",
            "value": {
              "title": "Article Schema",
              "type": "object",
              "properties": {
                "title": {
                  "type": "string",
                  "description": "The article's headline"
                },
                "author": {
                  "type": "string",
                  "description": "The author's name"
                },
                "published_date": {
                  "type": "string",
                  "format": "date-time",
                  "description": "Publication date and time"
                },
                "content": {
                  "type": "string",
                  "description": "The main article content"
                }
              },
              "required": ["title", "content"]
            }
          }
        }
      }
    }
  }
}

Deep Crawler Exampler

{
  "crawler_config": {
    "type": "CrawlerRunConfig",
    "params": {
      "deep_crawl_strategy": {
        "type": "BFSDeepCrawlStrategy",
        "params": {
          "max_depth": 3,
          "max_pages": 100,
          "filter_chain": {
            "type": "FastFilterChain",
            "params": {
              "filters": [
                {
                  "type": "FastContentTypeFilter",
                  "params": {
                    "allowed_types": ["text/html", "application/xhtml+xml"]
                  }
                },
                {
                  "type": "FastDomainFilter",
                  "params": {
                    "allowed_domains": ["blog.*", "docs.*"],
                    "blocked_domains": ["ads.*", "analytics.*"]
                  }
                },
                {
                  "type": "FastURLPatternFilter",
                  "params": {
                    "allowed_patterns": ["^/blog/", "^/docs/"],
                    "blocked_patterns": [".*/ads/", ".*/sponsored/"]
                  }
                }
              ]
            }
          },
          "url_scorer": {
            "type": "FastCompositeScorer",
            "params": {
              "scorers": [
                {
                  "type": "FastKeywordRelevanceScorer",
                  "params": {
                    "keywords": ["tutorial", "guide", "documentation"],
                    "weight": 1.0
                  }
                },
                {
                  "type": "FastPathDepthScorer",
                  "params": {
                    "weight": 0.5,
                    "preferred_depth": 2
                  }
                },
                {
                  "type": "FastFreshnessScorer",
                  "params": {
                    "weight": 0.8,
                    "max_age_days": 365
                  }
                }
              ]
            }
          }
        }
      }
    }
  }
}

Important Rules:

Always use the type-params pattern for class instances
Use direct values for primitives (numbers, strings, booleans)
Wrap dictionaries with {"type": "dict", "value": {...}}
Arrays/lists are passed directly without type-params
All parameters are optional unless specifically required

REST API Examples

Let's look at some practical examples:

Simple Crawl

import requests

response = requests.post(
    "http://localhost:8000/crawl",
    json={
        "urls": ["https://example.com"],
        "browser_config": {
            "type": "BrowserConfig",
            "params": {"headless": True}
        }
    }
)
print(response.json())

Streaming Results

import requests

response = requests.post(
    "http://localhost:8000/crawl",
    json={
        "urls": ["https://example.com"],
        "crawler_config": {
            "type": "CrawlerRunConfig",
            "params": {"stream": True}
        }
    },
    stream=True
)

for line in response.iter_lines():
    if line:
        print(line.decode())

Metrics & Monitoring

Keep an eye on your crawler with these endpoints:

/health - Quick health check
/metrics - Detailed Prometheus metrics
/schema - Full API schema

Example health check:

curl http://localhost:8000/health

Deployment Scenarios

🚧 Coming soon! We'll cover:

Kubernetes deployment

Cloud provider setups (AWS, GCP, Azure)

High-availability configurations

Load balancing strategies

Complete Examples

Check out the examples folder in our repository for full working examples! Here's one to get you started:

import requests
import time
import httpx
import asyncio
from typing import Dict, Any
from crawl4ai import (
    BrowserConfig, CrawlerRunConfig, DefaultMarkdownGenerator,
    PruningContentFilter, JsonCssExtractionStrategy, LLMContentFilter, CacheMode
)
from crawl4ai.docker_client import Crawl4aiDockerClient

class Crawl4AiTester:
    def __init__(self, base_url: str = "http://localhost:11235"):
        self.base_url = base_url

    def submit_and_wait(
        self, request_data: Dict[str, Any], timeout: int = 300
    ) -> Dict[str, Any]:
        # Submit crawl job
        response = requests.post(f"{self.base_url}/crawl", json=request_data)
        task_id = response.json()["task_id"]
        print(f"Task ID: {task_id}")

        # Poll for result
        start_time = time.time()
        while True:
            if time.time() - start_time > timeout:
                raise TimeoutError(
                    f"Task {task_id} did not complete within {timeout} seconds"
                )

            result = requests.get(f"{self.base_url}/task/{task_id}")
            status = result.json()

            if status["status"] == "failed":
                print("Task failed:", status.get("error"))
                raise Exception(f"Task failed: {status.get('error')}")

            if status["status"] == "completed":
                return status

            time.sleep(2)

async def test_direct_api():
    """Test direct API endpoints without using the client SDK"""
    print("\n=== Testing Direct API Calls ===")
    
    # Test 1: Basic crawl with content filtering
    browser_config = BrowserConfig(
        headless=True,
        viewport_width=1200,
        viewport_height=800
    )
    
    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        markdown_generator=DefaultMarkdownGenerator(
            content_filter=PruningContentFilter(
                threshold=0.48,
                threshold_type="fixed",
                min_word_threshold=0
            ),
            options={"ignore_links": True}
        )
    )

    request_data = {
        "urls": ["https://example.com"],
        "browser_config": browser_config.dump(),
        "crawler_config": crawler_config.dump()
    }

    # Make direct API call
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://localhost:8000/crawl",
            json=request_data,
            timeout=300
        )
        assert response.status_code == 200
        result = response.json()
        print("Basic crawl result:", result["success"])

    # Test 2: Structured extraction with JSON CSS
    schema = {
        "baseSelector": "article.post",
        "fields": [
            {"name": "title", "selector": "h1", "type": "text"},
            {"name": "content", "selector": ".content", "type": "html"}
        ]
    }

    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        extraction_strategy=JsonCssExtractionStrategy(schema=schema)
    )

    request_data["crawler_config"] = crawler_config.dump()

    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://localhost:8000/crawl",
            json=request_data
        )
        assert response.status_code == 200
        result = response.json()
        print("Structured extraction result:", result["success"])

    # Test 3: Get schema
    # async with httpx.AsyncClient() as client:
    #     response = await client.get("http://localhost:8000/schema")
    #     assert response.status_code == 200
    #     schemas = response.json()
    #     print("Retrieved schemas for:", list(schemas.keys()))

async def test_with_client():
    """Test using the Crawl4AI Docker client SDK"""
    print("\n=== Testing Client SDK ===")
    
    async with Crawl4aiDockerClient(verbose=True) as client:
        # Test 1: Basic crawl
        browser_config = BrowserConfig(headless=True)
        crawler_config = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,
            markdown_generator=DefaultMarkdownGenerator(
                content_filter=PruningContentFilter(
                    threshold=0.48,
                    threshold_type="fixed"
                )
            )
        )

        result = await client.crawl(
            urls=["https://example.com"],
            browser_config=browser_config,
            crawler_config=crawler_config
        )
        print("Client SDK basic crawl:", result.success)

        # Test 2: LLM extraction with streaming
        crawler_config = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,
            markdown_generator=DefaultMarkdownGenerator(
                content_filter=LLMContentFilter(
                    provider="openai/gpt-40",
                    instruction="Extract key technical concepts"
                )
            ),
            stream=True
        )

        async for result in await client.crawl(
            urls=["https://example.com"],
            browser_config=browser_config,
            crawler_config=crawler_config
        ):
            print(f"Streaming result for: {result.url}")

        # # Test 3: Get schema
        # schemas = await client.get_schema()
        # print("Retrieved client schemas for:", list(schemas.keys()))

async def main():
    """Run all tests"""
    # Test direct API
    print("Testing direct API calls...")
    await test_direct_api()

    # Test client SDK
    print("\nTesting client SDK...")
    await test_with_client()

if __name__ == "__main__":
    asyncio.run(main())

Server Configuration

The server's behavior can be customized through the config.yml file. Let's explore how to configure your Crawl4AI server for optimal performance and security.

Understanding config.yml

The configuration file is located at deploy/docker/config.yml. You can either modify this file before building the image or mount a custom configuration when running the container.

Here's a detailed breakdown of the configuration options:

# Application Configuration
app:
  title: "Crawl4AI API"           # Server title in OpenAPI docs
  version: "1.0.0"               # API version
  host: "0.0.0.0"               # Listen on all interfaces
  port: 8000                    # Server port
  reload: True                  # Enable hot reloading (development only)
  timeout_keep_alive: 300       # Keep-alive timeout in seconds

# Rate Limiting Configuration
rate_limiting:
  enabled: True                 # Enable/disable rate limiting
  default_limit: "100/minute"   # Rate limit format: "number/timeunit"
  trusted_proxies: []          # List of trusted proxy IPs
  storage_uri: "memory://"     # Use "redis://localhost:6379" for production

# Security Configuration
security:
  enabled: false               # Master toggle for security features
  https_redirect: True         # Force HTTPS
  trusted_hosts: ["*"]        # Allowed hosts (use specific domains in production)
  headers:                     # Security headers
    x_content_type_options: "nosniff"
    x_frame_options: "DENY"
    content_security_policy: "default-src 'self'"
    strict_transport_security: "max-age=63072000; includeSubDomains"

# Crawler Configuration
crawler:
  memory_threshold_percent: 95.0  # Memory usage threshold
  rate_limiter:
    base_delay: [1.0, 2.0]      # Min and max delay between requests
  timeouts:
    stream_init: 30.0           # Stream initialization timeout
    batch_process: 300.0        # Batch processing timeout

# Logging Configuration
logging:
  level: "INFO"                 # Log level (DEBUG, INFO, WARNING, ERROR)
  format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"

# Observability Configuration
observability:
  prometheus:
    enabled: True              # Enable Prometheus metrics
    endpoint: "/metrics"       # Metrics endpoint
  health_check:
    endpoint: "/health"        # Health check endpoint

Configuration Tips and Best Practices

Production Settings 🏭

app:
  reload: False              # Disable reload in production
  timeout_keep_alive: 120    # Lower timeout for better resource management

rate_limiting:
  storage_uri: "redis://redis:6379"  # Use Redis for distributed rate limiting
  default_limit: "50/minute"         # More conservative rate limit

security:
  enabled: true                      # Enable all security features
  trusted_hosts: ["your-domain.com"] # Restrict to your domain

Development Settings 🛠️

app:
  reload: True               # Enable hot reloading
  timeout_keep_alive: 300    # Longer timeout for debugging

logging:
  level: "DEBUG"            # More verbose logging

High-Traffic Settings 🚦

crawler:
  memory_threshold_percent: 85.0  # More conservative memory limit
  rate_limiter:
    base_delay: [2.0, 4.0]       # More aggressive rate limiting

Customizing Your Configuration

Method 1: Pre-build Configuration

# Copy and modify config before building
cp deploy/docker/config.yml custom-config.yml
vim custom-config.yml

# Build with custom config
docker build -t crawl4ai-server:prod \
  --build-arg CONFIG_PATH=custom-config.yml .

Method 2: Runtime Configuration

# Mount custom config at runtime
docker run -d -p 8000:8000 \
  -v $(pwd)/custom-config.yml:/app/config.yml \
  crawl4ai-server:prod

Configuration Recommendations

Security First 🔒
- Always enable security in production
- Use specific trusted_hosts instead of wildcards
- Set up proper rate limiting to protect your server
- Consider your environment before enabling HTTPS redirect
Resource Management 💻
- Adjust memory_threshold_percent based on available RAM
- Set timeouts according to your content size and network conditions
- Use Redis for rate limiting in multi-container setups
Monitoring 📊
- Enable Prometheus if you need metrics
- Set DEBUG logging in development, INFO in production
- Regular health check monitoring is crucial
Performance Tuning ⚡
- Start with conservative rate limiter delays
- Increase batch_process timeout for large content
- Adjust stream_init timeout based on initial response times

Configuration Migration

When upgrading Crawl4AI, follow these steps:

Back up your current config:

cp /app/config.yml /app/config.yml.backup

Use version control:

git add config.yml
git commit -m "Save current server configuration"

Test in staging first:

docker run -d -p 8001:8000 \  # Use different port
  -v $(pwd)/new-config.yml:/app/config.yml \
  crawl4ai-server:prod

Common Configuration Scenarios

Basic Development Setup

security:
  enabled: false
logging:
  level: "DEBUG"

Production API Server

security:
  enabled: true
  trusted_hosts: ["api.yourdomain.com"]
rate_limiting:
  enabled: true
  default_limit: "50/minute"

High-Performance Crawler

crawler:
  memory_threshold_percent: 90.0
  timeouts:
    batch_process: 600.0

Getting Help

We're here to help you succeed with Crawl4AI! Here's how to get support:

📖 Check our full documentation
🐛 Found a bug? Open an issue
💬 Join our Discord community
⭐ Star us on GitHub to show support!

Summary

In this guide, we've covered everything you need to get started with Crawl4AI's Docker deployment:

Building and running the Docker container
Configuring the environment
Making API requests with proper typing
Using the Python SDK
Monitoring your deployment

Remember, the examples in the examples folder are your friends - they show real-world usage patterns that you can adapt for your needs.

Keep exploring, and don't hesitate to reach out if you need help! We're building something amazing together. 🚀

Happy crawling! 🕷️

README.md

Crawl4AI Docker Guide 🐳

Table of Contents

Prerequisites

Installation

Local Build

1. Building the Image

2. Environment Setup

3. Running the Container

More on Building

Basic Build

Advanced Build Options

Platform-Specific Builds

Multi-Platform Build

Development Build

GPU-Enabled Build

Build Arguments Explained

Build Best Practices

Docker Hub

Dockerfile Parameters

Using the API

Python SDK

Second Approach: Direct API Calls

The Magic of Type Matching

Structuring Your Requests

REST API Examples

Simple Crawl

Streaming Results

Metrics & Monitoring

Deployment Scenarios

Complete Examples

Server Configuration

Understanding config.yml

Configuration Tips and Best Practices

Customizing Your Configuration

Method 1: Pre-build Configuration

Method 2: Runtime Configuration

Configuration Recommendations

Configuration Migration

Common Configuration Scenarios

Getting Help

Summary