ARCHITECTURE.md: - Dense technical reference for AI agents - Complete system flow diagrams - Memory leak prevention strategies - File cross-references with line numbers - Symbolic notation for compression - Docker orchestration deep dive QUICKSTART.md: - One-page cheat sheet for users - Install → launch → scale → test workflow - Simple example.com curl test - Common commands reference
21 KiB
Crawl4AI Docker Architecture - AI Context Map
Purpose: Dense technical reference for AI agents to understand complete system architecture. Format: Symbolic, compressed, high-information-density documentation.
System Overview
┌─────────────────────────────────────────────────────────────┐
│ CRAWL4AI DOCKER ORCHESTRATION SYSTEM │
├─────────────────────────────────────────────────────────────┤
│ Modes: Single (N=1) | Swarm (N>1) | Compose+Nginx (N>1) │
│ Entry: cnode CLI → deploy/docker/cnode_cli.py │
│ Core: deploy/docker/server_manager.py │
│ Server: deploy/docker/server.py (FastAPI) │
│ API: deploy/docker/api.py (crawl endpoints) │
│ Monitor: deploy/docker/monitor.py + monitor_routes.py │
└─────────────────────────────────────────────────────────────┘
Directory Structure & File Map
deploy/
├── docker/ # Server runtime & orchestration
│ ├── server.py # FastAPI app entry [CRITICAL]
│ ├── api.py # /crawl, /screenshot, /pdf endpoints
│ ├── server_manager.py # Docker orchestration logic [CORE]
│ ├── cnode_cli.py # CLI interface (Click-based)
│ ├── monitor.py # Real-time metrics collector
│ ├── monitor_routes.py # /monitor dashboard routes
│ ├── crawler_pool.py # Browser pool management
│ ├── hook_manager.py # Pre/post crawl hooks
│ ├── job.py # Job queue schema
│ ├── utils.py # Helpers (port check, health)
│ ├── auth.py # API key authentication
│ ├── schemas.py # Pydantic models
│ ├── mcp_bridge.py # MCP protocol bridge
│ ├── supervisord.conf # Process manager config
│ ├── config.yml # Server config template
│ ├── requirements.txt # Python deps
│ ├── static/ # Web assets
│ │ ├── monitor/ # Dashboard UI
│ │ └── playground/ # API playground
│ └── tests/ # Test suite
│
└── installer/ # User-facing installation
├── cnode_pkg/ # Standalone package
│ ├── cli.py # Copy of cnode_cli.py
│ ├── server_manager.py # Copy of server_manager.py
│ └── requirements.txt # click, rich, anyio, pyyaml
├── install-cnode.sh # Remote installer (git sparse-checkout)
├── sync-cnode.sh # Dev tool (source→pkg sync)
├── USER_GUIDE.md # Human-readable guide
├── README.md # Developer documentation
└── QUICKSTART.md # Cheat sheet
Core Components Deep Dive
1. server_manager.py - Orchestration Engine
Role: Manages Docker container lifecycle, auto-detects deployment mode.
Key Classes:
ServerManager- Main orchestratorstart(replicas, mode, port, env_file, image)→ Deploy serverstop(remove_volumes)→ Teardownstatus()→ Health checkscale(replicas)→ Live scalinglogs(follow, tail)→ Stream logscleanup(force)→ Emergency cleanup
State Management:
- File:
~/.crawl4ai/server_state.yml - Schema:
{mode, replicas, port, image, started_at, containers[]} - Atomic writes with lock file
Deployment Modes:
if replicas == 1:
mode = "single" # docker run
elif swarm_available():
mode = "swarm" # docker stack deploy
else:
mode = "compose" # docker-compose + nginx
Container Naming:
- Single:
crawl4ai-server - Swarm:
crawl4ai-stack_crawl4ai - Compose:
crawl4ai-server-{1..N},crawl4ai-nginx
Networks:
crawl4ai-network(bridge mode for all)
Volumes:
crawl4ai-redis-data- Persistent queuecrawl4ai-profiles- Browser profiles
Health Checks:
- Endpoint:
http://localhost:{port}/health - Timeout: 30s startup
- Retry: 3 attempts
2. server.py - FastAPI Application
Role: HTTP server exposing crawl API + monitoring.
Startup Flow:
app = FastAPI()
@app.on_event("startup")
async def startup():
init_crawler_pool() # Pre-warm browsers
init_redis_connection() # Job queue
start_monitor_collector() # Metrics
Key Endpoints:
POST /crawl → api.py:crawl_endpoint()
POST /crawl/stream → api.py:crawl_stream_endpoint()
POST /screenshot → api.py:screenshot_endpoint()
POST /pdf → api.py:pdf_endpoint()
GET /health → server.py:health_check()
GET /monitor → monitor_routes.py:dashboard()
WS /monitor/ws → monitor_routes.py:websocket_endpoint()
GET /playground → static/playground/index.html
Process Manager:
- Uses
supervisordto manage:- FastAPI server (port 11235)
- Redis (port 6379)
- Background workers
Environment:
CRAWL4AI_PORT=11235
REDIS_URL=redis://localhost:6379
MAX_CONCURRENT_CRAWLS=5
BROWSER_POOL_SIZE=3
3. api.py - Crawl Endpoints
Main Endpoint: POST /crawl
Request Schema:
{
"urls": ["https://example.com"],
"priority": 10,
"browser_config": {
"type": "BrowserConfig",
"params": {"headless": true, "viewport_width": 1920}
},
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {"cache_mode": "bypass", "extraction_strategy": {...}}
}
}
Processing Flow:
1. Validate request (Pydantic)
2. Queue job → Redis
3. Get browser from pool → crawler_pool.py
4. Execute crawl → AsyncWebCrawler
5. Apply hooks → hook_manager.py
6. Return result (JSON)
7. Release browser to pool
Memory Management:
- Browser pool: Max 3 instances
- LRU eviction when pool full
- Explicit cleanup:
browser.close()in finally block - Redis TTL: 1 hour for completed jobs
Error Handling:
try:
result = await crawler.arun(url, config)
except PlaywrightError as e:
# Browser crash - release & recreate
await pool.invalidate(browser_id)
except TimeoutError as e:
# Timeout - kill & retry
await crawler.kill()
except Exception as e:
# Unknown - log & fail gracefully
logger.error(f"Crawl failed: {e}")
4. crawler_pool.py - Browser Pool Manager
Role: Manage persistent browser instances to avoid startup overhead.
Class: CrawlerPool
get_crawler()→ Lease browser (async with context manager)release_crawler(id)→ Return to poolwarm_up(count)→ Pre-launch browserscleanup()→ Close all browsers
Pool Strategy:
pool = {
"browser_1": {"crawler": AsyncWebCrawler(), "in_use": False},
"browser_2": {"crawler": AsyncWebCrawler(), "in_use": False},
"browser_3": {"crawler": AsyncWebCrawler(), "in_use": False},
}
async with pool.get_crawler() as crawler:
result = await crawler.arun(url)
# Auto-released on context exit
Anti-Leak Mechanisms:
- Context managers enforce cleanup
- Watchdog thread kills stale browsers (>10min idle)
- Max lifetime: 1 hour per browser
- Force GC after browser close
5. monitor.py + monitor_routes.py - Real-time Dashboard
Architecture:
[Browser] <--WebSocket--> [monitor_routes.py] <--Events--> [monitor.py]
↓
[Redis Pub/Sub]
↓
[Metrics Collector]
Metrics Collected:
- Requests/sec (sliding window)
- Active crawls (real-time count)
- Response times (p50, p95, p99)
- Error rate (5min rolling)
- Memory usage (RSS, heap)
- Browser pool utilization
WebSocket Protocol:
// Server → Client
{
"type": "metrics",
"data": {
"rps": 45.3,
"active_crawls": 12,
"p95_latency": 1234,
"error_rate": 0.02
}
}
// Client → Server
{
"type": "subscribe",
"channels": ["metrics", "logs"]
}
Dashboard Route: /monitor
- Real-time graphs (Chart.js)
- Request log stream
- Container health status
- Resource utilization
6. cnode_cli.py - CLI Interface
Framework: Click (Python CLI framework)
Command Structure:
cnode
├── start [--replicas N] [--port P] [--mode M] [--image I]
├── stop [--remove-volumes]
├── status
├── scale N
├── logs [--follow] [--tail N]
├── restart [--replicas N]
└── cleanup [--force]
Execution Flow:
@cli.command("start")
def start_cmd(replicas, mode, port, env_file, image):
manager = ServerManager()
result = anyio.run(manager.start(...)) # Async bridge
if result["success"]:
console.print(success_panel)
User Feedback:
- Rich library for colors/tables
- Progress spinners during operations
- Error messages with hints
- Status tables with health indicators
State Persistence:
- Saves deployment config to
~/.crawl4ai/server_state.yml - Enables stateless commands (status, scale, restart)
7. Docker Orchestration Details
Single Container Mode (N=1):
docker run -d \
--name crawl4ai-server \
--network crawl4ai-network \
-p 11235:11235 \
-v crawl4ai-redis-data:/data \
unclecode/crawl4ai:latest
Docker Swarm Mode (N>1, Swarm available):
# docker-compose.swarm.yml
version: '3.8'
services:
crawl4ai:
image: unclecode/crawl4ai:latest
deploy:
replicas: 5
update_config:
parallelism: 2
delay: 10s
restart_policy:
condition: on-failure
ports:
- "11235:11235"
networks:
- crawl4ai-network
Deploy: docker stack deploy -c docker-compose.swarm.yml crawl4ai-stack
Docker Compose + Nginx Mode (N>1, fallback):
# docker-compose.yml
services:
crawl4ai-1:
image: unclecode/crawl4ai:latest
networks: [crawl4ai-network]
crawl4ai-2:
image: unclecode/crawl4ai:latest
networks: [crawl4ai-network]
nginx:
image: nginx:alpine
ports: ["11235:80"]
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
networks: [crawl4ai-network]
Nginx config (round-robin load balancing):
upstream crawl4ai_backend {
server crawl4ai-1:11235;
server crawl4ai-2:11235;
server crawl4ai-3:11235;
}
server {
listen 80;
location / {
proxy_pass http://crawl4ai_backend;
proxy_set_header Host $host;
}
}
Memory Leak Prevention Strategy
Problem Areas & Solutions
1. Browser Instances
# ❌ BAD - Leak risk
crawler = AsyncWebCrawler()
result = await crawler.arun(url)
# Browser never closed!
# ✅ GOOD - Guaranteed cleanup
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url)
# Auto-closed on exit
2. WebSocket Connections
# monitor_routes.py
active_connections = set()
@app.websocket("/monitor/ws")
async def websocket_endpoint(websocket):
await websocket.accept()
active_connections.add(websocket)
try:
while True:
await websocket.send_json(get_metrics())
finally:
active_connections.remove(websocket) # Critical!
3. Redis Connections
# Use connection pooling
redis_pool = aioredis.ConnectionPool(
host="localhost",
port=6379,
max_connections=10,
decode_responses=True
)
# Reuse connections
async def get_job(job_id):
async with redis_pool.get_connection() as conn:
data = await conn.get(f"job:{job_id}")
# Connection auto-returned to pool
4. Async Task Cleanup
# Track background tasks
background_tasks = set()
async def crawl_task(url):
try:
result = await crawl(url)
finally:
background_tasks.discard(asyncio.current_task())
# On shutdown
async def shutdown():
tasks = list(background_tasks)
for task in tasks:
task.cancel()
await asyncio.gather(*tasks, return_exceptions=True)
5. File Descriptor Leaks
# Use context managers for files
async def save_screenshot(url):
async with aiofiles.open(f"{job_id}.png", "wb") as f:
await f.write(screenshot_bytes)
# File auto-closed
Installation & Distribution
User Installation Flow
Script: deploy/installer/install-cnode.sh
Steps:
- Check Python 3.8+ exists
- Check pip available
- Check Docker installed (warn if missing)
- Create temp dir:
mktemp -d - Git sparse-checkout:
git init git remote add origin https://github.com/unclecode/crawl4ai.git git config core.sparseCheckout true echo "deploy/installer/cnode_pkg/*" > .git/info/sparse-checkout git pull --depth=1 origin main - Install deps:
pip install click rich anyio pyyaml - Copy package:
cnode_pkg/ → /usr/local/lib/cnode/ - Create wrapper:
/usr/local/bin/cnode#!/usr/bin/env bash export PYTHONPATH="/usr/local/lib/cnode:$PYTHONPATH" exec python3 -m cnode_pkg.cli "$@" - Cleanup temp dir
Result:
- Binary-like experience (fast startup: ~0.1s)
- No need for PyInstaller (49x faster)
- Platform-independent (any OS with Python)
Development Workflow
Source Code Sync (Auto)
Git Hook: .githooks/pre-commit
Trigger: When committing deploy/docker/cnode_cli.py or server_manager.py
Action:
1. Diff source vs package
2. If different:
- Run sync-cnode.sh
- Copy cnode_cli.py → cnode_pkg/cli.py
- Fix imports: s/deploy.docker/cnode_pkg/g
- Copy server_manager.py → cnode_pkg/
- Stage synced files
3. Continue commit
Setup: ./setup-hooks.sh (configures git config core.hooksPath .githooks)
Smart Behavior:
- Silent when no sync needed
- Only syncs if content differs
- Minimal output:
✓ cnode synced
API Request/Response Flow
Example: POST /crawl
Request:
curl -X POST http://localhost:11235/crawl \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"],
"browser_config": {
"type": "BrowserConfig",
"params": {"headless": true}
},
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {"cache_mode": "bypass"}
}
}'
Processing:
1. FastAPI receives request → api.py:crawl_endpoint()
2. Validate schema → Pydantic models in schemas.py
3. Create job → job.py:Job(id=uuid4(), urls=[...])
4. Queue to Redis → LPUSH crawl_queue {job_json}
5. Get browser from pool → crawler_pool.py:get_crawler()
6. Execute crawl:
a. Launch page → browser.new_page()
b. Navigate → page.goto(url)
c. Extract → extraction_strategy.extract()
d. Generate markdown → markdown_generator.generate()
7. Store result → Redis SETEX result:{job_id} 3600 {result_json}
8. Release browser → pool.release(browser_id)
9. Return response:
{
"success": true,
"result": {
"url": "https://example.com",
"markdown": "# Example Domain...",
"metadata": {"title": "Example Domain"},
"extracted_content": {...}
}
}
Error Cases:
- 400: Invalid request schema
- 429: Rate limit exceeded
- 500: Internal error (browser crash, timeout)
- 503: Service unavailable (all browsers busy)
Scaling Behavior
Scale-Up (1 → 10 replicas)
Command: cnode scale 10
Swarm Mode:
docker service scale crawl4ai-stack_crawl4ai=10
# Docker handles:
# - Container creation
# - Network attachment
# - Load balancer update
# - Rolling deployment
Compose Mode:
# Update docker-compose.yml
# Change replica count in all service definitions
docker-compose up -d --scale crawl4ai=10
# Regenerate nginx.conf with 10 upstreams
docker exec nginx nginx -s reload
Load Distribution:
- Swarm: Built-in ingress network (VIP-based round-robin)
- Compose: Nginx upstream (round-robin, can configure least_conn)
Zero-Downtime:
- Swarm: Yes (rolling update, parallelism=2)
- Compose: Partial (nginx reload is graceful, but brief spike)
Configuration Files
config.yml - Server Configuration
server:
port: 11235
host: "0.0.0.0"
workers: 4
crawler:
max_concurrent: 5
timeout: 30
retries: 3
browser:
pool_size: 3
headless: true
args:
- "--no-sandbox"
- "--disable-dev-shm-usage"
redis:
host: "localhost"
port: 6379
db: 0
monitoring:
enabled: true
metrics_interval: 5 # seconds
supervisord.conf - Process Management
[supervisord]
nodaemon=true
[program:redis]
command=redis-server --port 6379
autorestart=true
[program:fastapi]
command=uvicorn server:app --host 0.0.0.0 --port 11235
autorestart=true
stdout_logfile=/var/log/crawl4ai/api.log
[program:monitor]
command=python monitor.py
autorestart=true
Testing & Quality
Test Structure
deploy/docker/tests/
├── cli/ # CLI command tests
│ └── test_commands.py # start, stop, scale, status
├── monitor/ # Dashboard tests
│ └── test_websocket.py # WS connection, metrics
└── codebase_test/ # Integration tests
└── test_api.py # End-to-end crawl tests
Key Test Cases
CLI Tests:
test_start_single()- Starts 1 replicatest_start_cluster()- Starts N replicastest_scale_up()- Scales 1→5test_scale_down()- Scales 5→2test_status()- Reports correct statetest_logs()- Streams logs
API Tests:
test_crawl_success()- Basic crawl workstest_crawl_timeout()- Handles slow sitestest_concurrent_crawls()- Parallel requeststest_browser_pool()- Reuses browserstest_memory_cleanup()- No leaks after 100 crawls
Monitor Tests:
test_websocket_connect()- WS handshaketest_metrics_stream()- Receives updatestest_multiple_clients()- Handles N connections
Critical File Cross-Reference
| Component | Primary File | Dependencies |
|---|---|---|
| CLI Entry | cnode_cli.py:482 |
server_manager.py, click, rich |
| Orchestrator | server_manager.py:45 |
docker, yaml, anyio |
| API Server | server.py:120 |
api.py, monitor_routes.py |
| Crawl Logic | api.py:78 |
crawler_pool.py, AsyncWebCrawler |
| Browser Pool | crawler_pool.py:23 |
AsyncWebCrawler, asyncio |
| Monitoring | monitor.py:156 |
redis, psutil |
| Dashboard | monitor_routes.py:89 |
monitor.py, websockets |
| Hooks | hook_manager.py:12 |
api.py, custom user hooks |
Startup Chain:
cnode start
└→ cnode_cli.py:start_cmd()
└→ server_manager.py:start()
└→ docker run/stack/compose
└→ supervisord
├→ redis-server
├→ server.py
│ └→ api.py (routes)
│ └→ crawler_pool.py (init)
└→ monitor.py (collector)
Symbolic Notation Summary
⊕ Addition/Creation ⊖ Removal/Cleanup
⊗ Multiplication/Scale ⊘ Division/Split
→ Flow/Dependency ← Reverse flow
⇄ Bidirectional ⇵ Up/Down scale
✓ Success/Complete ✗ Failure/Error
⚠ Warning ⚡ Performance critical
🔒 Lock/Exclusive 🔓 Unlock/Shared
📊 Metrics 📝 Logs
🌐 Network 💾 Storage
🧠 Memory 🐳 Docker
State Machine:
[STOPPED] →start→ [STARTING] →ready→ [RUNNING]
[RUNNING] →scale→ [SCALING] →done→ [RUNNING]
[RUNNING] →stop→ [STOPPING] →done→ [STOPPED]
[ANY] →error→ [ERROR] →cleanup→ [STOPPED]
Quick Reference Commands
# Install
curl -sSL https://raw.githubusercontent.com/unclecode/crawl4ai/main/deploy/installer/install-cnode.sh | bash
# Deploy
cnode start --replicas 5 --port 11235
# Monitor
open http://localhost:11235/monitor
# Scale
cnode scale 10
# Test
curl -X POST http://localhost:11235/crawl -H "Content-Type: application/json" -d '{"urls":["https://example.com"]}'
# Logs
cnode logs -f
# Teardown
cnode stop --remove-volumes
Future AI Agent Instructions
When debugging: Read this file first, then check specific source files mentioned.
When extending: Follow existing patterns (context managers, async/await, type hints).
When optimizing: Profile first (monitor.py metrics), then optimize hot paths (crawler_pool.py, api.py).
When troubleshooting memory: Check browser pool cleanup, WebSocket connection tracking, Redis connection pooling.
When scaling issues: Verify Docker mode (swarm vs compose), check nginx config if compose, review load balancer logs.
END OF ARCHITECTURE MAP Version: 1.0.0 | Last Updated: 2025-10-21 | Token-Optimized for AI Consumption