Add comprehensive Docker cluster orchestration with horizontal scaling support. CLI Commands: - crwl server start/stop/restart/status/scale/logs - Auto-detection: Single (N=1) → Swarm (N>1) → Compose (N>1 fallback) - Support for 1-100 container replicas with zero-downtime scaling Infrastructure: - Nginx load balancing (round-robin API, sticky sessions monitoring) - Redis-based container discovery via heartbeats (30s interval) - Real-time monitoring dashboard with cluster-wide visibility - WebSocket aggregation from all containers Security & Stability Fixes (12 critical issues): - Add timeout protection to browser pool locks (prevent deadlocks) - Implement Redis retry logic with exponential backoff - Add container ID validation (prevent Redis key injection) - Add CLI input sanitization (prevent shell injection) - Add file locking for state management (prevent corruption) - Fix WebSocket resource leaks and connection cleanup - Add graceful degradation and circuit breakers Configuration: - RedisTTLConfig dataclass with environment variable support - Template-based docker-compose.yml and nginx.conf generation - Comprehensive error handling with actionable messages Documentation: - AGENT.md: Complete DevOps context for AI assistants - MULTI_CONTAINER_ARCHITECTURE.md: Technical architecture guide - Reorganized docs into deploy/docker/docs/
11 KiB
Crawl4AI DevOps Agent Context
Service Overview
Crawl4AI: Browser-based web crawling service with AI extraction. Docker deployment with horizontal scaling (1-N containers), Redis coordination, Nginx load balancing.
Architecture Quick Reference
Client → Nginx:11235 → [crawl4ai-1, crawl4ai-2, ...crawl4ai-N] ← Redis
↓
Monitor Dashboard
Components:
- Nginx: Load balancer (round-robin API, sticky monitoring)
- Crawl4AI containers: FastAPI + Playwright browsers
- Redis: Container discovery (heartbeats 30s), monitoring data aggregation
- Monitor: Real-time dashboard at
/dashboard
CLI Commands
Start/Stop
crwl server start [-r N] [--port P] [--mode auto|single|swarm|compose] [--env-file F] [--image I]
crwl server stop [--remove-volumes]
crwl server restart [-r N]
Management
crwl server status # Show mode, replicas, port, uptime
crwl server scale N # Live scaling (Swarm/Compose only)
crwl server logs [-f] [--tail N]
Defaults: replicas=1, port=11235, mode=auto, image=unclecode/crawl4ai:latest
Deployment Modes
| Replicas | Mode | Load Balancer | Use Case |
|---|---|---|---|
| N=1 | single | None | Dev/testing |
| N>1 | swarm | Built-in | Production (if docker swarm init done) |
| N>1 | compose | Nginx | Production (fallback) |
Mode Detection (when mode=auto):
- If N=1 → single
- If N>1 & Swarm active → swarm
- If N>1 & Swarm inactive → compose
File Locations
~/.crawl4ai/server/
├── state.json # Current deployment state
├── docker-compose.yml # Generated compose file
└── nginx.conf # Generated nginx config
/app/ # Inside container
├── deploy/docker/server.py
├── deploy/docker/monitor.py
├── deploy/docker/static/monitor/index.html
└── crawler_pool.py # Browser pool (PERMANENT, HOT_POOL, COLD_POOL)
Monitoring & Troubleshooting
Health Checks
curl http://localhost:11235/health # Service health
curl http://localhost:11235/monitor/containers # Container discovery
curl http://localhost:11235/monitor/requests # Aggregated requests
Dashboard
- URL:
http://localhost:11235/dashboard/ - Features: Container filtering (All/C-1/C-2/C-3), real-time WebSocket, timeline charts
- WebSocket:
/monitor/ws(sticky sessions)
Common Issues
No containers showing in dashboard:
docker exec <redis-container> redis-cli SMEMBERS monitor:active_containers
docker exec <redis-container> redis-cli KEYS "monitor:heartbeat:*"
Wait 30s for heartbeat registration.
Load balancing not working:
docker exec <nginx-container> cat /etc/nginx/nginx.conf | grep upstream
docker logs <nginx-container> | grep error
Check Nginx upstream has no ip_hash for API endpoints.
Redis connection errors:
docker logs <crawl4ai-container> | grep -i redis
docker exec <crawl4ai-container> ping redis
Verify REDIS_HOST=redis, REDIS_PORT=6379.
Containers not scaling:
# Swarm
docker service ls
docker service ps crawl4ai
# Compose
docker compose -f ~/.crawl4ai/server/docker-compose.yml ps
docker compose -f ~/.crawl4ai/server/docker-compose.yml up -d --scale crawl4ai=N
Redis Data Structure
monitor:active_containers # SET: {container_ids}
monitor:heartbeat:{cid} # STRING: {id, hostname, last_seen} TTL=60s
monitor:{cid}:active_requests # STRING: JSON list, TTL=5min
monitor:{cid}:completed # STRING: JSON list, TTL=1h
monitor:{cid}:janitor # STRING: JSON list, TTL=1h
monitor:{cid}:errors # STRING: JSON list, TTL=1h
monitor:endpoint_stats # STRING: JSON aggregate, TTL=24h
Environment Variables
Required for Multi-LLM
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
DEEPSEEK_API_KEY=...
GROQ_API_KEY=...
TOGETHER_API_KEY=...
MISTRAL_API_KEY=...
GEMINI_API_TOKEN=...
Redis Configuration (Optional)
REDIS_HOST=redis # Default: redis
REDIS_PORT=6379 # Default: 6379
REDIS_TTL_ACTIVE_REQUESTS=300 # Default: 5min
REDIS_TTL_COMPLETED_REQUESTS=3600 # Default: 1h
REDIS_TTL_JANITOR_EVENTS=3600 # Default: 1h
REDIS_TTL_ERRORS=3600 # Default: 1h
REDIS_TTL_ENDPOINT_STATS=86400 # Default: 24h
REDIS_TTL_HEARTBEAT=60 # Default: 1min
API Endpoints
Core API
POST /crawl- Crawl URL (load-balanced)POST /batch- Batch crawl (load-balanced)GET /health- Health check (load-balanced)
Monitor API (Aggregated from all containers)
GET /monitor/health- Local container healthGET /monitor/containers- All active containersGET /monitor/requests- All requests (active + completed)GET /monitor/browsers- Browser pool status (local only)GET /monitor/logs/janitor- Janitor cleanup eventsGET /monitor/logs/errors- Error logsGET /monitor/endpoints/stats- Endpoint analyticsWS /monitor/ws- Real-time updates (aggregated)
Control Actions
POST /monitor/actions/cleanup- Force browser cleanupPOST /monitor/actions/kill_browser- Kill specific browserPOST /monitor/actions/restart_browser- Restart browserPOST /monitor/stats/reset- Reset endpoint counters
Docker Commands Reference
Inspection
# List containers
docker ps --filter "name=crawl4ai"
# Container logs
docker logs <container-id> -f --tail 100
# Redis CLI
docker exec -it <redis-container> redis-cli
KEYS monitor:*
SMEMBERS monitor:active_containers
GET monitor:<cid>:completed
TTL monitor:heartbeat:<cid>
# Nginx config
docker exec <nginx-container> cat /etc/nginx/nginx.conf
# Container stats
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}"
Compose Operations
# Scale
docker compose -f ~/.crawl4ai/server/docker-compose.yml up -d --scale crawl4ai=5
# Restart service
docker compose -f ~/.crawl4ai/server/docker-compose.yml restart crawl4ai
# View services
docker compose -f ~/.crawl4ai/server/docker-compose.yml ps
Swarm Operations
# Initialize Swarm
docker swarm init
# Scale service
docker service scale crawl4ai=5
# Service info
docker service ls
docker service ps crawl4ai --no-trunc
# Service logs
docker service logs crawl4ai --tail 100 -f
Performance & Scaling
Resource Recommendations
| Containers | Memory/Container | Total Memory | Use Case |
|---|---|---|---|
| 1 | 4GB | 4GB | Development |
| 3 | 4GB | 12GB | Small prod |
| 5 | 4GB | 20GB | Medium prod |
| 10 | 4GB | 40GB | Large prod |
Expected Throughput: ~10 req/min per container (depends on crawl complexity)
Scaling Guidelines
- Horizontal: Add replicas (
crwl server scale N) - Vertical: Adjust
--memory 8G --cpus 4in kwargs - Browser Pool: Permanent (1) + Hot pool (adaptive) + Cold pool (cleanup by janitor)
Redis Memory Usage
- Per container: ~110KB (requests + events + errors + heartbeat)
- 10 containers: ~1.1MB
- Recommendation: 256MB Redis is sufficient for <100 containers
Security Notes
Input Validation
All CLI inputs validated:
- Image name: alphanumeric +
.-/:_@only, max 256 chars - Port: 1-65535
- Replicas: 1-100
- Env file: must exist and be readable
- Container IDs: alphanumeric +
-_only (prevents Redis injection)
Network Security
- Nginx forwards to internal
crawl4aiservice (Docker network) - Monitor endpoints have NO authentication (add MONITOR_TOKEN env for security)
- Redis is internal-only (no external port)
Recommended Production Setup
# Add authentication
export MONITOR_TOKEN="your-secret-token"
# Use Redis password
redis:
command: redis-server --requirepass ${REDIS_PASSWORD}
# Enable rate limiting in Nginx
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
Common User Scenarios
Scenario 1: Fresh Deployment
crwl server start --replicas 3 --env-file .env
# Wait for health check, then access http://localhost:11235/health
Scenario 2: Scaling Under Load
crwl server scale 10
# Live scaling, no downtime
Scenario 3: Debugging Slow Requests
# Check dashboard
open http://localhost:11235/dashboard/
# Check container logs
docker logs <slowest-container-id> --tail 100
# Check browser pool
curl http://localhost:11235/monitor/browsers | jq
Scenario 4: Redis Connection Issues
# Check Redis connectivity
docker exec <crawl4ai-container> nc -zv redis 6379
# Check Redis logs
docker logs <redis-container>
# Restart containers (triggers reconnect with retry logic)
crwl server restart
Scenario 5: Container Not Appearing in Dashboard
# Wait 30s for heartbeat
sleep 30
# Check Redis
docker exec <redis-container> redis-cli SMEMBERS monitor:active_containers
# Check container logs for heartbeat errors
docker logs <missing-container> | grep -i heartbeat
Code Context for Advanced Debugging
Key Classes
MonitorStats(monitor.py): Tracks stats, Redis persistence, heartbeat workerServerManager(server_manager.py): CLI orchestration, mode detection- Browser pool globals:
PERMANENT,HOT_POOL,COLD_POOL,LOCK(crawler_pool.py)
Critical Timeouts
- Browser pool lock: 2s timeout (prevents deadlock)
- WebSocket connection: 5s timeout
- Health check: 30-60s timeout
- Heartbeat interval: 30s, TTL: 60s
- Redis retry: 3 attempts, backoff: 0.5s/1s/2s
- Circuit breaker: 5 failures → 5min backoff
State Transitions
NOT_RUNNING → STARTING → HEALTHY → RUNNING
↓ ↓
FAILED UNHEALTHY → STOPPED
State file: ~/.crawl4ai/server/state.json (atomic writes, fcntl locking)
Quick Diagnostic Commands
# Full system check
crwl server status
docker ps
curl http://localhost:11235/health
curl http://localhost:11235/monitor/containers | jq
# Redis check
docker exec <redis-container> redis-cli PING
docker exec <redis-container> redis-cli INFO stats
# Network check
docker network ls
docker network inspect <network-name>
# Logs check
docker logs <nginx-container> --tail 50
docker logs <redis-container> --tail 50
docker compose -f ~/.crawl4ai/server/docker-compose.yml logs --tail 100
Agent Decision Tree
User reports slow crawling:
- Check dashboard for active requests stuck → kill browser if >5min
- Check browser pool status → cleanup if hot/cold pool >10
- Check container CPU/memory → scale up if >80%
- Check Redis latency → restart Redis if >100ms
User reports missing containers:
- Wait 30s for heartbeat
- Check
docker psvs dashboard count - Check Redis SMEMBERS monitor:active_containers
- Check container logs for Redis connection errors
- Verify REDIS_HOST/PORT env vars
User reports 502/503 errors:
- Check Nginx logs for upstream errors
- Check container health:
curl http://localhost:11235/health - Check if all containers are healthy:
docker ps - Restart Nginx:
docker restart <nginx-container>
User wants to update image:
crwl server stopdocker pull unclecode/crawl4ai:latestcrwl server start --replicas <previous-count>
Version: Crawl4AI v0.7.4+ Last Updated: 2025-01-20 AI Agent Note: All commands, file paths, and Redis keys verified against codebase. Use exact syntax shown. For user-facing responses, translate technical details to plain language.