Add comprehensive Docker cluster orchestration with horizontal scaling support. CLI Commands: - crwl server start/stop/restart/status/scale/logs - Auto-detection: Single (N=1) → Swarm (N>1) → Compose (N>1 fallback) - Support for 1-100 container replicas with zero-downtime scaling Infrastructure: - Nginx load balancing (round-robin API, sticky sessions monitoring) - Redis-based container discovery via heartbeats (30s interval) - Real-time monitoring dashboard with cluster-wide visibility - WebSocket aggregation from all containers Security & Stability Fixes (12 critical issues): - Add timeout protection to browser pool locks (prevent deadlocks) - Implement Redis retry logic with exponential backoff - Add container ID validation (prevent Redis key injection) - Add CLI input sanitization (prevent shell injection) - Add file locking for state management (prevent corruption) - Fix WebSocket resource leaks and connection cleanup - Add graceful degradation and circuit breakers Configuration: - RedisTTLConfig dataclass with environment variable support - Template-based docker-compose.yml and nginx.conf generation - Comprehensive error handling with actionable messages Documentation: - AGENT.md: Complete DevOps context for AI assistants - MULTI_CONTAINER_ARCHITECTURE.md: Technical architecture guide - Reorganized docs into deploy/docker/docs/
403 lines
11 KiB
Markdown
403 lines
11 KiB
Markdown
# Crawl4AI DevOps Agent Context
|
|
|
|
## Service Overview
|
|
**Crawl4AI**: Browser-based web crawling service with AI extraction. Docker deployment with horizontal scaling (1-N containers), Redis coordination, Nginx load balancing.
|
|
|
|
## Architecture Quick Reference
|
|
|
|
```
|
|
Client → Nginx:11235 → [crawl4ai-1, crawl4ai-2, ...crawl4ai-N] ← Redis
|
|
↓
|
|
Monitor Dashboard
|
|
```
|
|
|
|
**Components:**
|
|
- **Nginx**: Load balancer (round-robin API, sticky monitoring)
|
|
- **Crawl4AI containers**: FastAPI + Playwright browsers
|
|
- **Redis**: Container discovery (heartbeats 30s), monitoring data aggregation
|
|
- **Monitor**: Real-time dashboard at `/dashboard`
|
|
|
|
## CLI Commands
|
|
|
|
### Start/Stop
|
|
```bash
|
|
crwl server start [-r N] [--port P] [--mode auto|single|swarm|compose] [--env-file F] [--image I]
|
|
crwl server stop [--remove-volumes]
|
|
crwl server restart [-r N]
|
|
```
|
|
|
|
### Management
|
|
```bash
|
|
crwl server status # Show mode, replicas, port, uptime
|
|
crwl server scale N # Live scaling (Swarm/Compose only)
|
|
crwl server logs [-f] [--tail N]
|
|
```
|
|
|
|
**Defaults**: replicas=1, port=11235, mode=auto, image=unclecode/crawl4ai:latest
|
|
|
|
## Deployment Modes
|
|
|
|
| Replicas | Mode | Load Balancer | Use Case |
|
|
|----------|------|---------------|----------|
|
|
| N=1 | single | None | Dev/testing |
|
|
| N>1 | swarm | Built-in | Production (if `docker swarm init` done) |
|
|
| N>1 | compose | Nginx | Production (fallback) |
|
|
|
|
**Mode Detection** (when mode=auto):
|
|
1. If N=1 → single
|
|
2. If N>1 & Swarm active → swarm
|
|
3. If N>1 & Swarm inactive → compose
|
|
|
|
## File Locations
|
|
|
|
```
|
|
~/.crawl4ai/server/
|
|
├── state.json # Current deployment state
|
|
├── docker-compose.yml # Generated compose file
|
|
└── nginx.conf # Generated nginx config
|
|
|
|
/app/ # Inside container
|
|
├── deploy/docker/server.py
|
|
├── deploy/docker/monitor.py
|
|
├── deploy/docker/static/monitor/index.html
|
|
└── crawler_pool.py # Browser pool (PERMANENT, HOT_POOL, COLD_POOL)
|
|
```
|
|
|
|
## Monitoring & Troubleshooting
|
|
|
|
### Health Checks
|
|
```bash
|
|
curl http://localhost:11235/health # Service health
|
|
curl http://localhost:11235/monitor/containers # Container discovery
|
|
curl http://localhost:11235/monitor/requests # Aggregated requests
|
|
```
|
|
|
|
### Dashboard
|
|
- URL: `http://localhost:11235/dashboard/`
|
|
- Features: Container filtering (All/C-1/C-2/C-3), real-time WebSocket, timeline charts
|
|
- WebSocket: `/monitor/ws` (sticky sessions)
|
|
|
|
### Common Issues
|
|
|
|
**No containers showing in dashboard:**
|
|
```bash
|
|
docker exec <redis-container> redis-cli SMEMBERS monitor:active_containers
|
|
docker exec <redis-container> redis-cli KEYS "monitor:heartbeat:*"
|
|
```
|
|
Wait 30s for heartbeat registration.
|
|
|
|
**Load balancing not working:**
|
|
```bash
|
|
docker exec <nginx-container> cat /etc/nginx/nginx.conf | grep upstream
|
|
docker logs <nginx-container> | grep error
|
|
```
|
|
Check Nginx upstream has no `ip_hash` for API endpoints.
|
|
|
|
**Redis connection errors:**
|
|
```bash
|
|
docker logs <crawl4ai-container> | grep -i redis
|
|
docker exec <crawl4ai-container> ping redis
|
|
```
|
|
Verify REDIS_HOST=redis, REDIS_PORT=6379.
|
|
|
|
**Containers not scaling:**
|
|
```bash
|
|
# Swarm
|
|
docker service ls
|
|
docker service ps crawl4ai
|
|
|
|
# Compose
|
|
docker compose -f ~/.crawl4ai/server/docker-compose.yml ps
|
|
docker compose -f ~/.crawl4ai/server/docker-compose.yml up -d --scale crawl4ai=N
|
|
```
|
|
|
|
### Redis Data Structure
|
|
```
|
|
monitor:active_containers # SET: {container_ids}
|
|
monitor:heartbeat:{cid} # STRING: {id, hostname, last_seen} TTL=60s
|
|
monitor:{cid}:active_requests # STRING: JSON list, TTL=5min
|
|
monitor:{cid}:completed # STRING: JSON list, TTL=1h
|
|
monitor:{cid}:janitor # STRING: JSON list, TTL=1h
|
|
monitor:{cid}:errors # STRING: JSON list, TTL=1h
|
|
monitor:endpoint_stats # STRING: JSON aggregate, TTL=24h
|
|
```
|
|
|
|
## Environment Variables
|
|
|
|
### Required for Multi-LLM
|
|
```bash
|
|
OPENAI_API_KEY=sk-...
|
|
ANTHROPIC_API_KEY=sk-ant-...
|
|
DEEPSEEK_API_KEY=...
|
|
GROQ_API_KEY=...
|
|
TOGETHER_API_KEY=...
|
|
MISTRAL_API_KEY=...
|
|
GEMINI_API_TOKEN=...
|
|
```
|
|
|
|
### Redis Configuration (Optional)
|
|
```bash
|
|
REDIS_HOST=redis # Default: redis
|
|
REDIS_PORT=6379 # Default: 6379
|
|
REDIS_TTL_ACTIVE_REQUESTS=300 # Default: 5min
|
|
REDIS_TTL_COMPLETED_REQUESTS=3600 # Default: 1h
|
|
REDIS_TTL_JANITOR_EVENTS=3600 # Default: 1h
|
|
REDIS_TTL_ERRORS=3600 # Default: 1h
|
|
REDIS_TTL_ENDPOINT_STATS=86400 # Default: 24h
|
|
REDIS_TTL_HEARTBEAT=60 # Default: 1min
|
|
```
|
|
|
|
## API Endpoints
|
|
|
|
### Core API
|
|
- `POST /crawl` - Crawl URL (load-balanced)
|
|
- `POST /batch` - Batch crawl (load-balanced)
|
|
- `GET /health` - Health check (load-balanced)
|
|
|
|
### Monitor API (Aggregated from all containers)
|
|
- `GET /monitor/health` - Local container health
|
|
- `GET /monitor/containers` - All active containers
|
|
- `GET /monitor/requests` - All requests (active + completed)
|
|
- `GET /monitor/browsers` - Browser pool status (local only)
|
|
- `GET /monitor/logs/janitor` - Janitor cleanup events
|
|
- `GET /monitor/logs/errors` - Error logs
|
|
- `GET /monitor/endpoints/stats` - Endpoint analytics
|
|
- `WS /monitor/ws` - Real-time updates (aggregated)
|
|
|
|
### Control Actions
|
|
- `POST /monitor/actions/cleanup` - Force browser cleanup
|
|
- `POST /monitor/actions/kill_browser` - Kill specific browser
|
|
- `POST /monitor/actions/restart_browser` - Restart browser
|
|
- `POST /monitor/stats/reset` - Reset endpoint counters
|
|
|
|
## Docker Commands Reference
|
|
|
|
### Inspection
|
|
```bash
|
|
# List containers
|
|
docker ps --filter "name=crawl4ai"
|
|
|
|
# Container logs
|
|
docker logs <container-id> -f --tail 100
|
|
|
|
# Redis CLI
|
|
docker exec -it <redis-container> redis-cli
|
|
KEYS monitor:*
|
|
SMEMBERS monitor:active_containers
|
|
GET monitor:<cid>:completed
|
|
TTL monitor:heartbeat:<cid>
|
|
|
|
# Nginx config
|
|
docker exec <nginx-container> cat /etc/nginx/nginx.conf
|
|
|
|
# Container stats
|
|
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}"
|
|
```
|
|
|
|
### Compose Operations
|
|
```bash
|
|
# Scale
|
|
docker compose -f ~/.crawl4ai/server/docker-compose.yml up -d --scale crawl4ai=5
|
|
|
|
# Restart service
|
|
docker compose -f ~/.crawl4ai/server/docker-compose.yml restart crawl4ai
|
|
|
|
# View services
|
|
docker compose -f ~/.crawl4ai/server/docker-compose.yml ps
|
|
```
|
|
|
|
### Swarm Operations
|
|
```bash
|
|
# Initialize Swarm
|
|
docker swarm init
|
|
|
|
# Scale service
|
|
docker service scale crawl4ai=5
|
|
|
|
# Service info
|
|
docker service ls
|
|
docker service ps crawl4ai --no-trunc
|
|
|
|
# Service logs
|
|
docker service logs crawl4ai --tail 100 -f
|
|
```
|
|
|
|
## Performance & Scaling
|
|
|
|
### Resource Recommendations
|
|
| Containers | Memory/Container | Total Memory | Use Case |
|
|
|------------|-----------------|--------------|----------|
|
|
| 1 | 4GB | 4GB | Development |
|
|
| 3 | 4GB | 12GB | Small prod |
|
|
| 5 | 4GB | 20GB | Medium prod |
|
|
| 10 | 4GB | 40GB | Large prod |
|
|
|
|
**Expected Throughput**: ~10 req/min per container (depends on crawl complexity)
|
|
|
|
### Scaling Guidelines
|
|
- **Horizontal**: Add replicas (`crwl server scale N`)
|
|
- **Vertical**: Adjust `--memory 8G --cpus 4` in kwargs
|
|
- **Browser Pool**: Permanent (1) + Hot pool (adaptive) + Cold pool (cleanup by janitor)
|
|
|
|
### Redis Memory Usage
|
|
- **Per container**: ~110KB (requests + events + errors + heartbeat)
|
|
- **10 containers**: ~1.1MB
|
|
- **Recommendation**: 256MB Redis is sufficient for <100 containers
|
|
|
|
## Security Notes
|
|
|
|
### Input Validation
|
|
All CLI inputs validated:
|
|
- Image name: alphanumeric + `.-/:_@` only, max 256 chars
|
|
- Port: 1-65535
|
|
- Replicas: 1-100
|
|
- Env file: must exist and be readable
|
|
- Container IDs: alphanumeric + `-_` only (prevents Redis injection)
|
|
|
|
### Network Security
|
|
- Nginx forwards to internal `crawl4ai` service (Docker network)
|
|
- Monitor endpoints have NO authentication (add MONITOR_TOKEN env for security)
|
|
- Redis is internal-only (no external port)
|
|
|
|
### Recommended Production Setup
|
|
```bash
|
|
# Add authentication
|
|
export MONITOR_TOKEN="your-secret-token"
|
|
|
|
# Use Redis password
|
|
redis:
|
|
command: redis-server --requirepass ${REDIS_PASSWORD}
|
|
|
|
# Enable rate limiting in Nginx
|
|
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
|
|
```
|
|
|
|
## Common User Scenarios
|
|
|
|
### Scenario 1: Fresh Deployment
|
|
```bash
|
|
crwl server start --replicas 3 --env-file .env
|
|
# Wait for health check, then access http://localhost:11235/health
|
|
```
|
|
|
|
### Scenario 2: Scaling Under Load
|
|
```bash
|
|
crwl server scale 10
|
|
# Live scaling, no downtime
|
|
```
|
|
|
|
### Scenario 3: Debugging Slow Requests
|
|
```bash
|
|
# Check dashboard
|
|
open http://localhost:11235/dashboard/
|
|
|
|
# Check container logs
|
|
docker logs <slowest-container-id> --tail 100
|
|
|
|
# Check browser pool
|
|
curl http://localhost:11235/monitor/browsers | jq
|
|
```
|
|
|
|
### Scenario 4: Redis Connection Issues
|
|
```bash
|
|
# Check Redis connectivity
|
|
docker exec <crawl4ai-container> nc -zv redis 6379
|
|
|
|
# Check Redis logs
|
|
docker logs <redis-container>
|
|
|
|
# Restart containers (triggers reconnect with retry logic)
|
|
crwl server restart
|
|
```
|
|
|
|
### Scenario 5: Container Not Appearing in Dashboard
|
|
```bash
|
|
# Wait 30s for heartbeat
|
|
sleep 30
|
|
|
|
# Check Redis
|
|
docker exec <redis-container> redis-cli SMEMBERS monitor:active_containers
|
|
|
|
# Check container logs for heartbeat errors
|
|
docker logs <missing-container> | grep -i heartbeat
|
|
```
|
|
|
|
## Code Context for Advanced Debugging
|
|
|
|
### Key Classes
|
|
- `MonitorStats` (monitor.py): Tracks stats, Redis persistence, heartbeat worker
|
|
- `ServerManager` (server_manager.py): CLI orchestration, mode detection
|
|
- Browser pool globals: `PERMANENT`, `HOT_POOL`, `COLD_POOL`, `LOCK` (crawler_pool.py)
|
|
|
|
### Critical Timeouts
|
|
- Browser pool lock: 2s timeout (prevents deadlock)
|
|
- WebSocket connection: 5s timeout
|
|
- Health check: 30-60s timeout
|
|
- Heartbeat interval: 30s, TTL: 60s
|
|
- Redis retry: 3 attempts, backoff: 0.5s/1s/2s
|
|
- Circuit breaker: 5 failures → 5min backoff
|
|
|
|
### State Transitions
|
|
```
|
|
NOT_RUNNING → STARTING → HEALTHY → RUNNING
|
|
↓ ↓
|
|
FAILED UNHEALTHY → STOPPED
|
|
```
|
|
|
|
State file: `~/.crawl4ai/server/state.json` (atomic writes, fcntl locking)
|
|
|
|
## Quick Diagnostic Commands
|
|
|
|
```bash
|
|
# Full system check
|
|
crwl server status
|
|
docker ps
|
|
curl http://localhost:11235/health
|
|
curl http://localhost:11235/monitor/containers | jq
|
|
|
|
# Redis check
|
|
docker exec <redis-container> redis-cli PING
|
|
docker exec <redis-container> redis-cli INFO stats
|
|
|
|
# Network check
|
|
docker network ls
|
|
docker network inspect <network-name>
|
|
|
|
# Logs check
|
|
docker logs <nginx-container> --tail 50
|
|
docker logs <redis-container> --tail 50
|
|
docker compose -f ~/.crawl4ai/server/docker-compose.yml logs --tail 100
|
|
```
|
|
|
|
## Agent Decision Tree
|
|
|
|
**User reports slow crawling:**
|
|
1. Check dashboard for active requests stuck → kill browser if >5min
|
|
2. Check browser pool status → cleanup if hot/cold pool >10
|
|
3. Check container CPU/memory → scale up if >80%
|
|
4. Check Redis latency → restart Redis if >100ms
|
|
|
|
**User reports missing containers:**
|
|
1. Wait 30s for heartbeat
|
|
2. Check `docker ps` vs dashboard count
|
|
3. Check Redis SMEMBERS monitor:active_containers
|
|
4. Check container logs for Redis connection errors
|
|
5. Verify REDIS_HOST/PORT env vars
|
|
|
|
**User reports 502/503 errors:**
|
|
1. Check Nginx logs for upstream errors
|
|
2. Check container health: `curl http://localhost:11235/health`
|
|
3. Check if all containers are healthy: `docker ps`
|
|
4. Restart Nginx: `docker restart <nginx-container>`
|
|
|
|
**User wants to update image:**
|
|
1. `crwl server stop`
|
|
2. `docker pull unclecode/crawl4ai:latest`
|
|
3. `crwl server start --replicas <previous-count>`
|
|
|
|
---
|
|
|
|
**Version**: Crawl4AI v0.7.4+
|
|
**Last Updated**: 2025-01-20
|
|
**AI Agent Note**: All commands, file paths, and Redis keys verified against codebase. Use exact syntax shown. For user-facing responses, translate technical details to plain language.
|