# Crawl4AI Docker Architecture - AI Context Map **Purpose:** Dense technical reference for AI agents to understand complete system architecture. **Format:** Symbolic, compressed, high-information-density documentation. --- ## System Overview ``` ┌─────────────────────────────────────────────────────────────┐ │ CRAWL4AI DOCKER ORCHESTRATION SYSTEM │ ├─────────────────────────────────────────────────────────────┤ │ Modes: Single (N=1) | Swarm (N>1) | Compose+Nginx (N>1) │ │ Entry: cnode CLI → deploy/docker/cnode_cli.py │ │ Core: deploy/docker/server_manager.py │ │ Server: deploy/docker/server.py (FastAPI) │ │ API: deploy/docker/api.py (crawl endpoints) │ │ Monitor: deploy/docker/monitor.py + monitor_routes.py │ └─────────────────────────────────────────────────────────────┘ ``` --- ## Directory Structure & File Map ``` deploy/ ├── docker/ # Server runtime & orchestration │ ├── server.py # FastAPI app entry [CRITICAL] │ ├── api.py # /crawl, /screenshot, /pdf endpoints │ ├── server_manager.py # Docker orchestration logic [CORE] │ ├── cnode_cli.py # CLI interface (Click-based) │ ├── monitor.py # Real-time metrics collector │ ├── monitor_routes.py # /monitor dashboard routes │ ├── crawler_pool.py # Browser pool management │ ├── hook_manager.py # Pre/post crawl hooks │ ├── job.py # Job queue schema │ ├── utils.py # Helpers (port check, health) │ ├── auth.py # API key authentication │ ├── schemas.py # Pydantic models │ ├── mcp_bridge.py # MCP protocol bridge │ ├── supervisord.conf # Process manager config │ ├── config.yml # Server config template │ ├── requirements.txt # Python deps │ ├── static/ # Web assets │ │ ├── monitor/ # Dashboard UI │ │ └── playground/ # API playground │ └── tests/ # Test suite │ └── installer/ # User-facing installation ├── cnode_pkg/ # Standalone package │ ├── cli.py # Copy of cnode_cli.py │ ├── server_manager.py # Copy of server_manager.py │ └── requirements.txt # click, rich, anyio, pyyaml ├── install-cnode.sh # Remote installer (git sparse-checkout) ├── sync-cnode.sh # Dev tool (source→pkg sync) ├── USER_GUIDE.md # Human-readable guide ├── README.md # Developer documentation └── QUICKSTART.md # Cheat sheet ``` --- ## Core Components Deep Dive ### 1. `server_manager.py` - Orchestration Engine **Role:** Manages Docker container lifecycle, auto-detects deployment mode. **Key Classes:** - `ServerManager` - Main orchestrator - `start(replicas, mode, port, env_file, image)` → Deploy server - `stop(remove_volumes)` → Teardown - `status()` → Health check - `scale(replicas)` → Live scaling - `logs(follow, tail)` → Stream logs - `cleanup(force)` → Emergency cleanup **State Management:** - File: `~/.crawl4ai/server_state.yml` - Schema: `{mode, replicas, port, image, started_at, containers[]}` - Atomic writes with lock file **Deployment Modes:** ```python if replicas == 1: mode = "single" # docker run elif swarm_available(): mode = "swarm" # docker stack deploy else: mode = "compose" # docker-compose + nginx ``` **Container Naming:** - Single: `crawl4ai-server` - Swarm: `crawl4ai-stack_crawl4ai` - Compose: `crawl4ai-server-{1..N}`, `crawl4ai-nginx` **Networks:** - `crawl4ai-network` (bridge mode for all) **Volumes:** - `crawl4ai-redis-data` - Persistent queue - `crawl4ai-profiles` - Browser profiles **Health Checks:** - Endpoint: `http://localhost:{port}/health` - Timeout: 30s startup - Retry: 3 attempts --- ### 2. `server.py` - FastAPI Application **Role:** HTTP server exposing crawl API + monitoring. **Startup Flow:** ```python app = FastAPI() @app.on_event("startup") async def startup(): init_crawler_pool() # Pre-warm browsers init_redis_connection() # Job queue start_monitor_collector() # Metrics ``` **Key Endpoints:** ``` POST /crawl → api.py:crawl_endpoint() POST /crawl/stream → api.py:crawl_stream_endpoint() POST /screenshot → api.py:screenshot_endpoint() POST /pdf → api.py:pdf_endpoint() GET /health → server.py:health_check() GET /monitor → monitor_routes.py:dashboard() WS /monitor/ws → monitor_routes.py:websocket_endpoint() GET /playground → static/playground/index.html ``` **Process Manager:** - Uses `supervisord` to manage: - FastAPI server (port 11235) - Redis (port 6379) - Background workers **Environment:** ```bash CRAWL4AI_PORT=11235 REDIS_URL=redis://localhost:6379 MAX_CONCURRENT_CRAWLS=5 BROWSER_POOL_SIZE=3 ``` --- ### 3. `api.py` - Crawl Endpoints **Main Endpoint:** `POST /crawl` **Request Schema:** ```json { "urls": ["https://example.com"], "priority": 10, "browser_config": { "type": "BrowserConfig", "params": {"headless": true, "viewport_width": 1920} }, "crawler_config": { "type": "CrawlerRunConfig", "params": {"cache_mode": "bypass", "extraction_strategy": {...}} } } ``` **Processing Flow:** ``` 1. Validate request (Pydantic) 2. Queue job → Redis 3. Get browser from pool → crawler_pool.py 4. Execute crawl → AsyncWebCrawler 5. Apply hooks → hook_manager.py 6. Return result (JSON) 7. Release browser to pool ``` **Memory Management:** - Browser pool: Max 3 instances - LRU eviction when pool full - Explicit cleanup: `browser.close()` in finally block - Redis TTL: 1 hour for completed jobs **Error Handling:** ```python try: result = await crawler.arun(url, config) except PlaywrightError as e: # Browser crash - release & recreate await pool.invalidate(browser_id) except TimeoutError as e: # Timeout - kill & retry await crawler.kill() except Exception as e: # Unknown - log & fail gracefully logger.error(f"Crawl failed: {e}") ``` --- ### 4. `crawler_pool.py` - Browser Pool Manager **Role:** Manage persistent browser instances to avoid startup overhead. **Class:** `CrawlerPool` - `get_crawler()` → Lease browser (async with context manager) - `release_crawler(id)` → Return to pool - `warm_up(count)` → Pre-launch browsers - `cleanup()` → Close all browsers **Pool Strategy:** ```python pool = { "browser_1": {"crawler": AsyncWebCrawler(), "in_use": False}, "browser_2": {"crawler": AsyncWebCrawler(), "in_use": False}, "browser_3": {"crawler": AsyncWebCrawler(), "in_use": False}, } async with pool.get_crawler() as crawler: result = await crawler.arun(url) # Auto-released on context exit ``` **Anti-Leak Mechanisms:** 1. Context managers enforce cleanup 2. Watchdog thread kills stale browsers (>10min idle) 3. Max lifetime: 1 hour per browser 4. Force GC after browser close --- ### 5. `monitor.py` + `monitor_routes.py` - Real-time Dashboard **Architecture:** ``` [Browser] <--WebSocket--> [monitor_routes.py] <--Events--> [monitor.py] ↓ [Redis Pub/Sub] ↓ [Metrics Collector] ``` **Metrics Collected:** - Requests/sec (sliding window) - Active crawls (real-time count) - Response times (p50, p95, p99) - Error rate (5min rolling) - Memory usage (RSS, heap) - Browser pool utilization **WebSocket Protocol:** ```json // Server → Client { "type": "metrics", "data": { "rps": 45.3, "active_crawls": 12, "p95_latency": 1234, "error_rate": 0.02 } } // Client → Server { "type": "subscribe", "channels": ["metrics", "logs"] } ``` **Dashboard Route:** `/monitor` - Real-time graphs (Chart.js) - Request log stream - Container health status - Resource utilization --- ### 6. `cnode_cli.py` - CLI Interface **Framework:** Click (Python CLI framework) **Command Structure:** ``` cnode ├── start [--replicas N] [--port P] [--mode M] [--image I] ├── stop [--remove-volumes] ├── status ├── scale N ├── logs [--follow] [--tail N] ├── restart [--replicas N] └── cleanup [--force] ``` **Execution Flow:** ```python @cli.command("start") def start_cmd(replicas, mode, port, env_file, image): manager = ServerManager() result = anyio.run(manager.start(...)) # Async bridge if result["success"]: console.print(success_panel) ``` **User Feedback:** - Rich library for colors/tables - Progress spinners during operations - Error messages with hints - Status tables with health indicators **State Persistence:** - Saves deployment config to `~/.crawl4ai/server_state.yml` - Enables stateless commands (status, scale, restart) --- ### 7. Docker Orchestration Details **Single Container Mode (N=1):** ```bash docker run -d \ --name crawl4ai-server \ --network crawl4ai-network \ -p 11235:11235 \ -v crawl4ai-redis-data:/data \ unclecode/crawl4ai:latest ``` **Docker Swarm Mode (N>1, Swarm available):** ```yaml # docker-compose.swarm.yml version: '3.8' services: crawl4ai: image: unclecode/crawl4ai:latest deploy: replicas: 5 update_config: parallelism: 2 delay: 10s restart_policy: condition: on-failure ports: - "11235:11235" networks: - crawl4ai-network ``` Deploy: `docker stack deploy -c docker-compose.swarm.yml crawl4ai-stack` **Docker Compose + Nginx Mode (N>1, fallback):** ```yaml # docker-compose.yml services: crawl4ai-1: image: unclecode/crawl4ai:latest networks: [crawl4ai-network] crawl4ai-2: image: unclecode/crawl4ai:latest networks: [crawl4ai-network] nginx: image: nginx:alpine ports: ["11235:80"] volumes: - ./nginx.conf:/etc/nginx/nginx.conf networks: [crawl4ai-network] ``` Nginx config (round-robin load balancing): ```nginx upstream crawl4ai_backend { server crawl4ai-1:11235; server crawl4ai-2:11235; server crawl4ai-3:11235; } server { listen 80; location / { proxy_pass http://crawl4ai_backend; proxy_set_header Host $host; } } ``` --- ## Memory Leak Prevention Strategy ### Problem Areas & Solutions **1. Browser Instances** ```python # ❌ BAD - Leak risk crawler = AsyncWebCrawler() result = await crawler.arun(url) # Browser never closed! # ✅ GOOD - Guaranteed cleanup async with AsyncWebCrawler() as crawler: result = await crawler.arun(url) # Auto-closed on exit ``` **2. WebSocket Connections** ```python # monitor_routes.py active_connections = set() @app.websocket("/monitor/ws") async def websocket_endpoint(websocket): await websocket.accept() active_connections.add(websocket) try: while True: await websocket.send_json(get_metrics()) finally: active_connections.remove(websocket) # Critical! ``` **3. Redis Connections** ```python # Use connection pooling redis_pool = aioredis.ConnectionPool( host="localhost", port=6379, max_connections=10, decode_responses=True ) # Reuse connections async def get_job(job_id): async with redis_pool.get_connection() as conn: data = await conn.get(f"job:{job_id}") # Connection auto-returned to pool ``` **4. Async Task Cleanup** ```python # Track background tasks background_tasks = set() async def crawl_task(url): try: result = await crawl(url) finally: background_tasks.discard(asyncio.current_task()) # On shutdown async def shutdown(): tasks = list(background_tasks) for task in tasks: task.cancel() await asyncio.gather(*tasks, return_exceptions=True) ``` **5. File Descriptor Leaks** ```python # Use context managers for files async def save_screenshot(url): async with aiofiles.open(f"{job_id}.png", "wb") as f: await f.write(screenshot_bytes) # File auto-closed ``` --- ## Installation & Distribution ### User Installation Flow **Script:** `deploy/installer/install-cnode.sh` **Steps:** 1. Check Python 3.8+ exists 2. Check pip available 3. Check Docker installed (warn if missing) 4. Create temp dir: `mktemp -d` 5. Git sparse-checkout: ```bash git init git remote add origin https://github.com/unclecode/crawl4ai.git git config core.sparseCheckout true echo "deploy/installer/cnode_pkg/*" > .git/info/sparse-checkout git pull --depth=1 origin main ``` 6. Install deps: `pip install click rich anyio pyyaml` 7. Copy package: `cnode_pkg/ → /usr/local/lib/cnode/` 8. Create wrapper: `/usr/local/bin/cnode` ```bash #!/usr/bin/env bash export PYTHONPATH="/usr/local/lib/cnode:$PYTHONPATH" exec python3 -m cnode_pkg.cli "$@" ``` 9. Cleanup temp dir **Result:** - Binary-like experience (fast startup: ~0.1s) - No need for PyInstaller (49x faster) - Platform-independent (any OS with Python) --- ## Development Workflow ### Source Code Sync (Auto) **Git Hook:** `.githooks/pre-commit` **Trigger:** When committing `deploy/docker/cnode_cli.py` or `server_manager.py` **Action:** ```bash 1. Diff source vs package 2. If different: - Run sync-cnode.sh - Copy cnode_cli.py → cnode_pkg/cli.py - Fix imports: s/deploy.docker/cnode_pkg/g - Copy server_manager.py → cnode_pkg/ - Stage synced files 3. Continue commit ``` **Setup:** `./setup-hooks.sh` (configures `git config core.hooksPath .githooks`) **Smart Behavior:** - Silent when no sync needed - Only syncs if content differs - Minimal output: `✓ cnode synced` --- ## API Request/Response Flow ### Example: POST /crawl **Request:** ```bash curl -X POST http://localhost:11235/crawl \ -H "Content-Type: application/json" \ -d '{ "urls": ["https://example.com"], "browser_config": { "type": "BrowserConfig", "params": {"headless": true} }, "crawler_config": { "type": "CrawlerRunConfig", "params": {"cache_mode": "bypass"} } }' ``` **Processing:** ``` 1. FastAPI receives request → api.py:crawl_endpoint() 2. Validate schema → Pydantic models in schemas.py 3. Create job → job.py:Job(id=uuid4(), urls=[...]) 4. Queue to Redis → LPUSH crawl_queue {job_json} 5. Get browser from pool → crawler_pool.py:get_crawler() 6. Execute crawl: a. Launch page → browser.new_page() b. Navigate → page.goto(url) c. Extract → extraction_strategy.extract() d. Generate markdown → markdown_generator.generate() 7. Store result → Redis SETEX result:{job_id} 3600 {result_json} 8. Release browser → pool.release(browser_id) 9. Return response: { "success": true, "result": { "url": "https://example.com", "markdown": "# Example Domain...", "metadata": {"title": "Example Domain"}, "extracted_content": {...} } } ``` **Error Cases:** - 400: Invalid request schema - 429: Rate limit exceeded - 500: Internal error (browser crash, timeout) - 503: Service unavailable (all browsers busy) --- ## Scaling Behavior ### Scale-Up (1 → 10 replicas) **Command:** `cnode scale 10` **Swarm Mode:** ```bash docker service scale crawl4ai-stack_crawl4ai=10 # Docker handles: # - Container creation # - Network attachment # - Load balancer update # - Rolling deployment ``` **Compose Mode:** ```bash # Update docker-compose.yml # Change replica count in all service definitions docker-compose up -d --scale crawl4ai=10 # Regenerate nginx.conf with 10 upstreams docker exec nginx nginx -s reload ``` **Load Distribution:** - Swarm: Built-in ingress network (VIP-based round-robin) - Compose: Nginx upstream (round-robin, can configure least_conn) **Zero-Downtime:** - Swarm: Yes (rolling update, parallelism=2) - Compose: Partial (nginx reload is graceful, but brief spike) --- ## Configuration Files ### `config.yml` - Server Configuration ```yaml server: port: 11235 host: "0.0.0.0" workers: 4 crawler: max_concurrent: 5 timeout: 30 retries: 3 browser: pool_size: 3 headless: true args: - "--no-sandbox" - "--disable-dev-shm-usage" redis: host: "localhost" port: 6379 db: 0 monitoring: enabled: true metrics_interval: 5 # seconds ``` ### `supervisord.conf` - Process Management ```ini [supervisord] nodaemon=true [program:redis] command=redis-server --port 6379 autorestart=true [program:fastapi] command=uvicorn server:app --host 0.0.0.0 --port 11235 autorestart=true stdout_logfile=/var/log/crawl4ai/api.log [program:monitor] command=python monitor.py autorestart=true ``` --- ## Testing & Quality ### Test Structure ``` deploy/docker/tests/ ├── cli/ # CLI command tests │ └── test_commands.py # start, stop, scale, status ├── monitor/ # Dashboard tests │ └── test_websocket.py # WS connection, metrics └── codebase_test/ # Integration tests └── test_api.py # End-to-end crawl tests ``` ### Key Test Cases **CLI Tests:** - `test_start_single()` - Starts 1 replica - `test_start_cluster()` - Starts N replicas - `test_scale_up()` - Scales 1→5 - `test_scale_down()` - Scales 5→2 - `test_status()` - Reports correct state - `test_logs()` - Streams logs **API Tests:** - `test_crawl_success()` - Basic crawl works - `test_crawl_timeout()` - Handles slow sites - `test_concurrent_crawls()` - Parallel requests - `test_browser_pool()` - Reuses browsers - `test_memory_cleanup()` - No leaks after 100 crawls **Monitor Tests:** - `test_websocket_connect()` - WS handshake - `test_metrics_stream()` - Receives updates - `test_multiple_clients()` - Handles N connections --- ## Critical File Cross-Reference | Component | Primary File | Dependencies | |-----------|--------------|--------------| | **CLI Entry** | `cnode_cli.py:482` | `server_manager.py`, `click`, `rich` | | **Orchestrator** | `server_manager.py:45` | `docker`, `yaml`, `anyio` | | **API Server** | `server.py:120` | `api.py`, `monitor_routes.py` | | **Crawl Logic** | `api.py:78` | `crawler_pool.py`, `AsyncWebCrawler` | | **Browser Pool** | `crawler_pool.py:23` | `AsyncWebCrawler`, `asyncio` | | **Monitoring** | `monitor.py:156` | `redis`, `psutil` | | **Dashboard** | `monitor_routes.py:89` | `monitor.py`, `websockets` | | **Hooks** | `hook_manager.py:12` | `api.py`, custom user hooks | **Startup Chain:** ``` cnode start └→ cnode_cli.py:start_cmd() └→ server_manager.py:start() └→ docker run/stack/compose └→ supervisord ├→ redis-server ├→ server.py │ └→ api.py (routes) │ └→ crawler_pool.py (init) └→ monitor.py (collector) ``` --- ## Symbolic Notation Summary ``` ⊕ Addition/Creation ⊖ Removal/Cleanup ⊗ Multiplication/Scale ⊘ Division/Split → Flow/Dependency ← Reverse flow ⇄ Bidirectional ⇵ Up/Down scale ✓ Success/Complete ✗ Failure/Error ⚠ Warning ⚡ Performance critical 🔒 Lock/Exclusive 🔓 Unlock/Shared 📊 Metrics 📝 Logs 🌐 Network 💾 Storage 🧠 Memory 🐳 Docker ``` **State Machine:** ``` [STOPPED] →start→ [STARTING] →ready→ [RUNNING] [RUNNING] →scale→ [SCALING] →done→ [RUNNING] [RUNNING] →stop→ [STOPPING] →done→ [STOPPED] [ANY] →error→ [ERROR] →cleanup→ [STOPPED] ``` --- ## Quick Reference Commands ```bash # Install curl -sSL https://raw.githubusercontent.com/unclecode/crawl4ai/main/deploy/installer/install-cnode.sh | bash # Deploy cnode start --replicas 5 --port 11235 # Monitor open http://localhost:11235/monitor # Scale cnode scale 10 # Test curl -X POST http://localhost:11235/crawl -H "Content-Type: application/json" -d '{"urls":["https://example.com"]}' # Logs cnode logs -f # Teardown cnode stop --remove-volumes ``` --- ## Future AI Agent Instructions **When debugging:** Read this file first, then check specific source files mentioned. **When extending:** Follow existing patterns (context managers, async/await, type hints). **When optimizing:** Profile first (monitor.py metrics), then optimize hot paths (crawler_pool.py, api.py). **When troubleshooting memory:** Check browser pool cleanup, WebSocket connection tracking, Redis connection pooling. **When scaling issues:** Verify Docker mode (swarm vs compose), check nginx config if compose, review load balancer logs. --- **END OF ARCHITECTURE MAP** *Version: 1.0.0 | Last Updated: 2025-10-21 | Token-Optimized for AI Consumption*