From 05921811b8fdf43772c73aca9779c42b86be100f Mon Sep 17 00:00:00 2001 From: unclecode Date: Sat, 18 Oct 2025 12:05:49 +0800 Subject: [PATCH] docs: add comprehensive technical architecture documentation Created ARCHITECTURE.md as a complete technical reference for the Crawl4AI Docker server, replacing the stress test pipeline document with production-grade documentation. Contents: - System overview with architecture diagrams - Core components deep-dive (server, API, utils) - Smart browser pool implementation details - Real-time monitoring system architecture - WebSocket implementation and fallback strategy - Memory management and container detection - Production optimizations and code review fixes - Deployment guides (local, Docker, production) - Comprehensive troubleshooting section - Debug tools and performance tuning - Test suite documentation - Architecture decision log (ADRs) Target audience: Developers maintaining or extending the system Goal: Enable rapid onboarding and confident modifications --- deploy/docker/ARCHITECTURE.md | 1149 +++++++++++++++++++++++++++++++++ 1 file changed, 1149 insertions(+) create mode 100644 deploy/docker/ARCHITECTURE.md diff --git a/deploy/docker/ARCHITECTURE.md b/deploy/docker/ARCHITECTURE.md new file mode 100644 index 00000000..eb49cdae --- /dev/null +++ b/deploy/docker/ARCHITECTURE.md @@ -0,0 +1,1149 @@ +# Crawl4AI Docker Server - Technical Architecture + +**Version**: 0.7.4 +**Last Updated**: October 2025 +**Status**: Production-ready with real-time monitoring + +This document provides a comprehensive technical overview of the Crawl4AI Docker server architecture, including the smart browser pool, real-time monitoring system, and all production optimizations. + +--- + +## Table of Contents + +1. [System Overview](#system-overview) +2. [Core Components](#core-components) +3. [Smart Browser Pool](#smart-browser-pool) +4. [Real-time Monitoring System](#real-time-monitoring-system) +5. [API Layer](#api-layer) +6. [Memory Management](#memory-management) +7. [Production Optimizations](#production-optimizations) +8. [Deployment & Operations](#deployment--operations) +9. [Troubleshooting & Debugging](#troubleshooting--debugging) + +--- + +## System Overview + +### Architecture Diagram + +``` +┌─────────────────────────────────────────────────────────────┐ +│ Client Requests │ +└────────────┬────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ FastAPI Server (server.py) │ +│ ├─ REST API Endpoints (/crawl, /html, /md, /llm, etc.) │ +│ ├─ WebSocket Endpoint (/monitor/ws) │ +│ └─ Background Tasks (janitor, timeline_updater) │ +└────┬────────────────────┬────────────────────┬──────────────┘ + │ │ │ + ▼ ▼ ▼ +┌─────────────┐ ┌──────────────────┐ ┌─────────────────┐ +│ Browser │ │ Monitor System │ │ Redis │ +│ Pool │ │ (monitor.py) │ │ (Persistence) │ +│ │ │ │ │ │ +│ PERMANENT ●─┤ │ ├─ Stats │ │ ├─ Endpoint │ +│ HOT_POOL ♨─┤ │ ├─ Requests │ │ │ Stats │ +│ COLD_POOL ❄─┤ │ ├─ Browsers │ │ ├─ Task │ +│ │ │ ├─ Timeline │ │ │ Results │ +│ Janitor 🧹─┤ │ └─ Events/Errors │ │ └─ Cache │ +└─────────────┘ └──────────────────┘ └─────────────────┘ +``` + +### Key Features + +- **10x Memory Efficiency**: Smart 3-tier browser pooling reduces memory from 500-700MB to 50-70MB per concurrent user +- **Real-time Monitoring**: WebSocket-based live dashboard with 2-second update intervals +- **Production-Ready**: Comprehensive error handling, timeouts, cleanup, and graceful shutdown +- **Container-Aware**: Accurate memory detection using cgroup v2/v1 +- **Auto-Recovery**: Graceful WebSocket fallback, lock protection, background workers + +--- + +## Core Components + +### 1. Server Core (`server.py`) + +**Responsibilities:** +- FastAPI application lifecycle management +- Route registration and middleware +- Background task orchestration +- Graceful shutdown handling + +**Key Functions:** + +```python +@asynccontextmanager +async def lifespan(app: FastAPI): + """Application lifecycle manager""" + # Startup + - Initialize Redis connection + - Create monitor stats instance + - Start persistence worker + - Initialize permanent browser + - Start janitor (browser cleanup) + - Start timeline updater (5s interval) + + yield + + # Shutdown + - Cancel background tasks + - Persist final monitor stats + - Stop persistence worker + - Close all browsers +``` + +**Configuration:** +- Loaded from `config.yml` +- Browser settings, memory thresholds, rate limiting +- LLM provider credentials +- Server host/port + +### 2. API Layer (`api.py`) + +**Endpoints:** + +| Endpoint | Method | Purpose | Pool Usage | +|----------|--------|---------|------------| +| `/health` | GET | Health check | None | +| `/crawl` | POST | Full crawl with all features | ✓ Pool | +| `/crawl_stream` | POST | Streaming crawl results | ✓ Pool | +| `/html` | POST | HTML extraction | ✓ Pool | +| `/md` | POST | Markdown generation | ✓ Pool | +| `/screenshot` | POST | Page screenshots | ✓ Pool | +| `/pdf` | POST | PDF generation | ✓ Pool | +| `/llm/{path}` | GET/POST | LLM extraction | ✓ Pool | +| `/crawl/job` | POST | Background job creation | ✓ Pool | + +**Request Flow:** + +```python +@app.post("/crawl") +async def crawl(body: CrawlRequest): + # 1. Track request start + request_id = f"req_{uuid4().hex[:8]}" + await get_monitor().track_request_start(request_id, "/crawl", url, config) + + # 2. Get browser from pool + from crawler_pool import get_crawler + crawler = await get_crawler(browser_config) + + # 3. Execute crawl + result = await crawler.arun(url, config=crawler_config) + + # 4. Track request completion + await get_monitor().track_request_end(request_id, success=True) + + # 5. Return result (browser stays in pool) + return result +``` + +### 3. Utility Layer (`utils.py`) + +**Container Memory Detection:** + +```python +def get_container_memory_percent() -> float: + """Accurate container memory detection""" + try: + # Try cgroup v2 first + current = int(Path("/sys/fs/cgroup/memory.current").read_text().strip()) + max_mem = int(Path("/sys/fs/cgroup/memory.max").read_text().strip()) + return (current / max_mem) * 100 + except: + # Fallback to cgroup v1 + usage = int(Path("/sys/fs/cgroup/memory/memory.usage_in_bytes").read_text()) + limit = int(Path("/sys/fs/cgroup/memory/memory.limit_in_bytes").read_text()) + return (usage / limit) * 100 + except: + # Final fallback to psutil (may be inaccurate in containers) + return psutil.virtual_memory().percent +``` + +**Helper Functions:** +- `get_base_url()`: Request base URL extraction +- `is_task_id()`: Task ID validation +- `should_cleanup_task()`: TTL-based cleanup logic +- `validate_llm_provider()`: LLM configuration validation + +--- + +## Smart Browser Pool + +### Architecture + +The browser pool implements a 3-tier strategy optimized for real-world usage patterns: + +``` +┌──────────────────────────────────────────────────────────┐ +│ PERMANENT Browser (Default Config) │ +│ ● Always alive, never cleaned │ +│ ● Serves 90% of requests │ +│ ● ~270MB memory │ +└──────────────────────────────────────────────────────────┘ + ▲ + │ 90% of requests + │ +┌──────────────────────────────────────────────────────────┐ +│ HOT_POOL (Frequently Used Configs) │ +│ ♨ Configs used 3+ times │ +│ ♨ Longer TTL (2-5 min depending on memory) │ +│ ♨ ~180MB per browser │ +└──────────────────────────────────────────────────────────┘ + ▲ + │ Promotion at 3 uses + │ +┌──────────────────────────────────────────────────────────┐ +│ COLD_POOL (Rarely Used Configs) │ +│ ❄ New/rare browser configs │ +│ ❄ Short TTL (30s-5min depending on memory) │ +│ ❄ ~180MB per browser │ +└──────────────────────────────────────────────────────────┘ +``` + +### Implementation (`crawler_pool.py`) + +**Core Data Structures:** + +```python +PERMANENT: Optional[AsyncWebCrawler] = None # Default browser +HOT_POOL: Dict[str, AsyncWebCrawler] = {} # Frequent configs +COLD_POOL: Dict[str, AsyncWebCrawler] = {} # Rare configs +LAST_USED: Dict[str, float] = {} # Timestamp tracking +USAGE_COUNT: Dict[str, int] = {} # Usage counter +LOCK = asyncio.Lock() # Thread-safe access +``` + +**Browser Acquisition Flow:** + +```python +async def get_crawler(cfg: BrowserConfig) -> AsyncWebCrawler: + sig = _sig(cfg) # SHA1 hash of config + + async with LOCK: # Prevent race conditions + # 1. Check permanent browser + if _is_default_config(sig): + return PERMANENT + + # 2. Check hot pool + if sig in HOT_POOL: + USAGE_COUNT[sig] += 1 + return HOT_POOL[sig] + + # 3. Check cold pool (with promotion logic) + if sig in COLD_POOL: + USAGE_COUNT[sig] += 1 + if USAGE_COUNT[sig] >= 3: + # Promote to hot pool + HOT_POOL[sig] = COLD_POOL.pop(sig) + await get_monitor().track_janitor_event("promote", sig, {...}) + return HOT_POOL[sig] + return COLD_POOL[sig] + + # 4. Memory check before creating new + if get_container_memory_percent() >= MEM_LIMIT: + raise MemoryError(f"Memory at {mem}%, refusing new browser") + + # 5. Create new browser in cold pool + crawler = AsyncWebCrawler(config=cfg) + await crawler.start() + COLD_POOL[sig] = crawler + return crawler +``` + +**Janitor (Adaptive Cleanup):** + +```python +async def janitor(): + """Memory-adaptive browser cleanup""" + while True: + mem_pct = get_container_memory_percent() + + # Adaptive intervals based on memory pressure + if mem_pct > 80: + interval, cold_ttl, hot_ttl = 10, 30, 120 # Aggressive + elif mem_pct > 60: + interval, cold_ttl, hot_ttl = 30, 60, 300 # Moderate + else: + interval, cold_ttl, hot_ttl = 60, 300, 600 # Relaxed + + await asyncio.sleep(interval) + + async with LOCK: + # Clean cold pool first (less valuable) + for sig in list(COLD_POOL.keys()): + if now - LAST_USED[sig] > cold_ttl: + await COLD_POOL[sig].close() + del COLD_POOL[sig], LAST_USED[sig], USAGE_COUNT[sig] + await track_janitor_event("close_cold", sig, {...}) + + # Clean hot pool (more conservative) + for sig in list(HOT_POOL.keys()): + if now - LAST_USED[sig] > hot_ttl: + await HOT_POOL[sig].close() + del HOT_POOL[sig], LAST_USED[sig], USAGE_COUNT[sig] + await track_janitor_event("close_hot", sig, {...}) +``` + +**Config Signature Generation:** + +```python +def _sig(cfg: BrowserConfig) -> str: + """Generate unique signature for browser config""" + payload = json.dumps(cfg.to_dict(), sort_keys=True, separators=(",",":")) + return hashlib.sha1(payload.encode()).hexdigest() +``` + +--- + +## Real-time Monitoring System + +### Architecture + +The monitoring system provides real-time insights via WebSocket with automatic fallback to HTTP polling. + +**Components:** + +``` +┌─────────────────────────────────────────────────────────┐ +│ MonitorStats Class (monitor.py) │ +│ ├─ In-memory queues (deques with maxlen) │ +│ ├─ Background persistence worker │ +│ ├─ Timeline tracking (5-min window, 5s resolution) │ +│ └─ Time-based expiry (5min for old entries) │ +└───────────┬─────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────┐ +│ WebSocket Endpoint (/monitor/ws) │ +│ ├─ 2-second update intervals │ +│ ├─ Auto-reconnect with exponential backoff │ +│ ├─ Comprehensive data payload │ +│ └─ Graceful fallback to polling │ +└───────────┬─────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────┐ +│ Dashboard UI (static/monitor/index.html) │ +│ ├─ Connection status indicator │ +│ ├─ Live updates (health, requests, browsers) │ +│ ├─ Timeline charts (memory, requests, browsers) │ +│ └─ Janitor events & error logs │ +└─────────────────────────────────────────────────────────┘ +``` + +### Monitor Stats (`monitor.py`) + +**Data Structures:** + +```python +class MonitorStats: + # In-memory queues + active_requests: Dict[str, Dict] # Currently processing + completed_requests: deque(maxlen=100) # Last 100 completed + janitor_events: deque(maxlen=100) # Cleanup events + errors: deque(maxlen=100) # Error log + + # Endpoint stats (persisted to Redis) + endpoint_stats: Dict[str, Dict] # Aggregated stats + + # Timeline data (5min window, 5s resolution = 60 points) + memory_timeline: deque(maxlen=60) + requests_timeline: deque(maxlen=60) + browser_timeline: deque(maxlen=60) + + # Background persistence + _persist_queue: asyncio.Queue(maxsize=10) + _persist_worker_task: Optional[asyncio.Task] +``` + +**Request Tracking:** + +```python +async def track_request_start(request_id, endpoint, url, config): + """Track new request""" + self.active_requests[request_id] = { + "id": request_id, + "endpoint": endpoint, + "url": url, + "start_time": time.time(), + "mem_start": psutil.Process().memory_info().rss / (1024 * 1024) + } + + # Update endpoint stats + if endpoint not in self.endpoint_stats: + self.endpoint_stats[endpoint] = { + "count": 0, "total_time": 0, "errors": 0, + "pool_hits": 0, "success": 0 + } + self.endpoint_stats[endpoint]["count"] += 1 + + # Queue background persistence + self._persist_queue.put_nowait(True) + +async def track_request_end(request_id, success, error=None, ...): + """Track request completion""" + req_info = self.active_requests.pop(request_id) + elapsed = time.time() - req_info["start_time"] + mem_delta = current_mem - req_info["mem_start"] + + # Add to completed queue + self.completed_requests.append({ + "id": request_id, + "endpoint": req_info["endpoint"], + "url": req_info["url"], + "success": success, + "elapsed": elapsed, + "mem_delta": mem_delta, + "end_time": time.time() + }) + + # Update stats + self.endpoint_stats[endpoint]["success" if success else "errors"] += 1 + await self._persist_endpoint_stats() +``` + +**Background Persistence Worker:** + +```python +async def _persistence_worker(self): + """Background worker for Redis persistence""" + while True: + try: + await self._persist_queue.get() + await self._persist_endpoint_stats() + self._persist_queue.task_done() + except asyncio.CancelledError: + break + except Exception as e: + logger.error(f"Persistence worker error: {e}") + +async def _persist_endpoint_stats(self): + """Persist stats to Redis with error handling""" + try: + await self.redis.set( + "monitor:endpoint_stats", + json.dumps(self.endpoint_stats), + ex=86400 # 24h TTL + ) + except Exception as e: + logger.warning(f"Failed to persist endpoint stats: {e}") +``` + +**Time-based Cleanup:** + +```python +def _cleanup_old_entries(self, max_age_seconds=300): + """Remove entries older than 5 minutes""" + now = time.time() + cutoff = now - max_age_seconds + + # Clean completed requests + while self.completed_requests and \ + self.completed_requests[0].get("end_time", 0) < cutoff: + self.completed_requests.popleft() + + # Clean janitor events + while self.janitor_events and \ + self.janitor_events[0].get("timestamp", 0) < cutoff: + self.janitor_events.popleft() + + # Clean errors + while self.errors and \ + self.errors[0].get("timestamp", 0) < cutoff: + self.errors.popleft() +``` + +### WebSocket Implementation (`monitor_routes.py`) + +**Endpoint:** + +```python +@router.websocket("/ws") +async def websocket_endpoint(websocket: WebSocket): + """Real-time monitoring updates""" + await websocket.accept() + logger.info("WebSocket client connected") + + try: + while True: + try: + monitor = get_monitor() + + # Gather comprehensive monitoring data + data = { + "timestamp": time.time(), + "health": await monitor.get_health_summary(), + "requests": { + "active": monitor.get_active_requests(), + "completed": monitor.get_completed_requests(limit=10) + }, + "browsers": await monitor.get_browser_list(), + "timeline": { + "memory": monitor.get_timeline_data("memory", "5m"), + "requests": monitor.get_timeline_data("requests", "5m"), + "browsers": monitor.get_timeline_data("browsers", "5m") + }, + "janitor": monitor.get_janitor_log(limit=10), + "errors": monitor.get_errors_log(limit=10) + } + + await websocket.send_json(data) + await asyncio.sleep(2) # 2-second update interval + + except WebSocketDisconnect: + logger.info("WebSocket client disconnected") + break + except Exception as e: + logger.error(f"WebSocket error: {e}", exc_info=True) + await asyncio.sleep(2) + except Exception as e: + logger.error(f"WebSocket connection error: {e}", exc_info=True) + finally: + logger.info("WebSocket connection closed") +``` + +**Input Validation:** + +```python +@router.get("/requests") +async def get_requests(status: str = "all", limit: int = 50): + # Input validation + if status not in ["all", "active", "completed", "success", "error"]: + raise HTTPException(400, f"Invalid status: {status}") + if limit < 1 or limit > 1000: + raise HTTPException(400, f"Invalid limit: {limit}") + + monitor = get_monitor() + # ... return data +``` + +### Frontend Dashboard + +**Connection Management:** + +```javascript +// WebSocket with auto-reconnect +function connectWebSocket() { + if (wsReconnectAttempts >= MAX_WS_RECONNECT) { + // Fallback to polling after 5 failed attempts + useWebSocket = false; + updateConnectionStatus('polling'); + startAutoRefresh(); + return; + } + + updateConnectionStatus('connecting'); + const wsUrl = `${protocol}//${window.location.host}/monitor/ws`; + websocket = new WebSocket(wsUrl); + + websocket.onopen = () => { + wsReconnectAttempts = 0; + updateConnectionStatus('connected'); + stopAutoRefresh(); // Stop polling + }; + + websocket.onmessage = (event) => { + const data = JSON.parse(event.data); + updateDashboard(data); // Update all sections + }; + + websocket.onclose = () => { + updateConnectionStatus('disconnected', 'Reconnecting...'); + if (useWebSocket) { + setTimeout(connectWebSocket, 2000 * wsReconnectAttempts); + } else { + startAutoRefresh(); // Fallback to polling + } + }; +} +``` + +**Connection Status Indicator:** + +| Status | Color | Animation | Meaning | +|--------|-------|-----------|---------| +| Live | Green | Pulsing fast | WebSocket connected | +| Connecting... | Yellow | Pulsing slow | Attempting connection | +| Polling | Blue | Pulsing slow | HTTP polling fallback | +| Disconnected | Red | None | Connection failed | + +--- + +## API Layer + +### Request/Response Flow + +``` +Client Request + │ + ▼ +FastAPI Route Handler + │ + ├─→ Monitor: track_request_start() + │ + ├─→ Browser Pool: get_crawler(config) + │ │ + │ ├─→ Check PERMANENT + │ ├─→ Check HOT_POOL + │ ├─→ Check COLD_POOL + │ └─→ Create New (if needed) + │ + ├─→ Execute Crawl + │ │ + │ ├─→ Fetch page + │ ├─→ Extract content + │ ├─→ Apply filters/strategies + │ └─→ Return result + │ + ├─→ Monitor: track_request_end() + │ + └─→ Return Response (browser stays in pool) +``` + +### Error Handling Strategy + +**Levels:** + +1. **Route Level**: HTTP exceptions with proper status codes +2. **Monitor Level**: Try-except with logging, non-critical failures +3. **Pool Level**: Memory checks, lock protection, graceful degradation +4. **WebSocket Level**: Auto-reconnect, fallback to polling + +**Example:** + +```python +@app.post("/crawl") +async def crawl(body: CrawlRequest): + request_id = f"req_{uuid4().hex[:8]}" + + try: + # Monitor tracking (non-blocking on failure) + try: + await get_monitor().track_request_start(...) + except: + pass # Monitor not critical + + # Browser acquisition (with memory protection) + crawler = await get_crawler(browser_config) + + # Crawl execution + result = await crawler.arun(url, config=cfg) + + # Success tracking + try: + await get_monitor().track_request_end(request_id, success=True) + except: + pass + + return result + + except MemoryError as e: + # Memory pressure - return 503 + await get_monitor().track_request_end(request_id, success=False, error=str(e)) + raise HTTPException(503, "Server at capacity") + except Exception as e: + # General errors - return 500 + await get_monitor().track_request_end(request_id, success=False, error=str(e)) + raise HTTPException(500, str(e)) +``` + +--- + +## Memory Management + +### Container Memory Detection + +**Priority Order:** +1. cgroup v2 (`/sys/fs/cgroup/memory.{current,max}`) +2. cgroup v1 (`/sys/fs/cgroup/memory/memory.{usage,limit}_in_bytes`) +3. psutil fallback (may be inaccurate in containers) + +**Usage:** + +```python +mem_pct = get_container_memory_percent() + +if mem_pct >= 95: # Critical + raise MemoryError("Refusing new browser") +elif mem_pct > 80: # High pressure + # Janitor: aggressive cleanup (10s interval, 30s TTL) +elif mem_pct > 60: # Moderate pressure + # Janitor: moderate cleanup (30s interval, 60s TTL) +else: # Normal + # Janitor: relaxed cleanup (60s interval, 300s TTL) +``` + +### Memory Budgets + +| Component | Memory | Notes | +|-----------|--------|-------| +| Base Container | 270 MB | Python + FastAPI + libraries | +| Permanent Browser | 270 MB | Always-on default browser | +| Hot Pool Browser | 180 MB | Per frequently-used config | +| Cold Pool Browser | 180 MB | Per rarely-used config | +| Active Crawl Overhead | 50-200 MB | Temporary, released after request | + +**Example Calculation:** + +``` +Container: 270 MB +Permanent: 270 MB +2x Hot: 360 MB +1x Cold: 180 MB +Total: 1080 MB baseline + +Under load (10 concurrent): ++ Active crawls: ~500-1000 MB += Peak: 1.5-2 GB +``` + +--- + +## Production Optimizations + +### Code Review Fixes Applied + +**Critical (3):** +1. ✅ Lock protection for browser pool access +2. ✅ Async track_janitor_event implementation +3. ✅ Error handling in request tracking + +**Important (8):** +4. ✅ Background persistence worker (replaces fire-and-forget) +5. ✅ Time-based expiry (5min cleanup for old entries) +6. ✅ Input validation (status, limit, metric, window) +7. ✅ Timeline updater timeout (4s max) +8. ✅ Warn when killing browsers with active requests +9. ✅ Monitor cleanup on shutdown +10. ✅ Document memory estimates +11. ✅ Structured error responses (HTTPException) + +### Performance Characteristics + +**Latency:** + +| Scenario | Time | Notes | +|----------|------|-------| +| Pool Hit (Permanent) | <100ms | Browser ready | +| Pool Hit (Hot/Cold) | <100ms | Browser ready | +| New Browser Creation | 3-5s | Chromium startup | +| Simple Page Fetch | 1-3s | Network + render | +| Complex Extraction | 5-10s | LLM processing | + +**Throughput:** + +| Load | Concurrent | Response Time | Success Rate | +|------|-----------|---------------|--------------| +| Light | 1-10 | <3s | 100% | +| Medium | 10-50 | 3-8s | 100% | +| Heavy | 50-100 | 8-15s | 95-100% | +| Extreme | 100+ | 15-30s | 80-95% | + +### Reliability Features + +**Race Condition Protection:** +- `asyncio.Lock` on all pool operations +- Lock on browser pool stats access +- Async janitor event tracking + +**Graceful Degradation:** +- WebSocket → HTTP polling fallback +- Redis persistence failures (logged, non-blocking) +- Monitor tracking failures (logged, non-blocking) + +**Resource Cleanup:** +- Janitor cleanup (adaptive intervals) +- Time-based expiry (5min for old data) +- Shutdown cleanup (persist final stats, close browsers) +- Background worker cancellation + +--- + +## Deployment & Operations + +### Running Locally + +```bash +# Install dependencies +pip install -r requirements.txt + +# Configure +cp .llm.env.example .llm.env +# Edit .llm.env with your API keys + +# Run server +python -m uvicorn server:app --host 0.0.0.0 --port 11235 --reload +``` + +### Docker Deployment + +```bash +# Build image +docker build -t crawl4ai:latest -f Dockerfile . + +# Run container +docker run -d \ + --name crawl4ai \ + -p 11235:11235 \ + --shm-size=1g \ + --env-file .llm.env \ + crawl4ai:latest +``` + +### Production Configuration + +**`config.yml` Key Settings:** + +```yaml +crawler: + browser: + extra_args: + - "--disable-gpu" + - "--disable-dev-shm-usage" + - "--no-sandbox" + kwargs: + headless: true + text_mode: true # Reduces memory by 30-40% + + memory_threshold_percent: 95 # Refuse new browsers above this + + pool: + idle_ttl_sec: 300 # Base TTL for cold pool (5 min) + + rate_limiter: + enabled: true + base_delay: [1.0, 3.0] # Random delay between requests +``` + +### Monitoring + +**Access Dashboard:** +``` +http://localhost:11235/static/monitor/ +``` + +**Check Logs:** +```bash +# All activity +docker logs crawl4ai -f + +# Pool activity only +docker logs crawl4ai | grep -E "(🔥|♨️|❄️|🆕|⬆️)" + +# Errors only +docker logs crawl4ai | grep ERROR +``` + +**Metrics:** +```bash +# Container stats +docker stats crawl4ai + +# Memory percentage +curl http://localhost:11235/monitor/health | jq '.container.memory_percent' + +# Pool status +curl http://localhost:11235/monitor/browsers | jq '.summary' +``` + +--- + +## Troubleshooting & Debugging + +### Common Issues + +**1. WebSocket Not Connecting** + +Symptoms: Yellow "Connecting..." indicator, falls back to blue "Polling" + +Debug: +```bash +# Check server logs +docker logs crawl4ai | grep WebSocket + +# Test WebSocket manually +python test-websocket.py +``` + +Fix: Check firewall/proxy settings, ensure port 11235 accessible + +**2. High Memory Usage** + +Symptoms: Container OOM kills, 503 errors, slow responses + +Debug: +```bash +# Check current memory +curl http://localhost:11235/monitor/health | jq '.container.memory_percent' + +# Check browser pool +curl http://localhost:11235/monitor/browsers + +# Check janitor activity +docker logs crawl4ai | grep "🧹" +``` + +Fix: +- Lower `memory_threshold_percent` in config.yml +- Increase container memory limit +- Enable `text_mode: true` in browser config +- Reduce idle_ttl_sec for more aggressive cleanup + +**3. Browser Pool Not Reusing** + +Symptoms: High "New Created" count, poor reuse rate + +Debug: +```python +# Check config signature matching +from crawl4ai import BrowserConfig +import json, hashlib + +cfg = BrowserConfig(...) # Your config +sig = hashlib.sha1(json.dumps(cfg.to_dict(), sort_keys=True).encode()).hexdigest() +print(f"Config signature: {sig[:8]}") +``` + +Check logs for permanent browser signature: +```bash +docker logs crawl4ai | grep "permanent" +``` + +Fix: Ensure endpoint configs match permanent browser config exactly + +**4. Janitor Not Cleaning Up** + +Symptoms: Memory stays high after idle period + +Debug: +```bash +# Check janitor events +curl http://localhost:11235/monitor/logs/janitor + +# Check pool stats over time +watch -n 5 'curl -s http://localhost:11235/monitor/browsers | jq ".summary"' +``` + +Fix: +- Janitor runs every 10-60s depending on memory +- Hot pool browsers have longer TTL (by design) +- Permanent browser never cleaned (by design) + +### Debug Tools + +**Config Signature Checker:** + +```python +from crawl4ai import BrowserConfig +import json, hashlib + +def check_sig(cfg: BrowserConfig) -> str: + payload = json.dumps(cfg.to_dict(), sort_keys=True, separators=(",",":")) + sig = hashlib.sha1(payload.encode()).hexdigest() + return sig[:8] + +# Example +cfg1 = BrowserConfig() +cfg2 = BrowserConfig(headless=True) +print(f"Default: {check_sig(cfg1)}") +print(f"Custom: {check_sig(cfg2)}") +``` + +**Monitor Stats Dumper:** + +```bash +#!/bin/bash +# Dump all monitor stats to JSON + +curl -s http://localhost:11235/monitor/health > health.json +curl -s http://localhost:11235/monitor/requests?limit=100 > requests.json +curl -s http://localhost:11235/monitor/browsers > browsers.json +curl -s http://localhost:11235/monitor/logs/janitor > janitor.json +curl -s http://localhost:11235/monitor/logs/errors > errors.json + +echo "Monitor stats dumped to *.json files" +``` + +**WebSocket Test Script:** + +```python +# test-websocket.py (included in repo) +import asyncio +import websockets +import json + +async def test_websocket(): + uri = "ws://localhost:11235/monitor/ws" + async with websockets.connect(uri) as websocket: + for i in range(5): + message = await websocket.recv() + data = json.loads(message) + print(f"\nUpdate #{i+1}:") + print(f" Health: CPU {data['health']['container']['cpu_percent']}%") + print(f" Active Requests: {len(data['requests']['active'])}") + print(f" Browsers: {len(data['browsers'])}") + +asyncio.run(test_websocket()) +``` + +### Performance Tuning + +**For High Throughput:** + +```yaml +# config.yml +crawler: + memory_threshold_percent: 90 # Allow more browsers + pool: + idle_ttl_sec: 600 # Keep browsers longer + rate_limiter: + enabled: false # Disable for max speed +``` + +**For Low Memory:** + +```yaml +# config.yml +crawler: + browser: + kwargs: + text_mode: true # 30-40% memory reduction + memory_threshold_percent: 80 # More conservative + pool: + idle_ttl_sec: 60 # Aggressive cleanup +``` + +**For Stability:** + +```yaml +# config.yml +crawler: + memory_threshold_percent: 85 # Balanced + pool: + idle_ttl_sec: 300 # Moderate cleanup + rate_limiter: + enabled: true + base_delay: [2.0, 5.0] # Prevent rate limiting +``` + +--- + +## Test Suite + +**Location:** `deploy/docker/tests/` + +**Tests:** + +1. `test_1_basic.py` - Health check, container lifecycle +2. `test_2_memory.py` - Memory tracking, leak detection +3. `test_3_pool.py` - Pool reuse validation +4. `test_4_concurrent.py` - Concurrent load testing +5. `test_5_pool_stress.py` - Multi-config pool behavior +6. `test_6_multi_endpoint.py` - All endpoint validation +7. `test_7_cleanup.py` - Janitor cleanup verification + +**Run All Tests:** + +```bash +cd deploy/docker/tests +pip install -r requirements.txt + +# Build image first +cd /path/to/repo +docker build -t crawl4ai-local:latest . + +# Run tests +cd deploy/docker/tests +for test in test_*.py; do + echo "Running $test..." + python $test || break +done +``` + +--- + +## Architecture Decision Log + +### Why 3-Tier Pool? + +**Decision:** PERMANENT + HOT_POOL + COLD_POOL + +**Rationale:** +- 90% of requests use default config → permanent browser serves most traffic +- Frequent variants (hot) deserve longer TTL for better reuse +- Rare configs (cold) should be cleaned aggressively to save memory + +**Alternatives Considered:** +- Single pool: Too simple, no optimization for common case +- LRU cache: Doesn't capture "hot" vs "rare" distinction +- Per-endpoint pools: Too complex, over-engineering + +### Why WebSocket + Polling Fallback? + +**Decision:** WebSocket primary, HTTP polling backup + +**Rationale:** +- WebSocket provides real-time updates (2s interval) +- Polling fallback ensures reliability in restricted networks +- Auto-reconnect handles temporary disconnections + +**Alternatives Considered:** +- Polling only: Works but higher latency, more server load +- WebSocket only: Fails in restricted networks +- Server-Sent Events: One-way, no client messages + +### Why Background Persistence Worker? + +**Decision:** Queue-based worker for Redis operations + +**Rationale:** +- Fire-and-forget loses data on failures +- Queue provides buffering and retry capability +- Non-blocking keeps request path fast + +**Alternatives Considered:** +- Synchronous writes: Blocks request handling +- Fire-and-forget: Silent failures +- Batch writes: Complex state management + +--- + +## Contributing + +When modifying the architecture: + +1. **Maintain backward compatibility** in API contracts +2. **Add tests** for new functionality +3. **Update this document** with architectural changes +4. **Profile memory impact** before production +5. **Test under load** using the test suite + +**Code Review Checklist:** +- [ ] Race conditions protected with locks +- [ ] Error handling with proper logging +- [ ] Graceful degradation on failures +- [ ] Memory impact measured +- [ ] Tests added/updated +- [ ] Documentation updated + +--- + +## License & Credits + +**Crawl4AI** - Created by Unclecode +**GitHub**: https://github.com/unclecode/crawl4ai +**License**: See LICENSE file in repository + +**Architecture & Optimizations**: October 2025 +**WebSocket Monitoring**: October 2025 +**Production Hardening**: October 2025 + +--- + +**End of Technical Architecture Document** + +For questions or issues, please open a GitHub issue at: +https://github.com/unclecode/crawl4ai/issues