# Crawl4AI Docker Server - Technical Architecture **Version**: 0.7.4 **Last Updated**: October 2025 **Status**: Production-ready with real-time monitoring This document provides a comprehensive technical overview of the Crawl4AI Docker server architecture, including the smart browser pool, real-time monitoring system, and all production optimizations. --- ## Table of Contents 1. [System Overview](#system-overview) 2. [Core Components](#core-components) 3. [Smart Browser Pool](#smart-browser-pool) 4. [Real-time Monitoring System](#real-time-monitoring-system) 5. [API Layer](#api-layer) 6. [Memory Management](#memory-management) 7. [Production Optimizations](#production-optimizations) 8. [Deployment & Operations](#deployment--operations) 9. [Troubleshooting & Debugging](#troubleshooting--debugging) --- ## System Overview ### Architecture Diagram ``` ┌─────────────────────────────────────────────────────────────┐ │ Client Requests │ └────────────┬────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ FastAPI Server (server.py) │ │ ├─ REST API Endpoints (/crawl, /html, /md, /llm, etc.) │ │ ├─ WebSocket Endpoint (/monitor/ws) │ │ └─ Background Tasks (janitor, timeline_updater) │ └────┬────────────────────┬────────────────────┬──────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ Browser │ │ Monitor System │ │ Redis │ │ Pool │ │ (monitor.py) │ │ (Persistence) │ │ │ │ │ │ │ │ PERMANENT ●─┤ │ ├─ Stats │ │ ├─ Endpoint │ │ HOT_POOL ♨─┤ │ ├─ Requests │ │ │ Stats │ │ COLD_POOL ❄─┤ │ ├─ Browsers │ │ ├─ Task │ │ │ │ ├─ Timeline │ │ │ Results │ │ Janitor 🧹─┤ │ └─ Events/Errors │ │ └─ Cache │ └─────────────┘ └──────────────────┘ └─────────────────┘ ``` ### Key Features - **10x Memory Efficiency**: Smart 3-tier browser pooling reduces memory from 500-700MB to 50-70MB per concurrent user - **Real-time Monitoring**: WebSocket-based live dashboard with 2-second update intervals - **Production-Ready**: Comprehensive error handling, timeouts, cleanup, and graceful shutdown - **Container-Aware**: Accurate memory detection using cgroup v2/v1 - **Auto-Recovery**: Graceful WebSocket fallback, lock protection, background workers --- ## Core Components ### 1. Server Core (`server.py`) **Responsibilities:** - FastAPI application lifecycle management - Route registration and middleware - Background task orchestration - Graceful shutdown handling **Key Functions:** ```python @asynccontextmanager async def lifespan(app: FastAPI): """Application lifecycle manager""" # Startup - Initialize Redis connection - Create monitor stats instance - Start persistence worker - Initialize permanent browser - Start janitor (browser cleanup) - Start timeline updater (5s interval) yield # Shutdown - Cancel background tasks - Persist final monitor stats - Stop persistence worker - Close all browsers ``` **Configuration:** - Loaded from `config.yml` - Browser settings, memory thresholds, rate limiting - LLM provider credentials - Server host/port ### 2. API Layer (`api.py`) **Endpoints:** | Endpoint | Method | Purpose | Pool Usage | |----------|--------|---------|------------| | `/health` | GET | Health check | None | | `/crawl` | POST | Full crawl with all features | ✓ Pool | | `/crawl_stream` | POST | Streaming crawl results | ✓ Pool | | `/html` | POST | HTML extraction | ✓ Pool | | `/md` | POST | Markdown generation | ✓ Pool | | `/screenshot` | POST | Page screenshots | ✓ Pool | | `/pdf` | POST | PDF generation | ✓ Pool | | `/llm/{path}` | GET/POST | LLM extraction | ✓ Pool | | `/crawl/job` | POST | Background job creation | ✓ Pool | **Request Flow:** ```python @app.post("/crawl") async def crawl(body: CrawlRequest): # 1. Track request start request_id = f"req_{uuid4().hex[:8]}" await get_monitor().track_request_start(request_id, "/crawl", url, config) # 2. Get browser from pool from crawler_pool import get_crawler crawler = await get_crawler(browser_config) # 3. Execute crawl result = await crawler.arun(url, config=crawler_config) # 4. Track request completion await get_monitor().track_request_end(request_id, success=True) # 5. Return result (browser stays in pool) return result ``` ### 3. Utility Layer (`utils.py`) **Container Memory Detection:** ```python def get_container_memory_percent() -> float: """Accurate container memory detection""" try: # Try cgroup v2 first current = int(Path("/sys/fs/cgroup/memory.current").read_text().strip()) max_mem = int(Path("/sys/fs/cgroup/memory.max").read_text().strip()) return (current / max_mem) * 100 except: # Fallback to cgroup v1 usage = int(Path("/sys/fs/cgroup/memory/memory.usage_in_bytes").read_text()) limit = int(Path("/sys/fs/cgroup/memory/memory.limit_in_bytes").read_text()) return (usage / limit) * 100 except: # Final fallback to psutil (may be inaccurate in containers) return psutil.virtual_memory().percent ``` **Helper Functions:** - `get_base_url()`: Request base URL extraction - `is_task_id()`: Task ID validation - `should_cleanup_task()`: TTL-based cleanup logic - `validate_llm_provider()`: LLM configuration validation --- ## Smart Browser Pool ### Architecture The browser pool implements a 3-tier strategy optimized for real-world usage patterns: ``` ┌──────────────────────────────────────────────────────────┐ │ PERMANENT Browser (Default Config) │ │ ● Always alive, never cleaned │ │ ● Serves 90% of requests │ │ ● ~270MB memory │ └──────────────────────────────────────────────────────────┘ ▲ │ 90% of requests │ ┌──────────────────────────────────────────────────────────┐ │ HOT_POOL (Frequently Used Configs) │ │ ♨ Configs used 3+ times │ │ ♨ Longer TTL (2-5 min depending on memory) │ │ ♨ ~180MB per browser │ └──────────────────────────────────────────────────────────┘ ▲ │ Promotion at 3 uses │ ┌──────────────────────────────────────────────────────────┐ │ COLD_POOL (Rarely Used Configs) │ │ ❄ New/rare browser configs │ │ ❄ Short TTL (30s-5min depending on memory) │ │ ❄ ~180MB per browser │ └──────────────────────────────────────────────────────────┘ ``` ### Implementation (`crawler_pool.py`) **Core Data Structures:** ```python PERMANENT: Optional[AsyncWebCrawler] = None # Default browser HOT_POOL: Dict[str, AsyncWebCrawler] = {} # Frequent configs COLD_POOL: Dict[str, AsyncWebCrawler] = {} # Rare configs LAST_USED: Dict[str, float] = {} # Timestamp tracking USAGE_COUNT: Dict[str, int] = {} # Usage counter LOCK = asyncio.Lock() # Thread-safe access ``` **Browser Acquisition Flow:** ```python async def get_crawler(cfg: BrowserConfig) -> AsyncWebCrawler: sig = _sig(cfg) # SHA1 hash of config async with LOCK: # Prevent race conditions # 1. Check permanent browser if _is_default_config(sig): return PERMANENT # 2. Check hot pool if sig in HOT_POOL: USAGE_COUNT[sig] += 1 return HOT_POOL[sig] # 3. Check cold pool (with promotion logic) if sig in COLD_POOL: USAGE_COUNT[sig] += 1 if USAGE_COUNT[sig] >= 3: # Promote to hot pool HOT_POOL[sig] = COLD_POOL.pop(sig) await get_monitor().track_janitor_event("promote", sig, {...}) return HOT_POOL[sig] return COLD_POOL[sig] # 4. Memory check before creating new if get_container_memory_percent() >= MEM_LIMIT: raise MemoryError(f"Memory at {mem}%, refusing new browser") # 5. Create new browser in cold pool crawler = AsyncWebCrawler(config=cfg) await crawler.start() COLD_POOL[sig] = crawler return crawler ``` **Janitor (Adaptive Cleanup):** ```python async def janitor(): """Memory-adaptive browser cleanup""" while True: mem_pct = get_container_memory_percent() # Adaptive intervals based on memory pressure if mem_pct > 80: interval, cold_ttl, hot_ttl = 10, 30, 120 # Aggressive elif mem_pct > 60: interval, cold_ttl, hot_ttl = 30, 60, 300 # Moderate else: interval, cold_ttl, hot_ttl = 60, 300, 600 # Relaxed await asyncio.sleep(interval) async with LOCK: # Clean cold pool first (less valuable) for sig in list(COLD_POOL.keys()): if now - LAST_USED[sig] > cold_ttl: await COLD_POOL[sig].close() del COLD_POOL[sig], LAST_USED[sig], USAGE_COUNT[sig] await track_janitor_event("close_cold", sig, {...}) # Clean hot pool (more conservative) for sig in list(HOT_POOL.keys()): if now - LAST_USED[sig] > hot_ttl: await HOT_POOL[sig].close() del HOT_POOL[sig], LAST_USED[sig], USAGE_COUNT[sig] await track_janitor_event("close_hot", sig, {...}) ``` **Config Signature Generation:** ```python def _sig(cfg: BrowserConfig) -> str: """Generate unique signature for browser config""" payload = json.dumps(cfg.to_dict(), sort_keys=True, separators=(",",":")) return hashlib.sha1(payload.encode()).hexdigest() ``` --- ## Real-time Monitoring System ### Architecture The monitoring system provides real-time insights via WebSocket with automatic fallback to HTTP polling. **Components:** ``` ┌─────────────────────────────────────────────────────────┐ │ MonitorStats Class (monitor.py) │ │ ├─ In-memory queues (deques with maxlen) │ │ ├─ Background persistence worker │ │ ├─ Timeline tracking (5-min window, 5s resolution) │ │ └─ Time-based expiry (5min for old entries) │ └───────────┬─────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ WebSocket Endpoint (/monitor/ws) │ │ ├─ 2-second update intervals │ │ ├─ Auto-reconnect with exponential backoff │ │ ├─ Comprehensive data payload │ │ └─ Graceful fallback to polling │ └───────────┬─────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Dashboard UI (static/monitor/index.html) │ │ ├─ Connection status indicator │ │ ├─ Live updates (health, requests, browsers) │ │ ├─ Timeline charts (memory, requests, browsers) │ │ └─ Janitor events & error logs │ └─────────────────────────────────────────────────────────┘ ``` ### Monitor Stats (`monitor.py`) **Data Structures:** ```python class MonitorStats: # In-memory queues active_requests: Dict[str, Dict] # Currently processing completed_requests: deque(maxlen=100) # Last 100 completed janitor_events: deque(maxlen=100) # Cleanup events errors: deque(maxlen=100) # Error log # Endpoint stats (persisted to Redis) endpoint_stats: Dict[str, Dict] # Aggregated stats # Timeline data (5min window, 5s resolution = 60 points) memory_timeline: deque(maxlen=60) requests_timeline: deque(maxlen=60) browser_timeline: deque(maxlen=60) # Background persistence _persist_queue: asyncio.Queue(maxsize=10) _persist_worker_task: Optional[asyncio.Task] ``` **Request Tracking:** ```python async def track_request_start(request_id, endpoint, url, config): """Track new request""" self.active_requests[request_id] = { "id": request_id, "endpoint": endpoint, "url": url, "start_time": time.time(), "mem_start": psutil.Process().memory_info().rss / (1024 * 1024) } # Update endpoint stats if endpoint not in self.endpoint_stats: self.endpoint_stats[endpoint] = { "count": 0, "total_time": 0, "errors": 0, "pool_hits": 0, "success": 0 } self.endpoint_stats[endpoint]["count"] += 1 # Queue background persistence self._persist_queue.put_nowait(True) async def track_request_end(request_id, success, error=None, ...): """Track request completion""" req_info = self.active_requests.pop(request_id) elapsed = time.time() - req_info["start_time"] mem_delta = current_mem - req_info["mem_start"] # Add to completed queue self.completed_requests.append({ "id": request_id, "endpoint": req_info["endpoint"], "url": req_info["url"], "success": success, "elapsed": elapsed, "mem_delta": mem_delta, "end_time": time.time() }) # Update stats self.endpoint_stats[endpoint]["success" if success else "errors"] += 1 await self._persist_endpoint_stats() ``` **Background Persistence Worker:** ```python async def _persistence_worker(self): """Background worker for Redis persistence""" while True: try: await self._persist_queue.get() await self._persist_endpoint_stats() self._persist_queue.task_done() except asyncio.CancelledError: break except Exception as e: logger.error(f"Persistence worker error: {e}") async def _persist_endpoint_stats(self): """Persist stats to Redis with error handling""" try: await self.redis.set( "monitor:endpoint_stats", json.dumps(self.endpoint_stats), ex=86400 # 24h TTL ) except Exception as e: logger.warning(f"Failed to persist endpoint stats: {e}") ``` **Time-based Cleanup:** ```python def _cleanup_old_entries(self, max_age_seconds=300): """Remove entries older than 5 minutes""" now = time.time() cutoff = now - max_age_seconds # Clean completed requests while self.completed_requests and \ self.completed_requests[0].get("end_time", 0) < cutoff: self.completed_requests.popleft() # Clean janitor events while self.janitor_events and \ self.janitor_events[0].get("timestamp", 0) < cutoff: self.janitor_events.popleft() # Clean errors while self.errors and \ self.errors[0].get("timestamp", 0) < cutoff: self.errors.popleft() ``` ### WebSocket Implementation (`monitor_routes.py`) **Endpoint:** ```python @router.websocket("/ws") async def websocket_endpoint(websocket: WebSocket): """Real-time monitoring updates""" await websocket.accept() logger.info("WebSocket client connected") try: while True: try: monitor = get_monitor() # Gather comprehensive monitoring data data = { "timestamp": time.time(), "health": await monitor.get_health_summary(), "requests": { "active": monitor.get_active_requests(), "completed": monitor.get_completed_requests(limit=10) }, "browsers": await monitor.get_browser_list(), "timeline": { "memory": monitor.get_timeline_data("memory", "5m"), "requests": monitor.get_timeline_data("requests", "5m"), "browsers": monitor.get_timeline_data("browsers", "5m") }, "janitor": monitor.get_janitor_log(limit=10), "errors": monitor.get_errors_log(limit=10) } await websocket.send_json(data) await asyncio.sleep(2) # 2-second update interval except WebSocketDisconnect: logger.info("WebSocket client disconnected") break except Exception as e: logger.error(f"WebSocket error: {e}", exc_info=True) await asyncio.sleep(2) except Exception as e: logger.error(f"WebSocket connection error: {e}", exc_info=True) finally: logger.info("WebSocket connection closed") ``` **Input Validation:** ```python @router.get("/requests") async def get_requests(status: str = "all", limit: int = 50): # Input validation if status not in ["all", "active", "completed", "success", "error"]: raise HTTPException(400, f"Invalid status: {status}") if limit < 1 or limit > 1000: raise HTTPException(400, f"Invalid limit: {limit}") monitor = get_monitor() # ... return data ``` ### Frontend Dashboard **Connection Management:** ```javascript // WebSocket with auto-reconnect function connectWebSocket() { if (wsReconnectAttempts >= MAX_WS_RECONNECT) { // Fallback to polling after 5 failed attempts useWebSocket = false; updateConnectionStatus('polling'); startAutoRefresh(); return; } updateConnectionStatus('connecting'); const wsUrl = `${protocol}//${window.location.host}/monitor/ws`; websocket = new WebSocket(wsUrl); websocket.onopen = () => { wsReconnectAttempts = 0; updateConnectionStatus('connected'); stopAutoRefresh(); // Stop polling }; websocket.onmessage = (event) => { const data = JSON.parse(event.data); updateDashboard(data); // Update all sections }; websocket.onclose = () => { updateConnectionStatus('disconnected', 'Reconnecting...'); if (useWebSocket) { setTimeout(connectWebSocket, 2000 * wsReconnectAttempts); } else { startAutoRefresh(); // Fallback to polling } }; } ``` **Connection Status Indicator:** | Status | Color | Animation | Meaning | |--------|-------|-----------|---------| | Live | Green | Pulsing fast | WebSocket connected | | Connecting... | Yellow | Pulsing slow | Attempting connection | | Polling | Blue | Pulsing slow | HTTP polling fallback | | Disconnected | Red | None | Connection failed | --- ## API Layer ### Request/Response Flow ``` Client Request │ ▼ FastAPI Route Handler │ ├─→ Monitor: track_request_start() │ ├─→ Browser Pool: get_crawler(config) │ │ │ ├─→ Check PERMANENT │ ├─→ Check HOT_POOL │ ├─→ Check COLD_POOL │ └─→ Create New (if needed) │ ├─→ Execute Crawl │ │ │ ├─→ Fetch page │ ├─→ Extract content │ ├─→ Apply filters/strategies │ └─→ Return result │ ├─→ Monitor: track_request_end() │ └─→ Return Response (browser stays in pool) ``` ### Error Handling Strategy **Levels:** 1. **Route Level**: HTTP exceptions with proper status codes 2. **Monitor Level**: Try-except with logging, non-critical failures 3. **Pool Level**: Memory checks, lock protection, graceful degradation 4. **WebSocket Level**: Auto-reconnect, fallback to polling **Example:** ```python @app.post("/crawl") async def crawl(body: CrawlRequest): request_id = f"req_{uuid4().hex[:8]}" try: # Monitor tracking (non-blocking on failure) try: await get_monitor().track_request_start(...) except: pass # Monitor not critical # Browser acquisition (with memory protection) crawler = await get_crawler(browser_config) # Crawl execution result = await crawler.arun(url, config=cfg) # Success tracking try: await get_monitor().track_request_end(request_id, success=True) except: pass return result except MemoryError as e: # Memory pressure - return 503 await get_monitor().track_request_end(request_id, success=False, error=str(e)) raise HTTPException(503, "Server at capacity") except Exception as e: # General errors - return 500 await get_monitor().track_request_end(request_id, success=False, error=str(e)) raise HTTPException(500, str(e)) ``` --- ## Memory Management ### Container Memory Detection **Priority Order:** 1. cgroup v2 (`/sys/fs/cgroup/memory.{current,max}`) 2. cgroup v1 (`/sys/fs/cgroup/memory/memory.{usage,limit}_in_bytes`) 3. psutil fallback (may be inaccurate in containers) **Usage:** ```python mem_pct = get_container_memory_percent() if mem_pct >= 95: # Critical raise MemoryError("Refusing new browser") elif mem_pct > 80: # High pressure # Janitor: aggressive cleanup (10s interval, 30s TTL) elif mem_pct > 60: # Moderate pressure # Janitor: moderate cleanup (30s interval, 60s TTL) else: # Normal # Janitor: relaxed cleanup (60s interval, 300s TTL) ``` ### Memory Budgets | Component | Memory | Notes | |-----------|--------|-------| | Base Container | 270 MB | Python + FastAPI + libraries | | Permanent Browser | 270 MB | Always-on default browser | | Hot Pool Browser | 180 MB | Per frequently-used config | | Cold Pool Browser | 180 MB | Per rarely-used config | | Active Crawl Overhead | 50-200 MB | Temporary, released after request | **Example Calculation:** ``` Container: 270 MB Permanent: 270 MB 2x Hot: 360 MB 1x Cold: 180 MB Total: 1080 MB baseline Under load (10 concurrent): + Active crawls: ~500-1000 MB = Peak: 1.5-2 GB ``` --- ## Production Optimizations ### Code Review Fixes Applied **Critical (3):** 1. ✅ Lock protection for browser pool access 2. ✅ Async track_janitor_event implementation 3. ✅ Error handling in request tracking **Important (8):** 4. ✅ Background persistence worker (replaces fire-and-forget) 5. ✅ Time-based expiry (5min cleanup for old entries) 6. ✅ Input validation (status, limit, metric, window) 7. ✅ Timeline updater timeout (4s max) 8. ✅ Warn when killing browsers with active requests 9. ✅ Monitor cleanup on shutdown 10. ✅ Document memory estimates 11. ✅ Structured error responses (HTTPException) ### Performance Characteristics **Latency:** | Scenario | Time | Notes | |----------|------|-------| | Pool Hit (Permanent) | <100ms | Browser ready | | Pool Hit (Hot/Cold) | <100ms | Browser ready | | New Browser Creation | 3-5s | Chromium startup | | Simple Page Fetch | 1-3s | Network + render | | Complex Extraction | 5-10s | LLM processing | **Throughput:** | Load | Concurrent | Response Time | Success Rate | |------|-----------|---------------|--------------| | Light | 1-10 | <3s | 100% | | Medium | 10-50 | 3-8s | 100% | | Heavy | 50-100 | 8-15s | 95-100% | | Extreme | 100+ | 15-30s | 80-95% | ### Reliability Features **Race Condition Protection:** - `asyncio.Lock` on all pool operations - Lock on browser pool stats access - Async janitor event tracking **Graceful Degradation:** - WebSocket → HTTP polling fallback - Redis persistence failures (logged, non-blocking) - Monitor tracking failures (logged, non-blocking) **Resource Cleanup:** - Janitor cleanup (adaptive intervals) - Time-based expiry (5min for old data) - Shutdown cleanup (persist final stats, close browsers) - Background worker cancellation --- ## Deployment & Operations ### Running Locally ```bash # Install dependencies pip install -r requirements.txt # Configure cp .llm.env.example .llm.env # Edit .llm.env with your API keys # Run server python -m uvicorn server:app --host 0.0.0.0 --port 11235 --reload ``` ### Docker Deployment ```bash # Build image docker build -t crawl4ai:latest -f Dockerfile . # Run container docker run -d \ --name crawl4ai \ -p 11235:11235 \ --shm-size=1g \ --env-file .llm.env \ crawl4ai:latest ``` ### Production Configuration **`config.yml` Key Settings:** ```yaml crawler: browser: extra_args: - "--disable-gpu" - "--disable-dev-shm-usage" - "--no-sandbox" kwargs: headless: true text_mode: true # Reduces memory by 30-40% memory_threshold_percent: 95 # Refuse new browsers above this pool: idle_ttl_sec: 300 # Base TTL for cold pool (5 min) rate_limiter: enabled: true base_delay: [1.0, 3.0] # Random delay between requests ``` ### Monitoring **Access Dashboard:** ``` http://localhost:11235/static/monitor/ ``` **Check Logs:** ```bash # All activity docker logs crawl4ai -f # Pool activity only docker logs crawl4ai | grep -E "(🔥|♨️|❄️|🆕|⬆️)" # Errors only docker logs crawl4ai | grep ERROR ``` **Metrics:** ```bash # Container stats docker stats crawl4ai # Memory percentage curl http://localhost:11235/monitor/health | jq '.container.memory_percent' # Pool status curl http://localhost:11235/monitor/browsers | jq '.summary' ``` --- ## Troubleshooting & Debugging ### Common Issues **1. WebSocket Not Connecting** Symptoms: Yellow "Connecting..." indicator, falls back to blue "Polling" Debug: ```bash # Check server logs docker logs crawl4ai | grep WebSocket # Test WebSocket manually python test-websocket.py ``` Fix: Check firewall/proxy settings, ensure port 11235 accessible **2. High Memory Usage** Symptoms: Container OOM kills, 503 errors, slow responses Debug: ```bash # Check current memory curl http://localhost:11235/monitor/health | jq '.container.memory_percent' # Check browser pool curl http://localhost:11235/monitor/browsers # Check janitor activity docker logs crawl4ai | grep "🧹" ``` Fix: - Lower `memory_threshold_percent` in config.yml - Increase container memory limit - Enable `text_mode: true` in browser config - Reduce idle_ttl_sec for more aggressive cleanup **3. Browser Pool Not Reusing** Symptoms: High "New Created" count, poor reuse rate Debug: ```python # Check config signature matching from crawl4ai import BrowserConfig import json, hashlib cfg = BrowserConfig(...) # Your config sig = hashlib.sha1(json.dumps(cfg.to_dict(), sort_keys=True).encode()).hexdigest() print(f"Config signature: {sig[:8]}") ``` Check logs for permanent browser signature: ```bash docker logs crawl4ai | grep "permanent" ``` Fix: Ensure endpoint configs match permanent browser config exactly **4. Janitor Not Cleaning Up** Symptoms: Memory stays high after idle period Debug: ```bash # Check janitor events curl http://localhost:11235/monitor/logs/janitor # Check pool stats over time watch -n 5 'curl -s http://localhost:11235/monitor/browsers | jq ".summary"' ``` Fix: - Janitor runs every 10-60s depending on memory - Hot pool browsers have longer TTL (by design) - Permanent browser never cleaned (by design) ### Debug Tools **Config Signature Checker:** ```python from crawl4ai import BrowserConfig import json, hashlib def check_sig(cfg: BrowserConfig) -> str: payload = json.dumps(cfg.to_dict(), sort_keys=True, separators=(",",":")) sig = hashlib.sha1(payload.encode()).hexdigest() return sig[:8] # Example cfg1 = BrowserConfig() cfg2 = BrowserConfig(headless=True) print(f"Default: {check_sig(cfg1)}") print(f"Custom: {check_sig(cfg2)}") ``` **Monitor Stats Dumper:** ```bash #!/bin/bash # Dump all monitor stats to JSON curl -s http://localhost:11235/monitor/health > health.json curl -s http://localhost:11235/monitor/requests?limit=100 > requests.json curl -s http://localhost:11235/monitor/browsers > browsers.json curl -s http://localhost:11235/monitor/logs/janitor > janitor.json curl -s http://localhost:11235/monitor/logs/errors > errors.json echo "Monitor stats dumped to *.json files" ``` **WebSocket Test Script:** ```python # test-websocket.py (included in repo) import asyncio import websockets import json async def test_websocket(): uri = "ws://localhost:11235/monitor/ws" async with websockets.connect(uri) as websocket: for i in range(5): message = await websocket.recv() data = json.loads(message) print(f"\nUpdate #{i+1}:") print(f" Health: CPU {data['health']['container']['cpu_percent']}%") print(f" Active Requests: {len(data['requests']['active'])}") print(f" Browsers: {len(data['browsers'])}") asyncio.run(test_websocket()) ``` ### Performance Tuning **For High Throughput:** ```yaml # config.yml crawler: memory_threshold_percent: 90 # Allow more browsers pool: idle_ttl_sec: 600 # Keep browsers longer rate_limiter: enabled: false # Disable for max speed ``` **For Low Memory:** ```yaml # config.yml crawler: browser: kwargs: text_mode: true # 30-40% memory reduction memory_threshold_percent: 80 # More conservative pool: idle_ttl_sec: 60 # Aggressive cleanup ``` **For Stability:** ```yaml # config.yml crawler: memory_threshold_percent: 85 # Balanced pool: idle_ttl_sec: 300 # Moderate cleanup rate_limiter: enabled: true base_delay: [2.0, 5.0] # Prevent rate limiting ``` --- ## Test Suite **Location:** `deploy/docker/tests/` **Tests:** 1. `test_1_basic.py` - Health check, container lifecycle 2. `test_2_memory.py` - Memory tracking, leak detection 3. `test_3_pool.py` - Pool reuse validation 4. `test_4_concurrent.py` - Concurrent load testing 5. `test_5_pool_stress.py` - Multi-config pool behavior 6. `test_6_multi_endpoint.py` - All endpoint validation 7. `test_7_cleanup.py` - Janitor cleanup verification **Run All Tests:** ```bash cd deploy/docker/tests pip install -r requirements.txt # Build image first cd /path/to/repo docker build -t crawl4ai-local:latest . # Run tests cd deploy/docker/tests for test in test_*.py; do echo "Running $test..." python $test || break done ``` --- ## Architecture Decision Log ### Why 3-Tier Pool? **Decision:** PERMANENT + HOT_POOL + COLD_POOL **Rationale:** - 90% of requests use default config → permanent browser serves most traffic - Frequent variants (hot) deserve longer TTL for better reuse - Rare configs (cold) should be cleaned aggressively to save memory **Alternatives Considered:** - Single pool: Too simple, no optimization for common case - LRU cache: Doesn't capture "hot" vs "rare" distinction - Per-endpoint pools: Too complex, over-engineering ### Why WebSocket + Polling Fallback? **Decision:** WebSocket primary, HTTP polling backup **Rationale:** - WebSocket provides real-time updates (2s interval) - Polling fallback ensures reliability in restricted networks - Auto-reconnect handles temporary disconnections **Alternatives Considered:** - Polling only: Works but higher latency, more server load - WebSocket only: Fails in restricted networks - Server-Sent Events: One-way, no client messages ### Why Background Persistence Worker? **Decision:** Queue-based worker for Redis operations **Rationale:** - Fire-and-forget loses data on failures - Queue provides buffering and retry capability - Non-blocking keeps request path fast **Alternatives Considered:** - Synchronous writes: Blocks request handling - Fire-and-forget: Silent failures - Batch writes: Complex state management --- ## Contributing When modifying the architecture: 1. **Maintain backward compatibility** in API contracts 2. **Add tests** for new functionality 3. **Update this document** with architectural changes 4. **Profile memory impact** before production 5. **Test under load** using the test suite **Code Review Checklist:** - [ ] Race conditions protected with locks - [ ] Error handling with proper logging - [ ] Graceful degradation on failures - [ ] Memory impact measured - [ ] Tests added/updated - [ ] Documentation updated --- ## License & Credits **Crawl4AI** - Created by Unclecode **GitHub**: https://github.com/unclecode/crawl4ai **License**: See LICENSE file in repository **Architecture & Optimizations**: October 2025 **WebSocket Monitoring**: October 2025 **Production Hardening**: October 2025 --- **End of Technical Architecture Document** For questions or issues, please open a GitHub issue at: https://github.com/unclecode/crawl4ai/issues