From 05921811b8fdf43772c73aca9779c42b86be100f Mon Sep 17 00:00:00 2001
From: unclecode <unclecode@kidocode.com>
Date: Sat, 18 Oct 2025 12:05:49 +0800
Subject: [PATCH] docs: add comprehensive technical architecture documentation

Created ARCHITECTURE.md as a complete technical reference for the
Crawl4AI Docker server, replacing the stress test pipeline document
with production-grade documentation.

Contents:
- System overview with architecture diagrams
- Core components deep-dive (server, API, utils)
- Smart browser pool implementation details
- Real-time monitoring system architecture
- WebSocket implementation and fallback strategy
- Memory management and container detection
- Production optimizations and code review fixes
- Deployment guides (local, Docker, production)
- Comprehensive troubleshooting section
- Debug tools and performance tuning
- Test suite documentation
- Architecture decision log (ADRs)

Target audience: Developers maintaining or extending the system
Goal: Enable rapid onboarding and confident modifications
---
 deploy/docker/ARCHITECTURE.md | 1149 +++++++++++++++++++++++++++++++++
 1 file changed, 1149 insertions(+)
 create mode 100644 deploy/docker/ARCHITECTURE.md

diff --git a/deploy/docker/ARCHITECTURE.md b/deploy/docker/ARCHITECTURE.md
new file mode 100644
index 00000000..eb49cdae
--- /dev/null
+++ b/deploy/docker/ARCHITECTURE.md
@@ -0,0 +1,1149 @@
+# Crawl4AI Docker Server - Technical Architecture
+
+**Version**: 0.7.4
+**Last Updated**: October 2025
+**Status**: Production-ready with real-time monitoring
+
+This document provides a comprehensive technical overview of the Crawl4AI Docker server architecture, including the smart browser pool, real-time monitoring system, and all production optimizations.
+
+---
+
+## Table of Contents
+
+1. [System Overview](#system-overview)
+2. [Core Components](#core-components)
+3. [Smart Browser Pool](#smart-browser-pool)
+4. [Real-time Monitoring System](#real-time-monitoring-system)
+5. [API Layer](#api-layer)
+6. [Memory Management](#memory-management)
+7. [Production Optimizations](#production-optimizations)
+8. [Deployment & Operations](#deployment--operations)
+9. [Troubleshooting & Debugging](#troubleshooting--debugging)
+
+---
+
+## System Overview
+
+### Architecture Diagram
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                     Client Requests                          │
+└────────────┬────────────────────────────────────────────────┘
+             │
+             ▼
+┌─────────────────────────────────────────────────────────────┐
+│  FastAPI Server (server.py)                                  │
+│  ├─ REST API Endpoints (/crawl, /html, /md, /llm, etc.)    │
+│  ├─ WebSocket Endpoint (/monitor/ws)                        │
+│  └─ Background Tasks (janitor, timeline_updater)            │
+└────┬────────────────────┬────────────────────┬──────────────┘
+     │                    │                    │
+     ▼                    ▼                    ▼
+┌─────────────┐  ┌──────────────────┐  ┌─────────────────┐
+│ Browser     │  │ Monitor System   │  │ Redis           │
+│ Pool        │  │ (monitor.py)     │  │ (Persistence)   │
+│             │  │                  │  │                 │
+│ PERMANENT ●─┤  │ ├─ Stats         │  │ ├─ Endpoint     │
+│ HOT_POOL  ♨─┤  │ ├─ Requests      │  │ │   Stats       │
+│ COLD_POOL ❄─┤  │ ├─ Browsers      │  │ ├─ Task         │
+│             │  │ ├─ Timeline      │  │ │   Results     │
+│ Janitor  🧹─┤  │ └─ Events/Errors │  │ └─ Cache        │
+└─────────────┘  └──────────────────┘  └─────────────────┘
+```
+
+### Key Features
+
+- **10x Memory Efficiency**: Smart 3-tier browser pooling reduces memory from 500-700MB to 50-70MB per concurrent user
+- **Real-time Monitoring**: WebSocket-based live dashboard with 2-second update intervals
+- **Production-Ready**: Comprehensive error handling, timeouts, cleanup, and graceful shutdown
+- **Container-Aware**: Accurate memory detection using cgroup v2/v1
+- **Auto-Recovery**: Graceful WebSocket fallback, lock protection, background workers
+
+---
+
+## Core Components
+
+### 1. Server Core (`server.py`)
+
+**Responsibilities:**
+- FastAPI application lifecycle management
+- Route registration and middleware
+- Background task orchestration
+- Graceful shutdown handling
+
+**Key Functions:**
+
+```python
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    """Application lifecycle manager"""
+    # Startup
+    - Initialize Redis connection
+    - Create monitor stats instance
+    - Start persistence worker
+    - Initialize permanent browser
+    - Start janitor (browser cleanup)
+    - Start timeline updater (5s interval)
+
+    yield
+
+    # Shutdown
+    - Cancel background tasks
+    - Persist final monitor stats
+    - Stop persistence worker
+    - Close all browsers
+```
+
+**Configuration:**
+- Loaded from `config.yml`
+- Browser settings, memory thresholds, rate limiting
+- LLM provider credentials
+- Server host/port
+
+### 2. API Layer (`api.py`)
+
+**Endpoints:**
+
+| Endpoint | Method | Purpose | Pool Usage |
+|----------|--------|---------|------------|
+| `/health` | GET | Health check | None |
+| `/crawl` | POST | Full crawl with all features | ✓ Pool |
+| `/crawl_stream` | POST | Streaming crawl results | ✓ Pool |
+| `/html` | POST | HTML extraction | ✓ Pool |
+| `/md` | POST | Markdown generation | ✓ Pool |
+| `/screenshot` | POST | Page screenshots | ✓ Pool |
+| `/pdf` | POST | PDF generation | ✓ Pool |
+| `/llm/{path}` | GET/POST | LLM extraction | ✓ Pool |
+| `/crawl/job` | POST | Background job creation | ✓ Pool |
+
+**Request Flow:**
+
+```python
+@app.post("/crawl")
+async def crawl(body: CrawlRequest):
+    # 1. Track request start
+    request_id = f"req_{uuid4().hex[:8]}"
+    await get_monitor().track_request_start(request_id, "/crawl", url, config)
+
+    # 2. Get browser from pool
+    from crawler_pool import get_crawler
+    crawler = await get_crawler(browser_config)
+
+    # 3. Execute crawl
+    result = await crawler.arun(url, config=crawler_config)
+
+    # 4. Track request completion
+    await get_monitor().track_request_end(request_id, success=True)
+
+    # 5. Return result (browser stays in pool)
+    return result
+```
+
+### 3. Utility Layer (`utils.py`)
+
+**Container Memory Detection:**
+
+```python
+def get_container_memory_percent() -> float:
+    """Accurate container memory detection"""
+    try:
+        # Try cgroup v2 first
+        current = int(Path("/sys/fs/cgroup/memory.current").read_text().strip())
+        max_mem = int(Path("/sys/fs/cgroup/memory.max").read_text().strip())
+        return (current / max_mem) * 100
+    except:
+        # Fallback to cgroup v1
+        usage = int(Path("/sys/fs/cgroup/memory/memory.usage_in_bytes").read_text())
+        limit = int(Path("/sys/fs/cgroup/memory/memory.limit_in_bytes").read_text())
+        return (usage / limit) * 100
+    except:
+        # Final fallback to psutil (may be inaccurate in containers)
+        return psutil.virtual_memory().percent
+```
+
+**Helper Functions:**
+- `get_base_url()`: Request base URL extraction
+- `is_task_id()`: Task ID validation
+- `should_cleanup_task()`: TTL-based cleanup logic
+- `validate_llm_provider()`: LLM configuration validation
+
+---
+
+## Smart Browser Pool
+
+### Architecture
+
+The browser pool implements a 3-tier strategy optimized for real-world usage patterns:
+
+```
+┌──────────────────────────────────────────────────────────┐
+│  PERMANENT Browser (Default Config)                      │
+│  ● Always alive, never cleaned                           │
+│  ● Serves 90% of requests                                │
+│  ● ~270MB memory                                         │
+└──────────────────────────────────────────────────────────┘
+                        ▲
+                        │ 90% of requests
+                        │
+┌──────────────────────────────────────────────────────────┐
+│  HOT_POOL (Frequently Used Configs)                      │
+│  ♨ Configs used 3+ times                                 │
+│  ♨ Longer TTL (2-5 min depending on memory)             │
+│  ♨ ~180MB per browser                                   │
+└──────────────────────────────────────────────────────────┘
+                        ▲
+                        │ Promotion at 3 uses
+                        │
+┌──────────────────────────────────────────────────────────┐
+│  COLD_POOL (Rarely Used Configs)                         │
+│  ❄ New/rare browser configs                             │
+│  ❄ Short TTL (30s-5min depending on memory)             │
+│  ❄ ~180MB per browser                                   │
+└──────────────────────────────────────────────────────────┘
+```
+
+### Implementation (`crawler_pool.py`)
+
+**Core Data Structures:**
+
+```python
+PERMANENT: Optional[AsyncWebCrawler] = None  # Default browser
+HOT_POOL: Dict[str, AsyncWebCrawler] = {}    # Frequent configs
+COLD_POOL: Dict[str, AsyncWebCrawler] = {}   # Rare configs
+LAST_USED: Dict[str, float] = {}             # Timestamp tracking
+USAGE_COUNT: Dict[str, int] = {}             # Usage counter
+LOCK = asyncio.Lock()                        # Thread-safe access
+```
+
+**Browser Acquisition Flow:**
+
+```python
+async def get_crawler(cfg: BrowserConfig) -> AsyncWebCrawler:
+    sig = _sig(cfg)  # SHA1 hash of config
+
+    async with LOCK:  # Prevent race conditions
+        # 1. Check permanent browser
+        if _is_default_config(sig):
+            return PERMANENT
+
+        # 2. Check hot pool
+        if sig in HOT_POOL:
+            USAGE_COUNT[sig] += 1
+            return HOT_POOL[sig]
+
+        # 3. Check cold pool (with promotion logic)
+        if sig in COLD_POOL:
+            USAGE_COUNT[sig] += 1
+            if USAGE_COUNT[sig] >= 3:
+                # Promote to hot pool
+                HOT_POOL[sig] = COLD_POOL.pop(sig)
+                await get_monitor().track_janitor_event("promote", sig, {...})
+                return HOT_POOL[sig]
+            return COLD_POOL[sig]
+
+        # 4. Memory check before creating new
+        if get_container_memory_percent() >= MEM_LIMIT:
+            raise MemoryError(f"Memory at {mem}%, refusing new browser")
+
+        # 5. Create new browser in cold pool
+        crawler = AsyncWebCrawler(config=cfg)
+        await crawler.start()
+        COLD_POOL[sig] = crawler
+        return crawler
+```
+
+**Janitor (Adaptive Cleanup):**
+
+```python
+async def janitor():
+    """Memory-adaptive browser cleanup"""
+    while True:
+        mem_pct = get_container_memory_percent()
+
+        # Adaptive intervals based on memory pressure
+        if mem_pct > 80:
+            interval, cold_ttl, hot_ttl = 10, 30, 120      # Aggressive
+        elif mem_pct > 60:
+            interval, cold_ttl, hot_ttl = 30, 60, 300      # Moderate
+        else:
+            interval, cold_ttl, hot_ttl = 60, 300, 600     # Relaxed
+
+        await asyncio.sleep(interval)
+
+        async with LOCK:
+            # Clean cold pool first (less valuable)
+            for sig in list(COLD_POOL.keys()):
+                if now - LAST_USED[sig] > cold_ttl:
+                    await COLD_POOL[sig].close()
+                    del COLD_POOL[sig], LAST_USED[sig], USAGE_COUNT[sig]
+                    await track_janitor_event("close_cold", sig, {...})
+
+            # Clean hot pool (more conservative)
+            for sig in list(HOT_POOL.keys()):
+                if now - LAST_USED[sig] > hot_ttl:
+                    await HOT_POOL[sig].close()
+                    del HOT_POOL[sig], LAST_USED[sig], USAGE_COUNT[sig]
+                    await track_janitor_event("close_hot", sig, {...})
+```
+
+**Config Signature Generation:**
+
+```python
+def _sig(cfg: BrowserConfig) -> str:
+    """Generate unique signature for browser config"""
+    payload = json.dumps(cfg.to_dict(), sort_keys=True, separators=(",",":"))
+    return hashlib.sha1(payload.encode()).hexdigest()
+```
+
+---
+
+## Real-time Monitoring System
+
+### Architecture
+
+The monitoring system provides real-time insights via WebSocket with automatic fallback to HTTP polling.
+
+**Components:**
+
+```
+┌─────────────────────────────────────────────────────────┐
+│  MonitorStats Class (monitor.py)                        │
+│  ├─ In-memory queues (deques with maxlen)              │
+│  ├─ Background persistence worker                       │
+│  ├─ Timeline tracking (5-min window, 5s resolution)    │
+│  └─ Time-based expiry (5min for old entries)           │
+└───────────┬─────────────────────────────────────────────┘
+            │
+            ▼
+┌─────────────────────────────────────────────────────────┐
+│  WebSocket Endpoint (/monitor/ws)                       │
+│  ├─ 2-second update intervals                          │
+│  ├─ Auto-reconnect with exponential backoff            │
+│  ├─ Comprehensive data payload                         │
+│  └─ Graceful fallback to polling                       │
+└───────────┬─────────────────────────────────────────────┘
+            │
+            ▼
+┌─────────────────────────────────────────────────────────┐
+│  Dashboard UI (static/monitor/index.html)               │
+│  ├─ Connection status indicator                        │
+│  ├─ Live updates (health, requests, browsers)          │
+│  ├─ Timeline charts (memory, requests, browsers)       │
+│  └─ Janitor events & error logs                        │
+└─────────────────────────────────────────────────────────┘
+```
+
+### Monitor Stats (`monitor.py`)
+
+**Data Structures:**
+
+```python
+class MonitorStats:
+    # In-memory queues
+    active_requests: Dict[str, Dict]           # Currently processing
+    completed_requests: deque(maxlen=100)      # Last 100 completed
+    janitor_events: deque(maxlen=100)          # Cleanup events
+    errors: deque(maxlen=100)                  # Error log
+
+    # Endpoint stats (persisted to Redis)
+    endpoint_stats: Dict[str, Dict]            # Aggregated stats
+
+    # Timeline data (5min window, 5s resolution = 60 points)
+    memory_timeline: deque(maxlen=60)
+    requests_timeline: deque(maxlen=60)
+    browser_timeline: deque(maxlen=60)
+
+    # Background persistence
+    _persist_queue: asyncio.Queue(maxsize=10)
+    _persist_worker_task: Optional[asyncio.Task]
+```
+
+**Request Tracking:**
+
+```python
+async def track_request_start(request_id, endpoint, url, config):
+    """Track new request"""
+    self.active_requests[request_id] = {
+        "id": request_id,
+        "endpoint": endpoint,
+        "url": url,
+        "start_time": time.time(),
+        "mem_start": psutil.Process().memory_info().rss / (1024 * 1024)
+    }
+
+    # Update endpoint stats
+    if endpoint not in self.endpoint_stats:
+        self.endpoint_stats[endpoint] = {
+            "count": 0, "total_time": 0, "errors": 0,
+            "pool_hits": 0, "success": 0
+        }
+    self.endpoint_stats[endpoint]["count"] += 1
+
+    # Queue background persistence
+    self._persist_queue.put_nowait(True)
+
+async def track_request_end(request_id, success, error=None, ...):
+    """Track request completion"""
+    req_info = self.active_requests.pop(request_id)
+    elapsed = time.time() - req_info["start_time"]
+    mem_delta = current_mem - req_info["mem_start"]
+
+    # Add to completed queue
+    self.completed_requests.append({
+        "id": request_id,
+        "endpoint": req_info["endpoint"],
+        "url": req_info["url"],
+        "success": success,
+        "elapsed": elapsed,
+        "mem_delta": mem_delta,
+        "end_time": time.time()
+    })
+
+    # Update stats
+    self.endpoint_stats[endpoint]["success" if success else "errors"] += 1
+    await self._persist_endpoint_stats()
+```
+
+**Background Persistence Worker:**
+
+```python
+async def _persistence_worker(self):
+    """Background worker for Redis persistence"""
+    while True:
+        try:
+            await self._persist_queue.get()
+            await self._persist_endpoint_stats()
+            self._persist_queue.task_done()
+        except asyncio.CancelledError:
+            break
+        except Exception as e:
+            logger.error(f"Persistence worker error: {e}")
+
+async def _persist_endpoint_stats(self):
+    """Persist stats to Redis with error handling"""
+    try:
+        await self.redis.set(
+            "monitor:endpoint_stats",
+            json.dumps(self.endpoint_stats),
+            ex=86400  # 24h TTL
+        )
+    except Exception as e:
+        logger.warning(f"Failed to persist endpoint stats: {e}")
+```
+
+**Time-based Cleanup:**
+
+```python
+def _cleanup_old_entries(self, max_age_seconds=300):
+    """Remove entries older than 5 minutes"""
+    now = time.time()
+    cutoff = now - max_age_seconds
+
+    # Clean completed requests
+    while self.completed_requests and \
+          self.completed_requests[0].get("end_time", 0) < cutoff:
+        self.completed_requests.popleft()
+
+    # Clean janitor events
+    while self.janitor_events and \
+          self.janitor_events[0].get("timestamp", 0) < cutoff:
+        self.janitor_events.popleft()
+
+    # Clean errors
+    while self.errors and \
+          self.errors[0].get("timestamp", 0) < cutoff:
+        self.errors.popleft()
+```
+
+### WebSocket Implementation (`monitor_routes.py`)
+
+**Endpoint:**
+
+```python
+@router.websocket("/ws")
+async def websocket_endpoint(websocket: WebSocket):
+    """Real-time monitoring updates"""
+    await websocket.accept()
+    logger.info("WebSocket client connected")
+
+    try:
+        while True:
+            try:
+                monitor = get_monitor()
+
+                # Gather comprehensive monitoring data
+                data = {
+                    "timestamp": time.time(),
+                    "health": await monitor.get_health_summary(),
+                    "requests": {
+                        "active": monitor.get_active_requests(),
+                        "completed": monitor.get_completed_requests(limit=10)
+                    },
+                    "browsers": await monitor.get_browser_list(),
+                    "timeline": {
+                        "memory": monitor.get_timeline_data("memory", "5m"),
+                        "requests": monitor.get_timeline_data("requests", "5m"),
+                        "browsers": monitor.get_timeline_data("browsers", "5m")
+                    },
+                    "janitor": monitor.get_janitor_log(limit=10),
+                    "errors": monitor.get_errors_log(limit=10)
+                }
+
+                await websocket.send_json(data)
+                await asyncio.sleep(2)  # 2-second update interval
+
+            except WebSocketDisconnect:
+                logger.info("WebSocket client disconnected")
+                break
+            except Exception as e:
+                logger.error(f"WebSocket error: {e}", exc_info=True)
+                await asyncio.sleep(2)
+    except Exception as e:
+        logger.error(f"WebSocket connection error: {e}", exc_info=True)
+    finally:
+        logger.info("WebSocket connection closed")
+```
+
+**Input Validation:**
+
+```python
+@router.get("/requests")
+async def get_requests(status: str = "all", limit: int = 50):
+    # Input validation
+    if status not in ["all", "active", "completed", "success", "error"]:
+        raise HTTPException(400, f"Invalid status: {status}")
+    if limit < 1 or limit > 1000:
+        raise HTTPException(400, f"Invalid limit: {limit}")
+
+    monitor = get_monitor()
+    # ... return data
+```
+
+### Frontend Dashboard
+
+**Connection Management:**
+
+```javascript
+// WebSocket with auto-reconnect
+function connectWebSocket() {
+    if (wsReconnectAttempts >= MAX_WS_RECONNECT) {
+        // Fallback to polling after 5 failed attempts
+        useWebSocket = false;
+        updateConnectionStatus('polling');
+        startAutoRefresh();
+        return;
+    }
+
+    updateConnectionStatus('connecting');
+    const wsUrl = `${protocol}//${window.location.host}/monitor/ws`;
+    websocket = new WebSocket(wsUrl);
+
+    websocket.onopen = () => {
+        wsReconnectAttempts = 0;
+        updateConnectionStatus('connected');
+        stopAutoRefresh();  // Stop polling
+    };
+
+    websocket.onmessage = (event) => {
+        const data = JSON.parse(event.data);
+        updateDashboard(data);  // Update all sections
+    };
+
+    websocket.onclose = () => {
+        updateConnectionStatus('disconnected', 'Reconnecting...');
+        if (useWebSocket) {
+            setTimeout(connectWebSocket, 2000 * wsReconnectAttempts);
+        } else {
+            startAutoRefresh();  // Fallback to polling
+        }
+    };
+}
+```
+
+**Connection Status Indicator:**
+
+| Status | Color | Animation | Meaning |
+|--------|-------|-----------|---------|
+| Live | Green | Pulsing fast | WebSocket connected |
+| Connecting... | Yellow | Pulsing slow | Attempting connection |
+| Polling | Blue | Pulsing slow | HTTP polling fallback |
+| Disconnected | Red | None | Connection failed |
+
+---
+
+## API Layer
+
+### Request/Response Flow
+
+```
+Client Request
+    │
+    ▼
+FastAPI Route Handler
+    │
+    ├─→ Monitor: track_request_start()
+    │
+    ├─→ Browser Pool: get_crawler(config)
+    │       │
+    │       ├─→ Check PERMANENT
+    │       ├─→ Check HOT_POOL
+    │       ├─→ Check COLD_POOL
+    │       └─→ Create New (if needed)
+    │
+    ├─→ Execute Crawl
+    │       │
+    │       ├─→ Fetch page
+    │       ├─→ Extract content
+    │       ├─→ Apply filters/strategies
+    │       └─→ Return result
+    │
+    ├─→ Monitor: track_request_end()
+    │
+    └─→ Return Response (browser stays in pool)
+```
+
+### Error Handling Strategy
+
+**Levels:**
+
+1. **Route Level**: HTTP exceptions with proper status codes
+2. **Monitor Level**: Try-except with logging, non-critical failures
+3. **Pool Level**: Memory checks, lock protection, graceful degradation
+4. **WebSocket Level**: Auto-reconnect, fallback to polling
+
+**Example:**
+
+```python
+@app.post("/crawl")
+async def crawl(body: CrawlRequest):
+    request_id = f"req_{uuid4().hex[:8]}"
+
+    try:
+        # Monitor tracking (non-blocking on failure)
+        try:
+            await get_monitor().track_request_start(...)
+        except:
+            pass  # Monitor not critical
+
+        # Browser acquisition (with memory protection)
+        crawler = await get_crawler(browser_config)
+
+        # Crawl execution
+        result = await crawler.arun(url, config=cfg)
+
+        # Success tracking
+        try:
+            await get_monitor().track_request_end(request_id, success=True)
+        except:
+            pass
+
+        return result
+
+    except MemoryError as e:
+        # Memory pressure - return 503
+        await get_monitor().track_request_end(request_id, success=False, error=str(e))
+        raise HTTPException(503, "Server at capacity")
+    except Exception as e:
+        # General errors - return 500
+        await get_monitor().track_request_end(request_id, success=False, error=str(e))
+        raise HTTPException(500, str(e))
+```
+
+---
+
+## Memory Management
+
+### Container Memory Detection
+
+**Priority Order:**
+1. cgroup v2 (`/sys/fs/cgroup/memory.{current,max}`)
+2. cgroup v1 (`/sys/fs/cgroup/memory/memory.{usage,limit}_in_bytes`)
+3. psutil fallback (may be inaccurate in containers)
+
+**Usage:**
+
+```python
+mem_pct = get_container_memory_percent()
+
+if mem_pct >= 95:  # Critical
+    raise MemoryError("Refusing new browser")
+elif mem_pct > 80:  # High pressure
+    # Janitor: aggressive cleanup (10s interval, 30s TTL)
+elif mem_pct > 60:  # Moderate pressure
+    # Janitor: moderate cleanup (30s interval, 60s TTL)
+else:  # Normal
+    # Janitor: relaxed cleanup (60s interval, 300s TTL)
+```
+
+### Memory Budgets
+
+| Component | Memory | Notes |
+|-----------|--------|-------|
+| Base Container | 270 MB | Python + FastAPI + libraries |
+| Permanent Browser | 270 MB | Always-on default browser |
+| Hot Pool Browser | 180 MB | Per frequently-used config |
+| Cold Pool Browser | 180 MB | Per rarely-used config |
+| Active Crawl Overhead | 50-200 MB | Temporary, released after request |
+
+**Example Calculation:**
+
+```
+Container: 270 MB
+Permanent: 270 MB
+2x Hot:    360 MB
+1x Cold:   180 MB
+Total:     1080 MB baseline
+
+Under load (10 concurrent):
++ Active crawls: ~500-1000 MB
+= Peak: 1.5-2 GB
+```
+
+---
+
+## Production Optimizations
+
+### Code Review Fixes Applied
+
+**Critical (3):**
+1. ✅ Lock protection for browser pool access
+2. ✅ Async track_janitor_event implementation
+3. ✅ Error handling in request tracking
+
+**Important (8):**
+4. ✅ Background persistence worker (replaces fire-and-forget)
+5. ✅ Time-based expiry (5min cleanup for old entries)
+6. ✅ Input validation (status, limit, metric, window)
+7. ✅ Timeline updater timeout (4s max)
+8. ✅ Warn when killing browsers with active requests
+9. ✅ Monitor cleanup on shutdown
+10. ✅ Document memory estimates
+11. ✅ Structured error responses (HTTPException)
+
+### Performance Characteristics
+
+**Latency:**
+
+| Scenario | Time | Notes |
+|----------|------|-------|
+| Pool Hit (Permanent) | <100ms | Browser ready |
+| Pool Hit (Hot/Cold) | <100ms | Browser ready |
+| New Browser Creation | 3-5s | Chromium startup |
+| Simple Page Fetch | 1-3s | Network + render |
+| Complex Extraction | 5-10s | LLM processing |
+
+**Throughput:**
+
+| Load | Concurrent | Response Time | Success Rate |
+|------|-----------|---------------|--------------|
+| Light | 1-10 | <3s | 100% |
+| Medium | 10-50 | 3-8s | 100% |
+| Heavy | 50-100 | 8-15s | 95-100% |
+| Extreme | 100+ | 15-30s | 80-95% |
+
+### Reliability Features
+
+**Race Condition Protection:**
+- `asyncio.Lock` on all pool operations
+- Lock on browser pool stats access
+- Async janitor event tracking
+
+**Graceful Degradation:**
+- WebSocket → HTTP polling fallback
+- Redis persistence failures (logged, non-blocking)
+- Monitor tracking failures (logged, non-blocking)
+
+**Resource Cleanup:**
+- Janitor cleanup (adaptive intervals)
+- Time-based expiry (5min for old data)
+- Shutdown cleanup (persist final stats, close browsers)
+- Background worker cancellation
+
+---
+
+## Deployment & Operations
+
+### Running Locally
+
+```bash
+# Install dependencies
+pip install -r requirements.txt
+
+# Configure
+cp .llm.env.example .llm.env
+# Edit .llm.env with your API keys
+
+# Run server
+python -m uvicorn server:app --host 0.0.0.0 --port 11235 --reload
+```
+
+### Docker Deployment
+
+```bash
+# Build image
+docker build -t crawl4ai:latest -f Dockerfile .
+
+# Run container
+docker run -d \
+  --name crawl4ai \
+  -p 11235:11235 \
+  --shm-size=1g \
+  --env-file .llm.env \
+  crawl4ai:latest
+```
+
+### Production Configuration
+
+**`config.yml` Key Settings:**
+
+```yaml
+crawler:
+  browser:
+    extra_args:
+      - "--disable-gpu"
+      - "--disable-dev-shm-usage"
+      - "--no-sandbox"
+    kwargs:
+      headless: true
+      text_mode: true  # Reduces memory by 30-40%
+
+  memory_threshold_percent: 95  # Refuse new browsers above this
+
+  pool:
+    idle_ttl_sec: 300  # Base TTL for cold pool (5 min)
+
+  rate_limiter:
+    enabled: true
+    base_delay: [1.0, 3.0]  # Random delay between requests
+```
+
+### Monitoring
+
+**Access Dashboard:**
+```
+http://localhost:11235/static/monitor/
+```
+
+**Check Logs:**
+```bash
+# All activity
+docker logs crawl4ai -f
+
+# Pool activity only
+docker logs crawl4ai | grep -E "(🔥|♨️|❄️|🆕|⬆️)"
+
+# Errors only
+docker logs crawl4ai | grep ERROR
+```
+
+**Metrics:**
+```bash
+# Container stats
+docker stats crawl4ai
+
+# Memory percentage
+curl http://localhost:11235/monitor/health | jq '.container.memory_percent'
+
+# Pool status
+curl http://localhost:11235/monitor/browsers | jq '.summary'
+```
+
+---
+
+## Troubleshooting & Debugging
+
+### Common Issues
+
+**1. WebSocket Not Connecting**
+
+Symptoms: Yellow "Connecting..." indicator, falls back to blue "Polling"
+
+Debug:
+```bash
+# Check server logs
+docker logs crawl4ai | grep WebSocket
+
+# Test WebSocket manually
+python test-websocket.py
+```
+
+Fix: Check firewall/proxy settings, ensure port 11235 accessible
+
+**2. High Memory Usage**
+
+Symptoms: Container OOM kills, 503 errors, slow responses
+
+Debug:
+```bash
+# Check current memory
+curl http://localhost:11235/monitor/health | jq '.container.memory_percent'
+
+# Check browser pool
+curl http://localhost:11235/monitor/browsers
+
+# Check janitor activity
+docker logs crawl4ai | grep "🧹"
+```
+
+Fix:
+- Lower `memory_threshold_percent` in config.yml
+- Increase container memory limit
+- Enable `text_mode: true` in browser config
+- Reduce idle_ttl_sec for more aggressive cleanup
+
+**3. Browser Pool Not Reusing**
+
+Symptoms: High "New Created" count, poor reuse rate
+
+Debug:
+```python
+# Check config signature matching
+from crawl4ai import BrowserConfig
+import json, hashlib
+
+cfg = BrowserConfig(...)  # Your config
+sig = hashlib.sha1(json.dumps(cfg.to_dict(), sort_keys=True).encode()).hexdigest()
+print(f"Config signature: {sig[:8]}")
+```
+
+Check logs for permanent browser signature:
+```bash
+docker logs crawl4ai | grep "permanent"
+```
+
+Fix: Ensure endpoint configs match permanent browser config exactly
+
+**4. Janitor Not Cleaning Up**
+
+Symptoms: Memory stays high after idle period
+
+Debug:
+```bash
+# Check janitor events
+curl http://localhost:11235/monitor/logs/janitor
+
+# Check pool stats over time
+watch -n 5 'curl -s http://localhost:11235/monitor/browsers | jq ".summary"'
+```
+
+Fix:
+- Janitor runs every 10-60s depending on memory
+- Hot pool browsers have longer TTL (by design)
+- Permanent browser never cleaned (by design)
+
+### Debug Tools
+
+**Config Signature Checker:**
+
+```python
+from crawl4ai import BrowserConfig
+import json, hashlib
+
+def check_sig(cfg: BrowserConfig) -> str:
+    payload = json.dumps(cfg.to_dict(), sort_keys=True, separators=(",",":"))
+    sig = hashlib.sha1(payload.encode()).hexdigest()
+    return sig[:8]
+
+# Example
+cfg1 = BrowserConfig()
+cfg2 = BrowserConfig(headless=True)
+print(f"Default: {check_sig(cfg1)}")
+print(f"Custom:  {check_sig(cfg2)}")
+```
+
+**Monitor Stats Dumper:**
+
+```bash
+#!/bin/bash
+# Dump all monitor stats to JSON
+
+curl -s http://localhost:11235/monitor/health > health.json
+curl -s http://localhost:11235/monitor/requests?limit=100 > requests.json
+curl -s http://localhost:11235/monitor/browsers > browsers.json
+curl -s http://localhost:11235/monitor/logs/janitor > janitor.json
+curl -s http://localhost:11235/monitor/logs/errors > errors.json
+
+echo "Monitor stats dumped to *.json files"
+```
+
+**WebSocket Test Script:**
+
+```python
+# test-websocket.py (included in repo)
+import asyncio
+import websockets
+import json
+
+async def test_websocket():
+    uri = "ws://localhost:11235/monitor/ws"
+    async with websockets.connect(uri) as websocket:
+        for i in range(5):
+            message = await websocket.recv()
+            data = json.loads(message)
+            print(f"\nUpdate #{i+1}:")
+            print(f"  Health: CPU {data['health']['container']['cpu_percent']}%")
+            print(f"  Active Requests: {len(data['requests']['active'])}")
+            print(f"  Browsers: {len(data['browsers'])}")
+
+asyncio.run(test_websocket())
+```
+
+### Performance Tuning
+
+**For High Throughput:**
+
+```yaml
+# config.yml
+crawler:
+  memory_threshold_percent: 90  # Allow more browsers
+  pool:
+    idle_ttl_sec: 600  # Keep browsers longer
+  rate_limiter:
+    enabled: false  # Disable for max speed
+```
+
+**For Low Memory:**
+
+```yaml
+# config.yml
+crawler:
+  browser:
+    kwargs:
+      text_mode: true  # 30-40% memory reduction
+  memory_threshold_percent: 80  # More conservative
+  pool:
+    idle_ttl_sec: 60  # Aggressive cleanup
+```
+
+**For Stability:**
+
+```yaml
+# config.yml
+crawler:
+  memory_threshold_percent: 85  # Balanced
+  pool:
+    idle_ttl_sec: 300  # Moderate cleanup
+  rate_limiter:
+    enabled: true
+    base_delay: [2.0, 5.0]  # Prevent rate limiting
+```
+
+---
+
+## Test Suite
+
+**Location:** `deploy/docker/tests/`
+
+**Tests:**
+
+1. `test_1_basic.py` - Health check, container lifecycle
+2. `test_2_memory.py` - Memory tracking, leak detection
+3. `test_3_pool.py` - Pool reuse validation
+4. `test_4_concurrent.py` - Concurrent load testing
+5. `test_5_pool_stress.py` - Multi-config pool behavior
+6. `test_6_multi_endpoint.py` - All endpoint validation
+7. `test_7_cleanup.py` - Janitor cleanup verification
+
+**Run All Tests:**
+
+```bash
+cd deploy/docker/tests
+pip install -r requirements.txt
+
+# Build image first
+cd /path/to/repo
+docker build -t crawl4ai-local:latest .
+
+# Run tests
+cd deploy/docker/tests
+for test in test_*.py; do
+    echo "Running $test..."
+    python $test || break
+done
+```
+
+---
+
+## Architecture Decision Log
+
+### Why 3-Tier Pool?
+
+**Decision:** PERMANENT + HOT_POOL + COLD_POOL
+
+**Rationale:**
+- 90% of requests use default config → permanent browser serves most traffic
+- Frequent variants (hot) deserve longer TTL for better reuse
+- Rare configs (cold) should be cleaned aggressively to save memory
+
+**Alternatives Considered:**
+- Single pool: Too simple, no optimization for common case
+- LRU cache: Doesn't capture "hot" vs "rare" distinction
+- Per-endpoint pools: Too complex, over-engineering
+
+### Why WebSocket + Polling Fallback?
+
+**Decision:** WebSocket primary, HTTP polling backup
+
+**Rationale:**
+- WebSocket provides real-time updates (2s interval)
+- Polling fallback ensures reliability in restricted networks
+- Auto-reconnect handles temporary disconnections
+
+**Alternatives Considered:**
+- Polling only: Works but higher latency, more server load
+- WebSocket only: Fails in restricted networks
+- Server-Sent Events: One-way, no client messages
+
+### Why Background Persistence Worker?
+
+**Decision:** Queue-based worker for Redis operations
+
+**Rationale:**
+- Fire-and-forget loses data on failures
+- Queue provides buffering and retry capability
+- Non-blocking keeps request path fast
+
+**Alternatives Considered:**
+- Synchronous writes: Blocks request handling
+- Fire-and-forget: Silent failures
+- Batch writes: Complex state management
+
+---
+
+## Contributing
+
+When modifying the architecture:
+
+1. **Maintain backward compatibility** in API contracts
+2. **Add tests** for new functionality
+3. **Update this document** with architectural changes
+4. **Profile memory impact** before production
+5. **Test under load** using the test suite
+
+**Code Review Checklist:**
+- [ ] Race conditions protected with locks
+- [ ] Error handling with proper logging
+- [ ] Graceful degradation on failures
+- [ ] Memory impact measured
+- [ ] Tests added/updated
+- [ ] Documentation updated
+
+---
+
+## License & Credits
+
+**Crawl4AI** - Created by Unclecode
+**GitHub**: https://github.com/unclecode/crawl4ai
+**License**: See LICENSE file in repository
+
+**Architecture & Optimizations**: October 2025
+**WebSocket Monitoring**: October 2025
+**Production Hardening**: October 2025
+
+---
+
+**End of Technical Architecture Document**
+
+For questions or issues, please open a GitHub issue at:
+https://github.com/unclecode/crawl4ai/issues