Files

unclecode 05921811b8 docs: add comprehensive technical architecture documentation

Created ARCHITECTURE.md as a complete technical reference for the
Crawl4AI Docker server, replacing the stress test pipeline document
with production-grade documentation.

Contents:
- System overview with architecture diagrams
- Core components deep-dive (server, API, utils)
- Smart browser pool implementation details
- Real-time monitoring system architecture
- WebSocket implementation and fallback strategy
- Memory management and container detection
- Production optimizations and code review fixes
- Deployment guides (local, Docker, production)
- Comprehensive troubleshooting section
- Debug tools and performance tuning
- Test suite documentation
- Architecture decision log (ADRs)

Target audience: Developers maintaining or extending the system
Goal: Enable rapid onboarding and confident modifications

2025-10-18 12:05:49 +08:00

35 KiB

Raw Blame History

Crawl4AI Docker Server - Technical Architecture

Version: 0.7.4 Last Updated: October 2025 Status: Production-ready with real-time monitoring

This document provides a comprehensive technical overview of the Crawl4AI Docker server architecture, including the smart browser pool, real-time monitoring system, and all production optimizations.

System Overview
Core Components
Smart Browser Pool
Real-time Monitoring System
API Layer
Memory Management
Production Optimizations
Deployment & Operations
Troubleshooting & Debugging

System Overview

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                     Client Requests                          │
└────────────┬────────────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────────────┐
│  FastAPI Server (server.py)                                  │
│  ├─ REST API Endpoints (/crawl, /html, /md, /llm, etc.)    │
│  ├─ WebSocket Endpoint (/monitor/ws)                        │
│  └─ Background Tasks (janitor, timeline_updater)            │
└────┬────────────────────┬────────────────────┬──────────────┘
     │                    │                    │
     ▼                    ▼                    ▼
┌─────────────┐  ┌──────────────────┐  ┌─────────────────┐
│ Browser     │  │ Monitor System   │  │ Redis           │
│ Pool        │  │ (monitor.py)     │  │ (Persistence)   │
│             │  │                  │  │                 │
│ PERMANENT ●─┤  │ ├─ Stats         │  │ ├─ Endpoint     │
│ HOT_POOL  ♨─┤  │ ├─ Requests      │  │ │   Stats       │
│ COLD_POOL ❄─┤  │ ├─ Browsers      │  │ ├─ Task         │
│             │  │ ├─ Timeline      │  │ │   Results     │
│ Janitor  🧹─┤  │ └─ Events/Errors │  │ └─ Cache        │
└─────────────┘  └──────────────────┘  └─────────────────┘

Key Features

10x Memory Efficiency: Smart 3-tier browser pooling reduces memory from 500-700MB to 50-70MB per concurrent user
Real-time Monitoring: WebSocket-based live dashboard with 2-second update intervals
Production-Ready: Comprehensive error handling, timeouts, cleanup, and graceful shutdown
Container-Aware: Accurate memory detection using cgroup v2/v1
Auto-Recovery: Graceful WebSocket fallback, lock protection, background workers

Core Components

1. Server Core (`server.py`)

Responsibilities:

FastAPI application lifecycle management
Route registration and middleware
Background task orchestration
Graceful shutdown handling

Key Functions:

@asynccontextmanager
async def lifespan(app: FastAPI):
    """Application lifecycle manager"""
    # Startup
    - Initialize Redis connection
    - Create monitor stats instance
    - Start persistence worker
    - Initialize permanent browser
    - Start janitor (browser cleanup)
    - Start timeline updater (5s interval)

    yield

    # Shutdown
    - Cancel background tasks
    - Persist final monitor stats
    - Stop persistence worker
    - Close all browsers

Configuration:

Loaded from config.yml
Browser settings, memory thresholds, rate limiting
LLM provider credentials
Server host/port

2. API Layer (`api.py`)

Endpoints:

Endpoint	Method	Purpose	Pool Usage
`/health`	GET	Health check	None
`/crawl`	POST	Full crawl with all features	✓ Pool
`/crawl_stream`	POST	Streaming crawl results	✓ Pool
`/html`	POST	HTML extraction	✓ Pool
`/md`	POST	Markdown generation	✓ Pool
`/screenshot`	POST	Page screenshots	✓ Pool
`/pdf`	POST	PDF generation	✓ Pool
`/llm/{path}`	GET/POST	LLM extraction	✓ Pool
`/crawl/job`	POST	Background job creation	✓ Pool

Request Flow:

@app.post("/crawl")
async def crawl(body: CrawlRequest):
    # 1. Track request start
    request_id = f"req_{uuid4().hex[:8]}"
    await get_monitor().track_request_start(request_id, "/crawl", url, config)

    # 2. Get browser from pool
    from crawler_pool import get_crawler
    crawler = await get_crawler(browser_config)

    # 3. Execute crawl
    result = await crawler.arun(url, config=crawler_config)

    # 4. Track request completion
    await get_monitor().track_request_end(request_id, success=True)

    # 5. Return result (browser stays in pool)
    return result

3. Utility Layer (`utils.py`)

Container Memory Detection:

def get_container_memory_percent() -> float:
    """Accurate container memory detection"""
    try:
        # Try cgroup v2 first
        current = int(Path("/sys/fs/cgroup/memory.current").read_text().strip())
        max_mem = int(Path("/sys/fs/cgroup/memory.max").read_text().strip())
        return (current / max_mem) * 100
    except:
        # Fallback to cgroup v1
        usage = int(Path("/sys/fs/cgroup/memory/memory.usage_in_bytes").read_text())
        limit = int(Path("/sys/fs/cgroup/memory/memory.limit_in_bytes").read_text())
        return (usage / limit) * 100
    except:
        # Final fallback to psutil (may be inaccurate in containers)
        return psutil.virtual_memory().percent

Helper Functions:

get_base_url(): Request base URL extraction
is_task_id(): Task ID validation
should_cleanup_task(): TTL-based cleanup logic
validate_llm_provider(): LLM configuration validation

Smart Browser Pool

Architecture

The browser pool implements a 3-tier strategy optimized for real-world usage patterns:

┌──────────────────────────────────────────────────────────┐
│  PERMANENT Browser (Default Config)                      │
│  ● Always alive, never cleaned                           │
│  ● Serves 90% of requests                                │
│  ● ~270MB memory                                         │
└──────────────────────────────────────────────────────────┘
                        ▲
                        │ 90% of requests
                        │
┌──────────────────────────────────────────────────────────┐
│  HOT_POOL (Frequently Used Configs)                      │
│  ♨ Configs used 3+ times                                 │
│  ♨ Longer TTL (2-5 min depending on memory)             │
│  ♨ ~180MB per browser                                   │
└──────────────────────────────────────────────────────────┘
                        ▲
                        │ Promotion at 3 uses
                        │
┌──────────────────────────────────────────────────────────┐
│  COLD_POOL (Rarely Used Configs)                         │
│  ❄ New/rare browser configs                             │
│  ❄ Short TTL (30s-5min depending on memory)             │
│  ❄ ~180MB per browser                                   │
└──────────────────────────────────────────────────────────┘

Implementation (`crawler_pool.py`)

Core Data Structures:

PERMANENT: Optional[AsyncWebCrawler] = None  # Default browser
HOT_POOL: Dict[str, AsyncWebCrawler] = {}    # Frequent configs
COLD_POOL: Dict[str, AsyncWebCrawler] = {}   # Rare configs
LAST_USED: Dict[str, float] = {}             # Timestamp tracking
USAGE_COUNT: Dict[str, int] = {}             # Usage counter
LOCK = asyncio.Lock()                        # Thread-safe access

Browser Acquisition Flow:

async def get_crawler(cfg: BrowserConfig) -> AsyncWebCrawler:
    sig = _sig(cfg)  # SHA1 hash of config

    async with LOCK:  # Prevent race conditions
        # 1. Check permanent browser
        if _is_default_config(sig):
            return PERMANENT

        # 2. Check hot pool
        if sig in HOT_POOL:
            USAGE_COUNT[sig] += 1
            return HOT_POOL[sig]

        # 3. Check cold pool (with promotion logic)
        if sig in COLD_POOL:
            USAGE_COUNT[sig] += 1
            if USAGE_COUNT[sig] >= 3:
                # Promote to hot pool
                HOT_POOL[sig] = COLD_POOL.pop(sig)
                await get_monitor().track_janitor_event("promote", sig, {...})
                return HOT_POOL[sig]
            return COLD_POOL[sig]

        # 4. Memory check before creating new
        if get_container_memory_percent() >= MEM_LIMIT:
            raise MemoryError(f"Memory at {mem}%, refusing new browser")

        # 5. Create new browser in cold pool
        crawler = AsyncWebCrawler(config=cfg)
        await crawler.start()
        COLD_POOL[sig] = crawler
        return crawler

Janitor (Adaptive Cleanup):

async def janitor():
    """Memory-adaptive browser cleanup"""
    while True:
        mem_pct = get_container_memory_percent()

        # Adaptive intervals based on memory pressure
        if mem_pct > 80:
            interval, cold_ttl, hot_ttl = 10, 30, 120      # Aggressive
        elif mem_pct > 60:
            interval, cold_ttl, hot_ttl = 30, 60, 300      # Moderate
        else:
            interval, cold_ttl, hot_ttl = 60, 300, 600     # Relaxed

        await asyncio.sleep(interval)

        async with LOCK:
            # Clean cold pool first (less valuable)
            for sig in list(COLD_POOL.keys()):
                if now - LAST_USED[sig] > cold_ttl:
                    await COLD_POOL[sig].close()
                    del COLD_POOL[sig], LAST_USED[sig], USAGE_COUNT[sig]
                    await track_janitor_event("close_cold", sig, {...})

            # Clean hot pool (more conservative)
            for sig in list(HOT_POOL.keys()):
                if now - LAST_USED[sig] > hot_ttl:
                    await HOT_POOL[sig].close()
                    del HOT_POOL[sig], LAST_USED[sig], USAGE_COUNT[sig]
                    await track_janitor_event("close_hot", sig, {...})

Config Signature Generation:

def _sig(cfg: BrowserConfig) -> str:
    """Generate unique signature for browser config"""
    payload = json.dumps(cfg.to_dict(), sort_keys=True, separators=(",",":"))
    return hashlib.sha1(payload.encode()).hexdigest()

Real-time Monitoring System

Architecture

The monitoring system provides real-time insights via WebSocket with automatic fallback to HTTP polling.

Components:

┌─────────────────────────────────────────────────────────┐
│  MonitorStats Class (monitor.py)                        │
│  ├─ In-memory queues (deques with maxlen)              │
│  ├─ Background persistence worker                       │
│  ├─ Timeline tracking (5-min window, 5s resolution)    │
│  └─ Time-based expiry (5min for old entries)           │
└───────────┬─────────────────────────────────────────────┘
            │
            ▼
┌─────────────────────────────────────────────────────────┐
│  WebSocket Endpoint (/monitor/ws)                       │
│  ├─ 2-second update intervals                          │
│  ├─ Auto-reconnect with exponential backoff            │
│  ├─ Comprehensive data payload                         │
│  └─ Graceful fallback to polling                       │
└───────────┬─────────────────────────────────────────────┘
            │
            ▼
┌─────────────────────────────────────────────────────────┐
│  Dashboard UI (static/monitor/index.html)               │
│  ├─ Connection status indicator                        │
│  ├─ Live updates (health, requests, browsers)          │
│  ├─ Timeline charts (memory, requests, browsers)       │
│  └─ Janitor events & error logs                        │
└─────────────────────────────────────────────────────────┘

Monitor Stats (`monitor.py`)

Data Structures:

class MonitorStats:
    # In-memory queues
    active_requests: Dict[str, Dict]           # Currently processing
    completed_requests: deque(maxlen=100)      # Last 100 completed
    janitor_events: deque(maxlen=100)          # Cleanup events
    errors: deque(maxlen=100)                  # Error log

    # Endpoint stats (persisted to Redis)
    endpoint_stats: Dict[str, Dict]            # Aggregated stats

    # Timeline data (5min window, 5s resolution = 60 points)
    memory_timeline: deque(maxlen=60)
    requests_timeline: deque(maxlen=60)
    browser_timeline: deque(maxlen=60)

    # Background persistence
    _persist_queue: asyncio.Queue(maxsize=10)
    _persist_worker_task: Optional[asyncio.Task]

Request Tracking:

async def track_request_start(request_id, endpoint, url, config):
    """Track new request"""
    self.active_requests[request_id] = {
        "id": request_id,
        "endpoint": endpoint,
        "url": url,
        "start_time": time.time(),
        "mem_start": psutil.Process().memory_info().rss / (1024 * 1024)
    }

    # Update endpoint stats
    if endpoint not in self.endpoint_stats:
        self.endpoint_stats[endpoint] = {
            "count": 0, "total_time": 0, "errors": 0,
            "pool_hits": 0, "success": 0
        }
    self.endpoint_stats[endpoint]["count"] += 1

    # Queue background persistence
    self._persist_queue.put_nowait(True)

async def track_request_end(request_id, success, error=None, ...):
    """Track request completion"""
    req_info = self.active_requests.pop(request_id)
    elapsed = time.time() - req_info["start_time"]
    mem_delta = current_mem - req_info["mem_start"]

    # Add to completed queue
    self.completed_requests.append({
        "id": request_id,
        "endpoint": req_info["endpoint"],
        "url": req_info["url"],
        "success": success,
        "elapsed": elapsed,
        "mem_delta": mem_delta,
        "end_time": time.time()
    })

    # Update stats
    self.endpoint_stats[endpoint]["success" if success else "errors"] += 1
    await self._persist_endpoint_stats()

Background Persistence Worker:

async def _persistence_worker(self):
    """Background worker for Redis persistence"""
    while True:
        try:
            await self._persist_queue.get()
            await self._persist_endpoint_stats()
            self._persist_queue.task_done()
        except asyncio.CancelledError:
            break
        except Exception as e:
            logger.error(f"Persistence worker error: {e}")

async def _persist_endpoint_stats(self):
    """Persist stats to Redis with error handling"""
    try:
        await self.redis.set(
            "monitor:endpoint_stats",
            json.dumps(self.endpoint_stats),
            ex=86400  # 24h TTL
        )
    except Exception as e:
        logger.warning(f"Failed to persist endpoint stats: {e}")

Time-based Cleanup:

def _cleanup_old_entries(self, max_age_seconds=300):
    """Remove entries older than 5 minutes"""
    now = time.time()
    cutoff = now - max_age_seconds

    # Clean completed requests
    while self.completed_requests and \
          self.completed_requests[0].get("end_time", 0) < cutoff:
        self.completed_requests.popleft()

    # Clean janitor events
    while self.janitor_events and \
          self.janitor_events[0].get("timestamp", 0) < cutoff:
        self.janitor_events.popleft()

    # Clean errors
    while self.errors and \
          self.errors[0].get("timestamp", 0) < cutoff:
        self.errors.popleft()

WebSocket Implementation (`monitor_routes.py`)

Endpoint:

@router.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
    """Real-time monitoring updates"""
    await websocket.accept()
    logger.info("WebSocket client connected")

    try:
        while True:
            try:
                monitor = get_monitor()

                # Gather comprehensive monitoring data
                data = {
                    "timestamp": time.time(),
                    "health": await monitor.get_health_summary(),
                    "requests": {
                        "active": monitor.get_active_requests(),
                        "completed": monitor.get_completed_requests(limit=10)
                    },
                    "browsers": await monitor.get_browser_list(),
                    "timeline": {
                        "memory": monitor.get_timeline_data("memory", "5m"),
                        "requests": monitor.get_timeline_data("requests", "5m"),
                        "browsers": monitor.get_timeline_data("browsers", "5m")
                    },
                    "janitor": monitor.get_janitor_log(limit=10),
                    "errors": monitor.get_errors_log(limit=10)
                }

                await websocket.send_json(data)
                await asyncio.sleep(2)  # 2-second update interval

            except WebSocketDisconnect:
                logger.info("WebSocket client disconnected")
                break
            except Exception as e:
                logger.error(f"WebSocket error: {e}", exc_info=True)
                await asyncio.sleep(2)
    except Exception as e:
        logger.error(f"WebSocket connection error: {e}", exc_info=True)
    finally:
        logger.info("WebSocket connection closed")

Input Validation:

@router.get("/requests")
async def get_requests(status: str = "all", limit: int = 50):
    # Input validation
    if status not in ["all", "active", "completed", "success", "error"]:
        raise HTTPException(400, f"Invalid status: {status}")
    if limit < 1 or limit > 1000:
        raise HTTPException(400, f"Invalid limit: {limit}")

    monitor = get_monitor()
    # ... return data

Frontend Dashboard

Connection Management:

// WebSocket with auto-reconnect
function connectWebSocket() {
    if (wsReconnectAttempts >= MAX_WS_RECONNECT) {
        // Fallback to polling after 5 failed attempts
        useWebSocket = false;
        updateConnectionStatus('polling');
        startAutoRefresh();
        return;
    }

    updateConnectionStatus('connecting');
    const wsUrl = `${protocol}//${window.location.host}/monitor/ws`;
    websocket = new WebSocket(wsUrl);

    websocket.onopen = () => {
        wsReconnectAttempts = 0;
        updateConnectionStatus('connected');
        stopAutoRefresh();  // Stop polling
    };

    websocket.onmessage = (event) => {
        const data = JSON.parse(event.data);
        updateDashboard(data);  // Update all sections
    };

    websocket.onclose = () => {
        updateConnectionStatus('disconnected', 'Reconnecting...');
        if (useWebSocket) {
            setTimeout(connectWebSocket, 2000 * wsReconnectAttempts);
        } else {
            startAutoRefresh();  // Fallback to polling
        }
    };
}

Connection Status Indicator:

Status	Color	Animation	Meaning
Live	Green	Pulsing fast	WebSocket connected
Connecting...	Yellow	Pulsing slow	Attempting connection
Polling	Blue	Pulsing slow	HTTP polling fallback
Disconnected	Red	None	Connection failed

API Layer

Request/Response Flow

Client Request
    │
    ▼
FastAPI Route Handler
    │
    ├─→ Monitor: track_request_start()
    │
    ├─→ Browser Pool: get_crawler(config)
    │       │
    │       ├─→ Check PERMANENT
    │       ├─→ Check HOT_POOL
    │       ├─→ Check COLD_POOL
    │       └─→ Create New (if needed)
    │
    ├─→ Execute Crawl
    │       │
    │       ├─→ Fetch page
    │       ├─→ Extract content
    │       ├─→ Apply filters/strategies
    │       └─→ Return result
    │
    ├─→ Monitor: track_request_end()
    │
    └─→ Return Response (browser stays in pool)

Error Handling Strategy

Levels:

Route Level: HTTP exceptions with proper status codes
Monitor Level: Try-except with logging, non-critical failures
Pool Level: Memory checks, lock protection, graceful degradation
WebSocket Level: Auto-reconnect, fallback to polling

Example:

@app.post("/crawl")
async def crawl(body: CrawlRequest):
    request_id = f"req_{uuid4().hex[:8]}"

    try:
        # Monitor tracking (non-blocking on failure)
        try:
            await get_monitor().track_request_start(...)
        except:
            pass  # Monitor not critical

        # Browser acquisition (with memory protection)
        crawler = await get_crawler(browser_config)

        # Crawl execution
        result = await crawler.arun(url, config=cfg)

        # Success tracking
        try:
            await get_monitor().track_request_end(request_id, success=True)
        except:
            pass

        return result

    except MemoryError as e:
        # Memory pressure - return 503
        await get_monitor().track_request_end(request_id, success=False, error=str(e))
        raise HTTPException(503, "Server at capacity")
    except Exception as e:
        # General errors - return 500
        await get_monitor().track_request_end(request_id, success=False, error=str(e))
        raise HTTPException(500, str(e))

Memory Management

Container Memory Detection

Priority Order:

cgroup v2 (/sys/fs/cgroup/memory.{current,max})
cgroup v1 (/sys/fs/cgroup/memory/memory.{usage,limit}_in_bytes)
psutil fallback (may be inaccurate in containers)

Usage:

mem_pct = get_container_memory_percent()

if mem_pct >= 95:  # Critical
    raise MemoryError("Refusing new browser")
elif mem_pct > 80:  # High pressure
    # Janitor: aggressive cleanup (10s interval, 30s TTL)
elif mem_pct > 60:  # Moderate pressure
    # Janitor: moderate cleanup (30s interval, 60s TTL)
else:  # Normal
    # Janitor: relaxed cleanup (60s interval, 300s TTL)

Memory Budgets

Component	Memory	Notes
Base Container	270 MB	Python + FastAPI + libraries
Permanent Browser	270 MB	Always-on default browser
Hot Pool Browser	180 MB	Per frequently-used config
Cold Pool Browser	180 MB	Per rarely-used config
Active Crawl Overhead	50-200 MB	Temporary, released after request

Example Calculation:

Container: 270 MB
Permanent: 270 MB
2x Hot:    360 MB
1x Cold:   180 MB
Total:     1080 MB baseline

Under load (10 concurrent):
+ Active crawls: ~500-1000 MB
= Peak: 1.5-2 GB

Production Optimizations

Code Review Fixes Applied

Critical (3):

✅ Lock protection for browser pool access
✅ Async track_janitor_event implementation
✅ Error handling in request tracking

Important (8): 4. ✅ Background persistence worker (replaces fire-and-forget) 5. ✅ Time-based expiry (5min cleanup for old entries) 6. ✅ Input validation (status, limit, metric, window) 7. ✅ Timeline updater timeout (4s max) 8. ✅ Warn when killing browsers with active requests 9. ✅ Monitor cleanup on shutdown 10. ✅ Document memory estimates 11. ✅ Structured error responses (HTTPException)

Performance Characteristics

Latency:

Scenario	Time	Notes
Pool Hit (Permanent)	<100ms	Browser ready
Pool Hit (Hot/Cold)	<100ms	Browser ready
New Browser Creation	3-5s	Chromium startup
Simple Page Fetch	1-3s	Network + render
Complex Extraction	5-10s	LLM processing

Throughput:

Load	Concurrent	Response Time	Success Rate
Light	1-10	<3s	100%
Medium	10-50	3-8s	100%
Heavy	50-100	8-15s	95-100%
Extreme	100+	15-30s	80-95%

Reliability Features

Race Condition Protection:

asyncio.Lock on all pool operations
Lock on browser pool stats access
Async janitor event tracking

Graceful Degradation:

WebSocket → HTTP polling fallback
Redis persistence failures (logged, non-blocking)
Monitor tracking failures (logged, non-blocking)

Resource Cleanup:

Janitor cleanup (adaptive intervals)
Time-based expiry (5min for old data)
Shutdown cleanup (persist final stats, close browsers)
Background worker cancellation

Deployment & Operations

Running Locally

# Install dependencies
pip install -r requirements.txt

# Configure
cp .llm.env.example .llm.env
# Edit .llm.env with your API keys

# Run server
python -m uvicorn server:app --host 0.0.0.0 --port 11235 --reload

Docker Deployment

# Build image
docker build -t crawl4ai:latest -f Dockerfile .

# Run container
docker run -d \
  --name crawl4ai \
  -p 11235:11235 \
  --shm-size=1g \
  --env-file .llm.env \
  crawl4ai:latest

Production Configuration

config.yml Key Settings:

crawler:
  browser:
    extra_args:
      - "--disable-gpu"
      - "--disable-dev-shm-usage"
      - "--no-sandbox"
    kwargs:
      headless: true
      text_mode: true  # Reduces memory by 30-40%

  memory_threshold_percent: 95  # Refuse new browsers above this

  pool:
    idle_ttl_sec: 300  # Base TTL for cold pool (5 min)

  rate_limiter:
    enabled: true
    base_delay: [1.0, 3.0]  # Random delay between requests

Monitoring

Access Dashboard:

http://localhost:11235/static/monitor/

Check Logs:

# All activity
docker logs crawl4ai -f

# Pool activity only
docker logs crawl4ai | grep -E "(🔥|♨️|❄️|🆕|⬆️)"

# Errors only
docker logs crawl4ai | grep ERROR

Metrics:

# Container stats
docker stats crawl4ai

# Memory percentage
curl http://localhost:11235/monitor/health | jq '.container.memory_percent'

# Pool status
curl http://localhost:11235/monitor/browsers | jq '.summary'

Troubleshooting & Debugging

Common Issues

1. WebSocket Not Connecting

Symptoms: Yellow "Connecting..." indicator, falls back to blue "Polling"

Debug:

# Check server logs
docker logs crawl4ai | grep WebSocket

# Test WebSocket manually
python test-websocket.py

Fix: Check firewall/proxy settings, ensure port 11235 accessible

2. High Memory Usage

Symptoms: Container OOM kills, 503 errors, slow responses

Debug:

# Check current memory
curl http://localhost:11235/monitor/health | jq '.container.memory_percent'

# Check browser pool
curl http://localhost:11235/monitor/browsers

# Check janitor activity
docker logs crawl4ai | grep "🧹"

Fix:

Lower memory_threshold_percent in config.yml
Increase container memory limit
Enable text_mode: true in browser config
Reduce idle_ttl_sec for more aggressive cleanup

3. Browser Pool Not Reusing

Symptoms: High "New Created" count, poor reuse rate

Debug:

# Check config signature matching
from crawl4ai import BrowserConfig
import json, hashlib

cfg = BrowserConfig(...)  # Your config
sig = hashlib.sha1(json.dumps(cfg.to_dict(), sort_keys=True).encode()).hexdigest()
print(f"Config signature: {sig[:8]}")

Check logs for permanent browser signature:

docker logs crawl4ai | grep "permanent"

Fix: Ensure endpoint configs match permanent browser config exactly

4. Janitor Not Cleaning Up

Symptoms: Memory stays high after idle period

Debug:

# Check janitor events
curl http://localhost:11235/monitor/logs/janitor

# Check pool stats over time
watch -n 5 'curl -s http://localhost:11235/monitor/browsers | jq ".summary"'

Fix:

Janitor runs every 10-60s depending on memory
Hot pool browsers have longer TTL (by design)
Permanent browser never cleaned (by design)

Debug Tools

Config Signature Checker:

from crawl4ai import BrowserConfig
import json, hashlib

def check_sig(cfg: BrowserConfig) -> str:
    payload = json.dumps(cfg.to_dict(), sort_keys=True, separators=(",",":"))
    sig = hashlib.sha1(payload.encode()).hexdigest()
    return sig[:8]

# Example
cfg1 = BrowserConfig()
cfg2 = BrowserConfig(headless=True)
print(f"Default: {check_sig(cfg1)}")
print(f"Custom:  {check_sig(cfg2)}")

Monitor Stats Dumper:

#!/bin/bash
# Dump all monitor stats to JSON

curl -s http://localhost:11235/monitor/health > health.json
curl -s http://localhost:11235/monitor/requests?limit=100 > requests.json
curl -s http://localhost:11235/monitor/browsers > browsers.json
curl -s http://localhost:11235/monitor/logs/janitor > janitor.json
curl -s http://localhost:11235/monitor/logs/errors > errors.json

echo "Monitor stats dumped to *.json files"

WebSocket Test Script:

# test-websocket.py (included in repo)
import asyncio
import websockets
import json

async def test_websocket():
    uri = "ws://localhost:11235/monitor/ws"
    async with websockets.connect(uri) as websocket:
        for i in range(5):
            message = await websocket.recv()
            data = json.loads(message)
            print(f"\nUpdate #{i+1}:")
            print(f"  Health: CPU {data['health']['container']['cpu_percent']}%")
            print(f"  Active Requests: {len(data['requests']['active'])}")
            print(f"  Browsers: {len(data['browsers'])}")

asyncio.run(test_websocket())

Performance Tuning

For High Throughput:

# config.yml
crawler:
  memory_threshold_percent: 90  # Allow more browsers
  pool:
    idle_ttl_sec: 600  # Keep browsers longer
  rate_limiter:
    enabled: false  # Disable for max speed

For Low Memory:

# config.yml
crawler:
  browser:
    kwargs:
      text_mode: true  # 30-40% memory reduction
  memory_threshold_percent: 80  # More conservative
  pool:
    idle_ttl_sec: 60  # Aggressive cleanup

For Stability:

# config.yml
crawler:
  memory_threshold_percent: 85  # Balanced
  pool:
    idle_ttl_sec: 300  # Moderate cleanup
  rate_limiter:
    enabled: true
    base_delay: [2.0, 5.0]  # Prevent rate limiting

Test Suite

Location: deploy/docker/tests/

Tests:

test_1_basic.py - Health check, container lifecycle
test_2_memory.py - Memory tracking, leak detection
test_3_pool.py - Pool reuse validation
test_4_concurrent.py - Concurrent load testing
test_5_pool_stress.py - Multi-config pool behavior
test_6_multi_endpoint.py - All endpoint validation
test_7_cleanup.py - Janitor cleanup verification

Run All Tests:

cd deploy/docker/tests
pip install -r requirements.txt

# Build image first
cd /path/to/repo
docker build -t crawl4ai-local:latest .

# Run tests
cd deploy/docker/tests
for test in test_*.py; do
    echo "Running $test..."
    python $test || break
done

Architecture Decision Log

Why 3-Tier Pool?

Decision: PERMANENT + HOT_POOL + COLD_POOL

Rationale:

90% of requests use default config → permanent browser serves most traffic
Frequent variants (hot) deserve longer TTL for better reuse
Rare configs (cold) should be cleaned aggressively to save memory

Alternatives Considered:

Single pool: Too simple, no optimization for common case
LRU cache: Doesn't capture "hot" vs "rare" distinction
Per-endpoint pools: Too complex, over-engineering

Why WebSocket + Polling Fallback?

Decision: WebSocket primary, HTTP polling backup

Rationale:

WebSocket provides real-time updates (2s interval)
Polling fallback ensures reliability in restricted networks
Auto-reconnect handles temporary disconnections

Alternatives Considered:

Polling only: Works but higher latency, more server load
WebSocket only: Fails in restricted networks
Server-Sent Events: One-way, no client messages

Why Background Persistence Worker?

Decision: Queue-based worker for Redis operations

Rationale:

Fire-and-forget loses data on failures
Queue provides buffering and retry capability
Non-blocking keeps request path fast

Alternatives Considered:

Synchronous writes: Blocks request handling
Fire-and-forget: Silent failures
Batch writes: Complex state management

Contributing

When modifying the architecture:

Maintain backward compatibility in API contracts
Add tests for new functionality
Update this document with architectural changes
Profile memory impact before production
Test under load using the test suite

Code Review Checklist:

Race conditions protected with locks
Error handling with proper logging
Graceful degradation on failures
Memory impact measured
Tests added/updated
Documentation updated

License & Credits

Crawl4AI - Created by Unclecode GitHub: https://github.com/unclecode/crawl4ai License: See LICENSE file in repository

Architecture & Optimizations: October 2025 WebSocket Monitoring: October 2025 Production Hardening: October 2025

End of Technical Architecture Document

For questions or issues, please open a GitHub issue at: https://github.com/unclecode/crawl4ai/issues

35 KiB Raw Blame History

Crawl4AI Docker Server - Technical Architecture

Table of Contents

System Overview

Architecture Diagram

Key Features

Core Components

1. Server Core (server.py)

2. API Layer (api.py)

3. Utility Layer (utils.py)

Smart Browser Pool

Architecture

Implementation (crawler_pool.py)

Real-time Monitoring System

Architecture

Monitor Stats (monitor.py)

WebSocket Implementation (monitor_routes.py)

Frontend Dashboard

API Layer

Request/Response Flow

Error Handling Strategy

Memory Management

Container Memory Detection

Memory Budgets

Production Optimizations

Code Review Fixes Applied

Performance Characteristics

Reliability Features

Deployment & Operations

Running Locally

Docker Deployment

Production Configuration

Monitoring

Troubleshooting & Debugging

Common Issues

Debug Tools

Performance Tuning

Test Suite

Architecture Decision Log

Why 3-Tier Pool?

Why WebSocket + Polling Fallback?

Why Background Persistence Worker?

Contributing

License & Credits

35 KiB

Raw Blame History

1. Server Core (`server.py`)

2. API Layer (`api.py`)

3. Utility Layer (`utils.py`)

Implementation (`crawler_pool.py`)

Monitor Stats (`monitor.py`)

WebSocket Implementation (`monitor_routes.py`)