Created ARCHITECTURE.md as a complete technical reference for the Crawl4AI Docker server, replacing the stress test pipeline document with production-grade documentation. Contents: - System overview with architecture diagrams - Core components deep-dive (server, API, utils) - Smart browser pool implementation details - Real-time monitoring system architecture - WebSocket implementation and fallback strategy - Memory management and container detection - Production optimizations and code review fixes - Deployment guides (local, Docker, production) - Comprehensive troubleshooting section - Debug tools and performance tuning - Test suite documentation - Architecture decision log (ADRs) Target audience: Developers maintaining or extending the system Goal: Enable rapid onboarding and confident modifications
35 KiB
Crawl4AI Docker Server - Technical Architecture
Version: 0.7.4 Last Updated: October 2025 Status: Production-ready with real-time monitoring
This document provides a comprehensive technical overview of the Crawl4AI Docker server architecture, including the smart browser pool, real-time monitoring system, and all production optimizations.
Table of Contents
- System Overview
- Core Components
- Smart Browser Pool
- Real-time Monitoring System
- API Layer
- Memory Management
- Production Optimizations
- Deployment & Operations
- Troubleshooting & Debugging
System Overview
Architecture Diagram
┌─────────────────────────────────────────────────────────────┐
│ Client Requests │
└────────────┬────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ FastAPI Server (server.py) │
│ ├─ REST API Endpoints (/crawl, /html, /md, /llm, etc.) │
│ ├─ WebSocket Endpoint (/monitor/ws) │
│ └─ Background Tasks (janitor, timeline_updater) │
└────┬────────────────────┬────────────────────┬──────────────┘
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Browser │ │ Monitor System │ │ Redis │
│ Pool │ │ (monitor.py) │ │ (Persistence) │
│ │ │ │ │ │
│ PERMANENT ●─┤ │ ├─ Stats │ │ ├─ Endpoint │
│ HOT_POOL ♨─┤ │ ├─ Requests │ │ │ Stats │
│ COLD_POOL ❄─┤ │ ├─ Browsers │ │ ├─ Task │
│ │ │ ├─ Timeline │ │ │ Results │
│ Janitor 🧹─┤ │ └─ Events/Errors │ │ └─ Cache │
└─────────────┘ └──────────────────┘ └─────────────────┘
Key Features
- 10x Memory Efficiency: Smart 3-tier browser pooling reduces memory from 500-700MB to 50-70MB per concurrent user
- Real-time Monitoring: WebSocket-based live dashboard with 2-second update intervals
- Production-Ready: Comprehensive error handling, timeouts, cleanup, and graceful shutdown
- Container-Aware: Accurate memory detection using cgroup v2/v1
- Auto-Recovery: Graceful WebSocket fallback, lock protection, background workers
Core Components
1. Server Core (server.py)
Responsibilities:
- FastAPI application lifecycle management
- Route registration and middleware
- Background task orchestration
- Graceful shutdown handling
Key Functions:
@asynccontextmanager
async def lifespan(app: FastAPI):
"""Application lifecycle manager"""
# Startup
- Initialize Redis connection
- Create monitor stats instance
- Start persistence worker
- Initialize permanent browser
- Start janitor (browser cleanup)
- Start timeline updater (5s interval)
yield
# Shutdown
- Cancel background tasks
- Persist final monitor stats
- Stop persistence worker
- Close all browsers
Configuration:
- Loaded from
config.yml - Browser settings, memory thresholds, rate limiting
- LLM provider credentials
- Server host/port
2. API Layer (api.py)
Endpoints:
| Endpoint | Method | Purpose | Pool Usage |
|---|---|---|---|
/health |
GET | Health check | None |
/crawl |
POST | Full crawl with all features | ✓ Pool |
/crawl_stream |
POST | Streaming crawl results | ✓ Pool |
/html |
POST | HTML extraction | ✓ Pool |
/md |
POST | Markdown generation | ✓ Pool |
/screenshot |
POST | Page screenshots | ✓ Pool |
/pdf |
POST | PDF generation | ✓ Pool |
/llm/{path} |
GET/POST | LLM extraction | ✓ Pool |
/crawl/job |
POST | Background job creation | ✓ Pool |
Request Flow:
@app.post("/crawl")
async def crawl(body: CrawlRequest):
# 1. Track request start
request_id = f"req_{uuid4().hex[:8]}"
await get_monitor().track_request_start(request_id, "/crawl", url, config)
# 2. Get browser from pool
from crawler_pool import get_crawler
crawler = await get_crawler(browser_config)
# 3. Execute crawl
result = await crawler.arun(url, config=crawler_config)
# 4. Track request completion
await get_monitor().track_request_end(request_id, success=True)
# 5. Return result (browser stays in pool)
return result
3. Utility Layer (utils.py)
Container Memory Detection:
def get_container_memory_percent() -> float:
"""Accurate container memory detection"""
try:
# Try cgroup v2 first
current = int(Path("/sys/fs/cgroup/memory.current").read_text().strip())
max_mem = int(Path("/sys/fs/cgroup/memory.max").read_text().strip())
return (current / max_mem) * 100
except:
# Fallback to cgroup v1
usage = int(Path("/sys/fs/cgroup/memory/memory.usage_in_bytes").read_text())
limit = int(Path("/sys/fs/cgroup/memory/memory.limit_in_bytes").read_text())
return (usage / limit) * 100
except:
# Final fallback to psutil (may be inaccurate in containers)
return psutil.virtual_memory().percent
Helper Functions:
get_base_url(): Request base URL extractionis_task_id(): Task ID validationshould_cleanup_task(): TTL-based cleanup logicvalidate_llm_provider(): LLM configuration validation
Smart Browser Pool
Architecture
The browser pool implements a 3-tier strategy optimized for real-world usage patterns:
┌──────────────────────────────────────────────────────────┐
│ PERMANENT Browser (Default Config) │
│ ● Always alive, never cleaned │
│ ● Serves 90% of requests │
│ ● ~270MB memory │
└──────────────────────────────────────────────────────────┘
▲
│ 90% of requests
│
┌──────────────────────────────────────────────────────────┐
│ HOT_POOL (Frequently Used Configs) │
│ ♨ Configs used 3+ times │
│ ♨ Longer TTL (2-5 min depending on memory) │
│ ♨ ~180MB per browser │
└──────────────────────────────────────────────────────────┘
▲
│ Promotion at 3 uses
│
┌──────────────────────────────────────────────────────────┐
│ COLD_POOL (Rarely Used Configs) │
│ ❄ New/rare browser configs │
│ ❄ Short TTL (30s-5min depending on memory) │
│ ❄ ~180MB per browser │
└──────────────────────────────────────────────────────────┘
Implementation (crawler_pool.py)
Core Data Structures:
PERMANENT: Optional[AsyncWebCrawler] = None # Default browser
HOT_POOL: Dict[str, AsyncWebCrawler] = {} # Frequent configs
COLD_POOL: Dict[str, AsyncWebCrawler] = {} # Rare configs
LAST_USED: Dict[str, float] = {} # Timestamp tracking
USAGE_COUNT: Dict[str, int] = {} # Usage counter
LOCK = asyncio.Lock() # Thread-safe access
Browser Acquisition Flow:
async def get_crawler(cfg: BrowserConfig) -> AsyncWebCrawler:
sig = _sig(cfg) # SHA1 hash of config
async with LOCK: # Prevent race conditions
# 1. Check permanent browser
if _is_default_config(sig):
return PERMANENT
# 2. Check hot pool
if sig in HOT_POOL:
USAGE_COUNT[sig] += 1
return HOT_POOL[sig]
# 3. Check cold pool (with promotion logic)
if sig in COLD_POOL:
USAGE_COUNT[sig] += 1
if USAGE_COUNT[sig] >= 3:
# Promote to hot pool
HOT_POOL[sig] = COLD_POOL.pop(sig)
await get_monitor().track_janitor_event("promote", sig, {...})
return HOT_POOL[sig]
return COLD_POOL[sig]
# 4. Memory check before creating new
if get_container_memory_percent() >= MEM_LIMIT:
raise MemoryError(f"Memory at {mem}%, refusing new browser")
# 5. Create new browser in cold pool
crawler = AsyncWebCrawler(config=cfg)
await crawler.start()
COLD_POOL[sig] = crawler
return crawler
Janitor (Adaptive Cleanup):
async def janitor():
"""Memory-adaptive browser cleanup"""
while True:
mem_pct = get_container_memory_percent()
# Adaptive intervals based on memory pressure
if mem_pct > 80:
interval, cold_ttl, hot_ttl = 10, 30, 120 # Aggressive
elif mem_pct > 60:
interval, cold_ttl, hot_ttl = 30, 60, 300 # Moderate
else:
interval, cold_ttl, hot_ttl = 60, 300, 600 # Relaxed
await asyncio.sleep(interval)
async with LOCK:
# Clean cold pool first (less valuable)
for sig in list(COLD_POOL.keys()):
if now - LAST_USED[sig] > cold_ttl:
await COLD_POOL[sig].close()
del COLD_POOL[sig], LAST_USED[sig], USAGE_COUNT[sig]
await track_janitor_event("close_cold", sig, {...})
# Clean hot pool (more conservative)
for sig in list(HOT_POOL.keys()):
if now - LAST_USED[sig] > hot_ttl:
await HOT_POOL[sig].close()
del HOT_POOL[sig], LAST_USED[sig], USAGE_COUNT[sig]
await track_janitor_event("close_hot", sig, {...})
Config Signature Generation:
def _sig(cfg: BrowserConfig) -> str:
"""Generate unique signature for browser config"""
payload = json.dumps(cfg.to_dict(), sort_keys=True, separators=(",",":"))
return hashlib.sha1(payload.encode()).hexdigest()
Real-time Monitoring System
Architecture
The monitoring system provides real-time insights via WebSocket with automatic fallback to HTTP polling.
Components:
┌─────────────────────────────────────────────────────────┐
│ MonitorStats Class (monitor.py) │
│ ├─ In-memory queues (deques with maxlen) │
│ ├─ Background persistence worker │
│ ├─ Timeline tracking (5-min window, 5s resolution) │
│ └─ Time-based expiry (5min for old entries) │
└───────────┬─────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ WebSocket Endpoint (/monitor/ws) │
│ ├─ 2-second update intervals │
│ ├─ Auto-reconnect with exponential backoff │
│ ├─ Comprehensive data payload │
│ └─ Graceful fallback to polling │
└───────────┬─────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Dashboard UI (static/monitor/index.html) │
│ ├─ Connection status indicator │
│ ├─ Live updates (health, requests, browsers) │
│ ├─ Timeline charts (memory, requests, browsers) │
│ └─ Janitor events & error logs │
└─────────────────────────────────────────────────────────┘
Monitor Stats (monitor.py)
Data Structures:
class MonitorStats:
# In-memory queues
active_requests: Dict[str, Dict] # Currently processing
completed_requests: deque(maxlen=100) # Last 100 completed
janitor_events: deque(maxlen=100) # Cleanup events
errors: deque(maxlen=100) # Error log
# Endpoint stats (persisted to Redis)
endpoint_stats: Dict[str, Dict] # Aggregated stats
# Timeline data (5min window, 5s resolution = 60 points)
memory_timeline: deque(maxlen=60)
requests_timeline: deque(maxlen=60)
browser_timeline: deque(maxlen=60)
# Background persistence
_persist_queue: asyncio.Queue(maxsize=10)
_persist_worker_task: Optional[asyncio.Task]
Request Tracking:
async def track_request_start(request_id, endpoint, url, config):
"""Track new request"""
self.active_requests[request_id] = {
"id": request_id,
"endpoint": endpoint,
"url": url,
"start_time": time.time(),
"mem_start": psutil.Process().memory_info().rss / (1024 * 1024)
}
# Update endpoint stats
if endpoint not in self.endpoint_stats:
self.endpoint_stats[endpoint] = {
"count": 0, "total_time": 0, "errors": 0,
"pool_hits": 0, "success": 0
}
self.endpoint_stats[endpoint]["count"] += 1
# Queue background persistence
self._persist_queue.put_nowait(True)
async def track_request_end(request_id, success, error=None, ...):
"""Track request completion"""
req_info = self.active_requests.pop(request_id)
elapsed = time.time() - req_info["start_time"]
mem_delta = current_mem - req_info["mem_start"]
# Add to completed queue
self.completed_requests.append({
"id": request_id,
"endpoint": req_info["endpoint"],
"url": req_info["url"],
"success": success,
"elapsed": elapsed,
"mem_delta": mem_delta,
"end_time": time.time()
})
# Update stats
self.endpoint_stats[endpoint]["success" if success else "errors"] += 1
await self._persist_endpoint_stats()
Background Persistence Worker:
async def _persistence_worker(self):
"""Background worker for Redis persistence"""
while True:
try:
await self._persist_queue.get()
await self._persist_endpoint_stats()
self._persist_queue.task_done()
except asyncio.CancelledError:
break
except Exception as e:
logger.error(f"Persistence worker error: {e}")
async def _persist_endpoint_stats(self):
"""Persist stats to Redis with error handling"""
try:
await self.redis.set(
"monitor:endpoint_stats",
json.dumps(self.endpoint_stats),
ex=86400 # 24h TTL
)
except Exception as e:
logger.warning(f"Failed to persist endpoint stats: {e}")
Time-based Cleanup:
def _cleanup_old_entries(self, max_age_seconds=300):
"""Remove entries older than 5 minutes"""
now = time.time()
cutoff = now - max_age_seconds
# Clean completed requests
while self.completed_requests and \
self.completed_requests[0].get("end_time", 0) < cutoff:
self.completed_requests.popleft()
# Clean janitor events
while self.janitor_events and \
self.janitor_events[0].get("timestamp", 0) < cutoff:
self.janitor_events.popleft()
# Clean errors
while self.errors and \
self.errors[0].get("timestamp", 0) < cutoff:
self.errors.popleft()
WebSocket Implementation (monitor_routes.py)
Endpoint:
@router.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
"""Real-time monitoring updates"""
await websocket.accept()
logger.info("WebSocket client connected")
try:
while True:
try:
monitor = get_monitor()
# Gather comprehensive monitoring data
data = {
"timestamp": time.time(),
"health": await monitor.get_health_summary(),
"requests": {
"active": monitor.get_active_requests(),
"completed": monitor.get_completed_requests(limit=10)
},
"browsers": await monitor.get_browser_list(),
"timeline": {
"memory": monitor.get_timeline_data("memory", "5m"),
"requests": monitor.get_timeline_data("requests", "5m"),
"browsers": monitor.get_timeline_data("browsers", "5m")
},
"janitor": monitor.get_janitor_log(limit=10),
"errors": monitor.get_errors_log(limit=10)
}
await websocket.send_json(data)
await asyncio.sleep(2) # 2-second update interval
except WebSocketDisconnect:
logger.info("WebSocket client disconnected")
break
except Exception as e:
logger.error(f"WebSocket error: {e}", exc_info=True)
await asyncio.sleep(2)
except Exception as e:
logger.error(f"WebSocket connection error: {e}", exc_info=True)
finally:
logger.info("WebSocket connection closed")
Input Validation:
@router.get("/requests")
async def get_requests(status: str = "all", limit: int = 50):
# Input validation
if status not in ["all", "active", "completed", "success", "error"]:
raise HTTPException(400, f"Invalid status: {status}")
if limit < 1 or limit > 1000:
raise HTTPException(400, f"Invalid limit: {limit}")
monitor = get_monitor()
# ... return data
Frontend Dashboard
Connection Management:
// WebSocket with auto-reconnect
function connectWebSocket() {
if (wsReconnectAttempts >= MAX_WS_RECONNECT) {
// Fallback to polling after 5 failed attempts
useWebSocket = false;
updateConnectionStatus('polling');
startAutoRefresh();
return;
}
updateConnectionStatus('connecting');
const wsUrl = `${protocol}//${window.location.host}/monitor/ws`;
websocket = new WebSocket(wsUrl);
websocket.onopen = () => {
wsReconnectAttempts = 0;
updateConnectionStatus('connected');
stopAutoRefresh(); // Stop polling
};
websocket.onmessage = (event) => {
const data = JSON.parse(event.data);
updateDashboard(data); // Update all sections
};
websocket.onclose = () => {
updateConnectionStatus('disconnected', 'Reconnecting...');
if (useWebSocket) {
setTimeout(connectWebSocket, 2000 * wsReconnectAttempts);
} else {
startAutoRefresh(); // Fallback to polling
}
};
}
Connection Status Indicator:
| Status | Color | Animation | Meaning |
|---|---|---|---|
| Live | Green | Pulsing fast | WebSocket connected |
| Connecting... | Yellow | Pulsing slow | Attempting connection |
| Polling | Blue | Pulsing slow | HTTP polling fallback |
| Disconnected | Red | None | Connection failed |
API Layer
Request/Response Flow
Client Request
│
▼
FastAPI Route Handler
│
├─→ Monitor: track_request_start()
│
├─→ Browser Pool: get_crawler(config)
│ │
│ ├─→ Check PERMANENT
│ ├─→ Check HOT_POOL
│ ├─→ Check COLD_POOL
│ └─→ Create New (if needed)
│
├─→ Execute Crawl
│ │
│ ├─→ Fetch page
│ ├─→ Extract content
│ ├─→ Apply filters/strategies
│ └─→ Return result
│
├─→ Monitor: track_request_end()
│
└─→ Return Response (browser stays in pool)
Error Handling Strategy
Levels:
- Route Level: HTTP exceptions with proper status codes
- Monitor Level: Try-except with logging, non-critical failures
- Pool Level: Memory checks, lock protection, graceful degradation
- WebSocket Level: Auto-reconnect, fallback to polling
Example:
@app.post("/crawl")
async def crawl(body: CrawlRequest):
request_id = f"req_{uuid4().hex[:8]}"
try:
# Monitor tracking (non-blocking on failure)
try:
await get_monitor().track_request_start(...)
except:
pass # Monitor not critical
# Browser acquisition (with memory protection)
crawler = await get_crawler(browser_config)
# Crawl execution
result = await crawler.arun(url, config=cfg)
# Success tracking
try:
await get_monitor().track_request_end(request_id, success=True)
except:
pass
return result
except MemoryError as e:
# Memory pressure - return 503
await get_monitor().track_request_end(request_id, success=False, error=str(e))
raise HTTPException(503, "Server at capacity")
except Exception as e:
# General errors - return 500
await get_monitor().track_request_end(request_id, success=False, error=str(e))
raise HTTPException(500, str(e))
Memory Management
Container Memory Detection
Priority Order:
- cgroup v2 (
/sys/fs/cgroup/memory.{current,max}) - cgroup v1 (
/sys/fs/cgroup/memory/memory.{usage,limit}_in_bytes) - psutil fallback (may be inaccurate in containers)
Usage:
mem_pct = get_container_memory_percent()
if mem_pct >= 95: # Critical
raise MemoryError("Refusing new browser")
elif mem_pct > 80: # High pressure
# Janitor: aggressive cleanup (10s interval, 30s TTL)
elif mem_pct > 60: # Moderate pressure
# Janitor: moderate cleanup (30s interval, 60s TTL)
else: # Normal
# Janitor: relaxed cleanup (60s interval, 300s TTL)
Memory Budgets
| Component | Memory | Notes |
|---|---|---|
| Base Container | 270 MB | Python + FastAPI + libraries |
| Permanent Browser | 270 MB | Always-on default browser |
| Hot Pool Browser | 180 MB | Per frequently-used config |
| Cold Pool Browser | 180 MB | Per rarely-used config |
| Active Crawl Overhead | 50-200 MB | Temporary, released after request |
Example Calculation:
Container: 270 MB
Permanent: 270 MB
2x Hot: 360 MB
1x Cold: 180 MB
Total: 1080 MB baseline
Under load (10 concurrent):
+ Active crawls: ~500-1000 MB
= Peak: 1.5-2 GB
Production Optimizations
Code Review Fixes Applied
Critical (3):
- ✅ Lock protection for browser pool access
- ✅ Async track_janitor_event implementation
- ✅ Error handling in request tracking
Important (8): 4. ✅ Background persistence worker (replaces fire-and-forget) 5. ✅ Time-based expiry (5min cleanup for old entries) 6. ✅ Input validation (status, limit, metric, window) 7. ✅ Timeline updater timeout (4s max) 8. ✅ Warn when killing browsers with active requests 9. ✅ Monitor cleanup on shutdown 10. ✅ Document memory estimates 11. ✅ Structured error responses (HTTPException)
Performance Characteristics
Latency:
| Scenario | Time | Notes |
|---|---|---|
| Pool Hit (Permanent) | <100ms | Browser ready |
| Pool Hit (Hot/Cold) | <100ms | Browser ready |
| New Browser Creation | 3-5s | Chromium startup |
| Simple Page Fetch | 1-3s | Network + render |
| Complex Extraction | 5-10s | LLM processing |
Throughput:
| Load | Concurrent | Response Time | Success Rate |
|---|---|---|---|
| Light | 1-10 | <3s | 100% |
| Medium | 10-50 | 3-8s | 100% |
| Heavy | 50-100 | 8-15s | 95-100% |
| Extreme | 100+ | 15-30s | 80-95% |
Reliability Features
Race Condition Protection:
asyncio.Lockon all pool operations- Lock on browser pool stats access
- Async janitor event tracking
Graceful Degradation:
- WebSocket → HTTP polling fallback
- Redis persistence failures (logged, non-blocking)
- Monitor tracking failures (logged, non-blocking)
Resource Cleanup:
- Janitor cleanup (adaptive intervals)
- Time-based expiry (5min for old data)
- Shutdown cleanup (persist final stats, close browsers)
- Background worker cancellation
Deployment & Operations
Running Locally
# Install dependencies
pip install -r requirements.txt
# Configure
cp .llm.env.example .llm.env
# Edit .llm.env with your API keys
# Run server
python -m uvicorn server:app --host 0.0.0.0 --port 11235 --reload
Docker Deployment
# Build image
docker build -t crawl4ai:latest -f Dockerfile .
# Run container
docker run -d \
--name crawl4ai \
-p 11235:11235 \
--shm-size=1g \
--env-file .llm.env \
crawl4ai:latest
Production Configuration
config.yml Key Settings:
crawler:
browser:
extra_args:
- "--disable-gpu"
- "--disable-dev-shm-usage"
- "--no-sandbox"
kwargs:
headless: true
text_mode: true # Reduces memory by 30-40%
memory_threshold_percent: 95 # Refuse new browsers above this
pool:
idle_ttl_sec: 300 # Base TTL for cold pool (5 min)
rate_limiter:
enabled: true
base_delay: [1.0, 3.0] # Random delay between requests
Monitoring
Access Dashboard:
http://localhost:11235/static/monitor/
Check Logs:
# All activity
docker logs crawl4ai -f
# Pool activity only
docker logs crawl4ai | grep -E "(🔥|♨️|❄️|🆕|⬆️)"
# Errors only
docker logs crawl4ai | grep ERROR
Metrics:
# Container stats
docker stats crawl4ai
# Memory percentage
curl http://localhost:11235/monitor/health | jq '.container.memory_percent'
# Pool status
curl http://localhost:11235/monitor/browsers | jq '.summary'
Troubleshooting & Debugging
Common Issues
1. WebSocket Not Connecting
Symptoms: Yellow "Connecting..." indicator, falls back to blue "Polling"
Debug:
# Check server logs
docker logs crawl4ai | grep WebSocket
# Test WebSocket manually
python test-websocket.py
Fix: Check firewall/proxy settings, ensure port 11235 accessible
2. High Memory Usage
Symptoms: Container OOM kills, 503 errors, slow responses
Debug:
# Check current memory
curl http://localhost:11235/monitor/health | jq '.container.memory_percent'
# Check browser pool
curl http://localhost:11235/monitor/browsers
# Check janitor activity
docker logs crawl4ai | grep "🧹"
Fix:
- Lower
memory_threshold_percentin config.yml - Increase container memory limit
- Enable
text_mode: truein browser config - Reduce idle_ttl_sec for more aggressive cleanup
3. Browser Pool Not Reusing
Symptoms: High "New Created" count, poor reuse rate
Debug:
# Check config signature matching
from crawl4ai import BrowserConfig
import json, hashlib
cfg = BrowserConfig(...) # Your config
sig = hashlib.sha1(json.dumps(cfg.to_dict(), sort_keys=True).encode()).hexdigest()
print(f"Config signature: {sig[:8]}")
Check logs for permanent browser signature:
docker logs crawl4ai | grep "permanent"
Fix: Ensure endpoint configs match permanent browser config exactly
4. Janitor Not Cleaning Up
Symptoms: Memory stays high after idle period
Debug:
# Check janitor events
curl http://localhost:11235/monitor/logs/janitor
# Check pool stats over time
watch -n 5 'curl -s http://localhost:11235/monitor/browsers | jq ".summary"'
Fix:
- Janitor runs every 10-60s depending on memory
- Hot pool browsers have longer TTL (by design)
- Permanent browser never cleaned (by design)
Debug Tools
Config Signature Checker:
from crawl4ai import BrowserConfig
import json, hashlib
def check_sig(cfg: BrowserConfig) -> str:
payload = json.dumps(cfg.to_dict(), sort_keys=True, separators=(",",":"))
sig = hashlib.sha1(payload.encode()).hexdigest()
return sig[:8]
# Example
cfg1 = BrowserConfig()
cfg2 = BrowserConfig(headless=True)
print(f"Default: {check_sig(cfg1)}")
print(f"Custom: {check_sig(cfg2)}")
Monitor Stats Dumper:
#!/bin/bash
# Dump all monitor stats to JSON
curl -s http://localhost:11235/monitor/health > health.json
curl -s http://localhost:11235/monitor/requests?limit=100 > requests.json
curl -s http://localhost:11235/monitor/browsers > browsers.json
curl -s http://localhost:11235/monitor/logs/janitor > janitor.json
curl -s http://localhost:11235/monitor/logs/errors > errors.json
echo "Monitor stats dumped to *.json files"
WebSocket Test Script:
# test-websocket.py (included in repo)
import asyncio
import websockets
import json
async def test_websocket():
uri = "ws://localhost:11235/monitor/ws"
async with websockets.connect(uri) as websocket:
for i in range(5):
message = await websocket.recv()
data = json.loads(message)
print(f"\nUpdate #{i+1}:")
print(f" Health: CPU {data['health']['container']['cpu_percent']}%")
print(f" Active Requests: {len(data['requests']['active'])}")
print(f" Browsers: {len(data['browsers'])}")
asyncio.run(test_websocket())
Performance Tuning
For High Throughput:
# config.yml
crawler:
memory_threshold_percent: 90 # Allow more browsers
pool:
idle_ttl_sec: 600 # Keep browsers longer
rate_limiter:
enabled: false # Disable for max speed
For Low Memory:
# config.yml
crawler:
browser:
kwargs:
text_mode: true # 30-40% memory reduction
memory_threshold_percent: 80 # More conservative
pool:
idle_ttl_sec: 60 # Aggressive cleanup
For Stability:
# config.yml
crawler:
memory_threshold_percent: 85 # Balanced
pool:
idle_ttl_sec: 300 # Moderate cleanup
rate_limiter:
enabled: true
base_delay: [2.0, 5.0] # Prevent rate limiting
Test Suite
Location: deploy/docker/tests/
Tests:
test_1_basic.py- Health check, container lifecycletest_2_memory.py- Memory tracking, leak detectiontest_3_pool.py- Pool reuse validationtest_4_concurrent.py- Concurrent load testingtest_5_pool_stress.py- Multi-config pool behaviortest_6_multi_endpoint.py- All endpoint validationtest_7_cleanup.py- Janitor cleanup verification
Run All Tests:
cd deploy/docker/tests
pip install -r requirements.txt
# Build image first
cd /path/to/repo
docker build -t crawl4ai-local:latest .
# Run tests
cd deploy/docker/tests
for test in test_*.py; do
echo "Running $test..."
python $test || break
done
Architecture Decision Log
Why 3-Tier Pool?
Decision: PERMANENT + HOT_POOL + COLD_POOL
Rationale:
- 90% of requests use default config → permanent browser serves most traffic
- Frequent variants (hot) deserve longer TTL for better reuse
- Rare configs (cold) should be cleaned aggressively to save memory
Alternatives Considered:
- Single pool: Too simple, no optimization for common case
- LRU cache: Doesn't capture "hot" vs "rare" distinction
- Per-endpoint pools: Too complex, over-engineering
Why WebSocket + Polling Fallback?
Decision: WebSocket primary, HTTP polling backup
Rationale:
- WebSocket provides real-time updates (2s interval)
- Polling fallback ensures reliability in restricted networks
- Auto-reconnect handles temporary disconnections
Alternatives Considered:
- Polling only: Works but higher latency, more server load
- WebSocket only: Fails in restricted networks
- Server-Sent Events: One-way, no client messages
Why Background Persistence Worker?
Decision: Queue-based worker for Redis operations
Rationale:
- Fire-and-forget loses data on failures
- Queue provides buffering and retry capability
- Non-blocking keeps request path fast
Alternatives Considered:
- Synchronous writes: Blocks request handling
- Fire-and-forget: Silent failures
- Batch writes: Complex state management
Contributing
When modifying the architecture:
- Maintain backward compatibility in API contracts
- Add tests for new functionality
- Update this document with architectural changes
- Profile memory impact before production
- Test under load using the test suite
Code Review Checklist:
- Race conditions protected with locks
- Error handling with proper logging
- Graceful degradation on failures
- Memory impact measured
- Tests added/updated
- Documentation updated
License & Credits
Crawl4AI - Created by Unclecode GitHub: https://github.com/unclecode/crawl4ai License: See LICENSE file in repository
Architecture & Optimizations: October 2025 WebSocket Monitoring: October 2025 Production Hardening: October 2025
End of Technical Architecture Document
For questions or issues, please open a GitHub issue at: https://github.com/unclecode/crawl4ai/issues