ARCHITECTURE.md: - Dense technical reference for AI agents - Complete system flow diagrams - Memory leak prevention strategies - File cross-references with line numbers - Symbolic notation for compression - Docker orchestration deep dive QUICKSTART.md: - One-page cheat sheet for users - Install → launch → scale → test workflow - Simple example.com curl test - Common commands reference
823 lines
21 KiB
Markdown
823 lines
21 KiB
Markdown
# Crawl4AI Docker Architecture - AI Context Map
|
|
|
|
**Purpose:** Dense technical reference for AI agents to understand complete system architecture.
|
|
**Format:** Symbolic, compressed, high-information-density documentation.
|
|
|
|
---
|
|
|
|
## System Overview
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ CRAWL4AI DOCKER ORCHESTRATION SYSTEM │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ Modes: Single (N=1) | Swarm (N>1) | Compose+Nginx (N>1) │
|
|
│ Entry: cnode CLI → deploy/docker/cnode_cli.py │
|
|
│ Core: deploy/docker/server_manager.py │
|
|
│ Server: deploy/docker/server.py (FastAPI) │
|
|
│ API: deploy/docker/api.py (crawl endpoints) │
|
|
│ Monitor: deploy/docker/monitor.py + monitor_routes.py │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## Directory Structure & File Map
|
|
|
|
```
|
|
deploy/
|
|
├── docker/ # Server runtime & orchestration
|
|
│ ├── server.py # FastAPI app entry [CRITICAL]
|
|
│ ├── api.py # /crawl, /screenshot, /pdf endpoints
|
|
│ ├── server_manager.py # Docker orchestration logic [CORE]
|
|
│ ├── cnode_cli.py # CLI interface (Click-based)
|
|
│ ├── monitor.py # Real-time metrics collector
|
|
│ ├── monitor_routes.py # /monitor dashboard routes
|
|
│ ├── crawler_pool.py # Browser pool management
|
|
│ ├── hook_manager.py # Pre/post crawl hooks
|
|
│ ├── job.py # Job queue schema
|
|
│ ├── utils.py # Helpers (port check, health)
|
|
│ ├── auth.py # API key authentication
|
|
│ ├── schemas.py # Pydantic models
|
|
│ ├── mcp_bridge.py # MCP protocol bridge
|
|
│ ├── supervisord.conf # Process manager config
|
|
│ ├── config.yml # Server config template
|
|
│ ├── requirements.txt # Python deps
|
|
│ ├── static/ # Web assets
|
|
│ │ ├── monitor/ # Dashboard UI
|
|
│ │ └── playground/ # API playground
|
|
│ └── tests/ # Test suite
|
|
│
|
|
└── installer/ # User-facing installation
|
|
├── cnode_pkg/ # Standalone package
|
|
│ ├── cli.py # Copy of cnode_cli.py
|
|
│ ├── server_manager.py # Copy of server_manager.py
|
|
│ └── requirements.txt # click, rich, anyio, pyyaml
|
|
├── install-cnode.sh # Remote installer (git sparse-checkout)
|
|
├── sync-cnode.sh # Dev tool (source→pkg sync)
|
|
├── USER_GUIDE.md # Human-readable guide
|
|
├── README.md # Developer documentation
|
|
└── QUICKSTART.md # Cheat sheet
|
|
```
|
|
|
|
---
|
|
|
|
## Core Components Deep Dive
|
|
|
|
### 1. `server_manager.py` - Orchestration Engine
|
|
|
|
**Role:** Manages Docker container lifecycle, auto-detects deployment mode.
|
|
|
|
**Key Classes:**
|
|
- `ServerManager` - Main orchestrator
|
|
- `start(replicas, mode, port, env_file, image)` → Deploy server
|
|
- `stop(remove_volumes)` → Teardown
|
|
- `status()` → Health check
|
|
- `scale(replicas)` → Live scaling
|
|
- `logs(follow, tail)` → Stream logs
|
|
- `cleanup(force)` → Emergency cleanup
|
|
|
|
**State Management:**
|
|
- File: `~/.crawl4ai/server_state.yml`
|
|
- Schema: `{mode, replicas, port, image, started_at, containers[]}`
|
|
- Atomic writes with lock file
|
|
|
|
**Deployment Modes:**
|
|
```python
|
|
if replicas == 1:
|
|
mode = "single" # docker run
|
|
elif swarm_available():
|
|
mode = "swarm" # docker stack deploy
|
|
else:
|
|
mode = "compose" # docker-compose + nginx
|
|
```
|
|
|
|
**Container Naming:**
|
|
- Single: `crawl4ai-server`
|
|
- Swarm: `crawl4ai-stack_crawl4ai`
|
|
- Compose: `crawl4ai-server-{1..N}`, `crawl4ai-nginx`
|
|
|
|
**Networks:**
|
|
- `crawl4ai-network` (bridge mode for all)
|
|
|
|
**Volumes:**
|
|
- `crawl4ai-redis-data` - Persistent queue
|
|
- `crawl4ai-profiles` - Browser profiles
|
|
|
|
**Health Checks:**
|
|
- Endpoint: `http://localhost:{port}/health`
|
|
- Timeout: 30s startup
|
|
- Retry: 3 attempts
|
|
|
|
---
|
|
|
|
### 2. `server.py` - FastAPI Application
|
|
|
|
**Role:** HTTP server exposing crawl API + monitoring.
|
|
|
|
**Startup Flow:**
|
|
```python
|
|
app = FastAPI()
|
|
@app.on_event("startup")
|
|
async def startup():
|
|
init_crawler_pool() # Pre-warm browsers
|
|
init_redis_connection() # Job queue
|
|
start_monitor_collector() # Metrics
|
|
```
|
|
|
|
**Key Endpoints:**
|
|
```
|
|
POST /crawl → api.py:crawl_endpoint()
|
|
POST /crawl/stream → api.py:crawl_stream_endpoint()
|
|
POST /screenshot → api.py:screenshot_endpoint()
|
|
POST /pdf → api.py:pdf_endpoint()
|
|
GET /health → server.py:health_check()
|
|
GET /monitor → monitor_routes.py:dashboard()
|
|
WS /monitor/ws → monitor_routes.py:websocket_endpoint()
|
|
GET /playground → static/playground/index.html
|
|
```
|
|
|
|
**Process Manager:**
|
|
- Uses `supervisord` to manage:
|
|
- FastAPI server (port 11235)
|
|
- Redis (port 6379)
|
|
- Background workers
|
|
|
|
**Environment:**
|
|
```bash
|
|
CRAWL4AI_PORT=11235
|
|
REDIS_URL=redis://localhost:6379
|
|
MAX_CONCURRENT_CRAWLS=5
|
|
BROWSER_POOL_SIZE=3
|
|
```
|
|
|
|
---
|
|
|
|
### 3. `api.py` - Crawl Endpoints
|
|
|
|
**Main Endpoint:** `POST /crawl`
|
|
|
|
**Request Schema:**
|
|
```json
|
|
{
|
|
"urls": ["https://example.com"],
|
|
"priority": 10,
|
|
"browser_config": {
|
|
"type": "BrowserConfig",
|
|
"params": {"headless": true, "viewport_width": 1920}
|
|
},
|
|
"crawler_config": {
|
|
"type": "CrawlerRunConfig",
|
|
"params": {"cache_mode": "bypass", "extraction_strategy": {...}}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Processing Flow:**
|
|
```
|
|
1. Validate request (Pydantic)
|
|
2. Queue job → Redis
|
|
3. Get browser from pool → crawler_pool.py
|
|
4. Execute crawl → AsyncWebCrawler
|
|
5. Apply hooks → hook_manager.py
|
|
6. Return result (JSON)
|
|
7. Release browser to pool
|
|
```
|
|
|
|
**Memory Management:**
|
|
- Browser pool: Max 3 instances
|
|
- LRU eviction when pool full
|
|
- Explicit cleanup: `browser.close()` in finally block
|
|
- Redis TTL: 1 hour for completed jobs
|
|
|
|
**Error Handling:**
|
|
```python
|
|
try:
|
|
result = await crawler.arun(url, config)
|
|
except PlaywrightError as e:
|
|
# Browser crash - release & recreate
|
|
await pool.invalidate(browser_id)
|
|
except TimeoutError as e:
|
|
# Timeout - kill & retry
|
|
await crawler.kill()
|
|
except Exception as e:
|
|
# Unknown - log & fail gracefully
|
|
logger.error(f"Crawl failed: {e}")
|
|
```
|
|
|
|
---
|
|
|
|
### 4. `crawler_pool.py` - Browser Pool Manager
|
|
|
|
**Role:** Manage persistent browser instances to avoid startup overhead.
|
|
|
|
**Class:** `CrawlerPool`
|
|
- `get_crawler()` → Lease browser (async with context manager)
|
|
- `release_crawler(id)` → Return to pool
|
|
- `warm_up(count)` → Pre-launch browsers
|
|
- `cleanup()` → Close all browsers
|
|
|
|
**Pool Strategy:**
|
|
```python
|
|
pool = {
|
|
"browser_1": {"crawler": AsyncWebCrawler(), "in_use": False},
|
|
"browser_2": {"crawler": AsyncWebCrawler(), "in_use": False},
|
|
"browser_3": {"crawler": AsyncWebCrawler(), "in_use": False},
|
|
}
|
|
|
|
async with pool.get_crawler() as crawler:
|
|
result = await crawler.arun(url)
|
|
# Auto-released on context exit
|
|
```
|
|
|
|
**Anti-Leak Mechanisms:**
|
|
1. Context managers enforce cleanup
|
|
2. Watchdog thread kills stale browsers (>10min idle)
|
|
3. Max lifetime: 1 hour per browser
|
|
4. Force GC after browser close
|
|
|
|
---
|
|
|
|
### 5. `monitor.py` + `monitor_routes.py` - Real-time Dashboard
|
|
|
|
**Architecture:**
|
|
```
|
|
[Browser] <--WebSocket--> [monitor_routes.py] <--Events--> [monitor.py]
|
|
↓
|
|
[Redis Pub/Sub]
|
|
↓
|
|
[Metrics Collector]
|
|
```
|
|
|
|
**Metrics Collected:**
|
|
- Requests/sec (sliding window)
|
|
- Active crawls (real-time count)
|
|
- Response times (p50, p95, p99)
|
|
- Error rate (5min rolling)
|
|
- Memory usage (RSS, heap)
|
|
- Browser pool utilization
|
|
|
|
**WebSocket Protocol:**
|
|
```json
|
|
// Server → Client
|
|
{
|
|
"type": "metrics",
|
|
"data": {
|
|
"rps": 45.3,
|
|
"active_crawls": 12,
|
|
"p95_latency": 1234,
|
|
"error_rate": 0.02
|
|
}
|
|
}
|
|
|
|
// Client → Server
|
|
{
|
|
"type": "subscribe",
|
|
"channels": ["metrics", "logs"]
|
|
}
|
|
```
|
|
|
|
**Dashboard Route:** `/monitor`
|
|
- Real-time graphs (Chart.js)
|
|
- Request log stream
|
|
- Container health status
|
|
- Resource utilization
|
|
|
|
---
|
|
|
|
### 6. `cnode_cli.py` - CLI Interface
|
|
|
|
**Framework:** Click (Python CLI framework)
|
|
|
|
**Command Structure:**
|
|
```
|
|
cnode
|
|
├── start [--replicas N] [--port P] [--mode M] [--image I]
|
|
├── stop [--remove-volumes]
|
|
├── status
|
|
├── scale N
|
|
├── logs [--follow] [--tail N]
|
|
├── restart [--replicas N]
|
|
└── cleanup [--force]
|
|
```
|
|
|
|
**Execution Flow:**
|
|
```python
|
|
@cli.command("start")
|
|
def start_cmd(replicas, mode, port, env_file, image):
|
|
manager = ServerManager()
|
|
result = anyio.run(manager.start(...)) # Async bridge
|
|
if result["success"]:
|
|
console.print(success_panel)
|
|
```
|
|
|
|
**User Feedback:**
|
|
- Rich library for colors/tables
|
|
- Progress spinners during operations
|
|
- Error messages with hints
|
|
- Status tables with health indicators
|
|
|
|
**State Persistence:**
|
|
- Saves deployment config to `~/.crawl4ai/server_state.yml`
|
|
- Enables stateless commands (status, scale, restart)
|
|
|
|
---
|
|
|
|
### 7. Docker Orchestration Details
|
|
|
|
**Single Container Mode (N=1):**
|
|
```bash
|
|
docker run -d \
|
|
--name crawl4ai-server \
|
|
--network crawl4ai-network \
|
|
-p 11235:11235 \
|
|
-v crawl4ai-redis-data:/data \
|
|
unclecode/crawl4ai:latest
|
|
```
|
|
|
|
**Docker Swarm Mode (N>1, Swarm available):**
|
|
```yaml
|
|
# docker-compose.swarm.yml
|
|
version: '3.8'
|
|
services:
|
|
crawl4ai:
|
|
image: unclecode/crawl4ai:latest
|
|
deploy:
|
|
replicas: 5
|
|
update_config:
|
|
parallelism: 2
|
|
delay: 10s
|
|
restart_policy:
|
|
condition: on-failure
|
|
ports:
|
|
- "11235:11235"
|
|
networks:
|
|
- crawl4ai-network
|
|
```
|
|
|
|
Deploy: `docker stack deploy -c docker-compose.swarm.yml crawl4ai-stack`
|
|
|
|
**Docker Compose + Nginx Mode (N>1, fallback):**
|
|
```yaml
|
|
# docker-compose.yml
|
|
services:
|
|
crawl4ai-1:
|
|
image: unclecode/crawl4ai:latest
|
|
networks: [crawl4ai-network]
|
|
|
|
crawl4ai-2:
|
|
image: unclecode/crawl4ai:latest
|
|
networks: [crawl4ai-network]
|
|
|
|
nginx:
|
|
image: nginx:alpine
|
|
ports: ["11235:80"]
|
|
volumes:
|
|
- ./nginx.conf:/etc/nginx/nginx.conf
|
|
networks: [crawl4ai-network]
|
|
```
|
|
|
|
Nginx config (round-robin load balancing):
|
|
```nginx
|
|
upstream crawl4ai_backend {
|
|
server crawl4ai-1:11235;
|
|
server crawl4ai-2:11235;
|
|
server crawl4ai-3:11235;
|
|
}
|
|
|
|
server {
|
|
listen 80;
|
|
location / {
|
|
proxy_pass http://crawl4ai_backend;
|
|
proxy_set_header Host $host;
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Memory Leak Prevention Strategy
|
|
|
|
### Problem Areas & Solutions
|
|
|
|
**1. Browser Instances**
|
|
```python
|
|
# ❌ BAD - Leak risk
|
|
crawler = AsyncWebCrawler()
|
|
result = await crawler.arun(url)
|
|
# Browser never closed!
|
|
|
|
# ✅ GOOD - Guaranteed cleanup
|
|
async with AsyncWebCrawler() as crawler:
|
|
result = await crawler.arun(url)
|
|
# Auto-closed on exit
|
|
```
|
|
|
|
**2. WebSocket Connections**
|
|
```python
|
|
# monitor_routes.py
|
|
active_connections = set()
|
|
|
|
@app.websocket("/monitor/ws")
|
|
async def websocket_endpoint(websocket):
|
|
await websocket.accept()
|
|
active_connections.add(websocket)
|
|
try:
|
|
while True:
|
|
await websocket.send_json(get_metrics())
|
|
finally:
|
|
active_connections.remove(websocket) # Critical!
|
|
```
|
|
|
|
**3. Redis Connections**
|
|
```python
|
|
# Use connection pooling
|
|
redis_pool = aioredis.ConnectionPool(
|
|
host="localhost",
|
|
port=6379,
|
|
max_connections=10,
|
|
decode_responses=True
|
|
)
|
|
|
|
# Reuse connections
|
|
async def get_job(job_id):
|
|
async with redis_pool.get_connection() as conn:
|
|
data = await conn.get(f"job:{job_id}")
|
|
# Connection auto-returned to pool
|
|
```
|
|
|
|
**4. Async Task Cleanup**
|
|
```python
|
|
# Track background tasks
|
|
background_tasks = set()
|
|
|
|
async def crawl_task(url):
|
|
try:
|
|
result = await crawl(url)
|
|
finally:
|
|
background_tasks.discard(asyncio.current_task())
|
|
|
|
# On shutdown
|
|
async def shutdown():
|
|
tasks = list(background_tasks)
|
|
for task in tasks:
|
|
task.cancel()
|
|
await asyncio.gather(*tasks, return_exceptions=True)
|
|
```
|
|
|
|
**5. File Descriptor Leaks**
|
|
```python
|
|
# Use context managers for files
|
|
async def save_screenshot(url):
|
|
async with aiofiles.open(f"{job_id}.png", "wb") as f:
|
|
await f.write(screenshot_bytes)
|
|
# File auto-closed
|
|
```
|
|
|
|
---
|
|
|
|
## Installation & Distribution
|
|
|
|
### User Installation Flow
|
|
|
|
**Script:** `deploy/installer/install-cnode.sh`
|
|
|
|
**Steps:**
|
|
1. Check Python 3.8+ exists
|
|
2. Check pip available
|
|
3. Check Docker installed (warn if missing)
|
|
4. Create temp dir: `mktemp -d`
|
|
5. Git sparse-checkout:
|
|
```bash
|
|
git init
|
|
git remote add origin https://github.com/unclecode/crawl4ai.git
|
|
git config core.sparseCheckout true
|
|
echo "deploy/installer/cnode_pkg/*" > .git/info/sparse-checkout
|
|
git pull --depth=1 origin main
|
|
```
|
|
6. Install deps: `pip install click rich anyio pyyaml`
|
|
7. Copy package: `cnode_pkg/ → /usr/local/lib/cnode/`
|
|
8. Create wrapper: `/usr/local/bin/cnode`
|
|
```bash
|
|
#!/usr/bin/env bash
|
|
export PYTHONPATH="/usr/local/lib/cnode:$PYTHONPATH"
|
|
exec python3 -m cnode_pkg.cli "$@"
|
|
```
|
|
9. Cleanup temp dir
|
|
|
|
**Result:**
|
|
- Binary-like experience (fast startup: ~0.1s)
|
|
- No need for PyInstaller (49x faster)
|
|
- Platform-independent (any OS with Python)
|
|
|
|
---
|
|
|
|
## Development Workflow
|
|
|
|
### Source Code Sync (Auto)
|
|
|
|
**Git Hook:** `.githooks/pre-commit`
|
|
|
|
**Trigger:** When committing `deploy/docker/cnode_cli.py` or `server_manager.py`
|
|
|
|
**Action:**
|
|
```bash
|
|
1. Diff source vs package
|
|
2. If different:
|
|
- Run sync-cnode.sh
|
|
- Copy cnode_cli.py → cnode_pkg/cli.py
|
|
- Fix imports: s/deploy.docker/cnode_pkg/g
|
|
- Copy server_manager.py → cnode_pkg/
|
|
- Stage synced files
|
|
3. Continue commit
|
|
```
|
|
|
|
**Setup:** `./setup-hooks.sh` (configures `git config core.hooksPath .githooks`)
|
|
|
|
**Smart Behavior:**
|
|
- Silent when no sync needed
|
|
- Only syncs if content differs
|
|
- Minimal output: `✓ cnode synced`
|
|
|
|
---
|
|
|
|
## API Request/Response Flow
|
|
|
|
### Example: POST /crawl
|
|
|
|
**Request:**
|
|
```bash
|
|
curl -X POST http://localhost:11235/crawl \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"urls": ["https://example.com"],
|
|
"browser_config": {
|
|
"type": "BrowserConfig",
|
|
"params": {"headless": true}
|
|
},
|
|
"crawler_config": {
|
|
"type": "CrawlerRunConfig",
|
|
"params": {"cache_mode": "bypass"}
|
|
}
|
|
}'
|
|
```
|
|
|
|
**Processing:**
|
|
```
|
|
1. FastAPI receives request → api.py:crawl_endpoint()
|
|
2. Validate schema → Pydantic models in schemas.py
|
|
3. Create job → job.py:Job(id=uuid4(), urls=[...])
|
|
4. Queue to Redis → LPUSH crawl_queue {job_json}
|
|
5. Get browser from pool → crawler_pool.py:get_crawler()
|
|
6. Execute crawl:
|
|
a. Launch page → browser.new_page()
|
|
b. Navigate → page.goto(url)
|
|
c. Extract → extraction_strategy.extract()
|
|
d. Generate markdown → markdown_generator.generate()
|
|
7. Store result → Redis SETEX result:{job_id} 3600 {result_json}
|
|
8. Release browser → pool.release(browser_id)
|
|
9. Return response:
|
|
{
|
|
"success": true,
|
|
"result": {
|
|
"url": "https://example.com",
|
|
"markdown": "# Example Domain...",
|
|
"metadata": {"title": "Example Domain"},
|
|
"extracted_content": {...}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Error Cases:**
|
|
- 400: Invalid request schema
|
|
- 429: Rate limit exceeded
|
|
- 500: Internal error (browser crash, timeout)
|
|
- 503: Service unavailable (all browsers busy)
|
|
|
|
---
|
|
|
|
## Scaling Behavior
|
|
|
|
### Scale-Up (1 → 10 replicas)
|
|
|
|
**Command:** `cnode scale 10`
|
|
|
|
**Swarm Mode:**
|
|
```bash
|
|
docker service scale crawl4ai-stack_crawl4ai=10
|
|
# Docker handles:
|
|
# - Container creation
|
|
# - Network attachment
|
|
# - Load balancer update
|
|
# - Rolling deployment
|
|
```
|
|
|
|
**Compose Mode:**
|
|
```bash
|
|
# Update docker-compose.yml
|
|
# Change replica count in all service definitions
|
|
docker-compose up -d --scale crawl4ai=10
|
|
# Regenerate nginx.conf with 10 upstreams
|
|
docker exec nginx nginx -s reload
|
|
```
|
|
|
|
**Load Distribution:**
|
|
- Swarm: Built-in ingress network (VIP-based round-robin)
|
|
- Compose: Nginx upstream (round-robin, can configure least_conn)
|
|
|
|
**Zero-Downtime:**
|
|
- Swarm: Yes (rolling update, parallelism=2)
|
|
- Compose: Partial (nginx reload is graceful, but brief spike)
|
|
|
|
---
|
|
|
|
## Configuration Files
|
|
|
|
### `config.yml` - Server Configuration
|
|
|
|
```yaml
|
|
server:
|
|
port: 11235
|
|
host: "0.0.0.0"
|
|
workers: 4
|
|
|
|
crawler:
|
|
max_concurrent: 5
|
|
timeout: 30
|
|
retries: 3
|
|
|
|
browser:
|
|
pool_size: 3
|
|
headless: true
|
|
args:
|
|
- "--no-sandbox"
|
|
- "--disable-dev-shm-usage"
|
|
|
|
redis:
|
|
host: "localhost"
|
|
port: 6379
|
|
db: 0
|
|
|
|
monitoring:
|
|
enabled: true
|
|
metrics_interval: 5 # seconds
|
|
```
|
|
|
|
### `supervisord.conf` - Process Management
|
|
|
|
```ini
|
|
[supervisord]
|
|
nodaemon=true
|
|
|
|
[program:redis]
|
|
command=redis-server --port 6379
|
|
autorestart=true
|
|
|
|
[program:fastapi]
|
|
command=uvicorn server:app --host 0.0.0.0 --port 11235
|
|
autorestart=true
|
|
stdout_logfile=/var/log/crawl4ai/api.log
|
|
|
|
[program:monitor]
|
|
command=python monitor.py
|
|
autorestart=true
|
|
```
|
|
|
|
---
|
|
|
|
## Testing & Quality
|
|
|
|
### Test Structure
|
|
|
|
```
|
|
deploy/docker/tests/
|
|
├── cli/ # CLI command tests
|
|
│ └── test_commands.py # start, stop, scale, status
|
|
├── monitor/ # Dashboard tests
|
|
│ └── test_websocket.py # WS connection, metrics
|
|
└── codebase_test/ # Integration tests
|
|
└── test_api.py # End-to-end crawl tests
|
|
```
|
|
|
|
### Key Test Cases
|
|
|
|
**CLI Tests:**
|
|
- `test_start_single()` - Starts 1 replica
|
|
- `test_start_cluster()` - Starts N replicas
|
|
- `test_scale_up()` - Scales 1→5
|
|
- `test_scale_down()` - Scales 5→2
|
|
- `test_status()` - Reports correct state
|
|
- `test_logs()` - Streams logs
|
|
|
|
**API Tests:**
|
|
- `test_crawl_success()` - Basic crawl works
|
|
- `test_crawl_timeout()` - Handles slow sites
|
|
- `test_concurrent_crawls()` - Parallel requests
|
|
- `test_browser_pool()` - Reuses browsers
|
|
- `test_memory_cleanup()` - No leaks after 100 crawls
|
|
|
|
**Monitor Tests:**
|
|
- `test_websocket_connect()` - WS handshake
|
|
- `test_metrics_stream()` - Receives updates
|
|
- `test_multiple_clients()` - Handles N connections
|
|
|
|
---
|
|
|
|
## Critical File Cross-Reference
|
|
|
|
| Component | Primary File | Dependencies |
|
|
|-----------|--------------|--------------|
|
|
| **CLI Entry** | `cnode_cli.py:482` | `server_manager.py`, `click`, `rich` |
|
|
| **Orchestrator** | `server_manager.py:45` | `docker`, `yaml`, `anyio` |
|
|
| **API Server** | `server.py:120` | `api.py`, `monitor_routes.py` |
|
|
| **Crawl Logic** | `api.py:78` | `crawler_pool.py`, `AsyncWebCrawler` |
|
|
| **Browser Pool** | `crawler_pool.py:23` | `AsyncWebCrawler`, `asyncio` |
|
|
| **Monitoring** | `monitor.py:156` | `redis`, `psutil` |
|
|
| **Dashboard** | `monitor_routes.py:89` | `monitor.py`, `websockets` |
|
|
| **Hooks** | `hook_manager.py:12` | `api.py`, custom user hooks |
|
|
|
|
**Startup Chain:**
|
|
```
|
|
cnode start
|
|
└→ cnode_cli.py:start_cmd()
|
|
└→ server_manager.py:start()
|
|
└→ docker run/stack/compose
|
|
└→ supervisord
|
|
├→ redis-server
|
|
├→ server.py
|
|
│ └→ api.py (routes)
|
|
│ └→ crawler_pool.py (init)
|
|
└→ monitor.py (collector)
|
|
```
|
|
|
|
---
|
|
|
|
## Symbolic Notation Summary
|
|
|
|
```
|
|
⊕ Addition/Creation ⊖ Removal/Cleanup
|
|
⊗ Multiplication/Scale ⊘ Division/Split
|
|
→ Flow/Dependency ← Reverse flow
|
|
⇄ Bidirectional ⇵ Up/Down scale
|
|
✓ Success/Complete ✗ Failure/Error
|
|
⚠ Warning ⚡ Performance critical
|
|
🔒 Lock/Exclusive 🔓 Unlock/Shared
|
|
📊 Metrics 📝 Logs
|
|
🌐 Network 💾 Storage
|
|
🧠 Memory 🐳 Docker
|
|
```
|
|
|
|
**State Machine:**
|
|
```
|
|
[STOPPED] →start→ [STARTING] →ready→ [RUNNING]
|
|
[RUNNING] →scale→ [SCALING] →done→ [RUNNING]
|
|
[RUNNING] →stop→ [STOPPING] →done→ [STOPPED]
|
|
[ANY] →error→ [ERROR] →cleanup→ [STOPPED]
|
|
```
|
|
|
|
---
|
|
|
|
## Quick Reference Commands
|
|
|
|
```bash
|
|
# Install
|
|
curl -sSL https://raw.githubusercontent.com/unclecode/crawl4ai/main/deploy/installer/install-cnode.sh | bash
|
|
|
|
# Deploy
|
|
cnode start --replicas 5 --port 11235
|
|
|
|
# Monitor
|
|
open http://localhost:11235/monitor
|
|
|
|
# Scale
|
|
cnode scale 10
|
|
|
|
# Test
|
|
curl -X POST http://localhost:11235/crawl -H "Content-Type: application/json" -d '{"urls":["https://example.com"]}'
|
|
|
|
# Logs
|
|
cnode logs -f
|
|
|
|
# Teardown
|
|
cnode stop --remove-volumes
|
|
```
|
|
|
|
---
|
|
|
|
## Future AI Agent Instructions
|
|
|
|
**When debugging:** Read this file first, then check specific source files mentioned.
|
|
|
|
**When extending:** Follow existing patterns (context managers, async/await, type hints).
|
|
|
|
**When optimizing:** Profile first (monitor.py metrics), then optimize hot paths (crawler_pool.py, api.py).
|
|
|
|
**When troubleshooting memory:** Check browser pool cleanup, WebSocket connection tracking, Redis connection pooling.
|
|
|
|
**When scaling issues:** Verify Docker mode (swarm vs compose), check nginx config if compose, review load balancer logs.
|
|
|
|
---
|
|
|
|
**END OF ARCHITECTURE MAP**
|
|
*Version: 1.0.0 | Last Updated: 2025-10-21 | Token-Optimized for AI Consumption*
|