docs: add AI-optimized architecture map and quick start cheat sheet

ARCHITECTURE.md:
- Dense technical reference for AI agents
- Complete system flow diagrams
- Memory leak prevention strategies
- File cross-references with line numbers
- Symbolic notation for compression
- Docker orchestration deep dive

QUICKSTART.md:
- One-page cheat sheet for users
- Install → launch → scale → test workflow
- Simple example.com curl test
- Common commands reference
This commit is contained in:
unclecode
2025-10-23 12:20:07 +08:00
parent 418dd60a80
commit 589339a336
2 changed files with 969 additions and 0 deletions

View File

@@ -0,0 +1,822 @@
# Crawl4AI Docker Architecture - AI Context Map
**Purpose:** Dense technical reference for AI agents to understand complete system architecture.
**Format:** Symbolic, compressed, high-information-density documentation.
---
## System Overview
```
┌─────────────────────────────────────────────────────────────┐
│ CRAWL4AI DOCKER ORCHESTRATION SYSTEM │
├─────────────────────────────────────────────────────────────┤
│ Modes: Single (N=1) | Swarm (N>1) | Compose+Nginx (N>1) │
│ Entry: cnode CLI → deploy/docker/cnode_cli.py │
│ Core: deploy/docker/server_manager.py │
│ Server: deploy/docker/server.py (FastAPI) │
│ API: deploy/docker/api.py (crawl endpoints) │
│ Monitor: deploy/docker/monitor.py + monitor_routes.py │
└─────────────────────────────────────────────────────────────┘
```
---
## Directory Structure & File Map
```
deploy/
├── docker/ # Server runtime & orchestration
│ ├── server.py # FastAPI app entry [CRITICAL]
│ ├── api.py # /crawl, /screenshot, /pdf endpoints
│ ├── server_manager.py # Docker orchestration logic [CORE]
│ ├── cnode_cli.py # CLI interface (Click-based)
│ ├── monitor.py # Real-time metrics collector
│ ├── monitor_routes.py # /monitor dashboard routes
│ ├── crawler_pool.py # Browser pool management
│ ├── hook_manager.py # Pre/post crawl hooks
│ ├── job.py # Job queue schema
│ ├── utils.py # Helpers (port check, health)
│ ├── auth.py # API key authentication
│ ├── schemas.py # Pydantic models
│ ├── mcp_bridge.py # MCP protocol bridge
│ ├── supervisord.conf # Process manager config
│ ├── config.yml # Server config template
│ ├── requirements.txt # Python deps
│ ├── static/ # Web assets
│ │ ├── monitor/ # Dashboard UI
│ │ └── playground/ # API playground
│ └── tests/ # Test suite
└── installer/ # User-facing installation
├── cnode_pkg/ # Standalone package
│ ├── cli.py # Copy of cnode_cli.py
│ ├── server_manager.py # Copy of server_manager.py
│ └── requirements.txt # click, rich, anyio, pyyaml
├── install-cnode.sh # Remote installer (git sparse-checkout)
├── sync-cnode.sh # Dev tool (source→pkg sync)
├── USER_GUIDE.md # Human-readable guide
├── README.md # Developer documentation
└── QUICKSTART.md # Cheat sheet
```
---
## Core Components Deep Dive
### 1. `server_manager.py` - Orchestration Engine
**Role:** Manages Docker container lifecycle, auto-detects deployment mode.
**Key Classes:**
- `ServerManager` - Main orchestrator
- `start(replicas, mode, port, env_file, image)` → Deploy server
- `stop(remove_volumes)` → Teardown
- `status()` → Health check
- `scale(replicas)` → Live scaling
- `logs(follow, tail)` → Stream logs
- `cleanup(force)` → Emergency cleanup
**State Management:**
- File: `~/.crawl4ai/server_state.yml`
- Schema: `{mode, replicas, port, image, started_at, containers[]}`
- Atomic writes with lock file
**Deployment Modes:**
```python
if replicas == 1:
mode = "single" # docker run
elif swarm_available():
mode = "swarm" # docker stack deploy
else:
mode = "compose" # docker-compose + nginx
```
**Container Naming:**
- Single: `crawl4ai-server`
- Swarm: `crawl4ai-stack_crawl4ai`
- Compose: `crawl4ai-server-{1..N}`, `crawl4ai-nginx`
**Networks:**
- `crawl4ai-network` (bridge mode for all)
**Volumes:**
- `crawl4ai-redis-data` - Persistent queue
- `crawl4ai-profiles` - Browser profiles
**Health Checks:**
- Endpoint: `http://localhost:{port}/health`
- Timeout: 30s startup
- Retry: 3 attempts
---
### 2. `server.py` - FastAPI Application
**Role:** HTTP server exposing crawl API + monitoring.
**Startup Flow:**
```python
app = FastAPI()
@app.on_event("startup")
async def startup():
init_crawler_pool() # Pre-warm browsers
init_redis_connection() # Job queue
start_monitor_collector() # Metrics
```
**Key Endpoints:**
```
POST /crawl → api.py:crawl_endpoint()
POST /crawl/stream → api.py:crawl_stream_endpoint()
POST /screenshot → api.py:screenshot_endpoint()
POST /pdf → api.py:pdf_endpoint()
GET /health → server.py:health_check()
GET /monitor → monitor_routes.py:dashboard()
WS /monitor/ws → monitor_routes.py:websocket_endpoint()
GET /playground → static/playground/index.html
```
**Process Manager:**
- Uses `supervisord` to manage:
- FastAPI server (port 11235)
- Redis (port 6379)
- Background workers
**Environment:**
```bash
CRAWL4AI_PORT=11235
REDIS_URL=redis://localhost:6379
MAX_CONCURRENT_CRAWLS=5
BROWSER_POOL_SIZE=3
```
---
### 3. `api.py` - Crawl Endpoints
**Main Endpoint:** `POST /crawl`
**Request Schema:**
```json
{
"urls": ["https://example.com"],
"priority": 10,
"browser_config": {
"type": "BrowserConfig",
"params": {"headless": true, "viewport_width": 1920}
},
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {"cache_mode": "bypass", "extraction_strategy": {...}}
}
}
```
**Processing Flow:**
```
1. Validate request (Pydantic)
2. Queue job → Redis
3. Get browser from pool → crawler_pool.py
4. Execute crawl → AsyncWebCrawler
5. Apply hooks → hook_manager.py
6. Return result (JSON)
7. Release browser to pool
```
**Memory Management:**
- Browser pool: Max 3 instances
- LRU eviction when pool full
- Explicit cleanup: `browser.close()` in finally block
- Redis TTL: 1 hour for completed jobs
**Error Handling:**
```python
try:
result = await crawler.arun(url, config)
except PlaywrightError as e:
# Browser crash - release & recreate
await pool.invalidate(browser_id)
except TimeoutError as e:
# Timeout - kill & retry
await crawler.kill()
except Exception as e:
# Unknown - log & fail gracefully
logger.error(f"Crawl failed: {e}")
```
---
### 4. `crawler_pool.py` - Browser Pool Manager
**Role:** Manage persistent browser instances to avoid startup overhead.
**Class:** `CrawlerPool`
- `get_crawler()` → Lease browser (async with context manager)
- `release_crawler(id)` → Return to pool
- `warm_up(count)` → Pre-launch browsers
- `cleanup()` → Close all browsers
**Pool Strategy:**
```python
pool = {
"browser_1": {"crawler": AsyncWebCrawler(), "in_use": False},
"browser_2": {"crawler": AsyncWebCrawler(), "in_use": False},
"browser_3": {"crawler": AsyncWebCrawler(), "in_use": False},
}
async with pool.get_crawler() as crawler:
result = await crawler.arun(url)
# Auto-released on context exit
```
**Anti-Leak Mechanisms:**
1. Context managers enforce cleanup
2. Watchdog thread kills stale browsers (>10min idle)
3. Max lifetime: 1 hour per browser
4. Force GC after browser close
---
### 5. `monitor.py` + `monitor_routes.py` - Real-time Dashboard
**Architecture:**
```
[Browser] <--WebSocket--> [monitor_routes.py] <--Events--> [monitor.py]
[Redis Pub/Sub]
[Metrics Collector]
```
**Metrics Collected:**
- Requests/sec (sliding window)
- Active crawls (real-time count)
- Response times (p50, p95, p99)
- Error rate (5min rolling)
- Memory usage (RSS, heap)
- Browser pool utilization
**WebSocket Protocol:**
```json
// Server → Client
{
"type": "metrics",
"data": {
"rps": 45.3,
"active_crawls": 12,
"p95_latency": 1234,
"error_rate": 0.02
}
}
// Client → Server
{
"type": "subscribe",
"channels": ["metrics", "logs"]
}
```
**Dashboard Route:** `/monitor`
- Real-time graphs (Chart.js)
- Request log stream
- Container health status
- Resource utilization
---
### 6. `cnode_cli.py` - CLI Interface
**Framework:** Click (Python CLI framework)
**Command Structure:**
```
cnode
├── start [--replicas N] [--port P] [--mode M] [--image I]
├── stop [--remove-volumes]
├── status
├── scale N
├── logs [--follow] [--tail N]
├── restart [--replicas N]
└── cleanup [--force]
```
**Execution Flow:**
```python
@cli.command("start")
def start_cmd(replicas, mode, port, env_file, image):
manager = ServerManager()
result = anyio.run(manager.start(...)) # Async bridge
if result["success"]:
console.print(success_panel)
```
**User Feedback:**
- Rich library for colors/tables
- Progress spinners during operations
- Error messages with hints
- Status tables with health indicators
**State Persistence:**
- Saves deployment config to `~/.crawl4ai/server_state.yml`
- Enables stateless commands (status, scale, restart)
---
### 7. Docker Orchestration Details
**Single Container Mode (N=1):**
```bash
docker run -d \
--name crawl4ai-server \
--network crawl4ai-network \
-p 11235:11235 \
-v crawl4ai-redis-data:/data \
unclecode/crawl4ai:latest
```
**Docker Swarm Mode (N>1, Swarm available):**
```yaml
# docker-compose.swarm.yml
version: '3.8'
services:
crawl4ai:
image: unclecode/crawl4ai:latest
deploy:
replicas: 5
update_config:
parallelism: 2
delay: 10s
restart_policy:
condition: on-failure
ports:
- "11235:11235"
networks:
- crawl4ai-network
```
Deploy: `docker stack deploy -c docker-compose.swarm.yml crawl4ai-stack`
**Docker Compose + Nginx Mode (N>1, fallback):**
```yaml
# docker-compose.yml
services:
crawl4ai-1:
image: unclecode/crawl4ai:latest
networks: [crawl4ai-network]
crawl4ai-2:
image: unclecode/crawl4ai:latest
networks: [crawl4ai-network]
nginx:
image: nginx:alpine
ports: ["11235:80"]
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
networks: [crawl4ai-network]
```
Nginx config (round-robin load balancing):
```nginx
upstream crawl4ai_backend {
server crawl4ai-1:11235;
server crawl4ai-2:11235;
server crawl4ai-3:11235;
}
server {
listen 80;
location / {
proxy_pass http://crawl4ai_backend;
proxy_set_header Host $host;
}
}
```
---
## Memory Leak Prevention Strategy
### Problem Areas & Solutions
**1. Browser Instances**
```python
# ❌ BAD - Leak risk
crawler = AsyncWebCrawler()
result = await crawler.arun(url)
# Browser never closed!
# ✅ GOOD - Guaranteed cleanup
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url)
# Auto-closed on exit
```
**2. WebSocket Connections**
```python
# monitor_routes.py
active_connections = set()
@app.websocket("/monitor/ws")
async def websocket_endpoint(websocket):
await websocket.accept()
active_connections.add(websocket)
try:
while True:
await websocket.send_json(get_metrics())
finally:
active_connections.remove(websocket) # Critical!
```
**3. Redis Connections**
```python
# Use connection pooling
redis_pool = aioredis.ConnectionPool(
host="localhost",
port=6379,
max_connections=10,
decode_responses=True
)
# Reuse connections
async def get_job(job_id):
async with redis_pool.get_connection() as conn:
data = await conn.get(f"job:{job_id}")
# Connection auto-returned to pool
```
**4. Async Task Cleanup**
```python
# Track background tasks
background_tasks = set()
async def crawl_task(url):
try:
result = await crawl(url)
finally:
background_tasks.discard(asyncio.current_task())
# On shutdown
async def shutdown():
tasks = list(background_tasks)
for task in tasks:
task.cancel()
await asyncio.gather(*tasks, return_exceptions=True)
```
**5. File Descriptor Leaks**
```python
# Use context managers for files
async def save_screenshot(url):
async with aiofiles.open(f"{job_id}.png", "wb") as f:
await f.write(screenshot_bytes)
# File auto-closed
```
---
## Installation & Distribution
### User Installation Flow
**Script:** `deploy/installer/install-cnode.sh`
**Steps:**
1. Check Python 3.8+ exists
2. Check pip available
3. Check Docker installed (warn if missing)
4. Create temp dir: `mktemp -d`
5. Git sparse-checkout:
```bash
git init
git remote add origin https://github.com/unclecode/crawl4ai.git
git config core.sparseCheckout true
echo "deploy/installer/cnode_pkg/*" > .git/info/sparse-checkout
git pull --depth=1 origin main
```
6. Install deps: `pip install click rich anyio pyyaml`
7. Copy package: `cnode_pkg/ → /usr/local/lib/cnode/`
8. Create wrapper: `/usr/local/bin/cnode`
```bash
#!/usr/bin/env bash
export PYTHONPATH="/usr/local/lib/cnode:$PYTHONPATH"
exec python3 -m cnode_pkg.cli "$@"
```
9. Cleanup temp dir
**Result:**
- Binary-like experience (fast startup: ~0.1s)
- No need for PyInstaller (49x faster)
- Platform-independent (any OS with Python)
---
## Development Workflow
### Source Code Sync (Auto)
**Git Hook:** `.githooks/pre-commit`
**Trigger:** When committing `deploy/docker/cnode_cli.py` or `server_manager.py`
**Action:**
```bash
1. Diff source vs package
2. If different:
- Run sync-cnode.sh
- Copy cnode_cli.py → cnode_pkg/cli.py
- Fix imports: s/deploy.docker/cnode_pkg/g
- Copy server_manager.py → cnode_pkg/
- Stage synced files
3. Continue commit
```
**Setup:** `./setup-hooks.sh` (configures `git config core.hooksPath .githooks`)
**Smart Behavior:**
- Silent when no sync needed
- Only syncs if content differs
- Minimal output: `✓ cnode synced`
---
## API Request/Response Flow
### Example: POST /crawl
**Request:**
```bash
curl -X POST http://localhost:11235/crawl \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"],
"browser_config": {
"type": "BrowserConfig",
"params": {"headless": true}
},
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {"cache_mode": "bypass"}
}
}'
```
**Processing:**
```
1. FastAPI receives request → api.py:crawl_endpoint()
2. Validate schema → Pydantic models in schemas.py
3. Create job → job.py:Job(id=uuid4(), urls=[...])
4. Queue to Redis → LPUSH crawl_queue {job_json}
5. Get browser from pool → crawler_pool.py:get_crawler()
6. Execute crawl:
a. Launch page → browser.new_page()
b. Navigate → page.goto(url)
c. Extract → extraction_strategy.extract()
d. Generate markdown → markdown_generator.generate()
7. Store result → Redis SETEX result:{job_id} 3600 {result_json}
8. Release browser → pool.release(browser_id)
9. Return response:
{
"success": true,
"result": {
"url": "https://example.com",
"markdown": "# Example Domain...",
"metadata": {"title": "Example Domain"},
"extracted_content": {...}
}
}
```
**Error Cases:**
- 400: Invalid request schema
- 429: Rate limit exceeded
- 500: Internal error (browser crash, timeout)
- 503: Service unavailable (all browsers busy)
---
## Scaling Behavior
### Scale-Up (1 → 10 replicas)
**Command:** `cnode scale 10`
**Swarm Mode:**
```bash
docker service scale crawl4ai-stack_crawl4ai=10
# Docker handles:
# - Container creation
# - Network attachment
# - Load balancer update
# - Rolling deployment
```
**Compose Mode:**
```bash
# Update docker-compose.yml
# Change replica count in all service definitions
docker-compose up -d --scale crawl4ai=10
# Regenerate nginx.conf with 10 upstreams
docker exec nginx nginx -s reload
```
**Load Distribution:**
- Swarm: Built-in ingress network (VIP-based round-robin)
- Compose: Nginx upstream (round-robin, can configure least_conn)
**Zero-Downtime:**
- Swarm: Yes (rolling update, parallelism=2)
- Compose: Partial (nginx reload is graceful, but brief spike)
---
## Configuration Files
### `config.yml` - Server Configuration
```yaml
server:
port: 11235
host: "0.0.0.0"
workers: 4
crawler:
max_concurrent: 5
timeout: 30
retries: 3
browser:
pool_size: 3
headless: true
args:
- "--no-sandbox"
- "--disable-dev-shm-usage"
redis:
host: "localhost"
port: 6379
db: 0
monitoring:
enabled: true
metrics_interval: 5 # seconds
```
### `supervisord.conf` - Process Management
```ini
[supervisord]
nodaemon=true
[program:redis]
command=redis-server --port 6379
autorestart=true
[program:fastapi]
command=uvicorn server:app --host 0.0.0.0 --port 11235
autorestart=true
stdout_logfile=/var/log/crawl4ai/api.log
[program:monitor]
command=python monitor.py
autorestart=true
```
---
## Testing & Quality
### Test Structure
```
deploy/docker/tests/
├── cli/ # CLI command tests
│ └── test_commands.py # start, stop, scale, status
├── monitor/ # Dashboard tests
│ └── test_websocket.py # WS connection, metrics
└── codebase_test/ # Integration tests
└── test_api.py # End-to-end crawl tests
```
### Key Test Cases
**CLI Tests:**
- `test_start_single()` - Starts 1 replica
- `test_start_cluster()` - Starts N replicas
- `test_scale_up()` - Scales 1→5
- `test_scale_down()` - Scales 5→2
- `test_status()` - Reports correct state
- `test_logs()` - Streams logs
**API Tests:**
- `test_crawl_success()` - Basic crawl works
- `test_crawl_timeout()` - Handles slow sites
- `test_concurrent_crawls()` - Parallel requests
- `test_browser_pool()` - Reuses browsers
- `test_memory_cleanup()` - No leaks after 100 crawls
**Monitor Tests:**
- `test_websocket_connect()` - WS handshake
- `test_metrics_stream()` - Receives updates
- `test_multiple_clients()` - Handles N connections
---
## Critical File Cross-Reference
| Component | Primary File | Dependencies |
|-----------|--------------|--------------|
| **CLI Entry** | `cnode_cli.py:482` | `server_manager.py`, `click`, `rich` |
| **Orchestrator** | `server_manager.py:45` | `docker`, `yaml`, `anyio` |
| **API Server** | `server.py:120` | `api.py`, `monitor_routes.py` |
| **Crawl Logic** | `api.py:78` | `crawler_pool.py`, `AsyncWebCrawler` |
| **Browser Pool** | `crawler_pool.py:23` | `AsyncWebCrawler`, `asyncio` |
| **Monitoring** | `monitor.py:156` | `redis`, `psutil` |
| **Dashboard** | `monitor_routes.py:89` | `monitor.py`, `websockets` |
| **Hooks** | `hook_manager.py:12` | `api.py`, custom user hooks |
**Startup Chain:**
```
cnode start
└→ cnode_cli.py:start_cmd()
└→ server_manager.py:start()
└→ docker run/stack/compose
└→ supervisord
├→ redis-server
├→ server.py
│ └→ api.py (routes)
│ └→ crawler_pool.py (init)
└→ monitor.py (collector)
```
---
## Symbolic Notation Summary
```
⊕ Addition/Creation ⊖ Removal/Cleanup
⊗ Multiplication/Scale ⊘ Division/Split
→ Flow/Dependency ← Reverse flow
⇄ Bidirectional ⇵ Up/Down scale
✓ Success/Complete ✗ Failure/Error
⚠ Warning ⚡ Performance critical
🔒 Lock/Exclusive 🔓 Unlock/Shared
📊 Metrics 📝 Logs
🌐 Network 💾 Storage
🧠 Memory 🐳 Docker
```
**State Machine:**
```
[STOPPED] →start→ [STARTING] →ready→ [RUNNING]
[RUNNING] →scale→ [SCALING] →done→ [RUNNING]
[RUNNING] →stop→ [STOPPING] →done→ [STOPPED]
[ANY] →error→ [ERROR] →cleanup→ [STOPPED]
```
---
## Quick Reference Commands
```bash
# Install
curl -sSL https://raw.githubusercontent.com/unclecode/crawl4ai/main/deploy/installer/install-cnode.sh | bash
# Deploy
cnode start --replicas 5 --port 11235
# Monitor
open http://localhost:11235/monitor
# Scale
cnode scale 10
# Test
curl -X POST http://localhost:11235/crawl -H "Content-Type: application/json" -d '{"urls":["https://example.com"]}'
# Logs
cnode logs -f
# Teardown
cnode stop --remove-volumes
```
---
## Future AI Agent Instructions
**When debugging:** Read this file first, then check specific source files mentioned.
**When extending:** Follow existing patterns (context managers, async/await, type hints).
**When optimizing:** Profile first (monitor.py metrics), then optimize hot paths (crawler_pool.py, api.py).
**When troubleshooting memory:** Check browser pool cleanup, WebSocket connection tracking, Redis connection pooling.
**When scaling issues:** Verify Docker mode (swarm vs compose), check nginx config if compose, review load balancer logs.
---
**END OF ARCHITECTURE MAP**
*Version: 1.0.0 | Last Updated: 2025-10-21 | Token-Optimized for AI Consumption*

View File

@@ -0,0 +1,147 @@
# Crawl4AI cnode - Quick Start Cheat Sheet
Fast reference for getting started with cnode.
---
## 📥 Install
```bash
# Install cnode
curl -sSL https://raw.githubusercontent.com/unclecode/crawl4ai/main/deploy/installer/install-cnode.sh | bash
```
---
## 🚀 Launch Cluster
```bash
# Single server (development)
cnode start
# Production cluster with 5 replicas
cnode start --replicas 5
# Custom port
cnode start --replicas 3 --port 8080
```
---
## 📊 Check Status
```bash
# View server status
cnode status
# View logs
cnode logs -f
```
---
## ⚙️ Scale Cluster
```bash
# Scale to 10 replicas (live, no downtime)
cnode scale 10
# Scale down to 2
cnode scale 2
```
---
## 🔄 Restart/Stop
```bash
# Restart server
cnode restart
# Stop server
cnode stop
```
---
## 🌐 Test the API
```bash
# Simple test - crawl example.com
curl -X POST http://localhost:11235/crawl \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"],
"priority": 10
}'
# Pretty print with jq
curl -X POST http://localhost:11235/crawl \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"],
"priority": 10
}' | jq '.result.markdown' -r
# Health check
curl http://localhost:11235/health
```
---
## 📱 Monitor Dashboard
```bash
# Open in browser
open http://localhost:11235/monitor
# Or playground
open http://localhost:11235/playground
```
---
## 🐍 Python Example
```python
import requests
response = requests.post(
"http://localhost:11235/crawl",
json={
"urls": ["https://example.com"],
"priority": 10
}
)
result = response.json()
print(result['result']['markdown'])
```
---
## 🎯 Common Commands
| Command | Description |
|---------|-------------|
| `cnode start` | Start server |
| `cnode start -r 5` | Start with 5 replicas |
| `cnode status` | Check status |
| `cnode scale 10` | Scale to 10 replicas |
| `cnode logs -f` | Follow logs |
| `cnode restart` | Restart server |
| `cnode stop` | Stop server |
| `cnode --help` | Show all commands |
---
## 📚 Full Documentation
- **User Guide:** `deploy/installer/USER_GUIDE.md`
- **Developer Docs:** `deploy/installer/README.md`
- **Docker Guide:** `deploy/docker/README.md`
- **Agent Context:** `deploy/docker/AGENT.md`
---
**That's it!** You're ready to crawl at scale 🚀