From f0cfd884a9feafbc4244d666836f9c86b469c844 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Wed, 22 Oct 2025 11:05:32 +0000
Subject: [PATCH] docs: add production platform deployment PRD

Comprehensive PRD for split architecture deployment on Digital Ocean:

Architecture:
- Separate API servers (lightweight FastAPI)
- Browser worker pool (Crawl4AI + Chromium)
- Redis job queue for coordination
- DO Load Balancer + auto-scaling

Components:
- api_server.py - Job queue only, no browser
- worker.py - Job processor, pulls from Redis
- Dockerfiles for both images
- Cloud-init configs for auto-deployment

Infrastructure:
- DO CLI deployment scripts
- Auto-scaler daemon (queue-based)
- Monitoring and alerting setup
- Cost optimization strategies

Includes:
- Complete code structure
- Deployment scripts
- Testing strategy
- Security setup
- Rollback plan
- Success metrics

Cost estimate: $87-135/mo base, scales to $300/mo
Target: 100-500 req/min capacity

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>
---
 docs/PRD_PLATFORM_DEPLOYMENT.md | 1222 +++++++++++++++++++++++++++++++
 1 file changed, 1222 insertions(+)
 create mode 100644 docs/PRD_PLATFORM_DEPLOYMENT.md

diff --git a/docs/PRD_PLATFORM_DEPLOYMENT.md b/docs/PRD_PLATFORM_DEPLOYMENT.md
new file mode 100644
index 00000000..d28a0a63
--- /dev/null
+++ b/docs/PRD_PLATFORM_DEPLOYMENT.md
@@ -0,0 +1,1222 @@
+# Crawl4AI API Platform - Production Deployment PRD
+
+**Version:** 1.0
+**Target:** Digital Ocean Split Architecture
+**Pattern:** API Gateway + Redis Queue + Browser Worker Pool
+
+---
+
+## 1. Architecture Overview
+
+### 1.1 Component Diagram
+
+```
+┌─────────────────────────────────────────────────────────┐
+│                    Internet Traffic                      │
+└───────────────────────┬─────────────────────────────────┘
+                        │
+┌───────────────────────▼─────────────────────────────────┐
+│              DO Load Balancer (HTTP/HTTPS)              │
+│                   Port 80/443 → 11235                    │
+└───────────────────────┬─────────────────────────────────┘
+                        │
+        ┌───────────────┼───────────────┐
+        │               │               │
+┌───────▼──────┐ ┌──────▼──────┐ ┌─────▼────────┐
+│  API Server  │ │ API Server  │ │ API Server   │
+│  Container   │ │ Container   │ │  Container   │
+│  (1GB RAM)   │ │ (1GB RAM)   │ │  (1GB RAM)   │
+│              │ │             │ │              │
+│  FastAPI     │ │  FastAPI    │ │  FastAPI     │
+│  + Auth      │ │  + Auth     │ │  + Auth      │
+│  + Rate Lim  │ │  + Rate Lim │ │  + Rate Lim  │
+│  NO Chromium │ │ NO Chromium │ │ NO Chromium  │
+└───────┬──────┘ └──────┬──────┘ └─────┬────────┘
+        │               │               │
+        └───────────────┼───────────────┘
+                        │
+┌───────────────────────▼─────────────────────────────────┐
+│              Managed Redis (Persistent)                  │
+│           Queues: jobs, results, webhooks                │
+│              Keys: sessions, rate_limits                 │
+└───────────────────────┬─────────────────────────────────┘
+                        │
+        ┌───────────────┼───────────────────┬─────────────┐
+        │               │                   │             │
+┌───────▼──────┐ ┌──────▼──────┐ ┌─────────▼───┐ ┌───────▼──────┐
+│   Worker 1   │ │  Worker 2   │ │  Worker 3   │ │  Worker N    │
+│  (4GB RAM)   │ │  (4GB RAM)  │ │  (4GB RAM)  │ │  (4GB RAM)   │
+│              │ │             │ │             │ │              │
+│  Crawl4AI    │ │  Crawl4AI   │ │  Crawl4AI   │ │  Crawl4AI    │
+│  + Chromium  │ │  + Chromium │ │  + Chromium │ │  + Chromium  │
+│  (Job Puller)│ │ (Job Puller)│ │(Job Puller) │ │ (Job Puller) │
+└──────────────┘ └─────────────┘ └─────────────┘ └──────────────┘
+```
+
+### 1.2 Data Flow
+
+**Job Submission:**
+```
+Client → LB → API Server → Validate → Push to Redis Queue → Return task_id
+```
+
+**Job Execution:**
+```
+Worker → Pull from Queue → Execute Crawl → Store Result in Redis → Send Webhook
+```
+
+**Result Retrieval:**
+```
+Client → LB → API Server → Fetch from Redis → Return Result
+```
+
+---
+
+## 2. Component Specifications
+
+### 2.1 API Server Container
+
+**Image:** `crawl4ai-api-server:v1`
+**Base:** `python:3.12-slim`
+**RAM:** 1GB
+**CPU:** 1 vCPU
+
+**Includes:**
+- FastAPI server
+- Redis client
+- Auth/API key validation
+- Rate limiting
+- Webhook trigger logic
+- NO browser, NO crawl4ai core
+
+**Endpoints Supported:**
+- `POST /crawl/job` - Queue job
+- `GET /crawl/job/{task_id}` - Get result
+- `POST /llm/job` - Queue LLM job
+- `GET /llm/job/{task_id}` - Get LLM result
+- `GET /health` - Health check
+- `GET /metrics` - Prometheus metrics
+- `POST /token` - JWT auth
+
+**Excluded Endpoints:**
+- `/crawl` (sync) - removed
+- `/crawl/stream` - removed (use job pattern only)
+
+**Environment Variables:**
+```bash
+REDIS_URL=redis://managed-redis:6379/0
+REDIS_POOL_SIZE=50
+API_KEY_HEADER=X-API-Key
+JWT_SECRET=<secret>
+RATE_LIMIT_DEFAULT=1000/minute
+WEBHOOK_TIMEOUT=30
+WORKER_COUNT=4
+```
+
+**Dockerfile:**
+```dockerfile
+FROM python:3.12-slim
+
+WORKDIR /app
+
+# Install dependencies (NO playwright, NO chromium)
+COPY requirements-api.txt .
+RUN pip install --no-cache-dir -r requirements-api.txt
+
+# Copy API server code only
+COPY deploy/docker/api_server.py .
+COPY deploy/docker/auth.py .
+COPY deploy/docker/schemas.py .
+COPY deploy/docker/utils.py .
+
+EXPOSE 11235
+
+CMD ["uvicorn", "api_server:app", "--host", "0.0.0.0", "--port", "11235", "--workers", "4"]
+```
+
+### 2.2 Browser Worker Container
+
+**Image:** `crawl4ai-worker:v1`
+**Base:** `python:3.12-slim`
+**RAM:** 4GB
+**CPU:** 2 vCPU
+
+**Includes:**
+- Crawl4AI library
+- Chromium browser
+- Redis client
+- Job processor
+- Webhook sender
+- NO FastAPI server
+
+**Worker Logic:**
+```python
+while True:
+    # 1. Pull job from Redis queue (BLPOP)
+    job = redis.blpop('crawl_queue', timeout=5)
+
+    if job:
+        task_id, job_data = parse_job(job)
+
+        # 2. Execute crawl
+        result = await execute_crawl(job_data)
+
+        # 3. Store result
+        redis.setex(f"result:{task_id}", 3600, json.dumps(result))
+
+        # 4. Send webhook if configured
+        if job_data.get('webhook_url'):
+            await send_webhook(job_data['webhook_url'], task_id, result)
+
+        # 5. Update metrics
+        redis.incr('metrics:jobs_completed')
+```
+
+**Environment Variables:**
+```bash
+REDIS_URL=redis://managed-redis:6379/0
+WORKER_ID=worker-{uuid}
+MAX_CONCURRENT_JOBS=5
+BROWSER_POOL_SIZE=3
+RESULT_TTL=3600
+WEBHOOK_RETRY_COUNT=5
+LOG_LEVEL=INFO
+```
+
+**Dockerfile:**
+```dockerfile
+FROM unclecode/crawl4ai:latest
+
+WORKDIR /app
+
+# Install worker dependencies
+COPY requirements-worker.txt .
+RUN pip install --no-cache-dir -r requirements-worker.txt
+
+# Copy worker code
+COPY deploy/docker/worker.py .
+COPY deploy/docker/webhook.py .
+
+# No EXPOSE needed (worker doesn't listen)
+
+CMD ["python", "worker.py"]
+```
+
+---
+
+## 3. Code Structure
+
+### 3.1 New Files to Create
+
+```
+deploy/docker/
+├── api_server.py          # NEW: Stripped-down API (job queue only)
+├── worker.py              # NEW: Job processor
+├── requirements-api.txt   # NEW: API dependencies
+├── requirements-worker.txt # NEW: Worker dependencies
+├── docker-compose.yml     # MODIFIED: Multi-service
+├── Dockerfile.api         # NEW: API server image
+├── Dockerfile.worker      # NEW: Worker image
+└── deploy.sh             # NEW: DO deployment script
+```
+
+### 3.2 api_server.py Pseudocode
+
+```python
+from fastapi import FastAPI, Depends
+from redis import asyncio as aioredis
+import uuid
+from schemas import CrawlJobPayload, WebhookConfig
+
+app = FastAPI()
+redis = aioredis.from_url(REDIS_URL)
+
+@app.post("/crawl/job")
+async def submit_job(payload: CrawlJobPayload, api_key: str = Depends(validate_api_key)):
+    # 1. Validate API key and rate limit
+    await check_rate_limit(api_key)
+
+    # 2. Create task
+    task_id = f"crawl_{uuid.uuid4().hex[:8]}"
+
+    # 3. Push to queue
+    job = {
+        "task_id": task_id,
+        "urls": payload.urls,
+        "browser_config": payload.browser_config,
+        "crawler_config": payload.crawler_config,
+        "webhook_config": payload.webhook_config.dict() if payload.webhook_config else None,
+        "created_at": datetime.utcnow().isoformat(),
+        "api_key": api_key
+    }
+
+    await redis.rpush("crawl_queue", json.dumps(job))
+    await redis.hset(f"task:{task_id}", mapping={
+        "status": "queued",
+        "created_at": job["created_at"],
+        "api_key": api_key
+    })
+
+    return {"task_id": task_id, "status": "queued"}
+
+@app.get("/crawl/job/{task_id}")
+async def get_result(task_id: str, api_key: str = Depends(validate_api_key)):
+    # 1. Check task ownership
+    task_info = await redis.hgetall(f"task:{task_id}")
+    if task_info.get("api_key") != api_key:
+        raise HTTPException(403, "Access denied")
+
+    # 2. Get result
+    result = await redis.get(f"result:{task_id}")
+
+    if not result:
+        status = task_info.get("status", "unknown")
+        return {"task_id": task_id, "status": status, "result": None}
+
+    return json.loads(result)
+```
+
+### 3.3 worker.py Pseudocode
+
+```python
+import asyncio
+from redis import asyncio as aioredis
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
+from webhook import WebhookDeliveryService
+
+redis = aioredis.from_url(REDIS_URL)
+webhook_service = WebhookDeliveryService(config)
+
+async def process_job(job_data):
+    task_id = job_data['task_id']
+
+    try:
+        # Update status
+        await redis.hset(f"task:{task_id}", "status", "processing")
+
+        # Execute crawl
+        browser_config = BrowserConfig(**job_data.get('browser_config', {}))
+        crawler_config = CrawlerRunConfig(**job_data.get('crawler_config', {}))
+
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            results = await crawler.arun_many(
+                urls=job_data['urls'],
+                config=crawler_config
+            )
+
+        # Prepare result
+        result = {
+            "task_id": task_id,
+            "status": "completed",
+            "results": [r.model_dump() for r in results],
+            "completed_at": datetime.utcnow().isoformat()
+        }
+
+        # Store result (1 hour TTL)
+        await redis.setex(f"result:{task_id}", 3600, json.dumps(result))
+        await redis.hset(f"task:{task_id}", "status", "completed")
+
+        # Send webhook
+        if job_data.get('webhook_config'):
+            await webhook_service.notify_job_completion(
+                task_id=task_id,
+                task_type="crawl",
+                status="completed",
+                urls=job_data['urls'],
+                webhook_config=job_data['webhook_config'],
+                result=result
+            )
+
+        logger.info(f"Job {task_id} completed")
+
+    except Exception as e:
+        # Handle failure
+        await redis.hset(f"task:{task_id}", mapping={
+            "status": "failed",
+            "error": str(e)
+        })
+
+        if job_data.get('webhook_config'):
+            await webhook_service.notify_job_completion(
+                task_id=task_id,
+                task_type="crawl",
+                status="failed",
+                urls=job_data['urls'],
+                webhook_config=job_data['webhook_config'],
+                error=str(e)
+            )
+
+        logger.error(f"Job {task_id} failed: {e}")
+
+async def worker_loop():
+    logger.info(f"Worker {WORKER_ID} started")
+
+    while True:
+        try:
+            # Blocking pop from queue (5s timeout)
+            job = await redis.blpop("crawl_queue", timeout=5)
+
+            if job:
+                _, job_json = job
+                job_data = json.loads(job_json)
+                await process_job(job_data)
+
+        except Exception as e:
+            logger.error(f"Worker error: {e}")
+            await asyncio.sleep(1)
+
+if __name__ == "__main__":
+    asyncio.run(worker_loop())
+```
+
+---
+
+## 4. Digital Ocean Infrastructure
+
+### 4.1 Resource Requirements
+
+**Load Balancer:**
+- Type: Application Load Balancer
+- Algorithm: Round Robin
+- Health Check: `/health` every 10s
+- SSL: Let's Encrypt auto-cert
+- Cost: $12/month
+
+**API Servers:**
+- Droplet Size: Basic (1GB RAM, 1 vCPU) = $6/month
+- Count: 2 minimum, 5 maximum
+- OS: Ubuntu 22.04 LTS
+- Auto-scale based on: CPU > 70% or Request count
+
+**Browser Workers:**
+- Droplet Size: Basic (4GB RAM, 2 vCPU) = $24/month
+- Count: 2 minimum, 20 maximum
+- OS: Ubuntu 22.04 LTS
+- Auto-scale based on: Redis queue depth > 50
+
+**Managed Redis:**
+- Plan: Basic (1GB RAM)
+- Persistence: Yes
+- Backups: Daily
+- Cost: $15/month
+
+**Total Base Cost:** $12 + (2×$6) + (2×$24) + $15 = **$87/month**
+
+### 4.2 DO CLI Setup
+
+**Install CLI:**
+```bash
+# Install doctl
+cd ~
+wget https://github.com/digitalocean/doctl/releases/download/v1.98.1/doctl-1.98.1-linux-amd64.tar.gz
+tar xf doctl-*.tar.gz
+sudo mv doctl /usr/local/bin
+doctl auth init
+```
+
+**Create SSH Key:**
+```bash
+ssh-keygen -t rsa -b 4096 -f ~/.ssh/crawl4ai_deploy
+doctl compute ssh-key import crawl4ai-key --public-key-file ~/.ssh/crawl4ai_deploy.pub
+```
+
+---
+
+## 5. Deployment Scripts
+
+### 5.1 Build and Push Images
+
+**Script: `build_and_push.sh`**
+
+```bash
+#!/bin/bash
+set -e
+
+VERSION="v1.0.0"
+REGISTRY="registry.digitalocean.com/crawl4ai"
+
+echo "Building API Server image..."
+docker build -f Dockerfile.api -t $REGISTRY/api-server:$VERSION .
+docker push $REGISTRY/api-server:$VERSION
+
+echo "Building Worker image..."
+docker build -f Dockerfile.worker -t $REGISTRY/worker:$VERSION .
+docker push $REGISTRY/worker:$VERSION
+
+echo "Tagging latest..."
+docker tag $REGISTRY/api-server:$VERSION $REGISTRY/api-server:latest
+docker tag $REGISTRY/worker:$VERSION $REGISTRY/worker:latest
+
+docker push $REGISTRY/api-server:latest
+docker push $REGISTRY/worker:latest
+
+echo "✅ Images built and pushed"
+```
+
+### 5.2 Infrastructure Provisioning
+
+**Script: `deploy_infrastructure.sh`**
+
+```bash
+#!/bin/bash
+set -e
+
+PROJECT_NAME="crawl4ai-prod"
+REGION="nyc3"
+
+# 1. Create VPC
+echo "Creating VPC..."
+VPC_ID=$(doctl vpcs create \
+  --name $PROJECT_NAME-vpc \
+  --region $REGION \
+  --ip-range "10.100.0.0/16" \
+  --format ID --no-header)
+
+echo "VPC ID: $VPC_ID"
+
+# 2. Create Managed Redis
+echo "Creating Managed Redis..."
+REDIS_ID=$(doctl databases create $PROJECT_NAME-redis \
+  --engine redis \
+  --region $REGION \
+  --size db-s-1vcpu-1gb \
+  --version 7 \
+  --format ID --no-header)
+
+echo "Waiting for Redis to be ready..."
+doctl databases wait $REDIS_ID
+
+REDIS_HOST=$(doctl databases get $REDIS_ID --format PrivateHost --no-header)
+REDIS_PORT=$(doctl databases get $REDIS_ID --format Port --no-header)
+REDIS_PASSWORD=$(doctl databases get $REDIS_ID --format Password --no-header)
+
+echo "Redis: $REDIS_HOST:$REDIS_PORT"
+
+# 3. Create API Server Droplets
+echo "Creating API Server droplets..."
+for i in {1..2}; do
+  doctl compute droplet create api-server-$i \
+    --image docker-20-04 \
+    --size s-1vcpu-1gb \
+    --region $REGION \
+    --vpc-uuid $VPC_ID \
+    --tag-names api-server,production \
+    --user-data-file cloud-init-api.yml \
+    --wait
+done
+
+# 4. Create Worker Droplets
+echo "Creating Worker droplets..."
+for i in {1..2}; do
+  doctl compute droplet create worker-$i \
+    --image docker-20-04 \
+    --size s-2vcpu-4gb \
+    --region $REGION \
+    --vpc-uuid $VPC_ID \
+    --tag-names worker,production \
+    --user-data-file cloud-init-worker.yml \
+    --wait
+done
+
+# 5. Create Load Balancer
+echo "Creating Load Balancer..."
+API_IPS=$(doctl compute droplet list --tag-name api-server --format PublicIPv4 --no-header | tr '\n' ',')
+
+doctl compute load-balancer create \
+  --name $PROJECT_NAME-lb \
+  --region $REGION \
+  --forwarding-rules entry_protocol:https,entry_port:443,target_protocol:http,target_port:11235,certificate_id:auto \
+  --health-check protocol:http,port:11235,path:/health,check_interval_seconds:10 \
+  --tag-name api-server
+
+echo "✅ Infrastructure deployed"
+echo ""
+echo "REDIS_URL=redis://:$REDIS_PASSWORD@$REDIS_HOST:$REDIS_PORT/0"
+```
+
+### 5.3 Cloud-Init Scripts
+
+**File: `cloud-init-api.yml`**
+
+```yaml
+#cloud-config
+packages:
+  - docker.io
+  - docker-compose
+
+write_files:
+  - path: /etc/systemd/system/crawl4ai-api.service
+    content: |
+      [Unit]
+      Description=Crawl4AI API Server
+      After=docker.service
+      Requires=docker.service
+
+      [Service]
+      Environment="REDIS_URL=redis://:PASSWORD@HOST:PORT/0"
+      ExecStartPre=/usr/bin/docker pull registry.digitalocean.com/crawl4ai/api-server:latest
+      ExecStart=/usr/bin/docker run --rm --name api-server \
+        -p 11235:11235 \
+        -e REDIS_URL=${REDIS_URL} \
+        registry.digitalocean.com/crawl4ai/api-server:latest
+      ExecStop=/usr/bin/docker stop api-server
+      Restart=always
+
+      [Install]
+      WantedBy=multi-user.target
+
+runcmd:
+  - systemctl daemon-reload
+  - systemctl enable crawl4ai-api
+  - systemctl start crawl4ai-api
+```
+
+**File: `cloud-init-worker.yml`**
+
+```yaml
+#cloud-config
+packages:
+  - docker.io
+
+write_files:
+  - path: /etc/systemd/system/crawl4ai-worker.service
+    content: |
+      [Unit]
+      Description=Crawl4AI Worker
+      After=docker.service
+      Requires=docker.service
+
+      [Service]
+      Environment="REDIS_URL=redis://:PASSWORD@HOST:PORT/0"
+      Environment="WORKER_ID=%H"
+      ExecStartPre=/usr/bin/docker pull registry.digitalocean.com/crawl4ai/worker:latest
+      ExecStart=/usr/bin/docker run --rm --name worker \
+        --shm-size=2g \
+        -e REDIS_URL=${REDIS_URL} \
+        -e WORKER_ID=${WORKER_ID} \
+        registry.digitalocean.com/crawl4ai/worker:latest
+      ExecStop=/usr/bin/docker stop worker
+      Restart=always
+
+      [Install]
+      WantedBy=multi-user.target
+
+runcmd:
+  - systemctl daemon-reload
+  - systemctl enable crawl4ai-worker
+  - systemctl start crawl4ai-worker
+```
+
+---
+
+## 6. Auto-Scaling System
+
+### 6.1 Scaling Logic
+
+**Metrics to Monitor:**
+```python
+# Queue depth (Redis)
+queue_depth = redis.llen("crawl_queue")
+
+# Active workers
+active_workers = len(doctl_list_droplets(tag="worker"))
+
+# CPU usage (via DO API)
+avg_cpu = get_avg_cpu(droplets)
+```
+
+**Scaling Rules:**
+
+| Metric | Threshold | Action |
+|--------|-----------|--------|
+| Queue depth > 100 | Workers < 20 | Add 2 workers |
+| Queue depth > 500 | Workers < 20 | Add 5 workers |
+| Queue depth < 20 | Workers > 2 | Remove 1 worker |
+| API CPU > 80% | API servers < 5 | Add 1 API server |
+| API CPU < 30% | API servers > 2 | Remove 1 API server |
+
+**Cooldown:** 5 minutes between scaling actions
+
+### 6.2 Auto-Scaler Script
+
+**File: `autoscaler.py`**
+
+```python
+#!/usr/bin/env python3
+import redis
+import digitalocean
+import time
+from datetime import datetime, timedelta
+
+REDIS_URL = "redis://:pass@host:port/0"
+DO_TOKEN = "your_token"
+MIN_WORKERS = 2
+MAX_WORKERS = 20
+MIN_API = 2
+MAX_API = 5
+COOLDOWN_MINUTES = 5
+
+redis_client = redis.from_url(REDIS_URL)
+manager = digitalocean.Manager(token=DO_TOKEN)
+
+last_scale_time = {}
+
+def get_queue_depth():
+    return redis_client.llen("crawl_queue")
+
+def get_droplets_by_tag(tag):
+    return [d for d in manager.get_all_droplets() if tag in d.tags]
+
+def can_scale(component):
+    last_time = last_scale_time.get(component)
+    if not last_time:
+        return True
+    return datetime.now() - last_time > timedelta(minutes=COOLDOWN_MINUTES)
+
+def scale_workers(count):
+    if not can_scale("workers"):
+        print("⏳ Cooldown active for workers")
+        return
+
+    if count > 0:
+        print(f"➕ Adding {count} worker(s)")
+        # Create droplets using snapshot or template
+        for i in range(count):
+            droplet = digitalocean.Droplet(
+                token=DO_TOKEN,
+                name=f"worker-{int(time.time())}-{i}",
+                region='nyc3',
+                image='docker-20-04',
+                size_slug='s-2vcpu-4gb',
+                tags=['worker', 'production', 'autoscaled'],
+                user_data=open('cloud-init-worker.yml').read()
+            )
+            droplet.create()
+    else:
+        print(f"➖ Removing {abs(count)} worker(s)")
+        workers = get_droplets_by_tag("autoscaled")
+        for droplet in workers[:abs(count)]:
+            droplet.destroy()
+
+    last_scale_time["workers"] = datetime.now()
+
+def autoscale_loop():
+    print("🤖 Autoscaler started")
+
+    while True:
+        try:
+            # Get metrics
+            queue_depth = get_queue_depth()
+            workers = get_droplets_by_tag("worker")
+            worker_count = len(workers)
+
+            print(f"📊 Queue: {queue_depth}, Workers: {worker_count}")
+
+            # Scale workers based on queue
+            if queue_depth > 500 and worker_count < MAX_WORKERS:
+                scale_workers(5)
+            elif queue_depth > 100 and worker_count < MAX_WORKERS:
+                scale_workers(2)
+            elif queue_depth < 20 and worker_count > MIN_WORKERS:
+                scale_workers(-1)
+
+            # Sleep 2 minutes
+            time.sleep(120)
+
+        except Exception as e:
+            print(f"❌ Error: {e}")
+            time.sleep(60)
+
+if __name__ == "__main__":
+    autoscale_loop()
+```
+
+**Deploy as systemd service on control droplet:**
+
+```bash
+# /etc/systemd/system/autoscaler.service
+[Unit]
+Description=Crawl4AI Autoscaler
+After=network.target
+
+[Service]
+Type=simple
+User=root
+WorkingDirectory=/opt/crawl4ai
+ExecStart=/usr/bin/python3 /opt/crawl4ai/autoscaler.py
+Restart=always
+
+[Install]
+WantedBy=multi-user.target
+```
+
+---
+
+## 7. Monitoring & Observability
+
+### 7.1 Metrics to Track
+
+**Redis Metrics:**
+```python
+# Queue metrics
+crawl_queue_depth = LLEN crawl_queue
+jobs_completed_total = GET metrics:jobs_completed
+jobs_failed_total = GET metrics:jobs_failed
+
+# Performance metrics
+avg_job_duration = GET metrics:avg_job_duration
+webhook_success_rate = GET metrics:webhook_success_rate
+```
+
+**System Metrics (via DO API):**
+- Droplet CPU usage
+- Droplet memory usage
+- Droplet network I/O
+- Load balancer connections
+
+**Application Metrics (Prometheus):**
+```python
+# In API server
+from prometheus_client import Counter, Histogram
+
+jobs_submitted = Counter('jobs_submitted_total', 'Total jobs submitted')
+job_duration = Histogram('job_duration_seconds', 'Job execution time')
+webhook_attempts = Counter('webhook_attempts_total', 'Webhook delivery attempts', ['status'])
+```
+
+### 7.2 Monitoring Stack
+
+**Option 1: Managed (Recommended for Year 1)**
+- DataDog: $15/host/month
+- New Relic: $25/month
+- Total: ~$100/month
+
+**Option 2: Self-Hosted**
+```yaml
+# docker-compose-monitoring.yml
+services:
+  prometheus:
+    image: prom/prometheus
+    volumes:
+      - ./prometheus.yml:/etc/prometheus/prometheus.yml
+    ports:
+      - "9090:9090"
+
+  grafana:
+    image: grafana/grafana
+    ports:
+      - "3000:3000"
+    environment:
+      - GF_SECURITY_ADMIN_PASSWORD=admin
+```
+
+**Dashboards to create:**
+1. Queue depth over time
+2. Worker utilization
+3. Job success/failure rate
+4. Response time p50/p95/p99
+5. Webhook delivery rate
+6. Cost per job
+
+### 7.3 Alerting Rules
+
+```yaml
+# alerts.yml
+groups:
+  - name: crawl4ai
+    interval: 1m
+    rules:
+      - alert: HighQueueDepth
+        expr: crawl_queue_depth > 1000
+        for: 5m
+        annotations:
+          summary: "Queue backing up"
+
+      - alert: AllWorkersDown
+        expr: count(up{job="worker"}) == 0
+        for: 2m
+        annotations:
+          summary: "All workers are down"
+
+      - alert: HighJobFailureRate
+        expr: rate(jobs_failed_total[5m]) > 0.1
+        for: 10m
+        annotations:
+          summary: "Job failure rate > 10%"
+```
+
+---
+
+## 8. Testing Strategy
+
+### 8.1 Local Testing
+
+**Test Setup:**
+```bash
+# Start local stack
+docker-compose up -d
+
+# Submit test job
+curl -X POST http://localhost:11235/crawl/job \
+  -H "Content-Type: application/json" \
+  -d '{
+    "urls": ["https://example.com"],
+    "webhook_config": {
+      "webhook_url": "https://webhook.site/unique-id"
+    }
+  }'
+
+# Check result
+curl http://localhost:11235/crawl/job/{task_id}
+```
+
+**Test Cases:**
+1. Single URL crawl
+2. Multiple URLs (5, 10, 50)
+3. Webhook delivery (success)
+4. Webhook delivery (failure + retry)
+5. Queue backlog handling
+6. Worker failure recovery
+7. Rate limiting
+8. API key validation
+
+### 8.2 Load Testing
+
+**Script: `load_test.py`**
+
+```python
+import asyncio
+import aiohttp
+import time
+
+async def submit_job(session, i):
+    start = time.time()
+    async with session.post(
+        "https://api.crawl4ai.com/crawl/job",
+        json={"urls": [f"https://example.com/?test={i}"]},
+        headers={"X-API-Key": "test_key"}
+    ) as resp:
+        result = await resp.json()
+        duration = time.time() - start
+        return {"task_id": result["task_id"], "duration": duration}
+
+async def load_test(concurrency=100, total=1000):
+    async with aiohttp.ClientSession() as session:
+        tasks = []
+        for i in range(total):
+            tasks.append(submit_job(session, i))
+
+            if len(tasks) >= concurrency:
+                results = await asyncio.gather(*tasks)
+                print(f"Submitted {len(results)} jobs")
+                tasks = []
+
+        if tasks:
+            await asyncio.gather(*tasks)
+
+# Run: python load_test.py
+asyncio.run(load_test(concurrency=50, total=500))
+```
+
+**Metrics to collect:**
+- Jobs/second throughput
+- P50/P95/P99 latency
+- Queue depth under load
+- Worker utilization
+- Error rate
+
+**Target Performance:**
+- Handle 1000 concurrent jobs
+- P95 latency < 30s
+- Error rate < 0.1%
+
+---
+
+## 9. Cost Optimization
+
+### 9.1 Strategies
+
+**Infrastructure:**
+1. Use preemptible/spot droplets for workers (50% cheaper)
+2. Aggressive auto-scaling down during low traffic
+3. Shared Redis instead of dedicated per-env
+4. Use CDN for static assets (CloudFlare free tier)
+
+**Application:**
+1. Cache common crawls (example.com, etc)
+2. Batch similar jobs together
+3. Smart browser pool reuse
+4. Compress results before storing
+
+**Pricing:**
+```python
+# Cost model
+COST_PER_API_SERVER = 6  # per month
+COST_PER_WORKER = 24     # per month
+COST_REDIS = 15
+COST_LB = 12
+
+def calculate_cost(api_count, worker_count):
+    return (
+        api_count * COST_PER_API_SERVER +
+        worker_count * COST_PER_WORKER +
+        COST_REDIS +
+        COST_LB
+    )
+
+# Base: 2 API + 2 Workers = $87/mo
+# Peak: 5 API + 10 Workers = $297/mo
+```
+
+**Revenue Model:**
+```python
+# Charge customers based on usage
+FREE_TIER = 100  # requests/month
+STARTER_TIER = 5000  # $20/mo
+PRO_TIER = 50000     # $100/mo
+
+# Cost per 1000 requests at scale
+avg_job_duration = 10  # seconds
+worker_capacity = 6    # jobs/minute
+cost_per_worker_hour = 24 / 30 / 24  # $0.033/hr
+
+cost_per_1000_requests = (
+    (1000 / worker_capacity / 60) * cost_per_worker_hour
+)  # ~$0.92 per 1000 requests
+
+# Charge $2 per 1000 = 54% margin
+```
+
+### 9.2 Cost Monitoring
+
+**Track:**
+- Cost per request
+- Cost per customer
+- Infrastructure utilization %
+- Idle resource time
+
+**Alert if:**
+- Cost per request > $0.002
+- Idle time > 30%
+- Utilization < 50%
+
+---
+
+## 10. Security
+
+### 10.1 API Key Management
+
+**Storage:**
+```python
+# Redis schema
+api_key:{key_hash} -> {
+    "user_id": "uuid",
+    "tier": "pro",
+    "rate_limit": "1000/minute",
+    "created_at": "timestamp",
+    "active": true
+}
+
+# Rate limiting
+rate_limit:{api_key}:{minute} -> request_count
+```
+
+**Validation:**
+```python
+async def validate_api_key(api_key: str):
+    key_hash = hashlib.sha256(api_key.encode()).hexdigest()
+    key_data = await redis.hgetall(f"api_key:{key_hash}")
+
+    if not key_data or not key_data.get("active"):
+        raise HTTPException(401, "Invalid API key")
+
+    return key_data
+```
+
+### 10.2 Network Security
+
+**Firewall Rules:**
+```bash
+# API Servers
+- Allow: 443 from LB
+- Allow: 22 from bastion only
+- Allow: 6379 to Redis (private network)
+- Deny: all else
+
+# Workers
+- Allow: 6379 to Redis (private network)
+- Allow: 22 from bastion only
+- Deny: all else
+```
+
+**SSL/TLS:**
+- LB: Auto SSL via Let's Encrypt
+- Redis: TLS enabled
+- Internal: VPC isolation (encryption in transit)
+
+### 10.3 Secrets Management
+
+**Use DO Secrets:**
+```bash
+doctl compute secret create redis-password --value "xxx"
+doctl compute secret create jwt-secret --value "xxx"
+```
+
+**Inject into droplets:**
+```yaml
+#cloud-config
+write_files:
+  - path: /etc/crawl4ai/secrets.env
+    content: |
+      REDIS_PASSWORD={{.RedisPassword}}
+      JWT_SECRET={{.JWTSecret}}
+    permissions: '0600'
+```
+
+---
+
+## 11. Deployment Checklist
+
+### 11.1 Pre-Deployment
+
+- [ ] Test Docker images locally
+- [ ] Run integration tests
+- [ ] Load test (1000 concurrent jobs)
+- [ ] Verify webhook delivery
+- [ ] Test auto-scaling logic
+- [ ] Review security settings
+- [ ] Set up monitoring
+- [ ] Configure alerts
+- [ ] Document API endpoints
+- [ ] Create runbook
+
+### 11.2 Deployment Steps
+
+```bash
+# 1. Build images
+./build_and_push.sh
+
+# 2. Deploy infrastructure
+./deploy_infrastructure.sh
+
+# 3. Verify health
+doctl compute load-balancer list
+curl https://api.crawl4ai.com/health
+
+# 4. Submit test job
+curl -X POST https://api.crawl4ai.com/crawl/job \
+  -H "X-API-Key: test" \
+  -d '{"urls": ["https://example.com"]}'
+
+# 5. Monitor for 24 hours
+watch -n 60 'doctl compute droplet list'
+```
+
+### 11.3 Post-Deployment
+
+- [ ] Monitor queue depth for 24h
+- [ ] Check error logs
+- [ ] Verify webhook delivery rate
+- [ ] Test auto-scaling (manual trigger)
+- [ ] Validate cost metrics
+- [ ] Run smoke tests every hour
+- [ ] Customer beta testing
+
+---
+
+## 12. Rollback Plan
+
+**If deployment fails:**
+
+```bash
+# 1. Switch LB to old droplets
+doctl compute load-balancer update $LB_ID --droplet-ids $OLD_DROPLET_IDS
+
+# 2. Scale down new droplets
+doctl compute droplet delete $(doctl compute droplet list --tag-name new --format ID --no-header)
+
+# 3. Restore Redis snapshot
+doctl databases backups restore $REDIS_ID $BACKUP_ID
+
+# 4. Investigate
+tail -f /var/log/crawl4ai/*.log
+```
+
+---
+
+## 13. Success Metrics (First 90 Days)
+
+**Technical:**
+- 99.5% uptime
+- P95 latency < 30s
+- <0.1% error rate
+- Webhook delivery > 99%
+
+**Business:**
+- 100 API keys created
+- 50K requests/month processed
+- <$150/month infrastructure cost
+- Cost per request < $0.002
+
+**Scaling:**
+- Auto-scaler working (0 manual interventions)
+- Queue never exceeds 1000 depth
+- Worker utilization > 60%
+- API server utilization > 50%
+
+---
+
+## 14. Files Summary
+
+**To Create:**
+1. `deploy/docker/api_server.py` - Stripped API server
+2. `deploy/docker/worker.py` - Job processor
+3. `deploy/docker/Dockerfile.api` - API image
+4. `deploy/docker/Dockerfile.worker` - Worker image
+5. `deploy/docker/requirements-api.txt` - API deps
+6. `deploy/docker/requirements-worker.txt` - Worker deps
+7. `scripts/build_and_push.sh` - Build script
+8. `scripts/deploy_infrastructure.sh` - Provision script
+9. `scripts/autoscaler.py` - Auto-scaling daemon
+10. `scripts/cloud-init-api.yml` - API droplet config
+11. `scripts/cloud-init-worker.yml` - Worker droplet config
+12. `tests/load_test.py` - Load testing
+13. `docs/API.md` - API documentation
+14. `docs/RUNBOOK.md` - Operations guide
+
+**To Modify:**
+1. Current `server.py` - Extract job queue logic
+2. Current `job.py` - Simplify to queue only
+3. Current `webhook.py` - Use as-is
+
+---
+
+## 15. Next Steps
+
+**Week 1:**
+- [ ] Create API server code
+- [ ] Create worker code
+- [ ] Build Docker images
+- [ ] Test locally with docker-compose
+
+**Week 2:**
+- [ ] Deploy to DO staging
+- [ ] Integration testing
+- [ ] Load testing
+- [ ] Fix bugs
+
+**Week 3:**
+- [ ] Deploy to production
+- [ ] Monitor for 1 week
+- [ ] Optimize based on metrics
+- [ ] Beta customers
+
+**Week 4:**
+- [ ] Launch publicly
+- [ ] Marketing
+- [ ] Support setup
+- [ ] Iterate
+
+---
+
+**END OF PRD**