From f0cfd884a9feafbc4244d666836f9c86b469c844 Mon Sep 17 00:00:00 2001 From: Claude Date: Wed, 22 Oct 2025 11:05:32 +0000 Subject: [PATCH] docs: add production platform deployment PRD Comprehensive PRD for split architecture deployment on Digital Ocean: Architecture: - Separate API servers (lightweight FastAPI) - Browser worker pool (Crawl4AI + Chromium) - Redis job queue for coordination - DO Load Balancer + auto-scaling Components: - api_server.py - Job queue only, no browser - worker.py - Job processor, pulls from Redis - Dockerfiles for both images - Cloud-init configs for auto-deployment Infrastructure: - DO CLI deployment scripts - Auto-scaler daemon (queue-based) - Monitoring and alerting setup - Cost optimization strategies Includes: - Complete code structure - Deployment scripts - Testing strategy - Security setup - Rollback plan - Success metrics Cost estimate: $87-135/mo base, scales to $300/mo Target: 100-500 req/min capacity Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude --- docs/PRD_PLATFORM_DEPLOYMENT.md | 1222 +++++++++++++++++++++++++++++++ 1 file changed, 1222 insertions(+) create mode 100644 docs/PRD_PLATFORM_DEPLOYMENT.md diff --git a/docs/PRD_PLATFORM_DEPLOYMENT.md b/docs/PRD_PLATFORM_DEPLOYMENT.md new file mode 100644 index 00000000..d28a0a63 --- /dev/null +++ b/docs/PRD_PLATFORM_DEPLOYMENT.md @@ -0,0 +1,1222 @@ +# Crawl4AI API Platform - Production Deployment PRD + +**Version:** 1.0 +**Target:** Digital Ocean Split Architecture +**Pattern:** API Gateway + Redis Queue + Browser Worker Pool + +--- + +## 1. Architecture Overview + +### 1.1 Component Diagram + +``` +┌─────────────────────────────────────────────────────────┐ +│ Internet Traffic │ +└───────────────────────┬─────────────────────────────────┘ + │ +┌───────────────────────▼─────────────────────────────────┐ +│ DO Load Balancer (HTTP/HTTPS) │ +│ Port 80/443 → 11235 │ +└───────────────────────┬─────────────────────────────────┘ + │ + ┌───────────────┼───────────────┐ + │ │ │ +┌───────▼──────┐ ┌──────▼──────┐ ┌─────▼────────┐ +│ API Server │ │ API Server │ │ API Server │ +│ Container │ │ Container │ │ Container │ +│ (1GB RAM) │ │ (1GB RAM) │ │ (1GB RAM) │ +│ │ │ │ │ │ +│ FastAPI │ │ FastAPI │ │ FastAPI │ +│ + Auth │ │ + Auth │ │ + Auth │ +│ + Rate Lim │ │ + Rate Lim │ │ + Rate Lim │ +│ NO Chromium │ │ NO Chromium │ │ NO Chromium │ +└───────┬──────┘ └──────┬──────┘ └─────┬────────┘ + │ │ │ + └───────────────┼───────────────┘ + │ +┌───────────────────────▼─────────────────────────────────┐ +│ Managed Redis (Persistent) │ +│ Queues: jobs, results, webhooks │ +│ Keys: sessions, rate_limits │ +└───────────────────────┬─────────────────────────────────┘ + │ + ┌───────────────┼───────────────────┬─────────────┐ + │ │ │ │ +┌───────▼──────┐ ┌──────▼──────┐ ┌─────────▼───┐ ┌───────▼──────┐ +│ Worker 1 │ │ Worker 2 │ │ Worker 3 │ │ Worker N │ +│ (4GB RAM) │ │ (4GB RAM) │ │ (4GB RAM) │ │ (4GB RAM) │ +│ │ │ │ │ │ │ │ +│ Crawl4AI │ │ Crawl4AI │ │ Crawl4AI │ │ Crawl4AI │ +│ + Chromium │ │ + Chromium │ │ + Chromium │ │ + Chromium │ +│ (Job Puller)│ │ (Job Puller)│ │(Job Puller) │ │ (Job Puller) │ +└──────────────┘ └─────────────┘ └─────────────┘ └──────────────┘ +``` + +### 1.2 Data Flow + +**Job Submission:** +``` +Client → LB → API Server → Validate → Push to Redis Queue → Return task_id +``` + +**Job Execution:** +``` +Worker → Pull from Queue → Execute Crawl → Store Result in Redis → Send Webhook +``` + +**Result Retrieval:** +``` +Client → LB → API Server → Fetch from Redis → Return Result +``` + +--- + +## 2. Component Specifications + +### 2.1 API Server Container + +**Image:** `crawl4ai-api-server:v1` +**Base:** `python:3.12-slim` +**RAM:** 1GB +**CPU:** 1 vCPU + +**Includes:** +- FastAPI server +- Redis client +- Auth/API key validation +- Rate limiting +- Webhook trigger logic +- NO browser, NO crawl4ai core + +**Endpoints Supported:** +- `POST /crawl/job` - Queue job +- `GET /crawl/job/{task_id}` - Get result +- `POST /llm/job` - Queue LLM job +- `GET /llm/job/{task_id}` - Get LLM result +- `GET /health` - Health check +- `GET /metrics` - Prometheus metrics +- `POST /token` - JWT auth + +**Excluded Endpoints:** +- `/crawl` (sync) - removed +- `/crawl/stream` - removed (use job pattern only) + +**Environment Variables:** +```bash +REDIS_URL=redis://managed-redis:6379/0 +REDIS_POOL_SIZE=50 +API_KEY_HEADER=X-API-Key +JWT_SECRET= +RATE_LIMIT_DEFAULT=1000/minute +WEBHOOK_TIMEOUT=30 +WORKER_COUNT=4 +``` + +**Dockerfile:** +```dockerfile +FROM python:3.12-slim + +WORKDIR /app + +# Install dependencies (NO playwright, NO chromium) +COPY requirements-api.txt . +RUN pip install --no-cache-dir -r requirements-api.txt + +# Copy API server code only +COPY deploy/docker/api_server.py . +COPY deploy/docker/auth.py . +COPY deploy/docker/schemas.py . +COPY deploy/docker/utils.py . + +EXPOSE 11235 + +CMD ["uvicorn", "api_server:app", "--host", "0.0.0.0", "--port", "11235", "--workers", "4"] +``` + +### 2.2 Browser Worker Container + +**Image:** `crawl4ai-worker:v1` +**Base:** `python:3.12-slim` +**RAM:** 4GB +**CPU:** 2 vCPU + +**Includes:** +- Crawl4AI library +- Chromium browser +- Redis client +- Job processor +- Webhook sender +- NO FastAPI server + +**Worker Logic:** +```python +while True: + # 1. Pull job from Redis queue (BLPOP) + job = redis.blpop('crawl_queue', timeout=5) + + if job: + task_id, job_data = parse_job(job) + + # 2. Execute crawl + result = await execute_crawl(job_data) + + # 3. Store result + redis.setex(f"result:{task_id}", 3600, json.dumps(result)) + + # 4. Send webhook if configured + if job_data.get('webhook_url'): + await send_webhook(job_data['webhook_url'], task_id, result) + + # 5. Update metrics + redis.incr('metrics:jobs_completed') +``` + +**Environment Variables:** +```bash +REDIS_URL=redis://managed-redis:6379/0 +WORKER_ID=worker-{uuid} +MAX_CONCURRENT_JOBS=5 +BROWSER_POOL_SIZE=3 +RESULT_TTL=3600 +WEBHOOK_RETRY_COUNT=5 +LOG_LEVEL=INFO +``` + +**Dockerfile:** +```dockerfile +FROM unclecode/crawl4ai:latest + +WORKDIR /app + +# Install worker dependencies +COPY requirements-worker.txt . +RUN pip install --no-cache-dir -r requirements-worker.txt + +# Copy worker code +COPY deploy/docker/worker.py . +COPY deploy/docker/webhook.py . + +# No EXPOSE needed (worker doesn't listen) + +CMD ["python", "worker.py"] +``` + +--- + +## 3. Code Structure + +### 3.1 New Files to Create + +``` +deploy/docker/ +├── api_server.py # NEW: Stripped-down API (job queue only) +├── worker.py # NEW: Job processor +├── requirements-api.txt # NEW: API dependencies +├── requirements-worker.txt # NEW: Worker dependencies +├── docker-compose.yml # MODIFIED: Multi-service +├── Dockerfile.api # NEW: API server image +├── Dockerfile.worker # NEW: Worker image +└── deploy.sh # NEW: DO deployment script +``` + +### 3.2 api_server.py Pseudocode + +```python +from fastapi import FastAPI, Depends +from redis import asyncio as aioredis +import uuid +from schemas import CrawlJobPayload, WebhookConfig + +app = FastAPI() +redis = aioredis.from_url(REDIS_URL) + +@app.post("/crawl/job") +async def submit_job(payload: CrawlJobPayload, api_key: str = Depends(validate_api_key)): + # 1. Validate API key and rate limit + await check_rate_limit(api_key) + + # 2. Create task + task_id = f"crawl_{uuid.uuid4().hex[:8]}" + + # 3. Push to queue + job = { + "task_id": task_id, + "urls": payload.urls, + "browser_config": payload.browser_config, + "crawler_config": payload.crawler_config, + "webhook_config": payload.webhook_config.dict() if payload.webhook_config else None, + "created_at": datetime.utcnow().isoformat(), + "api_key": api_key + } + + await redis.rpush("crawl_queue", json.dumps(job)) + await redis.hset(f"task:{task_id}", mapping={ + "status": "queued", + "created_at": job["created_at"], + "api_key": api_key + }) + + return {"task_id": task_id, "status": "queued"} + +@app.get("/crawl/job/{task_id}") +async def get_result(task_id: str, api_key: str = Depends(validate_api_key)): + # 1. Check task ownership + task_info = await redis.hgetall(f"task:{task_id}") + if task_info.get("api_key") != api_key: + raise HTTPException(403, "Access denied") + + # 2. Get result + result = await redis.get(f"result:{task_id}") + + if not result: + status = task_info.get("status", "unknown") + return {"task_id": task_id, "status": status, "result": None} + + return json.loads(result) +``` + +### 3.3 worker.py Pseudocode + +```python +import asyncio +from redis import asyncio as aioredis +from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig +from webhook import WebhookDeliveryService + +redis = aioredis.from_url(REDIS_URL) +webhook_service = WebhookDeliveryService(config) + +async def process_job(job_data): + task_id = job_data['task_id'] + + try: + # Update status + await redis.hset(f"task:{task_id}", "status", "processing") + + # Execute crawl + browser_config = BrowserConfig(**job_data.get('browser_config', {})) + crawler_config = CrawlerRunConfig(**job_data.get('crawler_config', {})) + + async with AsyncWebCrawler(config=browser_config) as crawler: + results = await crawler.arun_many( + urls=job_data['urls'], + config=crawler_config + ) + + # Prepare result + result = { + "task_id": task_id, + "status": "completed", + "results": [r.model_dump() for r in results], + "completed_at": datetime.utcnow().isoformat() + } + + # Store result (1 hour TTL) + await redis.setex(f"result:{task_id}", 3600, json.dumps(result)) + await redis.hset(f"task:{task_id}", "status", "completed") + + # Send webhook + if job_data.get('webhook_config'): + await webhook_service.notify_job_completion( + task_id=task_id, + task_type="crawl", + status="completed", + urls=job_data['urls'], + webhook_config=job_data['webhook_config'], + result=result + ) + + logger.info(f"Job {task_id} completed") + + except Exception as e: + # Handle failure + await redis.hset(f"task:{task_id}", mapping={ + "status": "failed", + "error": str(e) + }) + + if job_data.get('webhook_config'): + await webhook_service.notify_job_completion( + task_id=task_id, + task_type="crawl", + status="failed", + urls=job_data['urls'], + webhook_config=job_data['webhook_config'], + error=str(e) + ) + + logger.error(f"Job {task_id} failed: {e}") + +async def worker_loop(): + logger.info(f"Worker {WORKER_ID} started") + + while True: + try: + # Blocking pop from queue (5s timeout) + job = await redis.blpop("crawl_queue", timeout=5) + + if job: + _, job_json = job + job_data = json.loads(job_json) + await process_job(job_data) + + except Exception as e: + logger.error(f"Worker error: {e}") + await asyncio.sleep(1) + +if __name__ == "__main__": + asyncio.run(worker_loop()) +``` + +--- + +## 4. Digital Ocean Infrastructure + +### 4.1 Resource Requirements + +**Load Balancer:** +- Type: Application Load Balancer +- Algorithm: Round Robin +- Health Check: `/health` every 10s +- SSL: Let's Encrypt auto-cert +- Cost: $12/month + +**API Servers:** +- Droplet Size: Basic (1GB RAM, 1 vCPU) = $6/month +- Count: 2 minimum, 5 maximum +- OS: Ubuntu 22.04 LTS +- Auto-scale based on: CPU > 70% or Request count + +**Browser Workers:** +- Droplet Size: Basic (4GB RAM, 2 vCPU) = $24/month +- Count: 2 minimum, 20 maximum +- OS: Ubuntu 22.04 LTS +- Auto-scale based on: Redis queue depth > 50 + +**Managed Redis:** +- Plan: Basic (1GB RAM) +- Persistence: Yes +- Backups: Daily +- Cost: $15/month + +**Total Base Cost:** $12 + (2×$6) + (2×$24) + $15 = **$87/month** + +### 4.2 DO CLI Setup + +**Install CLI:** +```bash +# Install doctl +cd ~ +wget https://github.com/digitalocean/doctl/releases/download/v1.98.1/doctl-1.98.1-linux-amd64.tar.gz +tar xf doctl-*.tar.gz +sudo mv doctl /usr/local/bin +doctl auth init +``` + +**Create SSH Key:** +```bash +ssh-keygen -t rsa -b 4096 -f ~/.ssh/crawl4ai_deploy +doctl compute ssh-key import crawl4ai-key --public-key-file ~/.ssh/crawl4ai_deploy.pub +``` + +--- + +## 5. Deployment Scripts + +### 5.1 Build and Push Images + +**Script: `build_and_push.sh`** + +```bash +#!/bin/bash +set -e + +VERSION="v1.0.0" +REGISTRY="registry.digitalocean.com/crawl4ai" + +echo "Building API Server image..." +docker build -f Dockerfile.api -t $REGISTRY/api-server:$VERSION . +docker push $REGISTRY/api-server:$VERSION + +echo "Building Worker image..." +docker build -f Dockerfile.worker -t $REGISTRY/worker:$VERSION . +docker push $REGISTRY/worker:$VERSION + +echo "Tagging latest..." +docker tag $REGISTRY/api-server:$VERSION $REGISTRY/api-server:latest +docker tag $REGISTRY/worker:$VERSION $REGISTRY/worker:latest + +docker push $REGISTRY/api-server:latest +docker push $REGISTRY/worker:latest + +echo "✅ Images built and pushed" +``` + +### 5.2 Infrastructure Provisioning + +**Script: `deploy_infrastructure.sh`** + +```bash +#!/bin/bash +set -e + +PROJECT_NAME="crawl4ai-prod" +REGION="nyc3" + +# 1. Create VPC +echo "Creating VPC..." +VPC_ID=$(doctl vpcs create \ + --name $PROJECT_NAME-vpc \ + --region $REGION \ + --ip-range "10.100.0.0/16" \ + --format ID --no-header) + +echo "VPC ID: $VPC_ID" + +# 2. Create Managed Redis +echo "Creating Managed Redis..." +REDIS_ID=$(doctl databases create $PROJECT_NAME-redis \ + --engine redis \ + --region $REGION \ + --size db-s-1vcpu-1gb \ + --version 7 \ + --format ID --no-header) + +echo "Waiting for Redis to be ready..." +doctl databases wait $REDIS_ID + +REDIS_HOST=$(doctl databases get $REDIS_ID --format PrivateHost --no-header) +REDIS_PORT=$(doctl databases get $REDIS_ID --format Port --no-header) +REDIS_PASSWORD=$(doctl databases get $REDIS_ID --format Password --no-header) + +echo "Redis: $REDIS_HOST:$REDIS_PORT" + +# 3. Create API Server Droplets +echo "Creating API Server droplets..." +for i in {1..2}; do + doctl compute droplet create api-server-$i \ + --image docker-20-04 \ + --size s-1vcpu-1gb \ + --region $REGION \ + --vpc-uuid $VPC_ID \ + --tag-names api-server,production \ + --user-data-file cloud-init-api.yml \ + --wait +done + +# 4. Create Worker Droplets +echo "Creating Worker droplets..." +for i in {1..2}; do + doctl compute droplet create worker-$i \ + --image docker-20-04 \ + --size s-2vcpu-4gb \ + --region $REGION \ + --vpc-uuid $VPC_ID \ + --tag-names worker,production \ + --user-data-file cloud-init-worker.yml \ + --wait +done + +# 5. Create Load Balancer +echo "Creating Load Balancer..." +API_IPS=$(doctl compute droplet list --tag-name api-server --format PublicIPv4 --no-header | tr '\n' ',') + +doctl compute load-balancer create \ + --name $PROJECT_NAME-lb \ + --region $REGION \ + --forwarding-rules entry_protocol:https,entry_port:443,target_protocol:http,target_port:11235,certificate_id:auto \ + --health-check protocol:http,port:11235,path:/health,check_interval_seconds:10 \ + --tag-name api-server + +echo "✅ Infrastructure deployed" +echo "" +echo "REDIS_URL=redis://:$REDIS_PASSWORD@$REDIS_HOST:$REDIS_PORT/0" +``` + +### 5.3 Cloud-Init Scripts + +**File: `cloud-init-api.yml`** + +```yaml +#cloud-config +packages: + - docker.io + - docker-compose + +write_files: + - path: /etc/systemd/system/crawl4ai-api.service + content: | + [Unit] + Description=Crawl4AI API Server + After=docker.service + Requires=docker.service + + [Service] + Environment="REDIS_URL=redis://:PASSWORD@HOST:PORT/0" + ExecStartPre=/usr/bin/docker pull registry.digitalocean.com/crawl4ai/api-server:latest + ExecStart=/usr/bin/docker run --rm --name api-server \ + -p 11235:11235 \ + -e REDIS_URL=${REDIS_URL} \ + registry.digitalocean.com/crawl4ai/api-server:latest + ExecStop=/usr/bin/docker stop api-server + Restart=always + + [Install] + WantedBy=multi-user.target + +runcmd: + - systemctl daemon-reload + - systemctl enable crawl4ai-api + - systemctl start crawl4ai-api +``` + +**File: `cloud-init-worker.yml`** + +```yaml +#cloud-config +packages: + - docker.io + +write_files: + - path: /etc/systemd/system/crawl4ai-worker.service + content: | + [Unit] + Description=Crawl4AI Worker + After=docker.service + Requires=docker.service + + [Service] + Environment="REDIS_URL=redis://:PASSWORD@HOST:PORT/0" + Environment="WORKER_ID=%H" + ExecStartPre=/usr/bin/docker pull registry.digitalocean.com/crawl4ai/worker:latest + ExecStart=/usr/bin/docker run --rm --name worker \ + --shm-size=2g \ + -e REDIS_URL=${REDIS_URL} \ + -e WORKER_ID=${WORKER_ID} \ + registry.digitalocean.com/crawl4ai/worker:latest + ExecStop=/usr/bin/docker stop worker + Restart=always + + [Install] + WantedBy=multi-user.target + +runcmd: + - systemctl daemon-reload + - systemctl enable crawl4ai-worker + - systemctl start crawl4ai-worker +``` + +--- + +## 6. Auto-Scaling System + +### 6.1 Scaling Logic + +**Metrics to Monitor:** +```python +# Queue depth (Redis) +queue_depth = redis.llen("crawl_queue") + +# Active workers +active_workers = len(doctl_list_droplets(tag="worker")) + +# CPU usage (via DO API) +avg_cpu = get_avg_cpu(droplets) +``` + +**Scaling Rules:** + +| Metric | Threshold | Action | +|--------|-----------|--------| +| Queue depth > 100 | Workers < 20 | Add 2 workers | +| Queue depth > 500 | Workers < 20 | Add 5 workers | +| Queue depth < 20 | Workers > 2 | Remove 1 worker | +| API CPU > 80% | API servers < 5 | Add 1 API server | +| API CPU < 30% | API servers > 2 | Remove 1 API server | + +**Cooldown:** 5 minutes between scaling actions + +### 6.2 Auto-Scaler Script + +**File: `autoscaler.py`** + +```python +#!/usr/bin/env python3 +import redis +import digitalocean +import time +from datetime import datetime, timedelta + +REDIS_URL = "redis://:pass@host:port/0" +DO_TOKEN = "your_token" +MIN_WORKERS = 2 +MAX_WORKERS = 20 +MIN_API = 2 +MAX_API = 5 +COOLDOWN_MINUTES = 5 + +redis_client = redis.from_url(REDIS_URL) +manager = digitalocean.Manager(token=DO_TOKEN) + +last_scale_time = {} + +def get_queue_depth(): + return redis_client.llen("crawl_queue") + +def get_droplets_by_tag(tag): + return [d for d in manager.get_all_droplets() if tag in d.tags] + +def can_scale(component): + last_time = last_scale_time.get(component) + if not last_time: + return True + return datetime.now() - last_time > timedelta(minutes=COOLDOWN_MINUTES) + +def scale_workers(count): + if not can_scale("workers"): + print("⏳ Cooldown active for workers") + return + + if count > 0: + print(f"➕ Adding {count} worker(s)") + # Create droplets using snapshot or template + for i in range(count): + droplet = digitalocean.Droplet( + token=DO_TOKEN, + name=f"worker-{int(time.time())}-{i}", + region='nyc3', + image='docker-20-04', + size_slug='s-2vcpu-4gb', + tags=['worker', 'production', 'autoscaled'], + user_data=open('cloud-init-worker.yml').read() + ) + droplet.create() + else: + print(f"➖ Removing {abs(count)} worker(s)") + workers = get_droplets_by_tag("autoscaled") + for droplet in workers[:abs(count)]: + droplet.destroy() + + last_scale_time["workers"] = datetime.now() + +def autoscale_loop(): + print("🤖 Autoscaler started") + + while True: + try: + # Get metrics + queue_depth = get_queue_depth() + workers = get_droplets_by_tag("worker") + worker_count = len(workers) + + print(f"📊 Queue: {queue_depth}, Workers: {worker_count}") + + # Scale workers based on queue + if queue_depth > 500 and worker_count < MAX_WORKERS: + scale_workers(5) + elif queue_depth > 100 and worker_count < MAX_WORKERS: + scale_workers(2) + elif queue_depth < 20 and worker_count > MIN_WORKERS: + scale_workers(-1) + + # Sleep 2 minutes + time.sleep(120) + + except Exception as e: + print(f"❌ Error: {e}") + time.sleep(60) + +if __name__ == "__main__": + autoscale_loop() +``` + +**Deploy as systemd service on control droplet:** + +```bash +# /etc/systemd/system/autoscaler.service +[Unit] +Description=Crawl4AI Autoscaler +After=network.target + +[Service] +Type=simple +User=root +WorkingDirectory=/opt/crawl4ai +ExecStart=/usr/bin/python3 /opt/crawl4ai/autoscaler.py +Restart=always + +[Install] +WantedBy=multi-user.target +``` + +--- + +## 7. Monitoring & Observability + +### 7.1 Metrics to Track + +**Redis Metrics:** +```python +# Queue metrics +crawl_queue_depth = LLEN crawl_queue +jobs_completed_total = GET metrics:jobs_completed +jobs_failed_total = GET metrics:jobs_failed + +# Performance metrics +avg_job_duration = GET metrics:avg_job_duration +webhook_success_rate = GET metrics:webhook_success_rate +``` + +**System Metrics (via DO API):** +- Droplet CPU usage +- Droplet memory usage +- Droplet network I/O +- Load balancer connections + +**Application Metrics (Prometheus):** +```python +# In API server +from prometheus_client import Counter, Histogram + +jobs_submitted = Counter('jobs_submitted_total', 'Total jobs submitted') +job_duration = Histogram('job_duration_seconds', 'Job execution time') +webhook_attempts = Counter('webhook_attempts_total', 'Webhook delivery attempts', ['status']) +``` + +### 7.2 Monitoring Stack + +**Option 1: Managed (Recommended for Year 1)** +- DataDog: $15/host/month +- New Relic: $25/month +- Total: ~$100/month + +**Option 2: Self-Hosted** +```yaml +# docker-compose-monitoring.yml +services: + prometheus: + image: prom/prometheus + volumes: + - ./prometheus.yml:/etc/prometheus/prometheus.yml + ports: + - "9090:9090" + + grafana: + image: grafana/grafana + ports: + - "3000:3000" + environment: + - GF_SECURITY_ADMIN_PASSWORD=admin +``` + +**Dashboards to create:** +1. Queue depth over time +2. Worker utilization +3. Job success/failure rate +4. Response time p50/p95/p99 +5. Webhook delivery rate +6. Cost per job + +### 7.3 Alerting Rules + +```yaml +# alerts.yml +groups: + - name: crawl4ai + interval: 1m + rules: + - alert: HighQueueDepth + expr: crawl_queue_depth > 1000 + for: 5m + annotations: + summary: "Queue backing up" + + - alert: AllWorkersDown + expr: count(up{job="worker"}) == 0 + for: 2m + annotations: + summary: "All workers are down" + + - alert: HighJobFailureRate + expr: rate(jobs_failed_total[5m]) > 0.1 + for: 10m + annotations: + summary: "Job failure rate > 10%" +``` + +--- + +## 8. Testing Strategy + +### 8.1 Local Testing + +**Test Setup:** +```bash +# Start local stack +docker-compose up -d + +# Submit test job +curl -X POST http://localhost:11235/crawl/job \ + -H "Content-Type: application/json" \ + -d '{ + "urls": ["https://example.com"], + "webhook_config": { + "webhook_url": "https://webhook.site/unique-id" + } + }' + +# Check result +curl http://localhost:11235/crawl/job/{task_id} +``` + +**Test Cases:** +1. Single URL crawl +2. Multiple URLs (5, 10, 50) +3. Webhook delivery (success) +4. Webhook delivery (failure + retry) +5. Queue backlog handling +6. Worker failure recovery +7. Rate limiting +8. API key validation + +### 8.2 Load Testing + +**Script: `load_test.py`** + +```python +import asyncio +import aiohttp +import time + +async def submit_job(session, i): + start = time.time() + async with session.post( + "https://api.crawl4ai.com/crawl/job", + json={"urls": [f"https://example.com/?test={i}"]}, + headers={"X-API-Key": "test_key"} + ) as resp: + result = await resp.json() + duration = time.time() - start + return {"task_id": result["task_id"], "duration": duration} + +async def load_test(concurrency=100, total=1000): + async with aiohttp.ClientSession() as session: + tasks = [] + for i in range(total): + tasks.append(submit_job(session, i)) + + if len(tasks) >= concurrency: + results = await asyncio.gather(*tasks) + print(f"Submitted {len(results)} jobs") + tasks = [] + + if tasks: + await asyncio.gather(*tasks) + +# Run: python load_test.py +asyncio.run(load_test(concurrency=50, total=500)) +``` + +**Metrics to collect:** +- Jobs/second throughput +- P50/P95/P99 latency +- Queue depth under load +- Worker utilization +- Error rate + +**Target Performance:** +- Handle 1000 concurrent jobs +- P95 latency < 30s +- Error rate < 0.1% + +--- + +## 9. Cost Optimization + +### 9.1 Strategies + +**Infrastructure:** +1. Use preemptible/spot droplets for workers (50% cheaper) +2. Aggressive auto-scaling down during low traffic +3. Shared Redis instead of dedicated per-env +4. Use CDN for static assets (CloudFlare free tier) + +**Application:** +1. Cache common crawls (example.com, etc) +2. Batch similar jobs together +3. Smart browser pool reuse +4. Compress results before storing + +**Pricing:** +```python +# Cost model +COST_PER_API_SERVER = 6 # per month +COST_PER_WORKER = 24 # per month +COST_REDIS = 15 +COST_LB = 12 + +def calculate_cost(api_count, worker_count): + return ( + api_count * COST_PER_API_SERVER + + worker_count * COST_PER_WORKER + + COST_REDIS + + COST_LB + ) + +# Base: 2 API + 2 Workers = $87/mo +# Peak: 5 API + 10 Workers = $297/mo +``` + +**Revenue Model:** +```python +# Charge customers based on usage +FREE_TIER = 100 # requests/month +STARTER_TIER = 5000 # $20/mo +PRO_TIER = 50000 # $100/mo + +# Cost per 1000 requests at scale +avg_job_duration = 10 # seconds +worker_capacity = 6 # jobs/minute +cost_per_worker_hour = 24 / 30 / 24 # $0.033/hr + +cost_per_1000_requests = ( + (1000 / worker_capacity / 60) * cost_per_worker_hour +) # ~$0.92 per 1000 requests + +# Charge $2 per 1000 = 54% margin +``` + +### 9.2 Cost Monitoring + +**Track:** +- Cost per request +- Cost per customer +- Infrastructure utilization % +- Idle resource time + +**Alert if:** +- Cost per request > $0.002 +- Idle time > 30% +- Utilization < 50% + +--- + +## 10. Security + +### 10.1 API Key Management + +**Storage:** +```python +# Redis schema +api_key:{key_hash} -> { + "user_id": "uuid", + "tier": "pro", + "rate_limit": "1000/minute", + "created_at": "timestamp", + "active": true +} + +# Rate limiting +rate_limit:{api_key}:{minute} -> request_count +``` + +**Validation:** +```python +async def validate_api_key(api_key: str): + key_hash = hashlib.sha256(api_key.encode()).hexdigest() + key_data = await redis.hgetall(f"api_key:{key_hash}") + + if not key_data or not key_data.get("active"): + raise HTTPException(401, "Invalid API key") + + return key_data +``` + +### 10.2 Network Security + +**Firewall Rules:** +```bash +# API Servers +- Allow: 443 from LB +- Allow: 22 from bastion only +- Allow: 6379 to Redis (private network) +- Deny: all else + +# Workers +- Allow: 6379 to Redis (private network) +- Allow: 22 from bastion only +- Deny: all else +``` + +**SSL/TLS:** +- LB: Auto SSL via Let's Encrypt +- Redis: TLS enabled +- Internal: VPC isolation (encryption in transit) + +### 10.3 Secrets Management + +**Use DO Secrets:** +```bash +doctl compute secret create redis-password --value "xxx" +doctl compute secret create jwt-secret --value "xxx" +``` + +**Inject into droplets:** +```yaml +#cloud-config +write_files: + - path: /etc/crawl4ai/secrets.env + content: | + REDIS_PASSWORD={{.RedisPassword}} + JWT_SECRET={{.JWTSecret}} + permissions: '0600' +``` + +--- + +## 11. Deployment Checklist + +### 11.1 Pre-Deployment + +- [ ] Test Docker images locally +- [ ] Run integration tests +- [ ] Load test (1000 concurrent jobs) +- [ ] Verify webhook delivery +- [ ] Test auto-scaling logic +- [ ] Review security settings +- [ ] Set up monitoring +- [ ] Configure alerts +- [ ] Document API endpoints +- [ ] Create runbook + +### 11.2 Deployment Steps + +```bash +# 1. Build images +./build_and_push.sh + +# 2. Deploy infrastructure +./deploy_infrastructure.sh + +# 3. Verify health +doctl compute load-balancer list +curl https://api.crawl4ai.com/health + +# 4. Submit test job +curl -X POST https://api.crawl4ai.com/crawl/job \ + -H "X-API-Key: test" \ + -d '{"urls": ["https://example.com"]}' + +# 5. Monitor for 24 hours +watch -n 60 'doctl compute droplet list' +``` + +### 11.3 Post-Deployment + +- [ ] Monitor queue depth for 24h +- [ ] Check error logs +- [ ] Verify webhook delivery rate +- [ ] Test auto-scaling (manual trigger) +- [ ] Validate cost metrics +- [ ] Run smoke tests every hour +- [ ] Customer beta testing + +--- + +## 12. Rollback Plan + +**If deployment fails:** + +```bash +# 1. Switch LB to old droplets +doctl compute load-balancer update $LB_ID --droplet-ids $OLD_DROPLET_IDS + +# 2. Scale down new droplets +doctl compute droplet delete $(doctl compute droplet list --tag-name new --format ID --no-header) + +# 3. Restore Redis snapshot +doctl databases backups restore $REDIS_ID $BACKUP_ID + +# 4. Investigate +tail -f /var/log/crawl4ai/*.log +``` + +--- + +## 13. Success Metrics (First 90 Days) + +**Technical:** +- 99.5% uptime +- P95 latency < 30s +- <0.1% error rate +- Webhook delivery > 99% + +**Business:** +- 100 API keys created +- 50K requests/month processed +- <$150/month infrastructure cost +- Cost per request < $0.002 + +**Scaling:** +- Auto-scaler working (0 manual interventions) +- Queue never exceeds 1000 depth +- Worker utilization > 60% +- API server utilization > 50% + +--- + +## 14. Files Summary + +**To Create:** +1. `deploy/docker/api_server.py` - Stripped API server +2. `deploy/docker/worker.py` - Job processor +3. `deploy/docker/Dockerfile.api` - API image +4. `deploy/docker/Dockerfile.worker` - Worker image +5. `deploy/docker/requirements-api.txt` - API deps +6. `deploy/docker/requirements-worker.txt` - Worker deps +7. `scripts/build_and_push.sh` - Build script +8. `scripts/deploy_infrastructure.sh` - Provision script +9. `scripts/autoscaler.py` - Auto-scaling daemon +10. `scripts/cloud-init-api.yml` - API droplet config +11. `scripts/cloud-init-worker.yml` - Worker droplet config +12. `tests/load_test.py` - Load testing +13. `docs/API.md` - API documentation +14. `docs/RUNBOOK.md` - Operations guide + +**To Modify:** +1. Current `server.py` - Extract job queue logic +2. Current `job.py` - Simplify to queue only +3. Current `webhook.py` - Use as-is + +--- + +## 15. Next Steps + +**Week 1:** +- [ ] Create API server code +- [ ] Create worker code +- [ ] Build Docker images +- [ ] Test locally with docker-compose + +**Week 2:** +- [ ] Deploy to DO staging +- [ ] Integration testing +- [ ] Load testing +- [ ] Fix bugs + +**Week 3:** +- [ ] Deploy to production +- [ ] Monitor for 1 week +- [ ] Optimize based on metrics +- [ ] Beta customers + +**Week 4:** +- [ ] Launch publicly +- [ ] Marketing +- [ ] Support setup +- [ ] Iterate + +--- + +**END OF PRD**