docs: rename Docker deployment to self-hosting guide with comprehensive monitoring documentation

Major documentation restructuring to emphasize self-hosting capabilities and fully document the real-time monitoring system. Changes: - Renamed docker-deployment.md → self-hosting.md to better reflect the value proposition - Updated mkdocs.yml navigation to "Self-Hosting Guide" - Completely rewrote introduction emphasizing self-hosting benefits: * Data privacy and ownership * Cost control and transparency * Performance and security advantages * Full customization capabilities - Expanded "Metrics & Monitoring" → "Real-time Monitoring & Operations" with: * Monitoring Dashboard section documenting the /monitor UI * Complete feature breakdown (system health, requests, browsers, janitor, errors) * Monitor API Endpoints with all REST endpoints and examples * WebSocket Streaming integration guide with Python examples * Control Actions for manual browser management * Production Integration patterns (Prometheus, custom dashboards, alerting) * Key production metrics to track - Enhanced summary section: * What users learned checklist * Why self-hosting matters * Clear next steps * Key resources with monitoring dashboard URL The monitoring dashboard built 2-3 weeks ago is now fully documented and discoverable. Users will understand they have complete operational visibility at http://localhost:11235/monitor with real-time updates, browser pool management, and programmatic control via REST/WebSocket APIs. This positions Crawl4AI as an enterprise-grade self-hosting solution with DevOps-level monitoring capabilities, not just a Docker deployment.
2025-11-09 13:31:52 +08:00
parent 81b5312629
commit 1a22fb4d4f
2 changed files with 516 additions and 24 deletions
--- a/docs/md_v2/core/docker-deployment.md
+++ b/docs/md_v2/core/docker-deployment.md
@@ -1,4 +1,20 @@
-# Crawl4AI Docker Guide 🐳
+# Self-Hosting Crawl4AI 🚀
 **Take Control of Your Web Crawling Infrastructure**
 Self-hosting Crawl4AI gives you complete control over your web crawling and data extraction pipeline. Unlike cloud-based solutions, you own your data, infrastructure, and destiny.
 ## Why Self-Host?
 - **🔒 Data Privacy**: Your crawled data never leaves your infrastructure
 - **💰 Cost Control**: No per-request pricing - scale within your own resources
 - **🎯 Customization**: Full control over browser configurations, extraction strategies, and performance tuning
 - **📊 Transparency**: Real-time monitoring dashboard shows exactly what's happening
 - **⚡ Performance**: Direct access without API rate limits or geographic restrictions
 - **🛡️ Security**: Keep sensitive data extraction workflows behind your firewall
 - **🔧 Flexibility**: Customize, extend, and integrate with your existing infrastructure
 When you self-host, you can scale from a single container to a full browser infrastructure, all while maintaining complete control and visibility.
 ## Table of Contents
 - [Prerequisites](#prerequisites)
@@ -25,7 +41,12 @@
  - [Available MCP Tools](#available-mcp-tools)
  - [Testing MCP Connections](#testing-mcp-connections)
  - [MCP Schemas](#mcp-schemas)
- [Metrics & Monitoring](#metrics--monitoring)
+- [Real-time Monitoring & Operations](#real-time-monitoring--operations)
  - [Monitoring Dashboard](#monitoring-dashboard)
  - [Monitor API Endpoints](#monitor-api-endpoints)
  - [WebSocket Streaming](#websocket-streaming)
  - [Control Actions](#control-actions)
  - [Production Integration](#production-integration)
 - [Deployment Scenarios](#deployment-scenarios)
 - [Complete Examples](#complete-examples)
 - [Server Configuration](#server-configuration)
@@ -1175,22 +1196,469 @@ async def test_stream_crawl(token: str = None): # Made token optional
 ---
-## Metrics & Monitoring
+## Real-time Monitoring & Operations
-Keep an eye on your crawler with these endpoints:
+One of the key advantages of self-hosting is complete visibility into your infrastructure. Crawl4AI includes a comprehensive real-time monitoring system that gives you full transparency and control.
- `/health` - Quick health check
+### Monitoring Dashboard
 - `/metrics` - Detailed Prometheus metrics
 - `/schema` - Full API schema
-Example health check:
+Access the **built-in real-time monitoring dashboard** for complete operational visibility:
 ```
 http://localhost:11235/monitor
 ```
 ![Monitoring Dashboard](https://via.placeholder.com/800x400?text=Crawl4AI+Monitoring+Dashboard)
 **Dashboard Features:**
 #### 1. System Health Overview
 - **CPU & Memory**: Live usage with progress bars and percentage indicators
 - **Network I/O**: Total bytes sent/received since startup
 - **Server Uptime**: How long your server has been running
 - **Browser Pool Status**:
  - 🔥 Permanent browser (always-on default config, ~270MB)
  - ♨️ Hot pool (frequently used configs, ~180MB each)
  - ❄️ Cold pool (idle browsers awaiting cleanup, ~180MB each)
 - **Memory Pressure**: LOW/MEDIUM/HIGH indicator for janitor behavior
 #### 2. Live Request Tracking
 - **Active Requests**: Currently running crawls with:
  - Request ID for tracking
  - Target URL (truncated for display)
  - Endpoint being used
  - Elapsed time (updates in real-time)
  - Memory usage from start
 - **Completed Requests**: Last 10 finished requests showing:
  - Success/failure status (color-coded)
  - Total execution time
  - Memory delta (how much memory changed)
  - Pool hit (was browser reused?)
  - HTTP status code
 - **Filtering**: View all, success only, or errors only
 #### 3. Browser Pool Management
 Interactive table showing all active browsers:
 | Type | Signature | Age | Last Used | Hits | Actions |
 |------|-----------|-----|-----------|------|---------|
 | permanent | abc12345 | 2h | 5s ago | 1,247 | Restart |
 | hot | def67890 | 45m | 2m ago | 89 | Kill / Restart |
 | cold | ghi11213 | 30m | 15m ago | 3 | Kill / Restart |
 - **Reuse Rate**: Percentage of requests that reused existing browsers
 - **Memory Estimates**: Total memory used by browser pool
 - **Manual Control**: Kill or restart individual browsers
 #### 4. Janitor Events Log
 Real-time log of browser pool cleanup events:
 - When cold browsers are closed due to memory pressure
 - When browsers are promoted from cold to hot pool
 - Forced cleanups triggered manually
 - Detailed cleanup reasons and browser signatures
 #### 5. Error Monitoring
 Recent errors with full context:
 - Timestamp
 - Endpoint where error occurred
 - Target URL
 - Error message
 - Request ID for correlation
 **Live Updates:**
 The dashboard connects via WebSocket and refreshes every **2 seconds** with the latest data. Connection status indicator shows when you're connected/disconnected.
 ---
 ### Monitor API Endpoints
 For programmatic monitoring, automation, and integration with your existing infrastructure:
 #### Health & Statistics
 **Get System Health**
 ```bash
-curl http://localhost:11235/health
+GET /monitor/health
 ```
 Returns current system snapshot:
 ```json
 {
  "container": {
    "memory_percent": 45.2,
    "cpu_percent": 23.1,
    "network_sent_mb": 1250.45,
    "network_recv_mb": 3421.12,
    "uptime_seconds": 7234
  },
  "pool": {
    "permanent": {"active": true, "memory_mb": 270},
    "hot": {"count": 3, "memory_mb": 540},
    "cold": {"count": 1, "memory_mb": 180},
    "total_memory_mb": 990
  },
  "janitor": {
    "next_cleanup_estimate": "adaptive",
    "memory_pressure": "MEDIUM"
  }
 }
 ```
 **Get Request Statistics**
 ```bash
 GET /monitor/requests?status=all&limit=50
 ```
 Query parameters:
 - `status`: Filter by `all`, `active`, `completed`, `success`, or `error`
 - `limit`: Number of completed requests to return (1-1000)
 **Get Browser Pool Details**
 ```bash
 GET /monitor/browsers
 ```
 Returns detailed information about all active browsers:
 ```json
 {
  "browsers": [
    {
      "type": "permanent",
      "sig": "abc12345",
      "age_seconds": 7234,
      "last_used_seconds": 5,
      "memory_mb": 270,
      "hits": 1247,
      "killable": false
    },
    {
      "type": "hot",
      "sig": "def67890",
      "age_seconds": 2701,
      "last_used_seconds": 120,
      "memory_mb": 180,
      "hits": 89,
      "killable": true
    }
  ],
  "summary": {
    "total_count": 5,
    "total_memory_mb": 990,
    "reuse_rate_percent": 87.3
  }
 }
 ```
 **Get Endpoint Performance Statistics**
 ```bash
 GET /monitor/endpoints/stats
 ```
 Returns aggregated metrics per endpoint:
 ```json
 {
  "/crawl": {
    "count": 1523,
    "avg_latency_ms": 2341.5,
    "success_rate_percent": 98.2,
    "pool_hit_rate_percent": 89.1,
    "errors": 27
  },
  "/md": {
    "count": 891,
    "avg_latency_ms": 1823.7,
    "success_rate_percent": 99.4,
    "pool_hit_rate_percent": 92.3,
    "errors": 5
  }
 }
 ```
 **Get Timeline Data**
 ```bash
 GET /monitor/timeline?metric=memory&window=5m
 ```
 Parameters:
 - `metric`: `memory`, `requests`, or `browsers`
 - `window`: Currently only `5m` (5-minute window, 5-second resolution)
 Returns time-series data for charts:
 ```json
 {
  "timestamps": [1699564800, 1699564805, 1699564810, ...],
  "values": [42.1, 43.5, 41.8, ...]
 }
 ```
 #### Logs
 **Get Janitor Events**
 ```bash
 GET /monitor/logs/janitor?limit=100
 ```
 **Get Error Log**
 ```bash
 GET /monitor/logs/errors?limit=100
 ```
 ---
-*(Deployment Scenarios and Complete Examples sections remain the same, maybe update links if examples moved)*
+### WebSocket Streaming
 For real-time monitoring in your own dashboards or applications:
 ```bash
 WS /monitor/ws
 ```
 **Connection Example (Python):**
 ```python
 import asyncio
 import websockets
 import json
 async def monitor_server():
    uri = "ws://localhost:11235/monitor/ws"
    async with websockets.connect(uri) as websocket:
        print("Connected to Crawl4AI monitor")
        while True:
            # Receive update every 2 seconds
            data = await websocket.recv()
            update = json.loads(data)
            # Extract key metrics
            health = update['health']
            active_requests = len(update['requests']['active'])
            browsers = len(update['browsers'])
            print(f"Memory: {health['container']['memory_percent']:.1f}% | "
                  f"Active: {active_requests} | "
                  f"Browsers: {browsers}")
            # Check for high memory pressure
            if health['janitor']['memory_pressure'] == 'HIGH':
                print("⚠️  HIGH MEMORY PRESSURE - Consider cleanup")
 asyncio.run(monitor_server())
 ```
 **Update Payload Structure:**
 ```json
 {
  "timestamp": 1699564823.456,
  "health": { /* System health snapshot */ },
  "requests": {
    "active": [ /* Currently running */ ],
    "completed": [ /* Last 10 completed */ ]
  },
  "browsers": [ /* All active browsers */ ],
  "timeline": {
    "memory": { /* Last 5 minutes */ },
    "requests": { /* Request rate */ },
    "browsers": { /* Pool composition */ }
  },
  "janitor": [ /* Last 10 cleanup events */ ],
  "errors": [ /* Last 10 errors */ ]
 }
 ```
 ---
 ### Control Actions
 Take manual control when needed:
 **Force Immediate Cleanup**
 ```bash
 POST /monitor/actions/cleanup
 ```
 Kills all cold pool browsers immediately (useful when memory is tight):
 ```json
 {
  "success": true,
  "killed_browsers": 3
 }
 ```
 **Kill Specific Browser**
 ```bash
 POST /monitor/actions/kill_browser
 Content-Type: application/json
 {
  "sig": "abc12345"  // First 8 chars of browser signature
 }
 ```
 Response:
 ```json
 {
  "success": true,
  "killed_sig": "abc12345",
  "pool_type": "hot"
 }
 ```
 **Restart Browser**
 ```bash
 POST /monitor/actions/restart_browser
 Content-Type: application/json
 {
  "sig": "permanent"  // Or first 8 chars of signature
 }
 ```
 For permanent browser, this will close and reinitialize it. For hot/cold browsers, it kills them and lets new requests create fresh ones.
 **Reset Statistics**
 ```bash
 POST /monitor/stats/reset
 ```
 Clears endpoint counters (useful for starting fresh after testing).
 ---
 ### Production Integration
 #### Integration with Existing Monitoring Systems
 **Prometheus Integration:**
 ```bash
 # Scrape metrics endpoint
 curl http://localhost:11235/metrics
 ```
 **Custom Dashboard Integration:**
 ```python
 # Example: Push metrics to your monitoring system
 import asyncio
 import websockets
 import json
 from your_monitoring import push_metric
 async def integrate_monitoring():
    async with websockets.connect("ws://localhost:11235/monitor/ws") as ws:
        while True:
            data = json.loads(await ws.recv())
            # Push to your monitoring system
            push_metric("crawl4ai.memory.percent",
                       data['health']['container']['memory_percent'])
            push_metric("crawl4ai.active_requests",
                       len(data['requests']['active']))
            push_metric("crawl4ai.browser_count",
                       len(data['browsers']))
 ```
 **Alerting Example:**
 ```python
 import requests
 import time
 def check_health():
    """Poll health endpoint and alert on issues"""
    response = requests.get("http://localhost:11235/monitor/health")
    health = response.json()
    # Alert on high memory
    if health['container']['memory_percent'] > 85:
        send_alert(f"High memory: {health['container']['memory_percent']}%")
    # Alert on high error rate
    stats = requests.get("http://localhost:11235/monitor/endpoints/stats").json()
    for endpoint, metrics in stats.items():
        if metrics['success_rate_percent'] < 95:
            send_alert(f"{endpoint} success rate: {metrics['success_rate_percent']}%")
 # Run every minute
 while True:
    check_health()
    time.sleep(60)
 ```
 **Log Aggregation:**
 ```python
 import requests
 from datetime import datetime
 def aggregate_errors():
    """Fetch and aggregate errors for logging system"""
    response = requests.get("http://localhost:11235/monitor/logs/errors?limit=100")
    errors = response.json()['errors']
    for error in errors:
        log_to_system({
            'timestamp': datetime.fromtimestamp(error['timestamp']),
            'service': 'crawl4ai',
            'endpoint': error['endpoint'],
            'url': error['url'],
            'message': error['error'],
            'request_id': error['request_id']
        })
 ```
 #### Key Metrics to Track
 For production self-hosted deployments, monitor these metrics:
 1. **Memory Usage Trends**
   - Track `container.memory_percent` over time
   - Alert when consistently above 80%
   - Prevents OOM kills
 2. **Request Success Rates**
   - Monitor per-endpoint success rates
   - Alert when below 95%
   - Indicates crawling issues
 3. **Average Latency**
   - Track `avg_latency_ms` per endpoint
   - Detect performance degradation
   - Optimize slow endpoints
 4. **Browser Pool Efficiency**
   - Monitor `reuse_rate_percent`
   - Should be >80% for good efficiency
   - Low rates indicate pool churn
 5. **Error Frequency**
   - Count errors per time window
   - Alert on sudden spikes
   - Track error patterns
 6. **Janitor Activity**
   - Monitor cleanup frequency
   - Excessive cleanup indicates memory pressure
   - Adjust pool settings if needed
 ---
 ### Quick Health Check
 For simple uptime monitoring:
 ```bash
 curl http://localhost:11235/health
 ```
 Returns:
 ```json
 {
  "status": "healthy",
  "version": "0.7.4"
 }
 ```
 Other useful endpoints:
 - `/metrics` - Prometheus metrics
 - `/schema` - Full API schema
 ---
@@ -1350,22 +1818,46 @@ We're here to help you succeed with Crawl4AI! Here's how to get support:
 ## Summary
-In this guide, we've covered everything you need to get started with Crawl4AI's Docker deployment:
+Congratulations! You now have everything you need to self-host your own Crawl4AI infrastructure with complete control and visibility.
 - Building and running the Docker container
 - Configuring the environment  
 - Using the interactive playground for testing
 - Making API requests with proper typing
 - Using the Python SDK
 - Leveraging specialized endpoints for screenshots, PDFs, and JavaScript execution
 - Connecting via the Model Context Protocol (MCP)
 - Monitoring your deployment
-The new playground interface at `http://localhost:11235/playground` makes it much easier to test configurations and generate the corresponding JSON for API requests.
+**What You've Learned:**
 - ✅ Multiple deployment options (Docker Hub, Docker Compose, manual builds)
 - ✅ Environment configuration and LLM integration
 - ✅ Using the interactive playground for testing
 - ✅ Making API requests with proper typing (SDK and REST)
 - ✅ Specialized endpoints (screenshots, PDFs, JavaScript execution)
 - ✅ MCP integration for AI-assisted development
 - ✅ **Real-time monitoring dashboard** for operational transparency
 - ✅ **Monitor API** for programmatic control and integration
 - ✅ Production deployment best practices
-For AI application developers, the MCP integration allows tools like Claude Code to directly access Crawl4AI's capabilities without complex API handling.
+**Why This Matters:**
-Remember, the examples in the `examples` folder are your friends - they show real-world usage patterns that you can adapt for your needs.
+By self-hosting Crawl4AI, you:
 - 🔒 **Own Your Data**: Everything stays in your infrastructure
 - 📊 **See Everything**: Real-time dashboard shows exactly what's happening
 - 💰 **Control Costs**: Scale within your resources, no per-request fees
 - ⚡ **Maximize Performance**: Direct access with smart browser pooling (10x memory efficiency)
 - 🛡️ **Stay Secure**: Keep sensitive workflows behind your firewall
 - 🔧 **Customize Freely**: Full control over configs, strategies, and optimizations
-Keep exploring, and don't hesitate to reach out if you need help! We're building something amazing together. 🚀
+**Next Steps:**
 1. **Start Simple**: Deploy with Docker Hub image and test with the playground
 2. **Monitor Everything**: Open `http://localhost:11235/monitor` to watch your server
 3. **Integrate**: Connect your applications using the Python SDK or REST API
 4. **Scale Smart**: Use the monitoring data to optimize your deployment
 5. **Go Production**: Set up alerting, log aggregation, and automated cleanup
 **Key Resources:**
 - 🎮 **Playground**: `http://localhost:11235/playground` - Interactive testing
 - 📊 **Monitor Dashboard**: `http://localhost:11235/monitor` - Real-time visibility
 - 📖 **Architecture Docs**: `deploy/docker/ARCHITECTURE.md` - Deep technical dive
 - 💬 **Discord Community**: Get help and share experiences
 - ⭐ **GitHub**: Report issues, contribute, show support
 Remember: The monitoring dashboard is your window into your infrastructure. Use it to understand performance, troubleshoot issues, and optimize your deployment. The examples in the `examples` folder show real-world usage patterns you can adapt.
 **You're now in control of your web crawling destiny!** 🚀
 Happy crawling! 🕷️
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -18,7 +18,7 @@ nav:
    - "Marketplace Admin": "marketplace/admin/index.html"
  - Setup & Installation:
    - "Installation": "core/installation.md"
-    - "Docker Deployment": "core/docker-deployment.md"
+    - "Self-Hosting Guide": "core/self-hosting.md"
  - "Blog & Changelog":
    - "Blog Home": "blog/index.md"
    - "Changelog": "https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md"