Example Domain

# Docker Server API Reference

The Crawl4AI Docker server provides a comprehensive REST API for web crawling, content extraction, and processing. This guide covers all available endpoints with practical examples.

## 🚀 Quick Start

### Base URL
```
http://localhost:11235
```

### Authentication
Most endpoints require JWT authentication. Get your token first:

```bash
curl -X POST http://localhost:11235/token \
  -H "Content-Type: application/json" \
  -d '{"email": "your@email.com"}'
```

### Interactive Documentation
Visit `http://localhost:11235/docs` for interactive Swagger UI documentation.

---

## 📑 Table of Contents

### Core Crawling
- [POST /crawl](#post-crawl) - Main crawling endpoint
- [POST /crawl/stream](#post-crawlstream) - Streaming crawl endpoint
- [POST /crawl/http](#post-crawlhttp) - HTTP-only crawling endpoint
- [POST /crawl/http/stream](#post-crawlhttpstream) - HTTP-only streaming crawl endpoint
- [POST /seed](#post-seed) - URL discovery and seeding

### Content Extraction
- [POST /md](#post-md) - Extract markdown from URL
- [POST /html](#post-html) - Get clean HTML content
- [POST /screenshot](#post-screenshot) - Capture page screenshots
- [POST /pdf](#post-pdf) - Export page as PDF
- [POST /execute_js](#post-execute_js) - Execute JavaScript on page

### Dispatcher Management
- [GET /dispatchers](#get-dispatchers) - List available dispatchers
- [GET /dispatchers/default](#get-dispatchersdefault) - Get default dispatcher
- [GET /dispatchers/stats](#get-dispatchersstats) - Get dispatcher statistics

### Adaptive Crawling
- [POST /adaptive/crawl](#post-adaptivecrawl) - Adaptive crawl with auto-discovery
- [GET /adaptive/status/{task_id}](#get-adaptivestatustask_id) - Check adaptive crawl status

### Monitoring & Profiling
- [GET /monitoring/health](#get-monitoringhealth) - Health check endpoint
- [GET /monitoring/stats](#get-monitoringstats) - Get current statistics
- [GET /monitoring/stats/stream](#get-monitoringsstatsstream) - Real-time statistics stream (SSE)
- [GET /monitoring/stats/urls](#get-monitoringstatssurls) - URL-specific statistics
- [POST /monitoring/stats/reset](#post-monitoringsstatsreset) - Reset statistics
- [POST /monitoring/profile/start](#post-monitoringprofilestart) - Start profiling session
- [GET /monitoring/profile/{session_id}](#get-monitoringprofilesession_id) - Get profiling results
- [GET /monitoring/profile](#get-monitoringprofile) - List profiling sessions
- [DELETE /monitoring/profile/{session_id}](#delete-monitoringprofilesession_id) - Delete session
- [POST /monitoring/profile/cleanup](#post-monitoringprofilecleanup) - Cleanup old sessions

### Utility Endpoints
- [POST /token](#post-token) - Get authentication token
- [GET /health](#get-health) - Health check
- [GET /metrics](#get-metrics) - Prometheus metrics
- [GET /schema](#get-schema) - Get API schemas
- [GET /llm/{url}](#get-llmurl) - LLM-friendly format

---

## Core Crawling Endpoints

### POST /crawl

Main endpoint for crawling single or multiple URLs.

#### Request

**Headers:**
```
Content-Type: application/json
Authorization: Bearer <your_token>
```

**Body:**
```json
{
  "urls": ["https://example.com"],
  "browser_config": {
    "headless": true,
    "viewport_width": 1920,
    "viewport_height": 1080
  },
  "crawler_config": {
    "word_count_threshold": 10,
    "wait_until": "networkidle",
    "screenshot": true
  },
  "dispatcher": "memory_adaptive"
}
```

#### Response

```json
{
  "success": true,
  "results": [
    {
      "url": "https://example.com",
      "html": "<html>...</html>",
      "markdown": "# Example Domain\n\nThis domain is for use in...",
      "cleaned_html": "<div>...</div>",
      "screenshot": "base64_encoded_image_data",
      "success": true,
      "status_code": 200,
      "extracted_content": {},
      "metadata": {
        "title": "Example Domain",
        "description": "Example Domain Description"
      },
      "links": {
        "internal": ["https://example.com/about"],
        "external": ["https://other.com"]
      },
      "media": {
        "images": [{"src": "image.jpg", "alt": "Image"}]
      }
    }
  ]
}
```

#### Examples

=== "Python"
    ```python
    import requests
    
    # Get token first
    token_response = requests.post(
        "http://localhost:11235/token",
        json={"email": "your@email.com"}
    )
    token = token_response.json()["access_token"]
    
    # Crawl with basic config
    response = requests.post(
        "http://localhost:11235/crawl",
        headers={
            "Authorization": f"Bearer {token}",
            "Content-Type": "application/json"
        },
        json={
            "urls": ["https://example.com"],
            "browser_config": {"headless": True},
            "crawler_config": {"screenshot": True}
        }
    )
    
    data = response.json()
    if data["success"]:
        result = data["results"][0]
        print(f"Title: {result['metadata']['title']}")
        print(f"Markdown length: {len(result['markdown'])}")
    ```

=== "cURL"
    ```bash
    # Get token
    TOKEN=$(curl -X POST http://localhost:11235/token \
      -H "Content-Type: application/json" \
      -d '{"email": "your@email.com"}' | jq -r '.access_token')
    
    # Crawl URL
    curl -X POST http://localhost:11235/crawl \
      -H "Authorization: Bearer $TOKEN" \
      -H "Content-Type: application/json" \
      -d '{
        "urls": ["https://example.com"],
        "browser_config": {"headless": true},
        "crawler_config": {"screenshot": true}
      }'
    ```

=== "JavaScript"
    ```javascript
    // Get token
    const tokenResponse = await fetch('http://localhost:11235/token', {
      method: 'POST',
      headers: {'Content-Type': 'application/json'},
      body: JSON.stringify({email: 'your@email.com'})
    });
    const {access_token} = await tokenResponse.json();
    
    // Crawl URL
    const response = await fetch('http://localhost:11235/crawl', {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${access_token}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        urls: ['https://example.com'],
        browser_config: {headless: true},
        crawler_config: {screenshot: true}
      })
    });
    
    const data = await response.json();
    console.log('Results:', data.results);
    ```

#### Configuration Options

**Browser Config:**
```json
{
  "headless": true,              // Run browser in headless mode
  "viewport_width": 1920,        // Browser viewport width
  "viewport_height": 1080,       // Browser viewport height
  "user_agent": "custom agent",  // Custom user agent
  "accept_downloads": false,     // Enable file downloads
  "use_managed_browser": false,  // Use system browser
  "java_script_enabled": true    // Enable JavaScript execution
}
```

**Crawler Config:**
```json
{
  "word_count_threshold": 10,    // Minimum words per block
  "wait_until": "networkidle",   // When to consider page loaded
  "wait_for": "div.content",     // CSS selector to wait for
  "delay_before_return": 0.5,    // Delay before returning (seconds)
  "screenshot": true,            // Capture screenshot
  "pdf": false,                  // Generate PDF
  "remove_overlay_elements": true,// Remove popups/modals
  "simulate_user": false,        // Simulate user interaction
  "magic": false,                // Auto-handle overlays
  "adjust_viewport_to_content": false, // Auto-adjust viewport
  "page_timeout": 60000,         // Page load timeout (ms)
  "js_code": "console.log('hi')", // Execute custom JS
  "css_selector": ".content",    // Extract specific element
  "excluded_tags": ["nav", "footer"], // Tags to exclude
  "exclude_external_links": true // Remove external links
}
```

**Dispatcher Options:**
- `memory_adaptive` - Dynamically adjusts based on memory (default)
- `semaphore` - Fixed concurrency limit

---

### POST /crawl/stream

Streaming endpoint for real-time crawl progress.

#### Request

Same as `/crawl` endpoint.

#### Response

Server-Sent Events (SSE) stream:

```
data: {"type": "progress", "url": "https://example.com", "status": "started"}

data: {"type": "progress", "url": "https://example.com", "status": "fetching"}

data: {"type": "result", "url": "https://example.com", "data": {...}}

data: {"type": "complete", "success": true}
```

#### Examples

=== "Python"
    ```python
    import requests
    import json
    
    response = requests.post(
        "http://localhost:11235/crawl/stream",
        headers={
            "Authorization": f"Bearer {token}",
            "Content-Type": "application/json"
        },
        json={"urls": ["https://example.com"]},
        stream=True
    )
    
    for line in response.iter_lines():
        if line:
            line = line.decode('utf-8')
            if line.startswith('data: '):
                data = json.loads(line[6:])
                print(f"Event: {data.get('type')} - {data}")
    ```

=== "JavaScript"
    ```javascript
    const eventSource = new EventSource(
      'http://localhost:11235/crawl/stream'
    );
    
    eventSource.onmessage = (event) => {
      const data = JSON.parse(event.data);
      console.log('Progress:', data);
      
      if (data.type === 'complete') {
        eventSource.close();
      }
    };
    ```

---

### POST /seed

Discover and seed URLs from a website.

#### Request

```json
{
  "url": "https://www.nbcnews.com",
  "config": {
    "max_urls": 20,
    "filter_type": "domain",
    "exclude_external": true
  }
}
```

**Filter Types:**
- `all` - Include all discovered URLs
- `domain` - Only URLs from same domain
- `subdomain` - URLs from same subdomain only

#### Response

```json
{
  "seed_url": [
    "https://www.nbcnews.com/news/page1",
    "https://www.nbcnews.com/news/page2",
    "https://www.nbcnews.com/about"
  ],
  "count": 3
}
```

#### Examples

=== "Python"
    ```python
    response = requests.post(
        "http://localhost:11235/seed",
        headers={"Authorization": f"Bearer {token}"},
        json={
            "url": "https://www.nbcnews.com",
            "config": {
                "max_urls": 20,
                "filter_type": "domain"
            }
        }
    )
    
    data = response.json()
    urls = data["seed_url"]
    print(f"Found {len(urls)} URLs")
    ```

=== "cURL"
    ```bash
    curl -X POST http://localhost:11235/seed \
      -H "Authorization: Bearer $TOKEN" \
      -H "Content-Type: application/json" \
      -d '{
        "url": "https://www.nbcnews.com",
        "config": {
          "max_urls": 20,
          "filter_type": "domain"
        }
      }'
    ```

---

### POST /crawl/http

Fast HTTP-only crawling endpoint for static content and APIs.

#### Request

**Headers:**
```
Content-Type: application/json
Authorization: Bearer <your_token>
```

**Body:**
```json
{
  "urls": ["https://api.example.com/data"],
  "http_config": {
    "method": "GET",
    "headers": {"Accept": "application/json"},
    "timeout": 30,
    "follow_redirects": true,
    "verify_ssl": true
  },
  "crawler_config": {
    "word_count_threshold": 10,
    "extraction_strategy": "NoExtractionStrategy"
  },
  "dispatcher": "memory_adaptive"
}
```

#### Response

```json
{
  "success": true,
  "results": [
    {
      "url": "https://api.example.com/data",
      "html": "<html>...</html>",
      "markdown": "# API Response\n\n...",
      "cleaned_html": "<div>...</div>",
      "success": true,
      "status_code": 200,
      "metadata": {
        "title": "API Data",
        "description": "JSON response data"
      },
      "links": {
        "internal": [],
        "external": []
      },
      "media": {
        "images": []
      }
    }
  ],
  "server_processing_time_s": 0.15,
  "server_memory_delta_mb": 1.2
}
```

#### Configuration Options

**HTTP Config:**
```json
{
  "method": "GET",                    // HTTP method (GET, POST, PUT, etc.)
  "headers": {                        // Custom HTTP headers
    "User-Agent": "Crawl4AI/1.0",
    "Accept": "application/json"
  },
  "data": "form=data",                // Form data for POST requests
  "json": {"key": "value"},           // JSON data for POST requests
  "timeout": 30,                      // Request timeout in seconds
  "follow_redirects": true,           // Follow HTTP redirects
  "verify_ssl": true,                 // Verify SSL certificates
  "params": {"key": "value"}          // URL query parameters
}
```

**Crawler Config:**
```json
{
  "word_count_threshold": 10,         // Minimum words per block
  "extraction_strategy": "NoExtractionStrategy", // Use lightweight extraction
  "remove_overlay_elements": false,   // No overlays in HTTP responses
  "css_selector": ".content",         // Extract specific elements
  "excluded_tags": ["script", "style"] // Tags to exclude
}
```

#### Examples

=== "Python"
    ```python
    import requests
    
    # Get token first
    token_response = requests.post(
        "http://localhost:11235/token",
        json={"email": "your@email.com"}
    )
    token = token_response.json()["access_token"]
    
    # Fast HTTP-only crawl
    response = requests.post(
        "http://localhost:11235/crawl/http",
        headers={
            "Authorization": f"Bearer {token}",
            "Content-Type": "application/json"
        },
        json={
            "urls": ["https://httpbin.org/json"],
            "http_config": {
                "method": "GET",
                "headers": {"Accept": "application/json"},
                "timeout": 10
            },
            "crawler_config": {
                "extraction_strategy": "NoExtractionStrategy"
            }
        }
    )
    
    data = response.json()
    if data["success"]:
        result = data["results"][0]
        print(f"Status: {result['status_code']}")
        print(f"Response time: {data['server_processing_time_s']:.2f}s")
        print(f"Content length: {len(result['html'])} chars")
    ```

=== "cURL"
    ```bash
    # Get token
    TOKEN=$(curl -X POST http://localhost:11235/token \
      -H "Content-Type: application/json" \
      -d '{"email": "your@email.com"}' | jq -r '.access_token')
    
    # HTTP-only crawl
    curl -X POST http://localhost:11235/crawl/http \
      -H "Authorization: Bearer $TOKEN" \
      -H "Content-Type: application/json" \
      -d '{
        "urls": ["https://httpbin.org/json"],
        "http_config": {
          "method": "GET",
          "headers": {"Accept": "application/json"},
          "timeout": 10
        },
        "crawler_config": {
          "extraction_strategy": "NoExtractionStrategy"
        }
      }'
    ```

=== "JavaScript"
    ```javascript
    // Get token
    const tokenResponse = await fetch('http://localhost:11235/token', {
      method: 'POST',
      headers: {'Content-Type': 'application/json'},
      body: JSON.stringify({email: 'your@email.com'})
    });
    const {access_token} = await tokenResponse.json();
    
    // HTTP-only crawl
    const response = await fetch('http://localhost:11235/crawl/http', {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${access_token}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        urls: ['https://httpbin.org/json'],
        http_config: {
          method: 'GET',
          headers: {'Accept': 'application/json'},
          timeout: 10
        },
        crawler_config: {
          extraction_strategy: 'NoExtractionStrategy'
        }
      })
    });
    
    const data = await response.json();
    console.log('HTTP Crawl Results:', data.results);
    console.log(`Processed in ${data.server_processing_time_s}s`);
    ```

#### Use Cases

- **API Endpoints**: Crawl REST APIs and GraphQL endpoints
- **Static Websites**: Fast crawling of HTML pages without JavaScript
- **JSON/XML Feeds**: Extract data from RSS feeds and API responses
- **Sitemaps**: Process XML sitemaps and structured data
- **Headless CMS**: Crawl content management system APIs

#### Performance Benefits

- **1000x Faster**: No browser startup or JavaScript execution
- **Lower Resource Usage**: Minimal memory and CPU overhead
- **Higher Throughput**: Process thousands of URLs per minute
- **Cost Effective**: Ideal for large-scale data collection

---

### POST /crawl/http/stream

Streaming HTTP-only crawling with real-time progress updates.

#### Request

Same as `/crawl/http` endpoint.

#### Response

Server-Sent Events (SSE) stream:

```
data: {"type": "progress", "url": "https://api.example.com", "status": "started"}

data: {"type": "progress", "url": "https://api.example.com", "status": "fetching"}

data: {"type": "result", "url": "https://api.example.com", "data": {...}}

data: {"type": "complete", "success": true, "total_urls": 1}
```

#### Examples

=== "Python"
    ```python
    import requests
    import json
    
    response = requests.post(
        "http://localhost:11235/crawl/http/stream",
        headers={
            "Authorization": f"Bearer {token}",
            "Content-Type": "application/json"
        },
        json={
            "urls": ["https://httpbin.org/json", "https://httpbin.org/uuid"],
            "http_config": {"timeout": 5}
        },
        stream=True
    )
    
    for line in response.iter_lines():
        if line:
            line = line.decode('utf-8')
            if line.startswith('data: '):
                data = json.loads(line[6:])
                print(f"Event: {data.get('type')} - URL: {data.get('url', 'N/A')}")
                
                if data['type'] == 'result':
                    result = data['data']
                    print(f"  Status: {result['status_code']}")
                elif data['type'] == 'complete':
                    print(f"  Total processed: {data['total_urls']}")
                    break
    ```

=== "JavaScript"
    ```javascript
    const eventSource = new EventSource(
      'http://localhost:11235/crawl/http/stream'
    );
    
    // Handle streaming events
    eventSource.onmessage = (event) => {
      const data = JSON.parse(event.data);
      
      switch(data.type) {
        case 'progress':
          console.log(`Progress: ${data.url} - ${data.status}`);
          break;
        case 'result':
          console.log(`Result: ${data.url} - Status ${data.data.status_code}`);
          break;
        case 'complete':
          console.log(`Complete: ${data.total_urls} URLs processed`);
          eventSource.close();
          break;
      }
    };
    
    // Send the request
    fetch('http://localhost:11235/crawl/http/stream', {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${token}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        urls: ['https://httpbin.org/json'],
        http_config: {timeout: 5}
      })
    });
    ```

---

## Content Extraction Endpoints

### POST /md

Extract markdown content from a URL.

#### Request

```json
{
  "url": "https://example.com",
  "f": "markdown",
  "q": ""
}
```

#### Response

```json
{
  "markdown": "# Example Domain\n\nThis domain is for use in...",
  "title": "Example Domain",
  "url": "https://example.com"
}
```

#### Examples

=== "Python"
    ```python
    response = requests.post(
        "http://localhost:11235/md",
        headers={"Authorization": f"Bearer {token}"},
        json={"url": "https://example.com"}
    )
    
    markdown = response.json()["markdown"]
    print(markdown)
    ```

---

### POST /html

Get clean HTML content.

#### Request

```json
{
  "url": "https://example.com",
  "only_text": false
}
```

#### Response

```json
{
  "html": "<div><h1>Example Domain</h1>...</div>",
  "url": "https://example.com"
}
```

---

### POST /screenshot

Capture page screenshot.

#### Request

```json
{
  "url": "https://example.com",
  "options": {
    "full_page": true,
    "format": "png"
  }
}
```

#### Response

```json
{
  "screenshot": "base64_encoded_image_data",
  "format": "png",
  "url": "https://example.com"
}
```

#### Examples

=== "Python"
    ```python
    import base64
    
    response = requests.post(
        "http://localhost:11235/screenshot",
        headers={"Authorization": f"Bearer {token}"},
        json={
            "url": "https://example.com",
            "options": {"full_page": True}
        }
    )
    
    screenshot_b64 = response.json()["screenshot"]
    screenshot_data = base64.b64decode(screenshot_b64)
    
    with open("screenshot.png", "wb") as f:
        f.write(screenshot_data)
    ```

---

### POST /pdf

Export page as PDF.

#### Request

```json
{
  "url": "https://example.com",
  "options": {
    "format": "A4",
    "print_background": true
  }
}
```

#### Response

```json
{
  "pdf": "base64_encoded_pdf_data",
  "url": "https://example.com"
}
```

---

### POST /execute_js

Execute JavaScript on a page.

#### Request

```json
{
  "url": "https://example.com",
  "js_code": "document.querySelector('h1').textContent",
  "wait_for": "h1"
}
```

#### Response

```json
{
  "result": "Example Domain",
  "success": true,
  "url": "https://example.com"
}
```

#### Examples

=== "Python"
    ```python
    response = requests.post(
        "http://localhost:11235/execute_js",
        headers={"Authorization": f"Bearer {token}"},
        json={
            "url": "https://example.com",
            "js_code": "document.title"
        }
    )
    
    result = response.json()["result"]
    print(f"Page title: {result}")
    ```

---

## Dispatcher Management

### GET /dispatchers

List all available dispatcher types.

#### Response

```json
[
  {
    "type": "memory_adaptive",
    "name": "Memory Adaptive Dispatcher",
    "description": "Dynamically adjusts concurrency based on system memory usage",
    "config": {
      "memory_threshold_percent": 70.0,
      "critical_threshold_percent": 85.0,
      "max_session_permit": 20
    },
    "features": [
      "Dynamic concurrency adjustment",
      "Memory pressure monitoring"
    ]
  },
  {
    "type": "semaphore",
    "name": "Semaphore Dispatcher",
    "description": "Fixed concurrency limit using semaphore",
    "config": {
      "semaphore_count": 5,
      "max_session_permit": 10
    },
    "features": [
      "Fixed concurrency limit",
      "Simple semaphore control"
    ]
  }
]
```

#### Examples

=== "Python"
    ```python
    response = requests.get("http://localhost:11235/dispatchers")
    dispatchers = response.json()
    
    for dispatcher in dispatchers:
        print(f"{dispatcher['type']}: {dispatcher['name']}")
    ```

=== "cURL"
    ```bash
    curl http://localhost:11235/dispatchers | jq
    ```

---

### GET /dispatchers/default

Get current default dispatcher information.

#### Response

```json
{
  "default_dispatcher": "memory_adaptive",
  "config": {
    "memory_threshold_percent": 70.0
  }
}
```

---

### GET /dispatchers/stats

Get dispatcher statistics and metrics.

#### Response

```json
{
  "current_dispatcher": "memory_adaptive",
  "active_sessions": 3,
  "queued_requests": 0,
  "memory_usage_percent": 45.2,
  "total_processed": 157
}
```

---

## Adaptive Crawling

### POST /adaptive/crawl

Start an adaptive crawl with automatic URL discovery.

#### Request

```json
{
  "start_url": "https://example.com",
  "config": {
    "max_depth": 2,
    "max_pages": 50,
    "adaptive_threshold": 0.5
  }
}
```

#### Response

```json
{
  "task_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "started",
  "start_url": "https://example.com"
}
```

---

### GET /adaptive/status/{task_id}

Check status of adaptive crawl task.

#### Response

```json
{
  "task_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "running",
  "pages_crawled": 23,
  "pages_queued": 15,
  "progress_percent": 46.0
}
```

---

## Monitoring & Profiling

The monitoring endpoints provide real-time statistics, profiling capabilities, and health monitoring for your Crawl4AI instance.

### GET /monitoring/health

Health check endpoint for monitoring integration.

#### Response

```json
{
  "status": "healthy",
  "uptime_seconds": 3600,
  "timestamp": "2025-01-07T12:00:00Z"
}
```

#### Examples

=== "Python"
    ```python
    response = requests.get("http://localhost:11235/monitoring/health")
    health = response.json()
    print(f"Status: {health['status']}")
    print(f"Uptime: {health['uptime_seconds']}s")
    ```

=== "cURL"
    ```bash
    curl http://localhost:11235/monitoring/health
    ```

---

### GET /monitoring/stats

Get current crawler statistics and system metrics.

#### Response

```json
{
  "active_crawls": 2,
  "total_crawls": 150,
  "successful_crawls": 142,
  "failed_crawls": 8,
  "success_rate": 94.67,
  "avg_duration_ms": 1250.5,
  "total_bytes_processed": 15728640,
  "system_stats": {
    "cpu_percent": 45.2,
    "memory_percent": 62.8,
    "memory_used_mb": 2048,
    "memory_available_mb": 8192,
    "disk_usage_percent": 55.3,
    "active_processes": 127
  }
}
```

#### Examples

=== "Python"
    ```python
    response = requests.get("http://localhost:11235/monitoring/stats")
    stats = response.json()
    
    print(f"Active crawls: {stats['active_crawls']}")
    print(f"Success rate: {stats['success_rate']:.2f}%")
    print(f"CPU usage: {stats['system_stats']['cpu_percent']:.1f}%")
    print(f"Memory usage: {stats['system_stats']['memory_percent']:.1f}%")
    ```

=== "cURL"
    ```bash
    curl http://localhost:11235/monitoring/stats
    ```

---

### GET /monitoring/stats/stream

Server-Sent Events (SSE) stream of real-time statistics. Updates every 2 seconds.

#### Response

```
data: {"active_crawls": 2, "total_crawls": 150, ...}

data: {"active_crawls": 3, "total_crawls": 151, ...}

data: {"active_crawls": 2, "total_crawls": 151, ...}
```

#### Examples

=== "Python"
    ```python
    import requests
    import json
    
    # Stream real-time stats
    response = requests.get(
        "http://localhost:11235/monitoring/stats/stream",
        stream=True
    )
    
    for line in response.iter_lines():
        if line.startswith(b"data: "):
            data = json.loads(line[6:])  # Remove "data: " prefix
            print(f"Active: {data['active_crawls']}, "
                  f"Total: {data['total_crawls']}, "
                  f"CPU: {data['system_stats']['cpu_percent']:.1f}%")
    ```

=== "JavaScript"
    ```javascript
    const eventSource = new EventSource('http://localhost:11235/monitoring/stats/stream');
    
    eventSource.onmessage = (event) => {
      const stats = JSON.parse(event.data);
      console.log('Active crawls:', stats.active_crawls);
      console.log('CPU:', stats.system_stats.cpu_percent);
    };
    ```

---

### GET /monitoring/stats/urls

Get URL-specific statistics showing per-URL performance metrics.

#### Response

```json
[
  {
    "url": "https://example.com",
    "total_requests": 45,
    "successful_requests": 42,
    "failed_requests": 3,
    "avg_duration_ms": 850.3,
    "total_bytes_processed": 2621440,
    "last_request_time": "2025-01-07T12:00:00Z"
  },
  {
    "url": "https://python.org",
    "total_requests": 32,
    "successful_requests": 32,
    "failed_requests": 0,
    "avg_duration_ms": 1120.7,
    "total_bytes_processed": 1835008,
    "last_request_time": "2025-01-07T11:55:00Z"
  }
]
```

#### Examples

=== "Python"
    ```python
    response = requests.get("http://localhost:11235/monitoring/stats/urls")
    url_stats = response.json()
    
    for stat in url_stats:
        success_rate = (stat['successful_requests'] / stat['total_requests']) * 100
        print(f"\nURL: {stat['url']}")
        print(f"  Requests: {stat['total_requests']}")
        print(f"  Success rate: {success_rate:.1f}%")
        print(f"  Avg time: {stat['avg_duration_ms']:.1f}ms")
        print(f"  Data processed: {stat['total_bytes_processed'] / 1024:.1f}KB")
    ```

---

### POST /monitoring/stats/reset

Reset all statistics counters. Useful for testing or starting fresh monitoring sessions.

#### Response

```json
{
  "status": "reset",
  "previous_stats": {
    "total_crawls": 150,
    "successful_crawls": 142,
    "failed_crawls": 8
  }
}
```

#### Examples

=== "Python"
    ```python
    response = requests.post("http://localhost:11235/monitoring/stats/reset")
    result = response.json()
    print(f"Stats reset. Previous total: {result['previous_stats']['total_crawls']}")
    ```

=== "cURL"
    ```bash
    curl -X POST http://localhost:11235/monitoring/stats/reset
    ```

---

### POST /monitoring/profile/start

Start a profiling session to monitor crawler performance over time.

#### Request

```json
{
  "urls": [
    "https://example.com",
    "https://python.org"
  ],
  "duration_seconds": 60,
  "browser_config": {
    "headless": true
  },
  "crawler_config": {
    "word_count_threshold": 10
  }
}
```

#### Response

```json
{
  "session_id": "prof_abc123xyz",
  "status": "running",
  "started_at": "2025-01-07T12:00:00Z",
  "urls": [
    "https://example.com",
    "https://python.org"
  ],
  "duration_seconds": 60
}
```

#### Examples

=== "Python"
    ```python
    # Start a profiling session
    response = requests.post(
        "http://localhost:11235/monitoring/profile/start",
        json={
            "urls": ["https://example.com", "https://python.org"],
            "duration_seconds": 60,
            "crawler_config": {
                "word_count_threshold": 10
            }
        }
    )
    
    session = response.json()
    session_id = session["session_id"]
    print(f"Profiling session started: {session_id}")
    print(f"Status: {session['status']}")
    ```

---

### GET /monitoring/profile/{session_id}

Get profiling session details and results.

#### Response

```json
{
  "session_id": "prof_abc123xyz",
  "status": "completed",
  "started_at": "2025-01-07T12:00:00Z",
  "completed_at": "2025-01-07T12:01:00Z",
  "duration_seconds": 60,
  "urls": ["https://example.com", "https://python.org"],
  "results": {
    "total_requests": 120,
    "successful_requests": 115,
    "failed_requests": 5,
    "avg_response_time_ms": 950.3,
    "system_metrics": {
      "avg_cpu_percent": 48.5,
      "peak_cpu_percent": 72.3,
      "avg_memory_percent": 55.2,
      "peak_memory_percent": 68.9,
      "total_bytes_processed": 5242880
    }
  }
}
```

#### Examples

=== "Python"
    ```python
    import time
    
    # Start session
    start_response = requests.post(
        "http://localhost:11235/monitoring/profile/start",
        json={
            "urls": ["https://example.com"],
            "duration_seconds": 30
        }
    )
    session_id = start_response.json()["session_id"]
    
    # Wait for completion
    time.sleep(32)
    
    # Get results
    result_response = requests.get(
        f"http://localhost:11235/monitoring/profile/{session_id}"
    )
    session = result_response.json()
    
    print(f"Session: {session_id}")
    print(f"Status: {session['status']}")
    
    if session['status'] == 'completed':
        results = session['results']
        print(f"\nResults:")
        print(f"  Total requests: {results['total_requests']}")
        print(f"  Success rate: {results['successful_requests'] / results['total_requests'] * 100:.1f}%")
        print(f"  Avg response time: {results['avg_response_time_ms']:.1f}ms")
        print(f"\nSystem Metrics:")
        print(f"  Avg CPU: {results['system_metrics']['avg_cpu_percent']:.1f}%")
        print(f"  Peak CPU: {results['system_metrics']['peak_cpu_percent']:.1f}%")
        print(f"  Avg Memory: {results['system_metrics']['avg_memory_percent']:.1f}%")
    ```

---

### GET /monitoring/profile

List all profiling sessions.

#### Response

```json
{
  "sessions": [
    {
      "session_id": "prof_abc123xyz",
      "status": "completed",
      "started_at": "2025-01-07T12:00:00Z",
      "completed_at": "2025-01-07T12:01:00Z",
      "duration_seconds": 60,
      "urls": ["https://example.com"]
    },
    {
      "session_id": "prof_def456uvw",
      "status": "running",
      "started_at": "2025-01-07T12:05:00Z",
      "duration_seconds": 120,
      "urls": ["https://python.org", "https://github.com"]
    }
  ]
}
```

#### Examples

=== "Python"
    ```python
    response = requests.get("http://localhost:11235/monitoring/profile")
    data = response.json()
    
    print(f"Total sessions: {len(data['sessions'])}")
    
    for session in data['sessions']:
        print(f"\n{session['session_id']}")
        print(f"  Status: {session['status']}")
        print(f"  URLs: {', '.join(session['urls'])}")
        print(f"  Duration: {session['duration_seconds']}s")
    ```

---

### DELETE /monitoring/profile/{session_id}

Delete a profiling session.

#### Response

```json
{
  "status": "deleted",
  "session_id": "prof_abc123xyz"
}
```

#### Examples

=== "Python"
    ```python
    response = requests.delete(
        f"http://localhost:11235/monitoring/profile/{session_id}"
    )
    
    if response.status_code == 200:
        print(f"Session {session_id} deleted")
    ```

---

### POST /monitoring/profile/cleanup

Clean up old profiling sessions.

#### Request

```json
{
  "max_age_seconds": 3600
}
```

#### Response

```json
{
  "deleted_count": 5,
  "remaining_count": 3
}
```

#### Examples

=== "Python"
    ```python
    # Delete sessions older than 1 hour
    response = requests.post(
        "http://localhost:11235/monitoring/profile/cleanup",
        json={"max_age_seconds": 3600}
    )
    
    result = response.json()
    print(f"Deleted {result['deleted_count']} old sessions")
    print(f"Remaining: {result['remaining_count']}")
    ```

---

### Monitoring Dashboard Demo

We provide an interactive terminal-based dashboard for monitoring. Run it with:

```bash
python tests/docker/extended_features/demo_monitoring_dashboard.py --url http://localhost:11235
```

**Features:**
- Real-time statistics with auto-refresh
- System resource monitoring (CPU, Memory, Disk)
- URL-specific performance metrics
- Profiling session management
- Interactive commands (view, create, delete sessions)
- Color-coded status indicators

**Dashboard Commands:**
- `[D]` - Dashboard view (default)
- `[S]` - Profiling sessions view
- `[U]` - URL statistics view
- `[R]` - Reset statistics
- `[N]` - Create new profiling session (from sessions view)
- `[V]` - View session details (from sessions view)
- `[X]` - Delete session (from sessions view)
- `[Q]` - Quit

---

## Utility Endpoints

### POST /token

Get authentication token for API access.

#### Request

```json
{
  "email": "your@email.com"
}
```

#### Response

```json
{
  "email": "your@email.com",
  "access_token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
  "token_type": "bearer"
}
```

#### Examples

=== "Python"
    ```python
    response = requests.post(
        "http://localhost:11235/token",
        json={"email": "your@email.com"}
    )
    
    token = response.json()["access_token"]
    ```

---

### GET /health

Health check endpoint.

#### Response

```json
{
  "status": "healthy",
  "version": "0.7.0",
  "uptime_seconds": 3600
}
```

---

### GET /metrics

Prometheus metrics endpoint (if enabled).

#### Response

```
# HELP crawl4ai_requests_total Total requests processed
# TYPE crawl4ai_requests_total counter
crawl4ai_requests_total 157.0

# HELP crawl4ai_request_duration_seconds Request duration
# TYPE crawl4ai_request_duration_seconds histogram
...
```

---

### GET /schema

Get Pydantic schemas for request/response models.

#### Response

```json
{
  "CrawlerRunConfig": {
    "type": "object",
    "properties": {
      "word_count_threshold": {"type": "integer"},
      ...
    }
  }
}
```

---

### GET /llm/{url}

Get LLM-friendly format of a URL.

#### Example

```bash
curl http://localhost:11235/llm/https://example.com
```

#### Response

```
# Example Domain

This domain is for use in illustrative examples in documents...

[Read more](https://example.com)
```

---

## Error Handling

All endpoints return standard HTTP status codes:

- **200 OK** - Request successful
- **400 Bad Request** - Invalid request parameters
- **401 Unauthorized** - Missing or invalid authentication token
- **404 Not Found** - Resource not found
- **429 Too Many Requests** - Rate limit exceeded
- **500 Internal Server Error** - Server error

### Error Response Format

```json
{
  "detail": "Error description",
  "error_code": "INVALID_URL",
  "status_code": 400
}
```

---

## Rate Limiting

The API implements rate limiting per IP address:

- Default: 100 requests per minute
- Configurable via `config.yml`
- Rate limit headers included in responses:
  - `X-RateLimit-Limit`
  - `X-RateLimit-Remaining`
  - `X-RateLimit-Reset`

---

## Best Practices

### 1. Authentication

Always store tokens securely and refresh before expiration:

```python
class Crawl4AIClient:
    def __init__(self, email, base_url="http://localhost:11235"):
        self.email = email
        self.base_url = base_url
        self.token = None
        self.refresh_token()
    
    def refresh_token(self):
        response = requests.post(
            f"{self.base_url}/token",
            json={"email": self.email}
        )
        self.token = response.json()["access_token"]
    
    def get_headers(self):
        return {
            "Authorization": f"Bearer {self.token}",
            "Content-Type": "application/json"
        }
```

### 2. Batch Processing

For multiple URLs, use the batch crawl endpoint:

```python
client = Crawl4AIClient("your@email.com")

response = requests.post(
    f"{client.base_url}/crawl",
    headers=client.get_headers(),
    json={
        "urls": [
            "https://example.com/page1",
            "https://example.com/page2",
            "https://example.com/page3"
        ],
        "dispatcher": "memory_adaptive"
    }
)
```

### 3. Error Handling

Always implement proper error handling:

```python
try:
    response = requests.post(url, headers=headers, json=payload)
    response.raise_for_status()
    data = response.json()
except requests.exceptions.HTTPError as e:
    if e.response.status_code == 429:
        print("Rate limit exceeded, waiting...")
        time.sleep(60)
    else:
        print(f"HTTP error: {e}")
except requests.exceptions.RequestException as e:
    print(f"Request failed: {e}")
```

### 4. Streaming for Long-Running Tasks

Use streaming endpoint for better progress tracking:

```python
import sseclient

response = requests.post(
    f"{client.base_url}/crawl/stream",
    headers=client.get_headers(),
    json={"urls": urls},
    stream=True
)

client_stream = sseclient.SSEClient(response)
for event in client_stream.events():
    data = json.loads(event.data)
    if data['type'] == 'progress':
        print(f"Progress: {data['status']}")
    elif data['type'] == 'result':
        process_result(data['data'])
```

---

## SDK Examples

### Complete Crawling Workflow

```python
import requests
import json
from typing import List, Dict

class Crawl4AIClient:
    def __init__(self, email: str, base_url: str = "http://localhost:11235"):
        self.base_url = base_url
        self.token = self._get_token(email)
    
    def _get_token(self, email: str) -> str:
        """Get authentication token"""
        response = requests.post(
            f"{self.base_url}/token",
            json={"email": email}
        )
        return response.json()["access_token"]
    
    def _headers(self) -> Dict[str, str]:
        """Get request headers with auth"""
        return {
            "Authorization": f"Bearer {self.token}",
            "Content-Type": "application/json"
        }
    
    def crawl(self, urls: List[str], **config) -> Dict:
        """Crawl one or more URLs"""
        response = requests.post(
            f"{self.base_url}/crawl",
            headers=self._headers(),
            json={"urls": urls, **config}
        )
        response.raise_for_status()
        return response.json()
    
    def seed_urls(self, url: str, max_urls: int = 20, filter_type: str = "domain") -> List[str]:
        """Discover URLs from a website"""
        response = requests.post(
            f"{self.base_url}/seed",
            headers=self._headers(),
            json={
                "url": url,
                "config": {
                    "max_urls": max_urls,
                    "filter_type": filter_type
                }
            }
        )
        return response.json()["seed_url"]
    
    def screenshot(self, url: str, full_page: bool = True) -> bytes:
        """Capture screenshot and return image data"""
        import base64
        
        response = requests.post(
            f"{self.base_url}/screenshot",
            headers=self._headers(),
            json={
                "url": url,
                "options": {"full_page": full_page}
            }
        )
        screenshot_b64 = response.json()["screenshot"]
        return base64.b64decode(screenshot_b64)
    
    def get_markdown(self, url: str) -> str:
        """Extract markdown from URL"""
        response = requests.post(
            f"{self.base_url}/md",
            headers=self._headers(),
            json={"url": url}
        )
        return response.json()["markdown"]

# Usage
client = Crawl4AIClient("your@email.com")

# Seed URLs
urls = client.seed_urls("https://example.com", max_urls=10)
print(f"Found {len(urls)} URLs")

# Crawl URLs
results = client.crawl(
    urls=urls[:5],
    browser_config={"headless": True},
    crawler_config={"screenshot": True}
)

# Process results
for result in results["results"]:
    print(f"Title: {result['metadata']['title']}")
    print(f"Links: {len(result['links']['internal'])}")
    
# Get markdown
markdown = client.get_markdown("https://example.com")
print(markdown[:200])

# Capture screenshot
screenshot_data = client.screenshot("https://example.com")
with open("page.png", "wb") as f:
    f.write(screenshot_data)
```

---

## Configuration Reference

### Server Configuration

The server is configured via `config.yml`:

```yaml
server:
  host: "0.0.0.0"
  port: 11235
  workers: 4

security:
  enabled: true
  jwt_secret: "your-secret-key"
  token_expire_minutes: 60

rate_limiting:
  default_limit: "100/minute"
  storage_uri: "redis://localhost:6379"

observability:
  health_check:
    enabled: true
    endpoint: "/health"
  prometheus:
    enabled: true
    endpoint: "/metrics"
```

---

## Troubleshooting

### Common Issues

**1. Authentication Errors**
```
{"detail": "Invalid authentication credentials"}
```
Solution: Refresh your token

**2. Rate Limit Exceeded**
```
{"detail": "Rate limit exceeded"}
```
Solution: Wait or implement exponential backoff

**3. Timeout Errors**
```
{"detail": "Page load timeout"}
```
Solution: Increase `page_timeout` in crawler_config

**4. Memory Issues**
```
{"detail": "Insufficient memory"}
```
Solution: Use `semaphore` dispatcher with lower concurrency

### Debug Mode

Enable verbose logging:

```python
import logging
logging.basicConfig(level=logging.DEBUG)
```

---

## Additional Resources

- [GitHub Repository](https://github.com/unclecode/crawl4ai)
- [Full Documentation](https://docs.crawl4ai.com)
- [Discord Community](https://discord.gg/crawl4ai)
- [Issue Tracker](https://github.com/unclecode/crawl4ai/issues)

---

**Last Updated**: October 7, 2025  
**API Version**: 0.7.0  
**Status**: Production Ready ✅