Files

AHMET YILMAZ 3877335d89 Profiling/monitoring :Add interactive monitoring dashboard and integration tests for monitoring endpoints

- Implemented an interactive monitoring dashboard in `demo_monitoring_dashboard.py` for real-time statistics, profiling session management, and system resource monitoring.
- Created a quick test script `test_monitoring_quick.py` to verify the functionality of monitoring endpoints.
- Developed comprehensive integration tests in `test_monitoring_endpoints.py` covering health checks, statistics, profiling sessions, and real-time streaming.
- Added error handling and user-friendly output for better usability in the dashboard.

2025-10-16 16:48:13 +08:00

42 KiB

Raw Blame History

Docker Server API Reference

The Crawl4AI Docker server provides a comprehensive REST API for web crawling, content extraction, and processing. This guide covers all available endpoints with practical examples.

🚀 Quick Start

Base URL

http://localhost:11235

Authentication

Most endpoints require JWT authentication. Get your token first:

curl -X POST http://localhost:11235/token \
  -H "Content-Type: application/json" \
  -d '{"email": "your@email.com"}'

Interactive Documentation

Visit http://localhost:11235/docs for interactive Swagger UI documentation.

📑 Table of Contents

Core Crawling

POST /crawl - Main crawling endpoint
POST /crawl/stream - Streaming crawl endpoint
POST /crawl/http - HTTP-only crawling endpoint
POST /crawl/http/stream - HTTP-only streaming crawl endpoint
POST /seed - URL discovery and seeding

Content Extraction

POST /md - Extract markdown from URL
POST /html - Get clean HTML content
POST /screenshot - Capture page screenshots
POST /pdf - Export page as PDF
POST /execute_js - Execute JavaScript on page

Dispatcher Management

GET /dispatchers - List available dispatchers
GET /dispatchers/default - Get default dispatcher
GET /dispatchers/stats - Get dispatcher statistics

Adaptive Crawling

POST /adaptive/crawl - Adaptive crawl with auto-discovery
GET /adaptive/status/{task_id} - Check adaptive crawl status

Monitoring & Profiling

GET /monitoring/health - Health check endpoint
GET /monitoring/stats - Get current statistics
GET /monitoring/stats/stream - Real-time statistics stream (SSE)
GET /monitoring/stats/urls - URL-specific statistics
POST /monitoring/stats/reset - Reset statistics
POST /monitoring/profile/start - Start profiling session
GET /monitoring/profile/{session_id} - Get profiling results
GET /monitoring/profile - List profiling sessions
DELETE /monitoring/profile/{session_id} - Delete session
POST /monitoring/profile/cleanup - Cleanup old sessions

Utility Endpoints

POST /token - Get authentication token
GET /health - Health check
GET /metrics - Prometheus metrics
GET /schema - Get API schemas
GET /llm/{url} - LLM-friendly format

Core Crawling Endpoints

POST /crawl

Main endpoint for crawling single or multiple URLs.

Request

Headers:

Content-Type: application/json
Authorization: Bearer <your_token>

Body:

{
  "urls": ["https://example.com"],
  "browser_config": {
    "headless": true,
    "viewport_width": 1920,
    "viewport_height": 1080
  },
  "crawler_config": {
    "word_count_threshold": 10,
    "wait_until": "networkidle",
    "screenshot": true
  },
  "dispatcher": "memory_adaptive"
}

Response

{
  "success": true,
  "results": [
    {
      "url": "https://example.com",
      "html": "<html>...</html>",
      "markdown": "# Example Domain\n\nThis domain is for use in...",
      "cleaned_html": "<div>...</div>",
      "screenshot": "base64_encoded_image_data",
      "success": true,
      "status_code": 200,
      "extracted_content": {},
      "metadata": {
        "title": "Example Domain",
        "description": "Example Domain Description"
      },
      "links": {
        "internal": ["https://example.com/about"],
        "external": ["https://other.com"]
      },
      "media": {
        "images": [{"src": "image.jpg", "alt": "Image"}]
      }
    }
  ]
}

Examples

=== "Python" ```python import requests

# Get token first
token_response = requests.post(
    "http://localhost:11235/token",
    json={"email": "your@email.com"}
)
token = token_response.json()["access_token"]

# Crawl with basic config
response = requests.post(
    "http://localhost:11235/crawl",
    headers={
        "Authorization": f"Bearer {token}",
        "Content-Type": "application/json"
    },
    json={
        "urls": ["https://example.com"],
        "browser_config": {"headless": True},
        "crawler_config": {"screenshot": True}
    }
)

data = response.json()
if data["success"]:
    result = data["results"][0]
    print(f"Title: {result['metadata']['title']}")
    print(f"Markdown length: {len(result['markdown'])}")
```

=== "cURL" ```bash # Get token TOKEN=$(curl -X POST http://localhost:11235/token
-H "Content-Type: application/json"
-d '{"email": "your@email.com"}' | jq -r '.access_token')

# Crawl URL
curl -X POST http://localhost:11235/crawl \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com"],
    "browser_config": {"headless": true},
    "crawler_config": {"screenshot": true}
  }'
```

=== "JavaScript" ```javascript // Get token const tokenResponse = await fetch('http://localhost:11235/token', { method: 'POST', headers: {'Content-Type': 'application/json'}, body: JSON.stringify({email: 'your@email.com'}) }); const {access_token} = await tokenResponse.json();

// Crawl URL
const response = await fetch('http://localhost:11235/crawl', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${access_token}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    urls: ['https://example.com'],
    browser_config: {headless: true},
    crawler_config: {screenshot: true}
  })
});

const data = await response.json();
console.log('Results:', data.results);
```

Configuration Options

Browser Config:

{
  "headless": true,              // Run browser in headless mode
  "viewport_width": 1920,        // Browser viewport width
  "viewport_height": 1080,       // Browser viewport height
  "user_agent": "custom agent",  // Custom user agent
  "accept_downloads": false,     // Enable file downloads
  "use_managed_browser": false,  // Use system browser
  "java_script_enabled": true    // Enable JavaScript execution
}

Crawler Config:

{
  "word_count_threshold": 10,    // Minimum words per block
  "wait_until": "networkidle",   // When to consider page loaded
  "wait_for": "div.content",     // CSS selector to wait for
  "delay_before_return": 0.5,    // Delay before returning (seconds)
  "screenshot": true,            // Capture screenshot
  "pdf": false,                  // Generate PDF
  "remove_overlay_elements": true,// Remove popups/modals
  "simulate_user": false,        // Simulate user interaction
  "magic": false,                // Auto-handle overlays
  "adjust_viewport_to_content": false, // Auto-adjust viewport
  "page_timeout": 60000,         // Page load timeout (ms)
  "js_code": "console.log('hi')", // Execute custom JS
  "css_selector": ".content",    // Extract specific element
  "excluded_tags": ["nav", "footer"], // Tags to exclude
  "exclude_external_links": true // Remove external links
}

Dispatcher Options:

memory_adaptive - Dynamically adjusts based on memory (default)
semaphore - Fixed concurrency limit

POST /crawl/stream

Streaming endpoint for real-time crawl progress.

Request

Same as /crawl endpoint.

Response

Server-Sent Events (SSE) stream:

data: {"type": "progress", "url": "https://example.com", "status": "started"}

data: {"type": "progress", "url": "https://example.com", "status": "fetching"}

data: {"type": "result", "url": "https://example.com", "data": {...}}

data: {"type": "complete", "success": true}

Examples

=== "Python" ```python import requests import json

response = requests.post(
    "http://localhost:11235/crawl/stream",
    headers={
        "Authorization": f"Bearer {token}",
        "Content-Type": "application/json"
    },
    json={"urls": ["https://example.com"]},
    stream=True
)

for line in response.iter_lines():
    if line:
        line = line.decode('utf-8')
        if line.startswith('data: '):
            data = json.loads(line[6:])
            print(f"Event: {data.get('type')} - {data}")
```

=== "JavaScript" ```javascript const eventSource = new EventSource( 'http://localhost:11235/crawl/stream' );

eventSource.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log('Progress:', data);
  
  if (data.type === 'complete') {
    eventSource.close();
  }
};
```

POST /seed

Discover and seed URLs from a website.

Request

{
  "url": "https://www.nbcnews.com",
  "config": {
    "max_urls": 20,
    "filter_type": "domain",
    "exclude_external": true
  }
}

Filter Types:

all - Include all discovered URLs
domain - Only URLs from same domain
subdomain - URLs from same subdomain only

Response

{
  "seed_url": [
    "https://www.nbcnews.com/news/page1",
    "https://www.nbcnews.com/news/page2",
    "https://www.nbcnews.com/about"
  ],
  "count": 3
}

Examples

=== "Python" ```python response = requests.post( "http://localhost:11235/seed", headers={"Authorization": f"Bearer {token}"}, json={ "url": "https://www.nbcnews.com", "config": { "max_urls": 20, "filter_type": "domain" } } )

data = response.json()
urls = data["seed_url"]
print(f"Found {len(urls)} URLs")
```

=== "cURL" bash curl -X POST http://localhost:11235/seed \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{ "url": "https://www.nbcnews.com", "config": { "max_urls": 20, "filter_type": "domain" } }'

POST /crawl/http

Fast HTTP-only crawling endpoint for static content and APIs.

Request

Headers:

Content-Type: application/json
Authorization: Bearer <your_token>

Body:

{
  "urls": ["https://api.example.com/data"],
  "http_config": {
    "method": "GET",
    "headers": {"Accept": "application/json"},
    "timeout": 30,
    "follow_redirects": true,
    "verify_ssl": true
  },
  "crawler_config": {
    "word_count_threshold": 10,
    "extraction_strategy": "NoExtractionStrategy"
  },
  "dispatcher": "memory_adaptive"
}

Response

{
  "success": true,
  "results": [
    {
      "url": "https://api.example.com/data",
      "html": "<html>...</html>",
      "markdown": "# API Response\n\n...",
      "cleaned_html": "<div>...</div>",
      "success": true,
      "status_code": 200,
      "metadata": {
        "title": "API Data",
        "description": "JSON response data"
      },
      "links": {
        "internal": [],
        "external": []
      },
      "media": {
        "images": []
      }
    }
  ],
  "server_processing_time_s": 0.15,
  "server_memory_delta_mb": 1.2
}

Configuration Options

HTTP Config:

{
  "method": "GET",                    // HTTP method (GET, POST, PUT, etc.)
  "headers": {                        // Custom HTTP headers
    "User-Agent": "Crawl4AI/1.0",
    "Accept": "application/json"
  },
  "data": "form=data",                // Form data for POST requests
  "json": {"key": "value"},           // JSON data for POST requests
  "timeout": 30,                      // Request timeout in seconds
  "follow_redirects": true,           // Follow HTTP redirects
  "verify_ssl": true,                 // Verify SSL certificates
  "params": {"key": "value"}          // URL query parameters
}

Crawler Config:

{
  "word_count_threshold": 10,         // Minimum words per block
  "extraction_strategy": "NoExtractionStrategy", // Use lightweight extraction
  "remove_overlay_elements": false,   // No overlays in HTTP responses
  "css_selector": ".content",         // Extract specific elements
  "excluded_tags": ["script", "style"] // Tags to exclude
}

Examples

=== "Python" ```python import requests

# Get token first
token_response = requests.post(
    "http://localhost:11235/token",
    json={"email": "your@email.com"}
)
token = token_response.json()["access_token"]

# Fast HTTP-only crawl
response = requests.post(
    "http://localhost:11235/crawl/http",
    headers={
        "Authorization": f"Bearer {token}",
        "Content-Type": "application/json"
    },
    json={
        "urls": ["https://httpbin.org/json"],
        "http_config": {
            "method": "GET",
            "headers": {"Accept": "application/json"},
            "timeout": 10
        },
        "crawler_config": {
            "extraction_strategy": "NoExtractionStrategy"
        }
    }
)

data = response.json()
if data["success"]:
    result = data["results"][0]
    print(f"Status: {result['status_code']}")
    print(f"Response time: {data['server_processing_time_s']:.2f}s")
    print(f"Content length: {len(result['html'])} chars")
```

=== "cURL" ```bash # Get token TOKEN=$(curl -X POST http://localhost:11235/token
-H "Content-Type: application/json"
-d '{"email": "your@email.com"}' | jq -r '.access_token')

# HTTP-only crawl
curl -X POST http://localhost:11235/crawl/http \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://httpbin.org/json"],
    "http_config": {
      "method": "GET",
      "headers": {"Accept": "application/json"},
      "timeout": 10
    },
    "crawler_config": {
      "extraction_strategy": "NoExtractionStrategy"
    }
  }'
```

// HTTP-only crawl
const response = await fetch('http://localhost:11235/crawl/http', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${access_token}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    urls: ['https://httpbin.org/json'],
    http_config: {
      method: 'GET',
      headers: {'Accept': 'application/json'},
      timeout: 10
    },
    crawler_config: {
      extraction_strategy: 'NoExtractionStrategy'
    }
  })
});

const data = await response.json();
console.log('HTTP Crawl Results:', data.results);
console.log(`Processed in ${data.server_processing_time_s}s`);
```

Use Cases

API Endpoints: Crawl REST APIs and GraphQL endpoints
Static Websites: Fast crawling of HTML pages without JavaScript
JSON/XML Feeds: Extract data from RSS feeds and API responses
Sitemaps: Process XML sitemaps and structured data
Headless CMS: Crawl content management system APIs

Performance Benefits

1000x Faster: No browser startup or JavaScript execution
Lower Resource Usage: Minimal memory and CPU overhead
Higher Throughput: Process thousands of URLs per minute
Cost Effective: Ideal for large-scale data collection

POST /crawl/http/stream

Streaming HTTP-only crawling with real-time progress updates.

Request

Same as /crawl/http endpoint.

Response

Server-Sent Events (SSE) stream:

data: {"type": "progress", "url": "https://api.example.com", "status": "started"}

data: {"type": "progress", "url": "https://api.example.com", "status": "fetching"}

data: {"type": "result", "url": "https://api.example.com", "data": {...}}

data: {"type": "complete", "success": true, "total_urls": 1}

Examples

=== "Python" ```python import requests import json

response = requests.post(
    "http://localhost:11235/crawl/http/stream",
    headers={
        "Authorization": f"Bearer {token}",
        "Content-Type": "application/json"
    },
    json={
        "urls": ["https://httpbin.org/json", "https://httpbin.org/uuid"],
        "http_config": {"timeout": 5}
    },
    stream=True
)

for line in response.iter_lines():
    if line:
        line = line.decode('utf-8')
        if line.startswith('data: '):
            data = json.loads(line[6:])
            print(f"Event: {data.get('type')} - URL: {data.get('url', 'N/A')}")
            
            if data['type'] == 'result':
                result = data['data']
                print(f"  Status: {result['status_code']}")
            elif data['type'] == 'complete':
                print(f"  Total processed: {data['total_urls']}")
                break
```

=== "JavaScript" ```javascript const eventSource = new EventSource( 'http://localhost:11235/crawl/http/stream' );

// Handle streaming events
eventSource.onmessage = (event) => {
  const data = JSON.parse(event.data);
  
  switch(data.type) {
    case 'progress':
      console.log(`Progress: ${data.url} - ${data.status}`);
      break;
    case 'result':
      console.log(`Result: ${data.url} - Status ${data.data.status_code}`);
      break;
    case 'complete':
      console.log(`Complete: ${data.total_urls} URLs processed`);
      eventSource.close();
      break;
  }
};

// Send the request
fetch('http://localhost:11235/crawl/http/stream', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${token}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    urls: ['https://httpbin.org/json'],
    http_config: {timeout: 5}
  })
});
```

Content Extraction Endpoints

POST /md

Extract markdown content from a URL.

Request

{
  "url": "https://example.com",
  "f": "markdown",
  "q": ""
}

Response

{
  "markdown": "# Example Domain\n\nThis domain is for use in...",
  "title": "Example Domain",
  "url": "https://example.com"
}

Examples

=== "Python" ```python response = requests.post( "http://localhost:11235/md", headers={"Authorization": f"Bearer {token}"}, json={"url": "https://example.com"} )

markdown = response.json()["markdown"]
print(markdown)
```

POST /html

Get clean HTML content.

Request

{
  "url": "https://example.com",
  "only_text": false
}

Response

{
  "html": "<div><h1>Example Domain</h1>...</div>",
  "url": "https://example.com"
}

POST /screenshot

Capture page screenshot.

Request

{
  "url": "https://example.com",
  "options": {
    "full_page": true,
    "format": "png"
  }
}

Response

{
  "screenshot": "base64_encoded_image_data",
  "format": "png",
  "url": "https://example.com"
}

Examples

=== "Python" ```python import base64

response = requests.post(
    "http://localhost:11235/screenshot",
    headers={"Authorization": f"Bearer {token}"},
    json={
        "url": "https://example.com",
        "options": {"full_page": True}
    }
)

screenshot_b64 = response.json()["screenshot"]
screenshot_data = base64.b64decode(screenshot_b64)

with open("screenshot.png", "wb") as f:
    f.write(screenshot_data)
```

POST /pdf

Export page as PDF.

Request

{
  "url": "https://example.com",
  "options": {
    "format": "A4",
    "print_background": true
  }
}

Response

{
  "pdf": "base64_encoded_pdf_data",
  "url": "https://example.com"
}

POST /execute_js

Execute JavaScript on a page.

Request

{
  "url": "https://example.com",
  "js_code": "document.querySelector('h1').textContent",
  "wait_for": "h1"
}

Response

{
  "result": "Example Domain",
  "success": true,
  "url": "https://example.com"
}

Examples

=== "Python" ```python response = requests.post( "http://localhost:11235/execute_js", headers={"Authorization": f"Bearer {token}"}, json={ "url": "https://example.com", "js_code": "document.title" } )

result = response.json()["result"]
print(f"Page title: {result}")
```

Dispatcher Management

GET /dispatchers

List all available dispatcher types.

Response

[
  {
    "type": "memory_adaptive",
    "name": "Memory Adaptive Dispatcher",
    "description": "Dynamically adjusts concurrency based on system memory usage",
    "config": {
      "memory_threshold_percent": 70.0,
      "critical_threshold_percent": 85.0,
      "max_session_permit": 20
    },
    "features": [
      "Dynamic concurrency adjustment",
      "Memory pressure monitoring"
    ]
  },
  {
    "type": "semaphore",
    "name": "Semaphore Dispatcher",
    "description": "Fixed concurrency limit using semaphore",
    "config": {
      "semaphore_count": 5,
      "max_session_permit": 10
    },
    "features": [
      "Fixed concurrency limit",
      "Simple semaphore control"
    ]
  }
]

Examples

=== "Python" ```python response = requests.get("http://localhost:11235/dispatchers") dispatchers = response.json()

for dispatcher in dispatchers:
    print(f"{dispatcher['type']}: {dispatcher['name']}")
```

=== "cURL" bash curl http://localhost:11235/dispatchers | jq

GET /dispatchers/default

Get current default dispatcher information.

Response

{
  "default_dispatcher": "memory_adaptive",
  "config": {
    "memory_threshold_percent": 70.0
  }
}

GET /dispatchers/stats

Get dispatcher statistics and metrics.

Response

{
  "current_dispatcher": "memory_adaptive",
  "active_sessions": 3,
  "queued_requests": 0,
  "memory_usage_percent": 45.2,
  "total_processed": 157
}

Adaptive Crawling

POST /adaptive/crawl

Start an adaptive crawl with automatic URL discovery.

Request

{
  "start_url": "https://example.com",
  "config": {
    "max_depth": 2,
    "max_pages": 50,
    "adaptive_threshold": 0.5
  }
}

Response

{
  "task_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "started",
  "start_url": "https://example.com"
}

GET /adaptive/status/{task_id}

Check status of adaptive crawl task.

Response

{
  "task_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "running",
  "pages_crawled": 23,
  "pages_queued": 15,
  "progress_percent": 46.0
}

Monitoring & Profiling

The monitoring endpoints provide real-time statistics, profiling capabilities, and health monitoring for your Crawl4AI instance.

GET /monitoring/health

Health check endpoint for monitoring integration.

Response

{
  "status": "healthy",
  "uptime_seconds": 3600,
  "timestamp": "2025-01-07T12:00:00Z"
}

Examples

=== "Python" python response = requests.get("http://localhost:11235/monitoring/health") health = response.json() print(f"Status: {health['status']}") print(f"Uptime: {health['uptime_seconds']}s")

=== "cURL" bash curl http://localhost:11235/monitoring/health

GET /monitoring/stats

Get current crawler statistics and system metrics.

Response

{
  "active_crawls": 2,
  "total_crawls": 150,
  "successful_crawls": 142,
  "failed_crawls": 8,
  "success_rate": 94.67,
  "avg_duration_ms": 1250.5,
  "total_bytes_processed": 15728640,
  "system_stats": {
    "cpu_percent": 45.2,
    "memory_percent": 62.8,
    "memory_used_mb": 2048,
    "memory_available_mb": 8192,
    "disk_usage_percent": 55.3,
    "active_processes": 127
  }
}

Examples

=== "Python" ```python response = requests.get("http://localhost:11235/monitoring/stats") stats = response.json()

print(f"Active crawls: {stats['active_crawls']}")
print(f"Success rate: {stats['success_rate']:.2f}%")
print(f"CPU usage: {stats['system_stats']['cpu_percent']:.1f}%")
print(f"Memory usage: {stats['system_stats']['memory_percent']:.1f}%")
```

=== "cURL" bash curl http://localhost:11235/monitoring/stats

GET /monitoring/stats/stream

Server-Sent Events (SSE) stream of real-time statistics. Updates every 2 seconds.

Response

data: {"active_crawls": 2, "total_crawls": 150, ...}

data: {"active_crawls": 3, "total_crawls": 151, ...}

data: {"active_crawls": 2, "total_crawls": 151, ...}

Examples

=== "Python" ```python import requests import json

# Stream real-time stats
response = requests.get(
    "http://localhost:11235/monitoring/stats/stream",
    stream=True
)

for line in response.iter_lines():
    if line.startswith(b"data: "):
        data = json.loads(line[6:])  # Remove "data: " prefix
        print(f"Active: {data['active_crawls']}, "
              f"Total: {data['total_crawls']}, "
              f"CPU: {data['system_stats']['cpu_percent']:.1f}%")
```

=== "JavaScript" ```javascript const eventSource = new EventSource('http://localhost:11235/monitoring/stats/stream');

eventSource.onmessage = (event) => {
  const stats = JSON.parse(event.data);
  console.log('Active crawls:', stats.active_crawls);
  console.log('CPU:', stats.system_stats.cpu_percent);
};
```

GET /monitoring/stats/urls

Get URL-specific statistics showing per-URL performance metrics.

Response

[
  {
    "url": "https://example.com",
    "total_requests": 45,
    "successful_requests": 42,
    "failed_requests": 3,
    "avg_duration_ms": 850.3,
    "total_bytes_processed": 2621440,
    "last_request_time": "2025-01-07T12:00:00Z"
  },
  {
    "url": "https://python.org",
    "total_requests": 32,
    "successful_requests": 32,
    "failed_requests": 0,
    "avg_duration_ms": 1120.7,
    "total_bytes_processed": 1835008,
    "last_request_time": "2025-01-07T11:55:00Z"
  }
]

Examples

=== "Python" ```python response = requests.get("http://localhost:11235/monitoring/stats/urls") url_stats = response.json()

for stat in url_stats:
    success_rate = (stat['successful_requests'] / stat['total_requests']) * 100
    print(f"\nURL: {stat['url']}")
    print(f"  Requests: {stat['total_requests']}")
    print(f"  Success rate: {success_rate:.1f}%")
    print(f"  Avg time: {stat['avg_duration_ms']:.1f}ms")
    print(f"  Data processed: {stat['total_bytes_processed'] / 1024:.1f}KB")
```

POST /monitoring/stats/reset

Reset all statistics counters. Useful for testing or starting fresh monitoring sessions.

Response

{
  "status": "reset",
  "previous_stats": {
    "total_crawls": 150,
    "successful_crawls": 142,
    "failed_crawls": 8
  }
}

Examples

=== "Python" python response = requests.post("http://localhost:11235/monitoring/stats/reset") result = response.json() print(f"Stats reset. Previous total: {result['previous_stats']['total_crawls']}")

=== "cURL" bash curl -X POST http://localhost:11235/monitoring/stats/reset

POST /monitoring/profile/start

Start a profiling session to monitor crawler performance over time.

Request

{
  "urls": [
    "https://example.com",
    "https://python.org"
  ],
  "duration_seconds": 60,
  "browser_config": {
    "headless": true
  },
  "crawler_config": {
    "word_count_threshold": 10
  }
}

Response

{
  "session_id": "prof_abc123xyz",
  "status": "running",
  "started_at": "2025-01-07T12:00:00Z",
  "urls": [
    "https://example.com",
    "https://python.org"
  ],
  "duration_seconds": 60
}

Examples

=== "Python" ```python # Start a profiling session response = requests.post( "http://localhost:11235/monitoring/profile/start", json={ "urls": ["https://example.com", "https://python.org"], "duration_seconds": 60, "crawler_config": { "word_count_threshold": 10 } } )

session = response.json()
session_id = session["session_id"]
print(f"Profiling session started: {session_id}")
print(f"Status: {session['status']}")
```

GET /monitoring/profile/{session_id}

Get profiling session details and results.

Response

{
  "session_id": "prof_abc123xyz",
  "status": "completed",
  "started_at": "2025-01-07T12:00:00Z",
  "completed_at": "2025-01-07T12:01:00Z",
  "duration_seconds": 60,
  "urls": ["https://example.com", "https://python.org"],
  "results": {
    "total_requests": 120,
    "successful_requests": 115,
    "failed_requests": 5,
    "avg_response_time_ms": 950.3,
    "system_metrics": {
      "avg_cpu_percent": 48.5,
      "peak_cpu_percent": 72.3,
      "avg_memory_percent": 55.2,
      "peak_memory_percent": 68.9,
      "total_bytes_processed": 5242880
    }
  }
}

Examples

=== "Python" ```python import time

# Start session
start_response = requests.post(
    "http://localhost:11235/monitoring/profile/start",
    json={
        "urls": ["https://example.com"],
        "duration_seconds": 30
    }
)
session_id = start_response.json()["session_id"]

# Wait for completion
time.sleep(32)

# Get results
result_response = requests.get(
    f"http://localhost:11235/monitoring/profile/{session_id}"
)
session = result_response.json()

print(f"Session: {session_id}")
print(f"Status: {session['status']}")

if session['status'] == 'completed':
    results = session['results']
    print(f"\nResults:")
    print(f"  Total requests: {results['total_requests']}")
    print(f"  Success rate: {results['successful_requests'] / results['total_requests'] * 100:.1f}%")
    print(f"  Avg response time: {results['avg_response_time_ms']:.1f}ms")
    print(f"\nSystem Metrics:")
    print(f"  Avg CPU: {results['system_metrics']['avg_cpu_percent']:.1f}%")
    print(f"  Peak CPU: {results['system_metrics']['peak_cpu_percent']:.1f}%")
    print(f"  Avg Memory: {results['system_metrics']['avg_memory_percent']:.1f}%")
```

GET /monitoring/profile

List all profiling sessions.

Response

{
  "sessions": [
    {
      "session_id": "prof_abc123xyz",
      "status": "completed",
      "started_at": "2025-01-07T12:00:00Z",
      "completed_at": "2025-01-07T12:01:00Z",
      "duration_seconds": 60,
      "urls": ["https://example.com"]
    },
    {
      "session_id": "prof_def456uvw",
      "status": "running",
      "started_at": "2025-01-07T12:05:00Z",
      "duration_seconds": 120,
      "urls": ["https://python.org", "https://github.com"]
    }
  ]
}

Examples

=== "Python" ```python response = requests.get("http://localhost:11235/monitoring/profile") data = response.json()

print(f"Total sessions: {len(data['sessions'])}")

for session in data['sessions']:
    print(f"\n{session['session_id']}")
    print(f"  Status: {session['status']}")
    print(f"  URLs: {', '.join(session['urls'])}")
    print(f"  Duration: {session['duration_seconds']}s")
```

DELETE /monitoring/profile/{session_id}

Delete a profiling session.

Response

{
  "status": "deleted",
  "session_id": "prof_abc123xyz"
}

Examples

=== "Python" ```python response = requests.delete( f"http://localhost:11235/monitoring/profile/{session_id}" )

if response.status_code == 200:
    print(f"Session {session_id} deleted")
```

POST /monitoring/profile/cleanup

Clean up old profiling sessions.

Request

{
  "max_age_seconds": 3600
}

Response

{
  "deleted_count": 5,
  "remaining_count": 3
}

Examples

=== "Python" ```python # Delete sessions older than 1 hour response = requests.post( "http://localhost:11235/monitoring/profile/cleanup", json={"max_age_seconds": 3600} )

result = response.json()
print(f"Deleted {result['deleted_count']} old sessions")
print(f"Remaining: {result['remaining_count']}")
```

Monitoring Dashboard Demo

We provide an interactive terminal-based dashboard for monitoring. Run it with:

python tests/docker/extended_features/demo_monitoring_dashboard.py --url http://localhost:11235

Features:

Real-time statistics with auto-refresh
System resource monitoring (CPU, Memory, Disk)
URL-specific performance metrics
Profiling session management
Interactive commands (view, create, delete sessions)
Color-coded status indicators

Dashboard Commands:

[D] - Dashboard view (default)
[S] - Profiling sessions view
[U] - URL statistics view
[R] - Reset statistics
[N] - Create new profiling session (from sessions view)
[V] - View session details (from sessions view)
[X] - Delete session (from sessions view)
[Q] - Quit

Utility Endpoints

POST /token

Get authentication token for API access.

Request

{
  "email": "your@email.com"
}

Response

{
  "email": "your@email.com",
  "access_token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
  "token_type": "bearer"
}

Examples

=== "Python" ```python response = requests.post( "http://localhost:11235/token", json={"email": "your@email.com"} )

token = response.json()["access_token"]
```

GET /health

Health check endpoint.

Response

{
  "status": "healthy",
  "version": "0.7.0",
  "uptime_seconds": 3600
}

GET /metrics

Prometheus metrics endpoint (if enabled).

Response

# HELP crawl4ai_requests_total Total requests processed
# TYPE crawl4ai_requests_total counter
crawl4ai_requests_total 157.0

# HELP crawl4ai_request_duration_seconds Request duration
# TYPE crawl4ai_request_duration_seconds histogram
...

GET /schema

Get Pydantic schemas for request/response models.

Response

{
  "CrawlerRunConfig": {
    "type": "object",
    "properties": {
      "word_count_threshold": {"type": "integer"},
      ...
    }
  }
}

GET /llm/{url}

Get LLM-friendly format of a URL.

Example

curl http://localhost:11235/llm/https://example.com

Response

# Example Domain

This domain is for use in illustrative examples in documents...

[Read more](https://example.com)

Error Handling

All endpoints return standard HTTP status codes:

200 OK - Request successful
400 Bad Request - Invalid request parameters
401 Unauthorized - Missing or invalid authentication token
404 Not Found - Resource not found
429 Too Many Requests - Rate limit exceeded
500 Internal Server Error - Server error

Error Response Format

{
  "detail": "Error description",
  "error_code": "INVALID_URL",
  "status_code": 400
}

Rate Limiting

The API implements rate limiting per IP address:

Default: 100 requests per minute
Configurable via config.yml
Rate limit headers included in responses:
- X-RateLimit-Limit
- X-RateLimit-Remaining
- X-RateLimit-Reset

Best Practices

1. Authentication

Always store tokens securely and refresh before expiration:

class Crawl4AIClient:
    def __init__(self, email, base_url="http://localhost:11235"):
        self.email = email
        self.base_url = base_url
        self.token = None
        self.refresh_token()
    
    def refresh_token(self):
        response = requests.post(
            f"{self.base_url}/token",
            json={"email": self.email}
        )
        self.token = response.json()["access_token"]
    
    def get_headers(self):
        return {
            "Authorization": f"Bearer {self.token}",
            "Content-Type": "application/json"
        }

2. Batch Processing

For multiple URLs, use the batch crawl endpoint:

client = Crawl4AIClient("your@email.com")

response = requests.post(
    f"{client.base_url}/crawl",
    headers=client.get_headers(),
    json={
        "urls": [
            "https://example.com/page1",
            "https://example.com/page2",
            "https://example.com/page3"
        ],
        "dispatcher": "memory_adaptive"
    }
)

3. Error Handling

Always implement proper error handling:

try:
    response = requests.post(url, headers=headers, json=payload)
    response.raise_for_status()
    data = response.json()
except requests.exceptions.HTTPError as e:
    if e.response.status_code == 429:
        print("Rate limit exceeded, waiting...")
        time.sleep(60)
    else:
        print(f"HTTP error: {e}")
except requests.exceptions.RequestException as e:
    print(f"Request failed: {e}")

4. Streaming for Long-Running Tasks

Use streaming endpoint for better progress tracking:

import sseclient

response = requests.post(
    f"{client.base_url}/crawl/stream",
    headers=client.get_headers(),
    json={"urls": urls},
    stream=True
)

client_stream = sseclient.SSEClient(response)
for event in client_stream.events():
    data = json.loads(event.data)
    if data['type'] == 'progress':
        print(f"Progress: {data['status']}")
    elif data['type'] == 'result':
        process_result(data['data'])

SDK Examples

Complete Crawling Workflow

import requests
import json
from typing import List, Dict

class Crawl4AIClient:
    def __init__(self, email: str, base_url: str = "http://localhost:11235"):
        self.base_url = base_url
        self.token = self._get_token(email)
    
    def _get_token(self, email: str) -> str:
        """Get authentication token"""
        response = requests.post(
            f"{self.base_url}/token",
            json={"email": email}
        )
        return response.json()["access_token"]
    
    def _headers(self) -> Dict[str, str]:
        """Get request headers with auth"""
        return {
            "Authorization": f"Bearer {self.token}",
            "Content-Type": "application/json"
        }
    
    def crawl(self, urls: List[str], **config) -> Dict:
        """Crawl one or more URLs"""
        response = requests.post(
            f"{self.base_url}/crawl",
            headers=self._headers(),
            json={"urls": urls, **config}
        )
        response.raise_for_status()
        return response.json()
    
    def seed_urls(self, url: str, max_urls: int = 20, filter_type: str = "domain") -> List[str]:
        """Discover URLs from a website"""
        response = requests.post(
            f"{self.base_url}/seed",
            headers=self._headers(),
            json={
                "url": url,
                "config": {
                    "max_urls": max_urls,
                    "filter_type": filter_type
                }
            }
        )
        return response.json()["seed_url"]
    
    def screenshot(self, url: str, full_page: bool = True) -> bytes:
        """Capture screenshot and return image data"""
        import base64
        
        response = requests.post(
            f"{self.base_url}/screenshot",
            headers=self._headers(),
            json={
                "url": url,
                "options": {"full_page": full_page}
            }
        )
        screenshot_b64 = response.json()["screenshot"]
        return base64.b64decode(screenshot_b64)
    
    def get_markdown(self, url: str) -> str:
        """Extract markdown from URL"""
        response = requests.post(
            f"{self.base_url}/md",
            headers=self._headers(),
            json={"url": url}
        )
        return response.json()["markdown"]

# Usage
client = Crawl4AIClient("your@email.com")

# Seed URLs
urls = client.seed_urls("https://example.com", max_urls=10)
print(f"Found {len(urls)} URLs")

# Crawl URLs
results = client.crawl(
    urls=urls[:5],
    browser_config={"headless": True},
    crawler_config={"screenshot": True}
)

# Process results
for result in results["results"]:
    print(f"Title: {result['metadata']['title']}")
    print(f"Links: {len(result['links']['internal'])}")
    
# Get markdown
markdown = client.get_markdown("https://example.com")
print(markdown[:200])

# Capture screenshot
screenshot_data = client.screenshot("https://example.com")
with open("page.png", "wb") as f:
    f.write(screenshot_data)

Configuration Reference

Server Configuration

The server is configured via config.yml:

server:
  host: "0.0.0.0"
  port: 11235
  workers: 4

security:
  enabled: true
  jwt_secret: "your-secret-key"
  token_expire_minutes: 60

rate_limiting:
  default_limit: "100/minute"
  storage_uri: "redis://localhost:6379"

observability:
  health_check:
    enabled: true
    endpoint: "/health"
  prometheus:
    enabled: true
    endpoint: "/metrics"

Troubleshooting

Common Issues

1. Authentication Errors

{"detail": "Invalid authentication credentials"}

Solution: Refresh your token

2. Rate Limit Exceeded

{"detail": "Rate limit exceeded"}

Solution: Wait or implement exponential backoff

3. Timeout Errors

{"detail": "Page load timeout"}

Solution: Increase page_timeout in crawler_config

4. Memory Issues

{"detail": "Insufficient memory"}

Solution: Use semaphore dispatcher with lower concurrency

Debug Mode

Enable verbose logging:

import logging
logging.basicConfig(level=logging.DEBUG)

Additional Resources

Last Updated: October 7, 2025
API Version: 0.7.0
Status: Production Ready ✅

42 KiB Raw Blame History

Docker Server API Reference

🚀 Quick Start

Base URL

Authentication

Interactive Documentation

📑 Table of Contents

Core Crawling

Content Extraction

Dispatcher Management

Adaptive Crawling

Monitoring & Profiling

Utility Endpoints

Core Crawling Endpoints

POST /crawl

Request

Response

Examples

Configuration Options

POST /crawl/stream

Request

Response

Examples

POST /seed

Request

Response

Examples

POST /crawl/http

Request

Response

Configuration Options

Examples

Use Cases

Performance Benefits

POST /crawl/http/stream

Request

Response

Examples

Content Extraction Endpoints

POST /md

Request

Response

Examples

POST /html

Request

Response

POST /screenshot

Request

Response

Examples

POST /pdf

Request

Response

POST /execute_js

Request

Response

Examples

Dispatcher Management

GET /dispatchers

Response

Examples

GET /dispatchers/default

Response

GET /dispatchers/stats

Response

Adaptive Crawling

POST /adaptive/crawl

Request

Response

GET /adaptive/status/{task_id}

Response

Monitoring & Profiling

GET /monitoring/health

Response

Examples

GET /monitoring/stats

Response

Examples

GET /monitoring/stats/stream

Response

42 KiB

Raw Blame History