Files
crawl4ai/deploy/docker/WEBHOOK_EXAMPLES.md
Claude 8a37710313 feat: add webhook notifications for crawl job completion
Implements webhook support for the crawl job API to eliminate polling requirements.

Changes:
- Added WebhookConfig and WebhookPayload schemas to schemas.py
- Created webhook.py with WebhookDeliveryService class
- Integrated webhook notifications in api.py handle_crawl_job
- Updated job.py CrawlJobPayload to accept webhook_config
- Added webhook configuration section to config.yml
- Included comprehensive usage examples in WEBHOOK_EXAMPLES.md

Features:
- Webhook notifications on job completion (success/failure)
- Configurable data inclusion in webhook payload
- Custom webhook headers support
- Global default webhook URL configuration
- Exponential backoff retry logic (5 attempts: 1s, 2s, 4s, 8s, 16s)
- 30-second timeout per webhook call

Usage:
POST /crawl/job with optional webhook_config:
- webhook_url: URL to receive notifications
- webhook_data_in_payload: include full results (default: false)
- webhook_headers: custom headers for authentication

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-21 16:17:40 +00:00

6.9 KiB

Webhook Feature Examples

This document provides examples of how to use the webhook feature for crawl jobs in Crawl4AI.

Overview

The webhook feature allows you to receive notifications when crawl jobs complete, eliminating the need for polling. Webhooks are sent with exponential backoff retry logic to ensure reliable delivery.

Configuration

Global Configuration (config.yml)

You can configure default webhook settings in config.yml:

webhooks:
  enabled: true
  default_url: null  # Optional: default webhook URL for all jobs
  data_in_payload: false  # Optional: default behavior for including data
  retry:
    max_attempts: 5
    initial_delay_ms: 1000  # 1s, 2s, 4s, 8s, 16s exponential backoff
    max_delay_ms: 32000
    timeout_ms: 30000  # 30s timeout per webhook call
  headers:  # Optional: default headers to include
    User-Agent: "Crawl4AI-Webhook/1.0"

API Usage Examples

Example 1: Basic Webhook (Notification Only)

Send a webhook notification without including the crawl data in the payload.

Request:

curl -X POST http://localhost:11235/crawl/job \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com"],
    "webhook_config": {
      "webhook_url": "https://myapp.com/webhooks/crawl-complete",
      "webhook_data_in_payload": false
    }
  }'

Response:

{
  "task_id": "crawl_a1b2c3d4"
}

Webhook Payload Received:

{
  "task_id": "crawl_a1b2c3d4",
  "task_type": "crawl",
  "status": "completed",
  "timestamp": "2025-10-21T10:30:00.000000+00:00",
  "urls": ["https://example.com"]
}

Your webhook handler should then fetch the results:

curl http://localhost:11235/crawl/job/crawl_a1b2c3d4

Example 2: Webhook with Data Included

Include the full crawl results in the webhook payload.

Request:

curl -X POST http://localhost:11235/crawl/job \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com"],
    "webhook_config": {
      "webhook_url": "https://myapp.com/webhooks/crawl-complete",
      "webhook_data_in_payload": true
    }
  }'

Webhook Payload Received:

{
  "task_id": "crawl_a1b2c3d4",
  "task_type": "crawl",
  "status": "completed",
  "timestamp": "2025-10-21T10:30:00.000000+00:00",
  "urls": ["https://example.com"],
  "data": {
    "markdown": "...",
    "html": "...",
    "links": {...},
    "metadata": {...}
  }
}

Example 3: Webhook with Custom Headers

Include custom headers for authentication or identification.

Request:

curl -X POST http://localhost:11235/crawl/job \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com"],
    "webhook_config": {
      "webhook_url": "https://myapp.com/webhooks/crawl-complete",
      "webhook_data_in_payload": false,
      "webhook_headers": {
        "X-Webhook-Secret": "my-secret-token",
        "X-Service-ID": "crawl4ai-production"
      }
    }
  }'

The webhook will be sent with these additional headers plus the default headers from config.

Example 4: Failure Notification

When a crawl job fails, a webhook is sent with error details.

Webhook Payload on Failure:

{
  "task_id": "crawl_a1b2c3d4",
  "task_type": "crawl",
  "status": "failed",
  "timestamp": "2025-10-21T10:30:00.000000+00:00",
  "urls": ["https://example.com"],
  "error": "Connection timeout after 30s"
}

Example 5: Using Global Default Webhook

If you set a default_url in config.yml, jobs without webhook_config will use it:

config.yml:

webhooks:
  enabled: true
  default_url: "https://myapp.com/webhooks/default"
  data_in_payload: false

Request (no webhook_config needed):

curl -X POST http://localhost:11235/crawl/job \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com"]
  }'

The webhook will be sent to the default URL configured in config.yml.

Webhook Handler Example

Here's a simple Python Flask webhook handler:

from flask import Flask, request, jsonify
import requests

app = Flask(__name__)

@app.route('/webhooks/crawl-complete', methods=['POST'])
def handle_crawl_webhook():
    payload = request.json

    task_id = payload['task_id']
    status = payload['status']

    if status == 'completed':
        # If data not in payload, fetch it
        if 'data' not in payload:
            response = requests.get(f'http://localhost:11235/crawl/job/{task_id}')
            data = response.json()
        else:
            data = payload['data']

        # Process the crawl data
        print(f"Processing crawl results for {task_id}")
        # Your business logic here...

    elif status == 'failed':
        error = payload.get('error', 'Unknown error')
        print(f"Crawl job {task_id} failed: {error}")
        # Handle failure...

    return jsonify({"status": "received"}), 200

if __name__ == '__main__':
    app.run(port=8080)

Retry Logic

The webhook delivery service uses exponential backoff retry logic:

  • Attempts: Up to 5 attempts by default
  • Delays: 1s → 2s → 4s → 8s → 16s
  • Timeout: 30 seconds per attempt
  • Retry Conditions:
    • Server errors (5xx status codes)
    • Network errors
    • Timeouts
  • No Retry:
    • Client errors (4xx status codes)
    • Successful delivery (2xx status codes)

Benefits

  1. No Polling Required - Eliminates constant API calls to check job status
  2. Real-time Notifications - Immediate notification when jobs complete
  3. Reliable Delivery - Exponential backoff ensures webhooks are delivered
  4. Flexible - Choose between notification-only or full data delivery
  5. Secure - Support for custom headers for authentication
  6. Configurable - Global defaults or per-job configuration

TypeScript Client Example

interface WebhookConfig {
  webhook_url: string;
  webhook_data_in_payload?: boolean;
  webhook_headers?: Record<string, string>;
}

interface CrawlJobRequest {
  urls: string[];
  browser_config?: Record<string, any>;
  crawler_config?: Record<string, any>;
  webhook_config?: WebhookConfig;
}

async function createCrawlJob(request: CrawlJobRequest) {
  const response = await fetch('http://localhost:11235/crawl/job', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(request)
  });

  const { task_id } = await response.json();
  return task_id;
}

// Usage
const taskId = await createCrawlJob({
  urls: ['https://example.com'],
  webhook_config: {
    webhook_url: 'https://myapp.com/webhooks/crawl-complete',
    webhook_data_in_payload: false,
    webhook_headers: {
      'X-Webhook-Secret': 'my-secret'
    }
  }
});

Monitoring and Debugging

Webhook delivery attempts are logged at INFO level:

  • Successful deliveries
  • Retry attempts with delays
  • Final failures after max attempts

Check the application logs for webhook delivery status:

docker logs crawl4ai-container | grep -i webhook