Files
crawl4ai/deploy/docker/WEBHOOK_EXAMPLES.md
ntohidi d670dcde0a feat: add webhook support for /llm/job endpoint
Add comprehensive webhook notification support for the /llm/job endpoint,
following the same pattern as the existing /crawl/job implementation.

Changes:
- Add webhook_config field to LlmJobPayload model (job.py)
- Implement webhook notifications in process_llm_extraction() with 4
  notification points: success, provider validation failure, extraction
  failure, and general exceptions (api.py)
- Store webhook_config in Redis task data for job tracking
- Initialize WebhookDeliveryService with exponential backoff retry logic
Documentation:
- Add Example 6 to WEBHOOK_EXAMPLES.md showing LLM extraction with webhooks
- Update Flask webhook handler to support both crawl and llm_extraction tasks
- Add TypeScript client examples for LLM jobs
- Add comprehensive examples to docker_webhook_example.py with schema support
- Clarify data structure differences between webhook and API responses

Testing:
- Add test_llm_webhook_feature.py with 7 validation tests (all passing)
- Verify pattern consistency with /crawl/job implementation
- Add implementation guide (WEBHOOK_LLM_JOB_IMPLEMENTATION.md)
2025-10-22 13:03:09 +02:00

9.7 KiB

Webhook Feature Examples

This document provides examples of how to use the webhook feature for crawl jobs in Crawl4AI.

Overview

The webhook feature allows you to receive notifications when crawl jobs complete, eliminating the need for polling. Webhooks are sent with exponential backoff retry logic to ensure reliable delivery.

Configuration

Global Configuration (config.yml)

You can configure default webhook settings in config.yml:

webhooks:
  enabled: true
  default_url: null  # Optional: default webhook URL for all jobs
  data_in_payload: false  # Optional: default behavior for including data
  retry:
    max_attempts: 5
    initial_delay_ms: 1000  # 1s, 2s, 4s, 8s, 16s exponential backoff
    max_delay_ms: 32000
    timeout_ms: 30000  # 30s timeout per webhook call
  headers:  # Optional: default headers to include
    User-Agent: "Crawl4AI-Webhook/1.0"

API Usage Examples

Example 1: Basic Webhook (Notification Only)

Send a webhook notification without including the crawl data in the payload.

Request:

curl -X POST http://localhost:11235/crawl/job \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com"],
    "webhook_config": {
      "webhook_url": "https://myapp.com/webhooks/crawl-complete",
      "webhook_data_in_payload": false
    }
  }'

Response:

{
  "task_id": "crawl_a1b2c3d4"
}

Webhook Payload Received:

{
  "task_id": "crawl_a1b2c3d4",
  "task_type": "crawl",
  "status": "completed",
  "timestamp": "2025-10-21T10:30:00.000000+00:00",
  "urls": ["https://example.com"]
}

Your webhook handler should then fetch the results:

curl http://localhost:11235/crawl/job/crawl_a1b2c3d4

Example 2: Webhook with Data Included

Include the full crawl results in the webhook payload.

Request:

curl -X POST http://localhost:11235/crawl/job \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com"],
    "webhook_config": {
      "webhook_url": "https://myapp.com/webhooks/crawl-complete",
      "webhook_data_in_payload": true
    }
  }'

Webhook Payload Received:

{
  "task_id": "crawl_a1b2c3d4",
  "task_type": "crawl",
  "status": "completed",
  "timestamp": "2025-10-21T10:30:00.000000+00:00",
  "urls": ["https://example.com"],
  "data": {
    "markdown": "...",
    "html": "...",
    "links": {...},
    "metadata": {...}
  }
}

Example 3: Webhook with Custom Headers

Include custom headers for authentication or identification.

Request:

curl -X POST http://localhost:11235/crawl/job \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com"],
    "webhook_config": {
      "webhook_url": "https://myapp.com/webhooks/crawl-complete",
      "webhook_data_in_payload": false,
      "webhook_headers": {
        "X-Webhook-Secret": "my-secret-token",
        "X-Service-ID": "crawl4ai-production"
      }
    }
  }'

The webhook will be sent with these additional headers plus the default headers from config.

Example 4: Failure Notification

When a crawl job fails, a webhook is sent with error details.

Webhook Payload on Failure:

{
  "task_id": "crawl_a1b2c3d4",
  "task_type": "crawl",
  "status": "failed",
  "timestamp": "2025-10-21T10:30:00.000000+00:00",
  "urls": ["https://example.com"],
  "error": "Connection timeout after 30s"
}

Example 5: Using Global Default Webhook

If you set a default_url in config.yml, jobs without webhook_config will use it:

config.yml:

webhooks:
  enabled: true
  default_url: "https://myapp.com/webhooks/default"
  data_in_payload: false

Request (no webhook_config needed):

curl -X POST http://localhost:11235/crawl/job \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com"]
  }'

The webhook will be sent to the default URL configured in config.yml.

Example 6: LLM Extraction Job with Webhook

Use webhooks with the LLM extraction endpoint for asynchronous processing.

Request:

curl -X POST http://localhost:11235/llm/job \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/article",
    "q": "Extract the article title, author, and publication date",
    "schema": "{\"type\": \"object\", \"properties\": {\"title\": {\"type\": \"string\"}, \"author\": {\"type\": \"string\"}, \"date\": {\"type\": \"string\"}}}",
    "cache": false,
    "provider": "openai/gpt-4o-mini",
    "webhook_config": {
      "webhook_url": "https://myapp.com/webhooks/llm-complete",
      "webhook_data_in_payload": true
    }
  }'

Response:

{
  "task_id": "llm_1698765432_12345"
}

Webhook Payload Received:

{
  "task_id": "llm_1698765432_12345",
  "task_type": "llm_extraction",
  "status": "completed",
  "timestamp": "2025-10-21T10:30:00.000000+00:00",
  "urls": ["https://example.com/article"],
  "data": {
    "extracted_content": {
      "title": "Understanding Web Scraping",
      "author": "John Doe",
      "date": "2025-10-21"
    }
  }
}

Webhook Handler Example

Here's a simple Python Flask webhook handler that supports both crawl and LLM extraction jobs:

from flask import Flask, request, jsonify
import requests

app = Flask(__name__)

@app.route('/webhooks/crawl-complete', methods=['POST'])
def handle_crawl_webhook():
    payload = request.json

    task_id = payload['task_id']
    task_type = payload['task_type']
    status = payload['status']

    if status == 'completed':
        # If data not in payload, fetch it
        if 'data' not in payload:
            # Determine endpoint based on task type
            endpoint = 'crawl' if task_type == 'crawl' else 'llm'
            response = requests.get(f'http://localhost:11235/{endpoint}/job/{task_id}')
            data = response.json()
        else:
            data = payload['data']

        # Process based on task type
        if task_type == 'crawl':
            print(f"Processing crawl results for {task_id}")
            # Handle crawl results
            results = data.get('results', [])
            for result in results:
                print(f"  - {result.get('url')}: {len(result.get('markdown', ''))} chars")

        elif task_type == 'llm_extraction':
            print(f"Processing LLM extraction for {task_id}")
            # Handle LLM extraction
            # Note: Webhook sends 'extracted_content', API returns 'result'
            extracted = data.get('extracted_content', data.get('result', {}))
            print(f"  - Extracted: {extracted}")

        # Your business logic here...

    elif status == 'failed':
        error = payload.get('error', 'Unknown error')
        print(f"{task_type} job {task_id} failed: {error}")
        # Handle failure...

    return jsonify({"status": "received"}), 200

if __name__ == '__main__':
    app.run(port=8080)

Retry Logic

The webhook delivery service uses exponential backoff retry logic:

  • Attempts: Up to 5 attempts by default
  • Delays: 1s → 2s → 4s → 8s → 16s
  • Timeout: 30 seconds per attempt
  • Retry Conditions:
    • Server errors (5xx status codes)
    • Network errors
    • Timeouts
  • No Retry:
    • Client errors (4xx status codes)
    • Successful delivery (2xx status codes)

Benefits

  1. No Polling Required - Eliminates constant API calls to check job status
  2. Real-time Notifications - Immediate notification when jobs complete
  3. Reliable Delivery - Exponential backoff ensures webhooks are delivered
  4. Flexible - Choose between notification-only or full data delivery
  5. Secure - Support for custom headers for authentication
  6. Configurable - Global defaults or per-job configuration
  7. Universal Support - Works with both /crawl/job and /llm/job endpoints

TypeScript Client Example

interface WebhookConfig {
  webhook_url: string;
  webhook_data_in_payload?: boolean;
  webhook_headers?: Record<string, string>;
}

interface CrawlJobRequest {
  urls: string[];
  browser_config?: Record<string, any>;
  crawler_config?: Record<string, any>;
  webhook_config?: WebhookConfig;
}

interface LLMJobRequest {
  url: string;
  q: string;
  schema?: string;
  cache?: boolean;
  provider?: string;
  webhook_config?: WebhookConfig;
}

async function createCrawlJob(request: CrawlJobRequest) {
  const response = await fetch('http://localhost:11235/crawl/job', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(request)
  });

  const { task_id } = await response.json();
  return task_id;
}

async function createLLMJob(request: LLMJobRequest) {
  const response = await fetch('http://localhost:11235/llm/job', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(request)
  });

  const { task_id } = await response.json();
  return task_id;
}

// Usage - Crawl Job
const crawlTaskId = await createCrawlJob({
  urls: ['https://example.com'],
  webhook_config: {
    webhook_url: 'https://myapp.com/webhooks/crawl-complete',
    webhook_data_in_payload: false,
    webhook_headers: {
      'X-Webhook-Secret': 'my-secret'
    }
  }
});

// Usage - LLM Extraction Job
const llmTaskId = await createLLMJob({
  url: 'https://example.com/article',
  q: 'Extract the main points from this article',
  provider: 'openai/gpt-4o-mini',
  webhook_config: {
    webhook_url: 'https://myapp.com/webhooks/llm-complete',
    webhook_data_in_payload: true,
    webhook_headers: {
      'X-Webhook-Secret': 'my-secret'
    }
  }
});

Monitoring and Debugging

Webhook delivery attempts are logged at INFO level:

  • Successful deliveries
  • Retry attempts with delays
  • Final failures after max attempts

Check the application logs for webhook delivery status:

docker logs crawl4ai-container | grep -i webhook