feat: add webhook notifications for crawl job completion

Implements webhook support for the crawl job API to eliminate polling requirements. Changes: - Added WebhookConfig and WebhookPayload schemas to schemas.py - Created webhook.py with WebhookDeliveryService class - Integrated webhook notifications in api.py handle_crawl_job - Updated job.py CrawlJobPayload to accept webhook_config - Added webhook configuration section to config.yml - Included comprehensive usage examples in WEBHOOK_EXAMPLES.md Features: - Webhook notifications on job completion (success/failure) - Configurable data inclusion in webhook payload - Custom webhook headers support - Global default webhook URL configuration - Exponential backoff retry logic (5 attempts: 1s, 2s, 4s, 8s, 16s) - 30-second timeout per webhook call Usage: POST /crawl/job with optional webhook_config: - webhook_url: URL to receive notifications - webhook_data_in_payload: include full results (default: false) - webhook_headers: custom headers for authentication Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-21 16:17:40 +00:00
parent fdbcddbf1a
commit 8a37710313
6 changed files with 517 additions and 5 deletions
--- a/deploy/docker/WEBHOOK_EXAMPLES.md
+++ b/deploy/docker/WEBHOOK_EXAMPLES.md
@@ -0,0 +1,281 @@
+# Webhook Feature Examples
+
+This document provides examples of how to use the webhook feature for crawl jobs in Crawl4AI.
+
+## Overview
+
+The webhook feature allows you to receive notifications when crawl jobs complete, eliminating the need for polling. Webhooks are sent with exponential backoff retry logic to ensure reliable delivery.
+
+## Configuration
+
+### Global Configuration (config.yml)
+
+You can configure default webhook settings in `config.yml`:
+
+```yaml
+webhooks:
+  enabled: true
+  default_url: null  # Optional: default webhook URL for all jobs
+  data_in_payload: false  # Optional: default behavior for including data
+  retry:
+    max_attempts: 5
+    initial_delay_ms: 1000  # 1s, 2s, 4s, 8s, 16s exponential backoff
+    max_delay_ms: 32000
+    timeout_ms: 30000  # 30s timeout per webhook call
+  headers:  # Optional: default headers to include
+    User-Agent: "Crawl4AI-Webhook/1.0"
+```
+
+## API Usage Examples
+
+### Example 1: Basic Webhook (Notification Only)
+
+Send a webhook notification without including the crawl data in the payload.
+
+**Request:**
+```bash
+curl -X POST http://localhost:11235/crawl/job \
+  -H "Content-Type: application/json" \
+  -d '{
+    "urls": ["https://example.com"],
+    "webhook_config": {
+      "webhook_url": "https://myapp.com/webhooks/crawl-complete",
+      "webhook_data_in_payload": false
+    }
+  }'
+```
+
+**Response:**
+```json
+{
+  "task_id": "crawl_a1b2c3d4"
+}
+```
+
+**Webhook Payload Received:**
+```json
+{
+  "task_id": "crawl_a1b2c3d4",
+  "task_type": "crawl",
+  "status": "completed",
+  "timestamp": "2025-10-21T10:30:00.000000+00:00",
+  "urls": ["https://example.com"]
+}
+```
+
+Your webhook handler should then fetch the results:
+```bash
+curl http://localhost:11235/crawl/job/crawl_a1b2c3d4
+```
+
+### Example 2: Webhook with Data Included
+
+Include the full crawl results in the webhook payload.
+
+**Request:**
+```bash
+curl -X POST http://localhost:11235/crawl/job \
+  -H "Content-Type: application/json" \
+  -d '{
+    "urls": ["https://example.com"],
+    "webhook_config": {
+      "webhook_url": "https://myapp.com/webhooks/crawl-complete",
+      "webhook_data_in_payload": true
+    }
+  }'
+```
+
+**Webhook Payload Received:**
+```json
+{
+  "task_id": "crawl_a1b2c3d4",
+  "task_type": "crawl",
+  "status": "completed",
+  "timestamp": "2025-10-21T10:30:00.000000+00:00",
+  "urls": ["https://example.com"],
+  "data": {
+    "markdown": "...",
+    "html": "...",
+    "links": {...},
+    "metadata": {...}
+  }
+}
+```
+
+### Example 3: Webhook with Custom Headers
+
+Include custom headers for authentication or identification.
+
+**Request:**
+```bash
+curl -X POST http://localhost:11235/crawl/job \
+  -H "Content-Type: application/json" \
+  -d '{
+    "urls": ["https://example.com"],
+    "webhook_config": {
+      "webhook_url": "https://myapp.com/webhooks/crawl-complete",
+      "webhook_data_in_payload": false,
+      "webhook_headers": {
+        "X-Webhook-Secret": "my-secret-token",
+        "X-Service-ID": "crawl4ai-production"
+      }
+    }
+  }'
+```
+
+The webhook will be sent with these additional headers plus the default headers from config.
+
+### Example 4: Failure Notification
+
+When a crawl job fails, a webhook is sent with error details.
+
+**Webhook Payload on Failure:**
+```json
+{
+  "task_id": "crawl_a1b2c3d4",
+  "task_type": "crawl",
+  "status": "failed",
+  "timestamp": "2025-10-21T10:30:00.000000+00:00",
+  "urls": ["https://example.com"],
+  "error": "Connection timeout after 30s"
+}
+```
+
+### Example 5: Using Global Default Webhook
+
+If you set a `default_url` in config.yml, jobs without webhook_config will use it:
+
+**config.yml:**
+```yaml
+webhooks:
+  enabled: true
+  default_url: "https://myapp.com/webhooks/default"
+  data_in_payload: false
+```
+
+**Request (no webhook_config needed):**
+```bash
+curl -X POST http://localhost:11235/crawl/job \
+  -H "Content-Type: application/json" \
+  -d '{
+    "urls": ["https://example.com"]
+  }'
+```
+
+The webhook will be sent to the default URL configured in config.yml.
+
+## Webhook Handler Example
+
+Here's a simple Python Flask webhook handler:
+
+```python
+from flask import Flask, request, jsonify
+import requests
+
+app = Flask(__name__)
+
+@app.route('/webhooks/crawl-complete', methods=['POST'])
+def handle_crawl_webhook():
+    payload = request.json
+
+    task_id = payload['task_id']
+    status = payload['status']
+
+    if status == 'completed':
+        # If data not in payload, fetch it
+        if 'data' not in payload:
+            response = requests.get(f'http://localhost:11235/crawl/job/{task_id}')
+            data = response.json()
+        else:
+            data = payload['data']
+
+        # Process the crawl data
+        print(f"Processing crawl results for {task_id}")
+        # Your business logic here...
+
+    elif status == 'failed':
+        error = payload.get('error', 'Unknown error')
+        print(f"Crawl job {task_id} failed: {error}")
+        # Handle failure...
+
+    return jsonify({"status": "received"}), 200
+
+if __name__ == '__main__':
+    app.run(port=8080)
+```
+
+## Retry Logic
+
+The webhook delivery service uses exponential backoff retry logic:
+
+- **Attempts:** Up to 5 attempts by default
+- **Delays:** 1s → 2s → 4s → 8s → 16s
+- **Timeout:** 30 seconds per attempt
+- **Retry Conditions:**
+  - Server errors (5xx status codes)
+  - Network errors
+  - Timeouts
+- **No Retry:**
+  - Client errors (4xx status codes)
+  - Successful delivery (2xx status codes)
+
+## Benefits
+
+1. **No Polling Required** - Eliminates constant API calls to check job status
+2. **Real-time Notifications** - Immediate notification when jobs complete
+3. **Reliable Delivery** - Exponential backoff ensures webhooks are delivered
+4. **Flexible** - Choose between notification-only or full data delivery
+5. **Secure** - Support for custom headers for authentication
+6. **Configurable** - Global defaults or per-job configuration
+
+## TypeScript Client Example
+
+```typescript
+interface WebhookConfig {
+  webhook_url: string;
+  webhook_data_in_payload?: boolean;
+  webhook_headers?: Record<string, string>;
+}
+
+interface CrawlJobRequest {
+  urls: string[];
+  browser_config?: Record<string, any>;
+  crawler_config?: Record<string, any>;
+  webhook_config?: WebhookConfig;
+}
+
+async function createCrawlJob(request: CrawlJobRequest) {
+  const response = await fetch('http://localhost:11235/crawl/job', {
+    method: 'POST',
+    headers: { 'Content-Type': 'application/json' },
+    body: JSON.stringify(request)
+  });
+
+  const { task_id } = await response.json();
+  return task_id;
+}
+
+// Usage
+const taskId = await createCrawlJob({
+  urls: ['https://example.com'],
+  webhook_config: {
+    webhook_url: 'https://myapp.com/webhooks/crawl-complete',
+    webhook_data_in_payload: false,
+    webhook_headers: {
+      'X-Webhook-Secret': 'my-secret'
+    }
+  }
+});
+```
+
+## Monitoring and Debugging
+
+Webhook delivery attempts are logged at INFO level:
+- Successful deliveries
+- Retry attempts with delays
+- Final failures after max attempts
+
+Check the application logs for webhook delivery status:
+```bash
+docker logs crawl4ai-container | grep -i webhook
+```
--- a/deploy/docker/api.py
+++ b/deploy/docker/api.py
@@ -44,6 +44,7 @@ from utils import (
    get_llm_api_key,
    validate_llm_provider
 )
+from webhook import WebhookDeliveryService

 import psutil, time

@@ -567,6 +568,7 @@ async def handle_crawl_job(
    browser_config: Dict,
    crawler_config: Dict,
    config: Dict,
+    webhook_config: Optional[Dict] = None,
 ) -> Dict:
    """
    Fire-and-forget version of handle_crawl_request.
@@ -574,13 +576,24 @@ async def handle_crawl_job(
    lets /crawl/job/{task_id} polling fetch the result.
    """
    task_id = f"crawl_{uuid4().hex[:8]}"
-    await redis.hset(f"task:{task_id}", mapping={
+
+    # Store task data in Redis
+    task_data = {
        "status": TaskStatus.PROCESSING,         # <-- keep enum values consistent
        "created_at": datetime.utcnow().isoformat(),
        "url": json.dumps(urls),                 # store list as JSON string
        "result": "",
        "error": "",
-    })
+    }
+
+    # Store webhook config if provided
+    if webhook_config:
+        task_data["webhook_config"] = json.dumps(webhook_config)
+
+    await redis.hset(f"task:{task_id}", mapping=task_data)
+
+    # Initialize webhook service
+    webhook_service = WebhookDeliveryService(config)

    async def _runner():
        try:
@@ -594,6 +607,17 @@ async def handle_crawl_job(
                "status": TaskStatus.COMPLETED,
                "result": json.dumps(result),
            })
+
+            # Send webhook notification on successful completion
+            await webhook_service.notify_job_completion(
+                task_id=task_id,
+                task_type="crawl",
+                status="completed",
+                urls=urls,
+                webhook_config=webhook_config,
+                result=result
+            )
+
            await asyncio.sleep(5)  # Give Redis time to process the update
        except Exception as exc:
            await redis.hset(f"task:{task_id}", mapping={
@@ -601,5 +625,15 @@ async def handle_crawl_job(
                "error": str(exc),
            })

+            # Send webhook notification on failure
+            await webhook_service.notify_job_completion(
+                task_id=task_id,
+                task_type="crawl",
+                status="failed",
+                urls=urls,
+                webhook_config=webhook_config,
+                error=str(exc)
+            )
+
    background_tasks.add_task(_runner)
    return {"task_id": task_id}
--- a/deploy/docker/config.yml
+++ b/deploy/docker/config.yml
@@ -88,4 +88,17 @@ observability:
    enabled: True
    endpoint: "/metrics"
  health_check:
-    endpoint: "/health"
+    endpoint: "/health"
+
+# Webhook Configuration
+webhooks:
+  enabled: true
+  default_url: null  # Optional: default webhook URL for all jobs
+  data_in_payload: false  # Optional: default behavior for including data
+  retry:
+    max_attempts: 5
+    initial_delay_ms: 1000  # 1s, 2s, 4s, 8s, 16s exponential backoff
+    max_delay_ms: 32000
+    timeout_ms: 30000  # 30s timeout per webhook call
+  headers:  # Optional: default headers to include
+    User-Agent: "Crawl4AI-Webhook/1.0"
--- a/deploy/docker/job.py
+++ b/deploy/docker/job.py
@@ -12,6 +12,7 @@ from api import (
    handle_crawl_job,
    handle_task_status,
 )
+from schemas import WebhookConfig

 # ------------- dependency placeholders -------------
 _redis = None        # will be injected from server.py
@@ -43,6 +44,7 @@ class CrawlJobPayload(BaseModel):
    urls:           list[HttpUrl]
    browser_config: Dict = {}
    crawler_config: Dict = {}
+    webhook_config: Optional[WebhookConfig] = None


 # ---------- LLM job ---------------------------------------------------------
@@ -82,6 +84,10 @@ async def crawl_job_enqueue(
        background_tasks: BackgroundTasks,
        _td: Dict = Depends(lambda: _token_dep()),
 ):
+    webhook_config = None
+    if payload.webhook_config:
+        webhook_config = payload.webhook_config.dict()
+
    return await handle_crawl_job(
        _redis,
        background_tasks,
@@ -89,6 +95,7 @@ async def crawl_job_enqueue(
        payload.browser_config,
        payload.crawler_config,
        config=_config,
+        webhook_config=webhook_config,
    )


--- a/deploy/docker/schemas.py
+++ b/deploy/docker/schemas.py
@@ -1,6 +1,6 @@
 from typing import List, Optional, Dict
 from enum import Enum
-from pydantic import BaseModel, Field
+from pydantic import BaseModel, Field, HttpUrl
 from utils import FilterType


@@ -39,4 +39,22 @@ class JSEndpointRequest(BaseModel):
    scripts: List[str] = Field(
        ...,
        description="List of separated JavaScript snippets to execute"
-    )
+    )
+
+
+class WebhookConfig(BaseModel):
+    """Configuration for webhook notifications."""
+    webhook_url: HttpUrl
+    webhook_data_in_payload: bool = False
+    webhook_headers: Optional[Dict[str, str]] = None
+
+
+class WebhookPayload(BaseModel):
+    """Payload sent to webhook endpoints."""
+    task_id: str
+    task_type: str  # "crawl", "llm_extraction", etc.
+    status: str  # "completed" or "failed"
+    timestamp: str  # ISO 8601 format
+    urls: List[str]
+    error: Optional[str] = None
+    data: Optional[Dict] = None  # Included only if webhook_data_in_payload=True
--- a/deploy/docker/webhook.py
+++ b/deploy/docker/webhook.py
@@ -0,0 +1,159 @@
+"""
+Webhook delivery service for Crawl4AI.
+
+This module provides webhook notification functionality with exponential backoff retry logic.
+"""
+import asyncio
+import httpx
+import logging
+from typing import Dict, Optional
+from datetime import datetime, timezone
+
+logger = logging.getLogger(__name__)
+
+
+class WebhookDeliveryService:
+    """Handles webhook delivery with exponential backoff retry logic."""
+
+    def __init__(self, config: Dict):
+        """
+        Initialize the webhook delivery service.
+
+        Args:
+            config: Application configuration dictionary containing webhook settings
+        """
+        self.config = config.get("webhooks", {})
+        self.max_attempts = self.config.get("retry", {}).get("max_attempts", 5)
+        self.initial_delay = self.config.get("retry", {}).get("initial_delay_ms", 1000) / 1000
+        self.max_delay = self.config.get("retry", {}).get("max_delay_ms", 32000) / 1000
+        self.timeout = self.config.get("retry", {}).get("timeout_ms", 30000) / 1000
+
+    async def send_webhook(
+        self,
+        webhook_url: str,
+        payload: Dict,
+        headers: Optional[Dict[str, str]] = None
+    ) -> bool:
+        """
+        Send webhook with exponential backoff retry logic.
+
+        Args:
+            webhook_url: The URL to send the webhook to
+            payload: The JSON payload to send
+            headers: Optional custom headers
+
+        Returns:
+            bool: True if delivered successfully, False otherwise
+        """
+        default_headers = self.config.get("headers", {})
+        merged_headers = {**default_headers, **(headers or {})}
+        merged_headers["Content-Type"] = "application/json"
+
+        async with httpx.AsyncClient(timeout=self.timeout) as client:
+            for attempt in range(self.max_attempts):
+                try:
+                    logger.info(
+                        f"Sending webhook (attempt {attempt + 1}/{self.max_attempts}) to {webhook_url}"
+                    )
+
+                    response = await client.post(
+                        webhook_url,
+                        json=payload,
+                        headers=merged_headers
+                    )
+
+                    # Success or client error (don't retry client errors)
+                    if response.status_code < 500:
+                        if 200 <= response.status_code < 300:
+                            logger.info(f"Webhook delivered successfully to {webhook_url}")
+                            return True
+                        else:
+                            logger.warning(
+                                f"Webhook rejected with status {response.status_code}: {response.text[:200]}"
+                            )
+                            return False  # Client error - don't retry
+
+                    # Server error - retry with backoff
+                    logger.warning(
+                        f"Webhook failed with status {response.status_code}, will retry"
+                    )
+
+                except httpx.TimeoutException as exc:
+                    logger.error(f"Webhook timeout (attempt {attempt + 1}): {exc}")
+                except httpx.RequestError as exc:
+                    logger.error(f"Webhook request error (attempt {attempt + 1}): {exc}")
+                except Exception as exc:
+                    logger.error(f"Webhook delivery error (attempt {attempt + 1}): {exc}")
+
+                # Calculate exponential backoff delay
+                if attempt < self.max_attempts - 1:
+                    delay = min(self.initial_delay * (2 ** attempt), self.max_delay)
+                    logger.info(f"Retrying in {delay}s...")
+                    await asyncio.sleep(delay)
+
+        logger.error(
+            f"Webhook delivery failed after {self.max_attempts} attempts to {webhook_url}"
+        )
+        return False
+
+    async def notify_job_completion(
+        self,
+        task_id: str,
+        task_type: str,
+        status: str,
+        urls: list,
+        webhook_config: Optional[Dict],
+        result: Optional[Dict] = None,
+        error: Optional[str] = None
+    ):
+        """
+        Notify webhook of job completion.
+
+        Args:
+            task_id: The task identifier
+            task_type: Type of task (e.g., "crawl", "llm_extraction")
+            status: Task status ("completed" or "failed")
+            urls: List of URLs that were crawled
+            webhook_config: Webhook configuration from the job request
+            result: Optional crawl result data
+            error: Optional error message if failed
+        """
+        # Determine webhook URL
+        webhook_url = None
+        data_in_payload = self.config.get("data_in_payload", False)
+        custom_headers = None
+
+        if webhook_config:
+            webhook_url = webhook_config.get("webhook_url")
+            data_in_payload = webhook_config.get("webhook_data_in_payload", data_in_payload)
+            custom_headers = webhook_config.get("webhook_headers")
+
+        if not webhook_url:
+            webhook_url = self.config.get("default_url")
+
+        if not webhook_url:
+            logger.debug("No webhook URL configured, skipping notification")
+            return
+
+        # Check if webhooks are enabled
+        if not self.config.get("enabled", True):
+            logger.debug("Webhooks are disabled, skipping notification")
+            return
+
+        # Build payload
+        payload = {
+            "task_id": task_id,
+            "task_type": task_type,
+            "status": status,
+            "timestamp": datetime.now(timezone.utc).isoformat(),
+            "urls": urls
+        }
+
+        if error:
+            payload["error"] = error
+
+        if data_in_payload and result:
+            payload["data"] = result
+
+        # Send webhook (fire and forget - don't block on completion)
+        await self.send_webhook(webhook_url, payload, custom_headers)