Implements webhook support for the crawl job API to eliminate polling requirements. Changes: - Added WebhookConfig and WebhookPayload schemas to schemas.py - Created webhook.py with WebhookDeliveryService class - Integrated webhook notifications in api.py handle_crawl_job - Updated job.py CrawlJobPayload to accept webhook_config - Added webhook configuration section to config.yml - Included comprehensive usage examples in WEBHOOK_EXAMPLES.md Features: - Webhook notifications on job completion (success/failure) - Configurable data inclusion in webhook payload - Custom webhook headers support - Global default webhook URL configuration - Exponential backoff retry logic (5 attempts: 1s, 2s, 4s, 8s, 16s) - 30-second timeout per webhook call Usage: POST /crawl/job with optional webhook_config: - webhook_url: URL to receive notifications - webhook_data_in_payload: include full results (default: false) - webhook_headers: custom headers for authentication Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude <noreply@anthropic.com>
282 lines
6.9 KiB
Markdown
282 lines
6.9 KiB
Markdown
# Webhook Feature Examples
|
|
|
|
This document provides examples of how to use the webhook feature for crawl jobs in Crawl4AI.
|
|
|
|
## Overview
|
|
|
|
The webhook feature allows you to receive notifications when crawl jobs complete, eliminating the need for polling. Webhooks are sent with exponential backoff retry logic to ensure reliable delivery.
|
|
|
|
## Configuration
|
|
|
|
### Global Configuration (config.yml)
|
|
|
|
You can configure default webhook settings in `config.yml`:
|
|
|
|
```yaml
|
|
webhooks:
|
|
enabled: true
|
|
default_url: null # Optional: default webhook URL for all jobs
|
|
data_in_payload: false # Optional: default behavior for including data
|
|
retry:
|
|
max_attempts: 5
|
|
initial_delay_ms: 1000 # 1s, 2s, 4s, 8s, 16s exponential backoff
|
|
max_delay_ms: 32000
|
|
timeout_ms: 30000 # 30s timeout per webhook call
|
|
headers: # Optional: default headers to include
|
|
User-Agent: "Crawl4AI-Webhook/1.0"
|
|
```
|
|
|
|
## API Usage Examples
|
|
|
|
### Example 1: Basic Webhook (Notification Only)
|
|
|
|
Send a webhook notification without including the crawl data in the payload.
|
|
|
|
**Request:**
|
|
```bash
|
|
curl -X POST http://localhost:11235/crawl/job \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"urls": ["https://example.com"],
|
|
"webhook_config": {
|
|
"webhook_url": "https://myapp.com/webhooks/crawl-complete",
|
|
"webhook_data_in_payload": false
|
|
}
|
|
}'
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"task_id": "crawl_a1b2c3d4"
|
|
}
|
|
```
|
|
|
|
**Webhook Payload Received:**
|
|
```json
|
|
{
|
|
"task_id": "crawl_a1b2c3d4",
|
|
"task_type": "crawl",
|
|
"status": "completed",
|
|
"timestamp": "2025-10-21T10:30:00.000000+00:00",
|
|
"urls": ["https://example.com"]
|
|
}
|
|
```
|
|
|
|
Your webhook handler should then fetch the results:
|
|
```bash
|
|
curl http://localhost:11235/crawl/job/crawl_a1b2c3d4
|
|
```
|
|
|
|
### Example 2: Webhook with Data Included
|
|
|
|
Include the full crawl results in the webhook payload.
|
|
|
|
**Request:**
|
|
```bash
|
|
curl -X POST http://localhost:11235/crawl/job \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"urls": ["https://example.com"],
|
|
"webhook_config": {
|
|
"webhook_url": "https://myapp.com/webhooks/crawl-complete",
|
|
"webhook_data_in_payload": true
|
|
}
|
|
}'
|
|
```
|
|
|
|
**Webhook Payload Received:**
|
|
```json
|
|
{
|
|
"task_id": "crawl_a1b2c3d4",
|
|
"task_type": "crawl",
|
|
"status": "completed",
|
|
"timestamp": "2025-10-21T10:30:00.000000+00:00",
|
|
"urls": ["https://example.com"],
|
|
"data": {
|
|
"markdown": "...",
|
|
"html": "...",
|
|
"links": {...},
|
|
"metadata": {...}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Example 3: Webhook with Custom Headers
|
|
|
|
Include custom headers for authentication or identification.
|
|
|
|
**Request:**
|
|
```bash
|
|
curl -X POST http://localhost:11235/crawl/job \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"urls": ["https://example.com"],
|
|
"webhook_config": {
|
|
"webhook_url": "https://myapp.com/webhooks/crawl-complete",
|
|
"webhook_data_in_payload": false,
|
|
"webhook_headers": {
|
|
"X-Webhook-Secret": "my-secret-token",
|
|
"X-Service-ID": "crawl4ai-production"
|
|
}
|
|
}
|
|
}'
|
|
```
|
|
|
|
The webhook will be sent with these additional headers plus the default headers from config.
|
|
|
|
### Example 4: Failure Notification
|
|
|
|
When a crawl job fails, a webhook is sent with error details.
|
|
|
|
**Webhook Payload on Failure:**
|
|
```json
|
|
{
|
|
"task_id": "crawl_a1b2c3d4",
|
|
"task_type": "crawl",
|
|
"status": "failed",
|
|
"timestamp": "2025-10-21T10:30:00.000000+00:00",
|
|
"urls": ["https://example.com"],
|
|
"error": "Connection timeout after 30s"
|
|
}
|
|
```
|
|
|
|
### Example 5: Using Global Default Webhook
|
|
|
|
If you set a `default_url` in config.yml, jobs without webhook_config will use it:
|
|
|
|
**config.yml:**
|
|
```yaml
|
|
webhooks:
|
|
enabled: true
|
|
default_url: "https://myapp.com/webhooks/default"
|
|
data_in_payload: false
|
|
```
|
|
|
|
**Request (no webhook_config needed):**
|
|
```bash
|
|
curl -X POST http://localhost:11235/crawl/job \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"urls": ["https://example.com"]
|
|
}'
|
|
```
|
|
|
|
The webhook will be sent to the default URL configured in config.yml.
|
|
|
|
## Webhook Handler Example
|
|
|
|
Here's a simple Python Flask webhook handler:
|
|
|
|
```python
|
|
from flask import Flask, request, jsonify
|
|
import requests
|
|
|
|
app = Flask(__name__)
|
|
|
|
@app.route('/webhooks/crawl-complete', methods=['POST'])
|
|
def handle_crawl_webhook():
|
|
payload = request.json
|
|
|
|
task_id = payload['task_id']
|
|
status = payload['status']
|
|
|
|
if status == 'completed':
|
|
# If data not in payload, fetch it
|
|
if 'data' not in payload:
|
|
response = requests.get(f'http://localhost:11235/crawl/job/{task_id}')
|
|
data = response.json()
|
|
else:
|
|
data = payload['data']
|
|
|
|
# Process the crawl data
|
|
print(f"Processing crawl results for {task_id}")
|
|
# Your business logic here...
|
|
|
|
elif status == 'failed':
|
|
error = payload.get('error', 'Unknown error')
|
|
print(f"Crawl job {task_id} failed: {error}")
|
|
# Handle failure...
|
|
|
|
return jsonify({"status": "received"}), 200
|
|
|
|
if __name__ == '__main__':
|
|
app.run(port=8080)
|
|
```
|
|
|
|
## Retry Logic
|
|
|
|
The webhook delivery service uses exponential backoff retry logic:
|
|
|
|
- **Attempts:** Up to 5 attempts by default
|
|
- **Delays:** 1s → 2s → 4s → 8s → 16s
|
|
- **Timeout:** 30 seconds per attempt
|
|
- **Retry Conditions:**
|
|
- Server errors (5xx status codes)
|
|
- Network errors
|
|
- Timeouts
|
|
- **No Retry:**
|
|
- Client errors (4xx status codes)
|
|
- Successful delivery (2xx status codes)
|
|
|
|
## Benefits
|
|
|
|
1. **No Polling Required** - Eliminates constant API calls to check job status
|
|
2. **Real-time Notifications** - Immediate notification when jobs complete
|
|
3. **Reliable Delivery** - Exponential backoff ensures webhooks are delivered
|
|
4. **Flexible** - Choose between notification-only or full data delivery
|
|
5. **Secure** - Support for custom headers for authentication
|
|
6. **Configurable** - Global defaults or per-job configuration
|
|
|
|
## TypeScript Client Example
|
|
|
|
```typescript
|
|
interface WebhookConfig {
|
|
webhook_url: string;
|
|
webhook_data_in_payload?: boolean;
|
|
webhook_headers?: Record<string, string>;
|
|
}
|
|
|
|
interface CrawlJobRequest {
|
|
urls: string[];
|
|
browser_config?: Record<string, any>;
|
|
crawler_config?: Record<string, any>;
|
|
webhook_config?: WebhookConfig;
|
|
}
|
|
|
|
async function createCrawlJob(request: CrawlJobRequest) {
|
|
const response = await fetch('http://localhost:11235/crawl/job', {
|
|
method: 'POST',
|
|
headers: { 'Content-Type': 'application/json' },
|
|
body: JSON.stringify(request)
|
|
});
|
|
|
|
const { task_id } = await response.json();
|
|
return task_id;
|
|
}
|
|
|
|
// Usage
|
|
const taskId = await createCrawlJob({
|
|
urls: ['https://example.com'],
|
|
webhook_config: {
|
|
webhook_url: 'https://myapp.com/webhooks/crawl-complete',
|
|
webhook_data_in_payload: false,
|
|
webhook_headers: {
|
|
'X-Webhook-Secret': 'my-secret'
|
|
}
|
|
}
|
|
});
|
|
```
|
|
|
|
## Monitoring and Debugging
|
|
|
|
Webhook delivery attempts are logged at INFO level:
|
|
- Successful deliveries
|
|
- Retry attempts with delays
|
|
- Final failures after max attempts
|
|
|
|
Check the application logs for webhook delivery status:
|
|
```bash
|
|
docker logs crawl4ai-container | grep -i webhook
|
|
```
|