Merge branch 'develop' into feature/docker-cluster

This commit is contained in:
unclecode
2025-10-24 12:33:45 +08:00
42 changed files with 14417 additions and 708 deletions

View File

@@ -0,0 +1,251 @@
# Webhook Feature Test Script
This directory contains a comprehensive test script for the webhook feature implementation.
## Overview
The `test_webhook_feature.sh` script automates the entire process of testing the webhook feature:
1. ✅ Fetches and switches to the webhook feature branch
2. ✅ Activates the virtual environment
3. ✅ Installs all required dependencies
4. ✅ Starts Redis server in background
5. ✅ Starts Crawl4AI server in background
6. ✅ Runs webhook integration test
7. ✅ Verifies job completion via webhook
8. ✅ Cleans up and returns to original branch
## Prerequisites
- Python 3.10+
- Virtual environment already created (`venv/` in project root)
- Git repository with the webhook feature branch
- `redis-server` (script will attempt to install if missing)
- `curl` and `lsof` commands available
## Usage
### Quick Start
From the project root:
```bash
./tests/test_webhook_feature.sh
```
Or from the tests directory:
```bash
cd tests
./test_webhook_feature.sh
```
### What the Script Does
#### Step 1: Branch Management
- Saves your current branch
- Fetches the webhook feature branch from remote
- Switches to the webhook feature branch
#### Step 2: Environment Setup
- Activates your existing virtual environment
- Installs dependencies from `deploy/docker/requirements.txt`
- Installs Flask for the webhook receiver
#### Step 3: Service Startup
- Starts Redis server on port 6379
- Starts Crawl4AI server on port 11235
- Waits for server health check to pass
#### Step 4: Webhook Test
- Creates a webhook receiver on port 8080
- Submits a crawl job for `https://example.com` with webhook config
- Waits for webhook notification (60s timeout)
- Verifies webhook payload contains expected data
#### Step 5: Cleanup
- Stops webhook receiver
- Stops Crawl4AI server
- Stops Redis server
- Returns to your original branch
## Expected Output
```
[INFO] Starting webhook feature test script
[INFO] Project root: /path/to/crawl4ai
[INFO] Step 1: Fetching PR branch...
[INFO] Current branch: develop
[SUCCESS] Branch fetched
[INFO] Step 2: Switching to branch: claude/implement-webhook-crawl-feature-011CULZY1Jy8N5MUkZqXkRVp
[SUCCESS] Switched to webhook feature branch
[INFO] Step 3: Activating virtual environment...
[SUCCESS] Virtual environment activated
[INFO] Step 4: Installing server dependencies...
[SUCCESS] Dependencies installed
[INFO] Step 5a: Starting Redis...
[SUCCESS] Redis started (PID: 12345)
[INFO] Step 5b: Starting server on port 11235...
[INFO] Server started (PID: 12346)
[INFO] Waiting for server to be ready...
[SUCCESS] Server is ready!
[INFO] Step 6: Creating webhook test script...
[INFO] Running webhook test...
🚀 Submitting crawl job with webhook...
✅ Job submitted successfully, task_id: crawl_abc123
⏳ Waiting for webhook notification...
✅ Webhook received: {
"task_id": "crawl_abc123",
"task_type": "crawl",
"status": "completed",
"timestamp": "2025-10-22T00:00:00.000000+00:00",
"urls": ["https://example.com"],
"data": { ... }
}
✅ Webhook received!
Task ID: crawl_abc123
Status: completed
URLs: ['https://example.com']
✅ Data included in webhook payload
📄 Crawled 1 URL(s)
- https://example.com: 1234 chars
🎉 Webhook test PASSED!
[INFO] Step 7: Verifying test results...
[SUCCESS] ✅ Webhook test PASSED!
[SUCCESS] All tests completed successfully! 🎉
[INFO] Cleanup will happen automatically...
[INFO] Starting cleanup...
[INFO] Stopping webhook receiver...
[INFO] Stopping server...
[INFO] Stopping Redis...
[INFO] Switching back to branch: develop
[SUCCESS] Cleanup complete
```
## Troubleshooting
### Server Failed to Start
If the server fails to start, check the logs:
```bash
tail -100 /tmp/crawl4ai_server.log
```
Common issues:
- Port 11235 already in use: `lsof -ti:11235 | xargs kill -9`
- Missing dependencies: Check that all packages are installed
### Redis Connection Failed
Check if Redis is running:
```bash
redis-cli ping
# Should return: PONG
```
If not running:
```bash
redis-server --port 6379 --daemonize yes
```
### Webhook Not Received
The script has a 60-second timeout for webhook delivery. If the webhook isn't received:
1. Check server logs: `/tmp/crawl4ai_server.log`
2. Verify webhook receiver is running on port 8080
3. Check network connectivity between components
### Script Interruption
If the script is interrupted (Ctrl+C), cleanup happens automatically via trap. The script will:
- Kill all background processes
- Stop Redis
- Return to your original branch
To manually cleanup if needed:
```bash
# Kill processes by port
lsof -ti:11235 | xargs kill -9 # Server
lsof -ti:8080 | xargs kill -9 # Webhook receiver
lsof -ti:6379 | xargs kill -9 # Redis
# Return to your branch
git checkout develop # or your branch name
```
## Testing Different URLs
To test with a different URL, modify the script or create a custom test:
```python
payload = {
"urls": ["https://your-url-here.com"],
"browser_config": {"headless": True},
"crawler_config": {"cache_mode": "bypass"},
"webhook_config": {
"webhook_url": "http://localhost:8080/webhook",
"webhook_data_in_payload": True
}
}
```
## Files Generated
The script creates temporary files:
- `/tmp/crawl4ai_server.log` - Server output logs
- `/tmp/test_webhook.py` - Webhook test Python script
These are not cleaned up automatically so you can review them after the test.
## Exit Codes
- `0` - All tests passed successfully
- `1` - Test failed (check output for details)
## Safety Features
- ✅ Automatic cleanup on exit, interrupt, or error
- ✅ Returns to original branch on completion
- ✅ Kills all background processes
- ✅ Comprehensive error handling
- ✅ Colored output for easy reading
- ✅ Detailed logging at each step
## Notes
- The script uses `set -e` to exit on any command failure
- All background processes are tracked and cleaned up
- The virtual environment must exist before running
- Redis must be available (installed or installable via apt-get/brew)
## Integration with CI/CD
This script can be integrated into CI/CD pipelines:
```yaml
# Example GitHub Actions
- name: Test Webhook Feature
run: |
chmod +x tests/test_webhook_feature.sh
./tests/test_webhook_feature.sh
```
## Support
If you encounter issues:
1. Check the troubleshooting section above
2. Review server logs at `/tmp/crawl4ai_server.log`
3. Ensure all prerequisites are met
4. Open an issue with the full output of the script

View File

@@ -0,0 +1,193 @@
"""
Test script demonstrating the hooks_to_string utility and Docker client integration.
"""
import asyncio
from crawl4ai import Crawl4aiDockerClient, hooks_to_string
# Define hook functions as regular Python functions
async def auth_hook(page, context, **kwargs):
"""Add authentication cookies."""
await context.add_cookies([{
'name': 'test_cookie',
'value': 'test_value',
'domain': '.httpbin.org',
'path': '/'
}])
return page
async def scroll_hook(page, context, **kwargs):
"""Scroll to load lazy content."""
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1000)
return page
async def viewport_hook(page, context, **kwargs):
"""Set custom viewport."""
await page.set_viewport_size({"width": 1920, "height": 1080})
return page
async def test_hooks_utility():
"""Test the hooks_to_string utility function."""
print("=" * 60)
print("Testing hooks_to_string utility")
print("=" * 60)
# Create hooks dictionary with function objects
hooks_dict = {
"on_page_context_created": auth_hook,
"before_retrieve_html": scroll_hook
}
# Convert to string format
hooks_string = hooks_to_string(hooks_dict)
print("\n✓ Successfully converted function objects to strings")
print(f"\n✓ Converted {len(hooks_string)} hooks:")
for hook_name in hooks_string.keys():
print(f" - {hook_name}")
print("\n✓ Preview of converted hook:")
print("-" * 60)
print(hooks_string["on_page_context_created"][:200] + "...")
print("-" * 60)
return hooks_string
async def test_docker_client_with_functions():
"""Test Docker client with function objects (automatic conversion)."""
print("\n" + "=" * 60)
print("Testing Docker Client with Function Objects")
print("=" * 60)
# Note: This requires a running Crawl4AI Docker server
# Uncomment the following to test with actual server:
async with Crawl4aiDockerClient(base_url="http://localhost:11234", verbose=True) as client:
# Pass function objects directly - they'll be converted automatically
result = await client.crawl(
["https://httpbin.org/html"],
hooks={
"on_page_context_created": auth_hook,
"before_retrieve_html": scroll_hook
},
hooks_timeout=30
)
print(f"\n✓ Crawl successful: {result.success}")
print(f"✓ URL: {result.url}")
print("\n✓ Docker client accepts function objects directly")
print("✓ Automatic conversion happens internally")
print("✓ No manual string formatting needed!")
async def test_docker_client_with_strings():
"""Test Docker client with pre-converted strings."""
print("\n" + "=" * 60)
print("Testing Docker Client with String Hooks")
print("=" * 60)
# Convert hooks to strings first
hooks_dict = {
"on_page_context_created": viewport_hook,
"before_retrieve_html": scroll_hook
}
hooks_string = hooks_to_string(hooks_dict)
# Note: This requires a running Crawl4AI Docker server
# Uncomment the following to test with actual server:
async with Crawl4aiDockerClient(base_url="http://localhost:11234", verbose=True) as client:
# Pass string hooks - they'll be used as-is
result = await client.crawl(
["https://httpbin.org/html"],
hooks=hooks_string,
hooks_timeout=30
)
print(f"\n✓ Crawl successful: {result.success}")
print("\n✓ Docker client also accepts pre-converted strings")
print("✓ Backward compatible with existing code")
async def show_usage_patterns():
"""Show different usage patterns."""
print("\n" + "=" * 60)
print("Usage Patterns")
print("=" * 60)
print("\n1. Direct function usage (simplest):")
print("-" * 60)
print("""
async def my_hook(page, context, **kwargs):
await page.set_viewport_size({"width": 1920, "height": 1080})
return page
result = await client.crawl(
["https://example.com"],
hooks={"on_page_context_created": my_hook}
)
""")
print("\n2. Convert then use:")
print("-" * 60)
print("""
hooks_dict = {"on_page_context_created": my_hook}
hooks_string = hooks_to_string(hooks_dict)
result = await client.crawl(
["https://example.com"],
hooks=hooks_string
)
""")
print("\n3. Manual string (backward compatible):")
print("-" * 60)
print("""
hooks_string = {
"on_page_context_created": '''
async def hook(page, context, **kwargs):
await page.set_viewport_size({"width": 1920, "height": 1080})
return page
'''
}
result = await client.crawl(
["https://example.com"],
hooks=hooks_string
)
""")
async def main():
"""Run all tests."""
print("\n🚀 Crawl4AI Hooks Utility Test Suite\n")
# Test the utility function
# await test_hooks_utility()
# Show usage with Docker client
# await test_docker_client_with_functions()
await test_docker_client_with_strings()
# Show different patterns
# await show_usage_patterns()
# print("\n" + "=" * 60)
# print("✓ All tests completed successfully!")
# print("=" * 60)
# print("\nKey Benefits:")
# print(" • Write hooks as regular Python functions")
# print(" • IDE support with autocomplete and type checking")
# print(" • Automatic conversion to API format")
# print(" • Backward compatible with string hooks")
# print(" • Same utility used everywhere")
# print("\n")
if __name__ == "__main__":
asyncio.run(main())

305
tests/test_webhook_feature.sh Executable file
View File

@@ -0,0 +1,305 @@
#!/bin/bash
#############################################################################
# Webhook Feature Test Script
#
# This script tests the webhook feature implementation by:
# 1. Switching to the webhook feature branch
# 2. Installing dependencies
# 3. Starting the server
# 4. Running webhook tests
# 5. Cleaning up and returning to original branch
#
# Usage: ./test_webhook_feature.sh
#############################################################################
set -e # Exit on error
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Configuration
BRANCH_NAME="claude/implement-webhook-crawl-feature-011CULZY1Jy8N5MUkZqXkRVp"
VENV_PATH="venv"
SERVER_PORT=11235
WEBHOOK_PORT=8080
PROJECT_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
# PID files for cleanup
REDIS_PID=""
SERVER_PID=""
WEBHOOK_PID=""
#############################################################################
# Utility Functions
#############################################################################
log_info() {
echo -e "${BLUE}[INFO]${NC} $1"
}
log_success() {
echo -e "${GREEN}[SUCCESS]${NC} $1"
}
log_warning() {
echo -e "${YELLOW}[WARNING]${NC} $1"
}
log_error() {
echo -e "${RED}[ERROR]${NC} $1"
}
cleanup() {
log_info "Starting cleanup..."
# Kill webhook receiver if running
if [ ! -z "$WEBHOOK_PID" ] && kill -0 $WEBHOOK_PID 2>/dev/null; then
log_info "Stopping webhook receiver (PID: $WEBHOOK_PID)..."
kill $WEBHOOK_PID 2>/dev/null || true
fi
# Kill server if running
if [ ! -z "$SERVER_PID" ] && kill -0 $SERVER_PID 2>/dev/null; then
log_info "Stopping server (PID: $SERVER_PID)..."
kill $SERVER_PID 2>/dev/null || true
fi
# Kill Redis if running
if [ ! -z "$REDIS_PID" ] && kill -0 $REDIS_PID 2>/dev/null; then
log_info "Stopping Redis (PID: $REDIS_PID)..."
kill $REDIS_PID 2>/dev/null || true
fi
# Also kill by port if PIDs didn't work
lsof -ti:$SERVER_PORT | xargs kill -9 2>/dev/null || true
lsof -ti:$WEBHOOK_PORT | xargs kill -9 2>/dev/null || true
lsof -ti:6379 | xargs kill -9 2>/dev/null || true
# Return to original branch
if [ ! -z "$ORIGINAL_BRANCH" ]; then
log_info "Switching back to branch: $ORIGINAL_BRANCH"
git checkout $ORIGINAL_BRANCH 2>/dev/null || true
fi
log_success "Cleanup complete"
}
# Set trap to cleanup on exit
trap cleanup EXIT INT TERM
#############################################################################
# Main Script
#############################################################################
log_info "Starting webhook feature test script"
log_info "Project root: $PROJECT_ROOT"
cd "$PROJECT_ROOT"
# Step 1: Save current branch and fetch PR
log_info "Step 1: Fetching PR branch..."
ORIGINAL_BRANCH=$(git rev-parse --abbrev-ref HEAD)
log_info "Current branch: $ORIGINAL_BRANCH"
git fetch origin $BRANCH_NAME
log_success "Branch fetched"
# Step 2: Switch to new branch
log_info "Step 2: Switching to branch: $BRANCH_NAME"
git checkout $BRANCH_NAME
log_success "Switched to webhook feature branch"
# Step 3: Activate virtual environment
log_info "Step 3: Activating virtual environment..."
if [ ! -d "$VENV_PATH" ]; then
log_error "Virtual environment not found at $VENV_PATH"
log_info "Creating virtual environment..."
python3 -m venv $VENV_PATH
fi
source $VENV_PATH/bin/activate
log_success "Virtual environment activated: $(which python)"
# Step 4: Install server dependencies
log_info "Step 4: Installing server dependencies..."
pip install -q -r deploy/docker/requirements.txt
log_success "Dependencies installed"
# Check if Redis is available
log_info "Checking Redis availability..."
if ! command -v redis-server &> /dev/null; then
log_warning "Redis not found, attempting to install..."
if command -v apt-get &> /dev/null; then
sudo apt-get update && sudo apt-get install -y redis-server
elif command -v brew &> /dev/null; then
brew install redis
else
log_error "Cannot install Redis automatically. Please install Redis manually."
exit 1
fi
fi
# Step 5: Start Redis in background
log_info "Step 5a: Starting Redis..."
redis-server --port 6379 --daemonize yes
sleep 2
REDIS_PID=$(pgrep redis-server)
log_success "Redis started (PID: $REDIS_PID)"
# Step 5b: Start server in background
log_info "Step 5b: Starting server on port $SERVER_PORT..."
cd deploy/docker
# Start server in background
python3 -m uvicorn server:app --host 0.0.0.0 --port $SERVER_PORT > /tmp/crawl4ai_server.log 2>&1 &
SERVER_PID=$!
cd "$PROJECT_ROOT"
log_info "Server started (PID: $SERVER_PID)"
# Wait for server to be ready
log_info "Waiting for server to be ready..."
for i in {1..30}; do
if curl -s http://localhost:$SERVER_PORT/health > /dev/null 2>&1; then
log_success "Server is ready!"
break
fi
if [ $i -eq 30 ]; then
log_error "Server failed to start within 30 seconds"
log_info "Server logs:"
tail -50 /tmp/crawl4ai_server.log
exit 1
fi
echo -n "."
sleep 1
done
echo ""
# Step 6: Create and run webhook test
log_info "Step 6: Creating webhook test script..."
cat > /tmp/test_webhook.py << 'PYTHON_SCRIPT'
import requests
import json
import time
from flask import Flask, request, jsonify
from threading import Thread, Event
# Configuration
CRAWL4AI_BASE_URL = "http://localhost:11235"
WEBHOOK_BASE_URL = "http://localhost:8080"
# Flask app for webhook receiver
app = Flask(__name__)
webhook_received = Event()
webhook_data = {}
@app.route('/webhook', methods=['POST'])
def handle_webhook():
global webhook_data
webhook_data = request.json
webhook_received.set()
print(f"\n✅ Webhook received: {json.dumps(webhook_data, indent=2)}")
return jsonify({"status": "received"}), 200
def start_webhook_server():
app.run(host='0.0.0.0', port=8080, debug=False, use_reloader=False)
# Start webhook server in background
webhook_thread = Thread(target=start_webhook_server, daemon=True)
webhook_thread.start()
time.sleep(2)
print("🚀 Submitting crawl job with webhook...")
# Submit job with webhook
payload = {
"urls": ["https://example.com"],
"browser_config": {"headless": True},
"crawler_config": {"cache_mode": "bypass"},
"webhook_config": {
"webhook_url": f"{WEBHOOK_BASE_URL}/webhook",
"webhook_data_in_payload": True
}
}
response = requests.post(
f"{CRAWL4AI_BASE_URL}/crawl/job",
json=payload,
headers={"Content-Type": "application/json"}
)
if not response.ok:
print(f"❌ Failed to submit job: {response.text}")
exit(1)
task_id = response.json()['task_id']
print(f"✅ Job submitted successfully, task_id: {task_id}")
# Wait for webhook (with timeout)
print("⏳ Waiting for webhook notification...")
if webhook_received.wait(timeout=60):
print(f"✅ Webhook received!")
print(f" Task ID: {webhook_data.get('task_id')}")
print(f" Status: {webhook_data.get('status')}")
print(f" URLs: {webhook_data.get('urls')}")
if webhook_data.get('status') == 'completed':
if 'data' in webhook_data:
print(f" ✅ Data included in webhook payload")
results = webhook_data['data'].get('results', [])
if results:
print(f" 📄 Crawled {len(results)} URL(s)")
for result in results:
print(f" - {result.get('url')}: {len(result.get('markdown', ''))} chars")
print("\n🎉 Webhook test PASSED!")
exit(0)
else:
print(f" ❌ Job failed: {webhook_data.get('error')}")
exit(1)
else:
print("❌ Webhook not received within 60 seconds")
# Try polling as fallback
print("⏳ Trying to poll job status...")
for i in range(10):
status_response = requests.get(f"{CRAWL4AI_BASE_URL}/crawl/job/{task_id}")
if status_response.ok:
status = status_response.json()
print(f" Status: {status.get('status')}")
if status.get('status') in ['completed', 'failed']:
break
time.sleep(2)
exit(1)
PYTHON_SCRIPT
# Install Flask for webhook receiver
pip install -q flask
# Run the webhook test
log_info "Running webhook test..."
python3 /tmp/test_webhook.py &
WEBHOOK_PID=$!
# Wait for test to complete
wait $WEBHOOK_PID
TEST_EXIT_CODE=$?
# Step 7: Verify results
log_info "Step 7: Verifying test results..."
if [ $TEST_EXIT_CODE -eq 0 ]; then
log_success "✅ Webhook test PASSED!"
else
log_error "❌ Webhook test FAILED (exit code: $TEST_EXIT_CODE)"
log_info "Server logs:"
tail -100 /tmp/crawl4ai_server.log
exit 1
fi
# Step 8: Cleanup happens automatically via trap
log_success "All tests completed successfully! 🎉"
log_info "Cleanup will happen automatically..."