Release v0.7.7
- Updated version to 0.7.7 - Added comprehensive demo and release notes - Updated all documentation
This commit is contained in:
626
docs/blog/release-v0.7.7.md
Normal file
626
docs/blog/release-v0.7.7.md
Normal file
@@ -0,0 +1,626 @@
|
||||
# 🚀 Crawl4AI v0.7.7: The Self-Hosting & Monitoring Update
|
||||
|
||||
*November 14, 2025 • 10 min read*
|
||||
|
||||
---
|
||||
|
||||
Today I'm releasing Crawl4AI v0.7.7—the Self-Hosting & Monitoring Update. This release transforms Crawl4AI Docker from a simple containerized crawler into a complete self-hosting platform with enterprise-grade real-time monitoring, full operational transparency, and production-ready observability.
|
||||
|
||||
## 🎯 What's New at a Glance
|
||||
|
||||
- **📊 Real-time Monitoring Dashboard**: Interactive web UI with live system metrics and browser pool status
|
||||
- **🔌 Comprehensive Monitor API**: Complete REST API for programmatic access to all monitoring data
|
||||
- **⚡ WebSocket Streaming**: Real-time updates every 2 seconds for custom dashboards
|
||||
- **🎮 Control Actions**: Manual browser management (kill, restart, cleanup)
|
||||
- **🔥 Smart Browser Pool**: 3-tier architecture (permanent/hot/cold) with automatic promotion
|
||||
- **🧹 Janitor Cleanup System**: Automatic resource management with event logging
|
||||
- **📈 Production Metrics**: 6 critical metrics for operational excellence
|
||||
- **🏭 Integration Ready**: Prometheus, alerting, and log aggregation examples
|
||||
- **🐛 Critical Bug Fixes**: Async LLM extraction, DFS crawling, viewport config, and more
|
||||
|
||||
## 📊 Real-time Monitoring Dashboard: Complete Visibility
|
||||
|
||||
**The Problem:** Running Crawl4AI in Docker was like flying blind. Users had no visibility into what was happening inside the container—memory usage, active requests, browser pools, or errors. Troubleshooting required checking logs, and there was no way to monitor performance or manually intervene when issues occurred.
|
||||
|
||||
**My Solution:** I built a complete real-time monitoring system with an interactive dashboard, comprehensive REST API, WebSocket streaming, and manual control actions. Now you have full transparency and control over your crawling infrastructure.
|
||||
|
||||
### The Self-Hosting Value Proposition
|
||||
|
||||
Before v0.7.7, Docker was just a containerized crawler. After v0.7.7, it's a complete self-hosting platform that gives you:
|
||||
|
||||
- **🔒 Data Privacy**: Your data never leaves your infrastructure
|
||||
- **💰 Cost Control**: No per-request pricing or rate limits
|
||||
- **🎯 Full Customization**: Complete control over configurations and strategies
|
||||
- **📊 Complete Transparency**: Real-time visibility into every aspect
|
||||
- **⚡ Performance**: Direct access without network overhead
|
||||
- **🛡️ Enterprise Security**: Keep workflows behind your firewall
|
||||
|
||||
### Interactive Monitoring Dashboard
|
||||
|
||||
Access the dashboard at `http://localhost:11235/dashboard` to see:
|
||||
|
||||
- **System Health Overview**: CPU, memory, network, and uptime in real-time
|
||||
- **Live Request Tracking**: Active and completed requests with full details
|
||||
- **Browser Pool Management**: Interactive table with permanent/hot/cold browsers
|
||||
- **Janitor Events Log**: Automatic cleanup activities
|
||||
- **Error Monitoring**: Full context error logs
|
||||
|
||||
The dashboard updates every 2 seconds via WebSocket, giving you live visibility into your crawling operations.
|
||||
|
||||
## 🔌 Monitor API: Programmatic Access
|
||||
|
||||
**The Problem:** Monitoring dashboards are great for humans, but automation and integration require programmatic access.
|
||||
|
||||
**My Solution:** A comprehensive REST API that exposes all monitoring data for integration with your existing infrastructure.
|
||||
|
||||
### System Health Endpoint
|
||||
|
||||
```python
|
||||
import httpx
|
||||
import asyncio
|
||||
|
||||
async def monitor_system_health():
|
||||
async with httpx.AsyncClient() as client:
|
||||
response = await client.get("http://localhost:11235/monitor/health")
|
||||
health = response.json()
|
||||
|
||||
print(f"Container Metrics:")
|
||||
print(f" CPU: {health['container']['cpu_percent']:.1f}%")
|
||||
print(f" Memory: {health['container']['memory_percent']:.1f}%")
|
||||
print(f" Uptime: {health['container']['uptime_seconds']}s")
|
||||
|
||||
print(f"\nBrowser Pool:")
|
||||
print(f" Permanent: {health['pool']['permanent']['active']} active")
|
||||
print(f" Hot Pool: {health['pool']['hot']['count']} browsers")
|
||||
print(f" Cold Pool: {health['pool']['cold']['count']} browsers")
|
||||
|
||||
print(f"\nStatistics:")
|
||||
print(f" Total Requests: {health['stats']['total_requests']}")
|
||||
print(f" Success Rate: {health['stats']['success_rate_percent']:.1f}%")
|
||||
print(f" Avg Latency: {health['stats']['avg_latency_ms']:.0f}ms")
|
||||
|
||||
asyncio.run(monitor_system_health())
|
||||
```
|
||||
|
||||
### Request Tracking
|
||||
|
||||
```python
|
||||
async def track_requests():
|
||||
async with httpx.AsyncClient() as client:
|
||||
response = await client.get("http://localhost:11235/monitor/requests")
|
||||
requests_data = response.json()
|
||||
|
||||
print(f"Active Requests: {len(requests_data['active'])}")
|
||||
print(f"Completed Requests: {len(requests_data['completed'])}")
|
||||
|
||||
# See details of recent requests
|
||||
for req in requests_data['completed'][:5]:
|
||||
status_icon = "✅" if req['success'] else "❌"
|
||||
print(f"{status_icon} {req['endpoint']} - {req['latency_ms']:.0f}ms")
|
||||
```
|
||||
|
||||
### Browser Pool Management
|
||||
|
||||
```python
|
||||
async def monitor_browser_pool():
|
||||
async with httpx.AsyncClient() as client:
|
||||
response = await client.get("http://localhost:11235/monitor/browsers")
|
||||
browsers = response.json()
|
||||
|
||||
print(f"Pool Summary:")
|
||||
print(f" Total Browsers: {browsers['summary']['total_count']}")
|
||||
print(f" Total Memory: {browsers['summary']['total_memory_mb']} MB")
|
||||
print(f" Reuse Rate: {browsers['summary']['reuse_rate_percent']:.1f}%")
|
||||
|
||||
# List all browsers
|
||||
for browser in browsers['permanent']:
|
||||
print(f"🔥 Permanent: {browser['browser_id'][:8]}... | "
|
||||
f"Requests: {browser['request_count']} | "
|
||||
f"Memory: {browser['memory_mb']:.0f} MB")
|
||||
```
|
||||
|
||||
### Endpoint Performance Statistics
|
||||
|
||||
```python
|
||||
async def get_endpoint_stats():
|
||||
async with httpx.AsyncClient() as client:
|
||||
response = await client.get("http://localhost:11235/monitor/endpoints/stats")
|
||||
stats = response.json()
|
||||
|
||||
print("Endpoint Analytics:")
|
||||
for endpoint, data in stats.items():
|
||||
print(f" {endpoint}:")
|
||||
print(f" Requests: {data['count']}")
|
||||
print(f" Avg Latency: {data['avg_latency_ms']:.0f}ms")
|
||||
print(f" Success Rate: {data['success_rate_percent']:.1f}%")
|
||||
```
|
||||
|
||||
### Complete API Reference
|
||||
|
||||
The Monitor API includes these endpoints:
|
||||
|
||||
- `GET /monitor/health` - System health with pool statistics
|
||||
- `GET /monitor/requests` - Active and completed request tracking
|
||||
- `GET /monitor/browsers` - Browser pool details and efficiency
|
||||
- `GET /monitor/endpoints/stats` - Per-endpoint performance analytics
|
||||
- `GET /monitor/timeline?minutes=5` - Time-series data for charts
|
||||
- `GET /monitor/logs/janitor?limit=10` - Cleanup activity logs
|
||||
- `GET /monitor/logs/errors?limit=10` - Error logs with context
|
||||
- `POST /monitor/actions/cleanup` - Force immediate cleanup
|
||||
- `POST /monitor/actions/kill_browser` - Kill specific browser
|
||||
- `POST /monitor/actions/restart_browser` - Restart browser
|
||||
- `POST /monitor/stats/reset` - Reset accumulated statistics
|
||||
|
||||
## ⚡ WebSocket Streaming: Real-time Updates
|
||||
|
||||
**The Problem:** Polling the API every few seconds wastes resources and adds latency. Real-time dashboards need instant updates.
|
||||
|
||||
**My Solution:** WebSocket streaming with 2-second update intervals for building custom real-time dashboards.
|
||||
|
||||
### WebSocket Integration Example
|
||||
|
||||
```python
|
||||
import websockets
|
||||
import json
|
||||
import asyncio
|
||||
|
||||
async def monitor_realtime():
|
||||
uri = "ws://localhost:11235/monitor/ws"
|
||||
|
||||
async with websockets.connect(uri) as websocket:
|
||||
print("Connected to real-time monitoring stream")
|
||||
|
||||
while True:
|
||||
# Receive update every 2 seconds
|
||||
data = await websocket.recv()
|
||||
update = json.loads(data)
|
||||
|
||||
# Access all monitoring data
|
||||
print(f"\n--- Update at {update['timestamp']} ---")
|
||||
print(f"Memory: {update['health']['container']['memory_percent']:.1f}%")
|
||||
print(f"Active Requests: {len(update['requests']['active'])}")
|
||||
print(f"Total Browsers: {update['browsers']['summary']['total_count']}")
|
||||
|
||||
if update['errors']:
|
||||
print(f"⚠️ Recent Errors: {len(update['errors'])}")
|
||||
|
||||
asyncio.run(monitor_realtime())
|
||||
```
|
||||
|
||||
**Expected Real-World Impact:**
|
||||
- **Custom Dashboards**: Build tailored monitoring UIs for your team
|
||||
- **Real-time Alerting**: Trigger alerts instantly when metrics exceed thresholds
|
||||
- **Integration**: Feed live data into monitoring tools like Grafana
|
||||
- **Automation**: React to events in real-time without polling
|
||||
|
||||
## 🔥 Smart Browser Pool: 3-Tier Architecture
|
||||
|
||||
**The Problem:** Creating a new browser for every request is slow and memory-intensive. Traditional browser pools are static and inefficient.
|
||||
|
||||
**My Solution:** A smart 3-tier browser pool that automatically adapts to usage patterns.
|
||||
|
||||
### How It Works
|
||||
|
||||
```python
|
||||
import httpx
|
||||
|
||||
async def demonstrate_browser_pool():
|
||||
async with httpx.AsyncClient() as client:
|
||||
# Request 1-3: Default config → Uses permanent browser
|
||||
print("Phase 1: Using permanent browser")
|
||||
for i in range(3):
|
||||
await client.post(
|
||||
"http://localhost:11235/crawl",
|
||||
json={"urls": [f"https://httpbin.org/html?req={i}"]}
|
||||
)
|
||||
print(f" Request {i+1}: Reused permanent browser")
|
||||
|
||||
# Request 4-6: Custom viewport → Cold pool (first use)
|
||||
print("\nPhase 2: Custom config creates cold pool browser")
|
||||
viewport_config = {"viewport": {"width": 1280, "height": 720}}
|
||||
for i in range(4):
|
||||
await client.post(
|
||||
"http://localhost:11235/crawl",
|
||||
json={
|
||||
"urls": [f"https://httpbin.org/json?v={i}"],
|
||||
"browser_config": viewport_config
|
||||
}
|
||||
)
|
||||
if i < 2:
|
||||
print(f" Request {i+1}: Cold pool browser")
|
||||
else:
|
||||
print(f" Request {i+1}: Promoted to hot pool! (after 3 uses)")
|
||||
|
||||
# Check pool status
|
||||
response = await client.get("http://localhost:11235/monitor/browsers")
|
||||
browsers = response.json()
|
||||
|
||||
print(f"\nPool Status:")
|
||||
print(f" Permanent: {len(browsers['permanent'])} (always active)")
|
||||
print(f" Hot: {len(browsers['hot'])} (frequently used configs)")
|
||||
print(f" Cold: {len(browsers['cold'])} (on-demand)")
|
||||
print(f" Reuse Rate: {browsers['summary']['reuse_rate_percent']:.1f}%")
|
||||
|
||||
asyncio.run(demonstrate_browser_pool())
|
||||
```
|
||||
|
||||
**Pool Tiers:**
|
||||
|
||||
- **🔥 Permanent Browser**: Always-on, default configuration, instant response
|
||||
- **♨️ Hot Pool**: Browsers promoted after 3+ uses, kept warm for quick access
|
||||
- **❄️ Cold Pool**: On-demand browsers for variant configs, cleaned up when idle
|
||||
|
||||
**Expected Real-World Impact:**
|
||||
- **Memory Efficiency**: 10x reduction in memory usage vs creating browsers per request
|
||||
- **Performance**: Instant access to frequently-used configurations
|
||||
- **Automatic Optimization**: Pool adapts to your usage patterns
|
||||
- **Resource Management**: Janitor automatically cleans up idle browsers
|
||||
|
||||
## 🧹 Janitor System: Automatic Cleanup
|
||||
|
||||
**The Problem:** Long-running crawlers accumulate idle browsers and consume memory over time.
|
||||
|
||||
**My Solution:** An automatic janitor system that monitors and cleans up idle resources.
|
||||
|
||||
```python
|
||||
async def monitor_janitor_activity():
|
||||
async with httpx.AsyncClient() as client:
|
||||
response = await client.get("http://localhost:11235/monitor/logs/janitor?limit=5")
|
||||
logs = response.json()
|
||||
|
||||
print("Recent Cleanup Activities:")
|
||||
for log in logs:
|
||||
print(f" {log['timestamp']}: {log['message']}")
|
||||
|
||||
# Example output:
|
||||
# 2025-11-14 10:30:00: Cleaned up 2 cold pool browsers (idle > 5min)
|
||||
# 2025-11-14 10:25:00: Browser reuse rate: 85.3%
|
||||
# 2025-11-14 10:20:00: Hot pool browser promoted (10 requests)
|
||||
```
|
||||
|
||||
## 🎮 Control Actions: Manual Management
|
||||
|
||||
**The Problem:** Sometimes you need to manually intervene—kill a stuck browser, force cleanup, or restart resources.
|
||||
|
||||
**My Solution:** Manual control actions via the API for operational troubleshooting.
|
||||
|
||||
### Force Cleanup
|
||||
|
||||
```python
|
||||
async def force_cleanup():
|
||||
async with httpx.AsyncClient() as client:
|
||||
response = await client.post("http://localhost:11235/monitor/actions/cleanup")
|
||||
result = response.json()
|
||||
|
||||
print(f"Cleanup completed:")
|
||||
print(f" Browsers cleaned: {result.get('cleaned_count', 0)}")
|
||||
print(f" Memory freed: {result.get('memory_freed_mb', 0):.1f} MB")
|
||||
```
|
||||
|
||||
### Kill Specific Browser
|
||||
|
||||
```python
|
||||
async def kill_stuck_browser(browser_id: str):
|
||||
async with httpx.AsyncClient() as client:
|
||||
response = await client.post(
|
||||
"http://localhost:11235/monitor/actions/kill_browser",
|
||||
json={"browser_id": browser_id}
|
||||
)
|
||||
|
||||
if response.status_code == 200:
|
||||
print(f"✅ Browser {browser_id} killed successfully")
|
||||
```
|
||||
|
||||
### Reset Statistics
|
||||
|
||||
```python
|
||||
async def reset_stats():
|
||||
async with httpx.AsyncClient() as client:
|
||||
response = await client.post("http://localhost:11235/monitor/stats/reset")
|
||||
print("📊 Statistics reset for fresh monitoring")
|
||||
```
|
||||
|
||||
## 📈 Production Integration Patterns
|
||||
|
||||
### Prometheus Integration
|
||||
|
||||
```python
|
||||
# Export metrics for Prometheus scraping
|
||||
async def export_prometheus_metrics():
|
||||
async with httpx.AsyncClient() as client:
|
||||
health = await client.get("http://localhost:11235/monitor/health")
|
||||
data = health.json()
|
||||
|
||||
# Export in Prometheus format
|
||||
metrics = f"""
|
||||
# HELP crawl4ai_memory_usage_percent Memory usage percentage
|
||||
# TYPE crawl4ai_memory_usage_percent gauge
|
||||
crawl4ai_memory_usage_percent {data['container']['memory_percent']}
|
||||
|
||||
# HELP crawl4ai_request_success_rate Request success rate
|
||||
# TYPE crawl4ai_request_success_rate gauge
|
||||
crawl4ai_request_success_rate {data['stats']['success_rate_percent']}
|
||||
|
||||
# HELP crawl4ai_browser_pool_count Total browsers in pool
|
||||
# TYPE crawl4ai_browser_pool_count gauge
|
||||
crawl4ai_browser_pool_count {data['pool']['permanent']['active'] + data['pool']['hot']['count'] + data['pool']['cold']['count']}
|
||||
"""
|
||||
return metrics
|
||||
```
|
||||
|
||||
### Alerting Example
|
||||
|
||||
```python
|
||||
async def check_alerts():
|
||||
async with httpx.AsyncClient() as client:
|
||||
health = await client.get("http://localhost:11235/monitor/health")
|
||||
data = health.json()
|
||||
|
||||
# Memory alert
|
||||
if data['container']['memory_percent'] > 80:
|
||||
print("🚨 ALERT: Memory usage above 80%")
|
||||
# Trigger cleanup
|
||||
await client.post("http://localhost:11235/monitor/actions/cleanup")
|
||||
|
||||
# Success rate alert
|
||||
if data['stats']['success_rate_percent'] < 90:
|
||||
print("🚨 ALERT: Success rate below 90%")
|
||||
# Check error logs
|
||||
errors = await client.get("http://localhost:11235/monitor/logs/errors")
|
||||
print(f"Recent errors: {len(errors.json())}")
|
||||
|
||||
# Latency alert
|
||||
if data['stats']['avg_latency_ms'] > 5000:
|
||||
print("🚨 ALERT: Average latency above 5s")
|
||||
```
|
||||
|
||||
### Key Metrics to Track
|
||||
|
||||
```python
|
||||
CRITICAL_METRICS = {
|
||||
"memory_usage": {
|
||||
"current": "container.memory_percent",
|
||||
"target": "<80%",
|
||||
"alert_threshold": ">80%",
|
||||
"action": "Force cleanup or scale"
|
||||
},
|
||||
"success_rate": {
|
||||
"current": "stats.success_rate_percent",
|
||||
"target": ">95%",
|
||||
"alert_threshold": "<90%",
|
||||
"action": "Check error logs"
|
||||
},
|
||||
"avg_latency": {
|
||||
"current": "stats.avg_latency_ms",
|
||||
"target": "<2000ms",
|
||||
"alert_threshold": ">5000ms",
|
||||
"action": "Investigate slow requests"
|
||||
},
|
||||
"browser_reuse_rate": {
|
||||
"current": "browsers.summary.reuse_rate_percent",
|
||||
"target": ">80%",
|
||||
"alert_threshold": "<60%",
|
||||
"action": "Check pool configuration"
|
||||
},
|
||||
"total_browsers": {
|
||||
"current": "browsers.summary.total_count",
|
||||
"target": "<15",
|
||||
"alert_threshold": ">20",
|
||||
"action": "Check for browser leaks"
|
||||
},
|
||||
"error_frequency": {
|
||||
"current": "len(errors)",
|
||||
"target": "<5/hour",
|
||||
"alert_threshold": ">10/hour",
|
||||
"action": "Review error patterns"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 🐛 Critical Bug Fixes
|
||||
|
||||
This release includes significant bug fixes that improve stability and performance:
|
||||
|
||||
### Async LLM Extraction (#1590)
|
||||
|
||||
**The Problem:** LLM extraction was blocking async execution, causing URLs to be processed sequentially instead of in parallel (issue #1055).
|
||||
|
||||
**The Fix:** Resolved the blocking issue to enable true parallel processing for LLM extraction.
|
||||
|
||||
```python
|
||||
# Before v0.7.7: Sequential processing
|
||||
# After v0.7.7: True parallel processing
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
urls = ["url1", "url2", "url3", "url4"]
|
||||
|
||||
# Now processes truly in parallel with LLM extraction
|
||||
results = await crawler.arun_many(
|
||||
urls,
|
||||
config=CrawlerRunConfig(
|
||||
extraction_strategy=LLMExtractionStrategy(...)
|
||||
)
|
||||
)
|
||||
# 4x faster for parallel LLM extraction!
|
||||
```
|
||||
|
||||
**Expected Impact:** Major performance improvement for batch LLM extraction workflows.
|
||||
|
||||
### DFS Deep Crawling (#1607)
|
||||
|
||||
**The Problem:** DFS (Depth-First Search) deep crawl strategy had implementation issues.
|
||||
|
||||
**The Fix:** Enhanced DFSDeepCrawlStrategy with proper seen URL tracking and improved documentation.
|
||||
|
||||
### Browser & Crawler Config Documentation (#1609)
|
||||
|
||||
**The Problem:** Documentation didn't match the actual `async_configs.py` implementation.
|
||||
|
||||
**The Fix:** Updated all configuration documentation to accurately reflect the current implementation.
|
||||
|
||||
### Sitemap Seeder (#1598)
|
||||
|
||||
**The Problem:** Sitemap parsing and URL normalization issues in AsyncUrlSeeder (issue #1559).
|
||||
|
||||
**The Fix:** Added comprehensive tests and fixes for sitemap namespace parsing and URL normalization.
|
||||
|
||||
### Remove Overlay Elements (#1529)
|
||||
|
||||
**The Problem:** The `remove_overlay_elements` functionality wasn't working (issue #1396).
|
||||
|
||||
**The Fix:** Fixed by properly calling the injected JavaScript function.
|
||||
|
||||
### Viewport Configuration (#1495)
|
||||
|
||||
**The Problem:** Viewport configuration wasn't working in managed browsers (issue #1490).
|
||||
|
||||
**The Fix:** Added proper viewport size configuration support for browser launch.
|
||||
|
||||
### Managed Browser CDP Timing (#1528)
|
||||
|
||||
**The Problem:** CDP (Chrome DevTools Protocol) endpoint verification had timing issues causing connection failures (issue #1445).
|
||||
|
||||
**The Fix:** Added exponential backoff for CDP endpoint verification to handle timing variations.
|
||||
|
||||
### Security Updates
|
||||
|
||||
- **pyOpenSSL**: Updated from >=24.3.0 to >=25.3.0 to address security vulnerability
|
||||
- Added verification tests for the security update
|
||||
|
||||
### Docker Fixes
|
||||
|
||||
- **Port Standardization**: Fixed inconsistent port usage (11234 vs 11235) - now standardized to 11235
|
||||
- **LLM Environment**: Fixed LLM API key handling for multi-provider support (PR #1537)
|
||||
- **Error Handling**: Improved Docker API error messages with comprehensive status codes
|
||||
- **Serialization**: Fixed `fit_html` property serialization in `/crawl` and `/crawl/stream` endpoints
|
||||
|
||||
### Other Important Fixes
|
||||
|
||||
- **arun_many Returns**: Fixed function to always return a list, even on exception (PR #1530)
|
||||
- **Webhook Serialization**: Properly serialize Pydantic HttpUrl in webhook config
|
||||
- **LLMConfig Documentation**: Fixed casing and variable name consistency (issue #1551)
|
||||
- **Python Version**: Dropped Python 3.9 support, now requires Python >=3.10
|
||||
|
||||
## 📊 Expected Real-World Impact
|
||||
|
||||
### For DevOps & Infrastructure Teams
|
||||
- **Full Visibility**: Know exactly what's happening inside your crawling infrastructure
|
||||
- **Proactive Monitoring**: Catch issues before they become problems
|
||||
- **Resource Optimization**: Identify memory leaks and performance bottlenecks
|
||||
- **Operational Control**: Manual intervention when automated systems need help
|
||||
|
||||
### For Production Deployments
|
||||
- **Enterprise Observability**: Prometheus, Grafana, and alerting integration
|
||||
- **Debugging**: Real-time logs and error tracking
|
||||
- **Capacity Planning**: Historical metrics for scaling decisions
|
||||
- **SLA Monitoring**: Track success rates and latency against targets
|
||||
|
||||
### For Development Teams
|
||||
- **Local Monitoring**: Understand crawler behavior during development
|
||||
- **Performance Testing**: Measure impact of configuration changes
|
||||
- **Troubleshooting**: Quickly identify and fix issues
|
||||
- **Learning**: See exactly how the browser pool works
|
||||
|
||||
## 🔄 Breaking Changes
|
||||
|
||||
**None!** This release is fully backward compatible.
|
||||
|
||||
- All existing Docker configurations continue to work
|
||||
- No API changes to existing endpoints
|
||||
- Monitoring is additive functionality
|
||||
- No migration required
|
||||
|
||||
## 🚀 Upgrade Instructions
|
||||
|
||||
### Docker
|
||||
|
||||
```bash
|
||||
# Pull the latest version
|
||||
docker pull unclecode/crawl4ai:0.7.7
|
||||
|
||||
# Or use the latest tag
|
||||
docker pull unclecode/crawl4ai:latest
|
||||
|
||||
# Run with monitoring enabled (default)
|
||||
docker run -d \
|
||||
-p 11235:11235 \
|
||||
--shm-size=1g \
|
||||
--name crawl4ai \
|
||||
unclecode/crawl4ai:0.7.7
|
||||
|
||||
# Access the monitoring dashboard
|
||||
open http://localhost:11235/dashboard
|
||||
```
|
||||
|
||||
### Python Package
|
||||
|
||||
```bash
|
||||
# Upgrade to latest version
|
||||
pip install --upgrade crawl4ai
|
||||
|
||||
# Or install specific version
|
||||
pip install crawl4ai==0.7.7
|
||||
```
|
||||
|
||||
## 🎬 Try the Demo
|
||||
|
||||
Run the comprehensive demo that showcases all monitoring features:
|
||||
|
||||
```bash
|
||||
python docs/releases_review/demo_v0.7.7.py
|
||||
```
|
||||
|
||||
**The demo includes:**
|
||||
1. System health overview with live metrics
|
||||
2. Request tracking with active/completed monitoring
|
||||
3. Browser pool management (permanent/hot/cold)
|
||||
4. Complete Monitor API endpoint examples
|
||||
5. WebSocket streaming demonstration
|
||||
6. Control actions (cleanup, kill, restart)
|
||||
7. Production metrics and alerting patterns
|
||||
8. Self-hosting value proposition
|
||||
|
||||
## 📚 Documentation
|
||||
|
||||
### New Documentation
|
||||
- **[Self-Hosting Guide](https://docs.crawl4ai.com/core/self-hosting/)** - Complete self-hosting documentation with monitoring
|
||||
- **Demo Script**: `docs/releases_review/demo_v0.7.7.py` - Working examples
|
||||
|
||||
### Updated Documentation
|
||||
- **Docker Deployment** → **Self-Hosting** (renamed for better positioning)
|
||||
- Added comprehensive monitoring sections
|
||||
- Production integration patterns
|
||||
- WebSocket streaming examples
|
||||
|
||||
## 💡 Pro Tips
|
||||
|
||||
1. **Start with the dashboard** - Visit `/dashboard` to get familiar with the monitoring system
|
||||
2. **Track the 6 key metrics** - Memory, success rate, latency, reuse rate, browser count, errors
|
||||
3. **Set up alerting early** - Use the Monitor API to build alerts before issues occur
|
||||
4. **Monitor browser pool efficiency** - Aim for >80% reuse rate for optimal performance
|
||||
5. **Use WebSocket for custom dashboards** - Build tailored monitoring UIs for your team
|
||||
6. **Leverage Prometheus integration** - Export metrics for long-term storage and analysis
|
||||
7. **Check janitor logs** - Understand automatic cleanup patterns
|
||||
8. **Use control actions judiciously** - Manual interventions are for exceptional cases
|
||||
|
||||
## 🙏 Acknowledgments
|
||||
|
||||
Thank you to our community for the feedback, bug reports, and feature requests that shaped this release. Special thanks to everyone who contributed to the issues that were fixed in this version.
|
||||
|
||||
The monitoring system was built based on real user needs for production deployments, and your input made it comprehensive and practical.
|
||||
|
||||
## 📞 Support & Resources
|
||||
|
||||
- **📖 Documentation**: [docs.crawl4ai.com](https://docs.crawl4ai.com)
|
||||
- **🐙 GitHub**: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
|
||||
- **💬 Discord**: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN)
|
||||
- **🐦 Twitter**: [@unclecode](https://x.com/unclecode)
|
||||
- **📊 Dashboard**: `http://localhost:11235/dashboard` (when running)
|
||||
|
||||
---
|
||||
|
||||
**Crawl4AI v0.7.7 delivers complete self-hosting with enterprise-grade monitoring. You now have full visibility and control over your web crawling infrastructure. The monitoring dashboard, comprehensive API, and WebSocket streaming give you everything needed for production deployments. Try the self-hosting platform—it's a game changer for operational excellence!**
|
||||
|
||||
**Happy crawling with full visibility!** 🕷️📊
|
||||
|
||||
*- unclecode*
|
||||
@@ -20,6 +20,25 @@ Ever wondered why your AI coding assistant struggles with your library despite c
|
||||
|
||||
## Latest Release
|
||||
|
||||
### [Crawl4AI v0.7.7 – The Self-Hosting & Monitoring Update](../blog/release-v0.7.7.md)
|
||||
*November 14, 2025*
|
||||
|
||||
Crawl4AI v0.7.7 transforms Docker into a complete self-hosting platform with enterprise-grade real-time monitoring, comprehensive observability, and full operational control. Experience complete visibility into your crawling infrastructure!
|
||||
|
||||
Key highlights:
|
||||
- **📊 Real-time Monitoring Dashboard**: Interactive web UI with live system metrics and browser pool visibility
|
||||
- **🔌 Comprehensive Monitor API**: Complete REST API for programmatic access to all monitoring data
|
||||
- **⚡ WebSocket Streaming**: Real-time updates every 2 seconds for custom dashboards
|
||||
- **🔥 Smart Browser Pool**: 3-tier architecture (permanent/hot/cold) with automatic promotion and cleanup
|
||||
- **🧹 Janitor System**: Automatic resource management with event logging
|
||||
- **🎮 Control Actions**: Manual browser management (kill, restart, cleanup) via API
|
||||
- **📈 Production Ready**: Prometheus integration, alerting patterns, and 6 critical metrics for ops excellence
|
||||
- **🐛 Critical Fixes**: Async LLM extraction (#1055), DFS crawling (#1607), viewport config, and security updates
|
||||
|
||||
[Read full release notes →](../blog/release-v0.7.7.md)
|
||||
|
||||
## Recent Releases
|
||||
|
||||
### [Crawl4AI v0.7.6 – The Webhook Infrastructure Update](../blog/release-v0.7.6.md)
|
||||
*October 22, 2025*
|
||||
|
||||
@@ -31,12 +50,9 @@ Key highlights:
|
||||
- **🔐 Custom Authentication**: Add custom headers for webhook authentication
|
||||
- **📊 Flexible Delivery**: Choose notification-only or include full data in payload
|
||||
- **⚙️ Global Configuration**: Set default webhook URL in config.yml for all jobs
|
||||
- **🎯 Zero Breaking Changes**: Fully backward compatible, webhooks are opt-in
|
||||
|
||||
[Read full release notes →](../blog/release-v0.7.6.md)
|
||||
|
||||
## Recent Releases
|
||||
|
||||
### [Crawl4AI v0.7.5 – The Docker Hooks & Security Update](../blog/release-v0.7.5.md)
|
||||
*September 29, 2025*
|
||||
|
||||
@@ -47,12 +63,9 @@ Key highlights:
|
||||
- **🤖 Enhanced LLM Integration**: Custom providers with temperature control and base_url configuration
|
||||
- **🔒 HTTPS Preservation**: Secure internal link handling for modern web applications
|
||||
- **🐍 Python 3.10+ Support**: Modern language features and enhanced performance
|
||||
- **🛠️ Bug Fixes**: Resolved multiple community-reported issues including URL processing, JWT authentication, and proxy configuration
|
||||
|
||||
[Read full release notes →](../blog/release-v0.7.5.md)
|
||||
|
||||
## Recent Releases
|
||||
|
||||
### [Crawl4AI v0.7.4 – The Intelligent Table Extraction & Performance Update](../blog/release-v0.7.4.md)
|
||||
*August 17, 2025*
|
||||
|
||||
|
||||
626
docs/md_v2/blog/releases/v0.7.7.md
Normal file
626
docs/md_v2/blog/releases/v0.7.7.md
Normal file
@@ -0,0 +1,626 @@
|
||||
# 🚀 Crawl4AI v0.7.7: The Self-Hosting & Monitoring Update
|
||||
|
||||
*November 14, 2025 • 10 min read*
|
||||
|
||||
---
|
||||
|
||||
Today I'm releasing Crawl4AI v0.7.7—the Self-Hosting & Monitoring Update. This release transforms Crawl4AI Docker from a simple containerized crawler into a complete self-hosting platform with enterprise-grade real-time monitoring, full operational transparency, and production-ready observability.
|
||||
|
||||
## 🎯 What's New at a Glance
|
||||
|
||||
- **📊 Real-time Monitoring Dashboard**: Interactive web UI with live system metrics and browser pool status
|
||||
- **🔌 Comprehensive Monitor API**: Complete REST API for programmatic access to all monitoring data
|
||||
- **⚡ WebSocket Streaming**: Real-time updates every 2 seconds for custom dashboards
|
||||
- **🎮 Control Actions**: Manual browser management (kill, restart, cleanup)
|
||||
- **🔥 Smart Browser Pool**: 3-tier architecture (permanent/hot/cold) with automatic promotion
|
||||
- **🧹 Janitor Cleanup System**: Automatic resource management with event logging
|
||||
- **📈 Production Metrics**: 6 critical metrics for operational excellence
|
||||
- **🏭 Integration Ready**: Prometheus, alerting, and log aggregation examples
|
||||
- **🐛 Critical Bug Fixes**: Async LLM extraction, DFS crawling, viewport config, and more
|
||||
|
||||
## 📊 Real-time Monitoring Dashboard: Complete Visibility
|
||||
|
||||
**The Problem:** Running Crawl4AI in Docker was like flying blind. Users had no visibility into what was happening inside the container—memory usage, active requests, browser pools, or errors. Troubleshooting required checking logs, and there was no way to monitor performance or manually intervene when issues occurred.
|
||||
|
||||
**My Solution:** I built a complete real-time monitoring system with an interactive dashboard, comprehensive REST API, WebSocket streaming, and manual control actions. Now you have full transparency and control over your crawling infrastructure.
|
||||
|
||||
### The Self-Hosting Value Proposition
|
||||
|
||||
Before v0.7.7, Docker was just a containerized crawler. After v0.7.7, it's a complete self-hosting platform that gives you:
|
||||
|
||||
- **🔒 Data Privacy**: Your data never leaves your infrastructure
|
||||
- **💰 Cost Control**: No per-request pricing or rate limits
|
||||
- **🎯 Full Customization**: Complete control over configurations and strategies
|
||||
- **📊 Complete Transparency**: Real-time visibility into every aspect
|
||||
- **⚡ Performance**: Direct access without network overhead
|
||||
- **🛡️ Enterprise Security**: Keep workflows behind your firewall
|
||||
|
||||
### Interactive Monitoring Dashboard
|
||||
|
||||
Access the dashboard at `http://localhost:11235/dashboard` to see:
|
||||
|
||||
- **System Health Overview**: CPU, memory, network, and uptime in real-time
|
||||
- **Live Request Tracking**: Active and completed requests with full details
|
||||
- **Browser Pool Management**: Interactive table with permanent/hot/cold browsers
|
||||
- **Janitor Events Log**: Automatic cleanup activities
|
||||
- **Error Monitoring**: Full context error logs
|
||||
|
||||
The dashboard updates every 2 seconds via WebSocket, giving you live visibility into your crawling operations.
|
||||
|
||||
## 🔌 Monitor API: Programmatic Access
|
||||
|
||||
**The Problem:** Monitoring dashboards are great for humans, but automation and integration require programmatic access.
|
||||
|
||||
**My Solution:** A comprehensive REST API that exposes all monitoring data for integration with your existing infrastructure.
|
||||
|
||||
### System Health Endpoint
|
||||
|
||||
```python
|
||||
import httpx
|
||||
import asyncio
|
||||
|
||||
async def monitor_system_health():
|
||||
async with httpx.AsyncClient() as client:
|
||||
response = await client.get("http://localhost:11235/monitor/health")
|
||||
health = response.json()
|
||||
|
||||
print(f"Container Metrics:")
|
||||
print(f" CPU: {health['container']['cpu_percent']:.1f}%")
|
||||
print(f" Memory: {health['container']['memory_percent']:.1f}%")
|
||||
print(f" Uptime: {health['container']['uptime_seconds']}s")
|
||||
|
||||
print(f"\nBrowser Pool:")
|
||||
print(f" Permanent: {health['pool']['permanent']['active']} active")
|
||||
print(f" Hot Pool: {health['pool']['hot']['count']} browsers")
|
||||
print(f" Cold Pool: {health['pool']['cold']['count']} browsers")
|
||||
|
||||
print(f"\nStatistics:")
|
||||
print(f" Total Requests: {health['stats']['total_requests']}")
|
||||
print(f" Success Rate: {health['stats']['success_rate_percent']:.1f}%")
|
||||
print(f" Avg Latency: {health['stats']['avg_latency_ms']:.0f}ms")
|
||||
|
||||
asyncio.run(monitor_system_health())
|
||||
```
|
||||
|
||||
### Request Tracking
|
||||
|
||||
```python
|
||||
async def track_requests():
|
||||
async with httpx.AsyncClient() as client:
|
||||
response = await client.get("http://localhost:11235/monitor/requests")
|
||||
requests_data = response.json()
|
||||
|
||||
print(f"Active Requests: {len(requests_data['active'])}")
|
||||
print(f"Completed Requests: {len(requests_data['completed'])}")
|
||||
|
||||
# See details of recent requests
|
||||
for req in requests_data['completed'][:5]:
|
||||
status_icon = "✅" if req['success'] else "❌"
|
||||
print(f"{status_icon} {req['endpoint']} - {req['latency_ms']:.0f}ms")
|
||||
```
|
||||
|
||||
### Browser Pool Management
|
||||
|
||||
```python
|
||||
async def monitor_browser_pool():
|
||||
async with httpx.AsyncClient() as client:
|
||||
response = await client.get("http://localhost:11235/monitor/browsers")
|
||||
browsers = response.json()
|
||||
|
||||
print(f"Pool Summary:")
|
||||
print(f" Total Browsers: {browsers['summary']['total_count']}")
|
||||
print(f" Total Memory: {browsers['summary']['total_memory_mb']} MB")
|
||||
print(f" Reuse Rate: {browsers['summary']['reuse_rate_percent']:.1f}%")
|
||||
|
||||
# List all browsers
|
||||
for browser in browsers['permanent']:
|
||||
print(f"🔥 Permanent: {browser['browser_id'][:8]}... | "
|
||||
f"Requests: {browser['request_count']} | "
|
||||
f"Memory: {browser['memory_mb']:.0f} MB")
|
||||
```
|
||||
|
||||
### Endpoint Performance Statistics
|
||||
|
||||
```python
|
||||
async def get_endpoint_stats():
|
||||
async with httpx.AsyncClient() as client:
|
||||
response = await client.get("http://localhost:11235/monitor/endpoints/stats")
|
||||
stats = response.json()
|
||||
|
||||
print("Endpoint Analytics:")
|
||||
for endpoint, data in stats.items():
|
||||
print(f" {endpoint}:")
|
||||
print(f" Requests: {data['count']}")
|
||||
print(f" Avg Latency: {data['avg_latency_ms']:.0f}ms")
|
||||
print(f" Success Rate: {data['success_rate_percent']:.1f}%")
|
||||
```
|
||||
|
||||
### Complete API Reference
|
||||
|
||||
The Monitor API includes these endpoints:
|
||||
|
||||
- `GET /monitor/health` - System health with pool statistics
|
||||
- `GET /monitor/requests` - Active and completed request tracking
|
||||
- `GET /monitor/browsers` - Browser pool details and efficiency
|
||||
- `GET /monitor/endpoints/stats` - Per-endpoint performance analytics
|
||||
- `GET /monitor/timeline?minutes=5` - Time-series data for charts
|
||||
- `GET /monitor/logs/janitor?limit=10` - Cleanup activity logs
|
||||
- `GET /monitor/logs/errors?limit=10` - Error logs with context
|
||||
- `POST /monitor/actions/cleanup` - Force immediate cleanup
|
||||
- `POST /monitor/actions/kill_browser` - Kill specific browser
|
||||
- `POST /monitor/actions/restart_browser` - Restart browser
|
||||
- `POST /monitor/stats/reset` - Reset accumulated statistics
|
||||
|
||||
## ⚡ WebSocket Streaming: Real-time Updates
|
||||
|
||||
**The Problem:** Polling the API every few seconds wastes resources and adds latency. Real-time dashboards need instant updates.
|
||||
|
||||
**My Solution:** WebSocket streaming with 2-second update intervals for building custom real-time dashboards.
|
||||
|
||||
### WebSocket Integration Example
|
||||
|
||||
```python
|
||||
import websockets
|
||||
import json
|
||||
import asyncio
|
||||
|
||||
async def monitor_realtime():
|
||||
uri = "ws://localhost:11235/monitor/ws"
|
||||
|
||||
async with websockets.connect(uri) as websocket:
|
||||
print("Connected to real-time monitoring stream")
|
||||
|
||||
while True:
|
||||
# Receive update every 2 seconds
|
||||
data = await websocket.recv()
|
||||
update = json.loads(data)
|
||||
|
||||
# Access all monitoring data
|
||||
print(f"\n--- Update at {update['timestamp']} ---")
|
||||
print(f"Memory: {update['health']['container']['memory_percent']:.1f}%")
|
||||
print(f"Active Requests: {len(update['requests']['active'])}")
|
||||
print(f"Total Browsers: {update['browsers']['summary']['total_count']}")
|
||||
|
||||
if update['errors']:
|
||||
print(f"⚠️ Recent Errors: {len(update['errors'])}")
|
||||
|
||||
asyncio.run(monitor_realtime())
|
||||
```
|
||||
|
||||
**Expected Real-World Impact:**
|
||||
- **Custom Dashboards**: Build tailored monitoring UIs for your team
|
||||
- **Real-time Alerting**: Trigger alerts instantly when metrics exceed thresholds
|
||||
- **Integration**: Feed live data into monitoring tools like Grafana
|
||||
- **Automation**: React to events in real-time without polling
|
||||
|
||||
## 🔥 Smart Browser Pool: 3-Tier Architecture
|
||||
|
||||
**The Problem:** Creating a new browser for every request is slow and memory-intensive. Traditional browser pools are static and inefficient.
|
||||
|
||||
**My Solution:** A smart 3-tier browser pool that automatically adapts to usage patterns.
|
||||
|
||||
### How It Works
|
||||
|
||||
```python
|
||||
import httpx
|
||||
|
||||
async def demonstrate_browser_pool():
|
||||
async with httpx.AsyncClient() as client:
|
||||
# Request 1-3: Default config → Uses permanent browser
|
||||
print("Phase 1: Using permanent browser")
|
||||
for i in range(3):
|
||||
await client.post(
|
||||
"http://localhost:11235/crawl",
|
||||
json={"urls": [f"https://httpbin.org/html?req={i}"]}
|
||||
)
|
||||
print(f" Request {i+1}: Reused permanent browser")
|
||||
|
||||
# Request 4-6: Custom viewport → Cold pool (first use)
|
||||
print("\nPhase 2: Custom config creates cold pool browser")
|
||||
viewport_config = {"viewport": {"width": 1280, "height": 720}}
|
||||
for i in range(4):
|
||||
await client.post(
|
||||
"http://localhost:11235/crawl",
|
||||
json={
|
||||
"urls": [f"https://httpbin.org/json?v={i}"],
|
||||
"browser_config": viewport_config
|
||||
}
|
||||
)
|
||||
if i < 2:
|
||||
print(f" Request {i+1}: Cold pool browser")
|
||||
else:
|
||||
print(f" Request {i+1}: Promoted to hot pool! (after 3 uses)")
|
||||
|
||||
# Check pool status
|
||||
response = await client.get("http://localhost:11235/monitor/browsers")
|
||||
browsers = response.json()
|
||||
|
||||
print(f"\nPool Status:")
|
||||
print(f" Permanent: {len(browsers['permanent'])} (always active)")
|
||||
print(f" Hot: {len(browsers['hot'])} (frequently used configs)")
|
||||
print(f" Cold: {len(browsers['cold'])} (on-demand)")
|
||||
print(f" Reuse Rate: {browsers['summary']['reuse_rate_percent']:.1f}%")
|
||||
|
||||
asyncio.run(demonstrate_browser_pool())
|
||||
```
|
||||
|
||||
**Pool Tiers:**
|
||||
|
||||
- **🔥 Permanent Browser**: Always-on, default configuration, instant response
|
||||
- **♨️ Hot Pool**: Browsers promoted after 3+ uses, kept warm for quick access
|
||||
- **❄️ Cold Pool**: On-demand browsers for variant configs, cleaned up when idle
|
||||
|
||||
**Expected Real-World Impact:**
|
||||
- **Memory Efficiency**: 10x reduction in memory usage vs creating browsers per request
|
||||
- **Performance**: Instant access to frequently-used configurations
|
||||
- **Automatic Optimization**: Pool adapts to your usage patterns
|
||||
- **Resource Management**: Janitor automatically cleans up idle browsers
|
||||
|
||||
## 🧹 Janitor System: Automatic Cleanup
|
||||
|
||||
**The Problem:** Long-running crawlers accumulate idle browsers and consume memory over time.
|
||||
|
||||
**My Solution:** An automatic janitor system that monitors and cleans up idle resources.
|
||||
|
||||
```python
|
||||
async def monitor_janitor_activity():
|
||||
async with httpx.AsyncClient() as client:
|
||||
response = await client.get("http://localhost:11235/monitor/logs/janitor?limit=5")
|
||||
logs = response.json()
|
||||
|
||||
print("Recent Cleanup Activities:")
|
||||
for log in logs:
|
||||
print(f" {log['timestamp']}: {log['message']}")
|
||||
|
||||
# Example output:
|
||||
# 2025-11-14 10:30:00: Cleaned up 2 cold pool browsers (idle > 5min)
|
||||
# 2025-11-14 10:25:00: Browser reuse rate: 85.3%
|
||||
# 2025-11-14 10:20:00: Hot pool browser promoted (10 requests)
|
||||
```
|
||||
|
||||
## 🎮 Control Actions: Manual Management
|
||||
|
||||
**The Problem:** Sometimes you need to manually intervene—kill a stuck browser, force cleanup, or restart resources.
|
||||
|
||||
**My Solution:** Manual control actions via the API for operational troubleshooting.
|
||||
|
||||
### Force Cleanup
|
||||
|
||||
```python
|
||||
async def force_cleanup():
|
||||
async with httpx.AsyncClient() as client:
|
||||
response = await client.post("http://localhost:11235/monitor/actions/cleanup")
|
||||
result = response.json()
|
||||
|
||||
print(f"Cleanup completed:")
|
||||
print(f" Browsers cleaned: {result.get('cleaned_count', 0)}")
|
||||
print(f" Memory freed: {result.get('memory_freed_mb', 0):.1f} MB")
|
||||
```
|
||||
|
||||
### Kill Specific Browser
|
||||
|
||||
```python
|
||||
async def kill_stuck_browser(browser_id: str):
|
||||
async with httpx.AsyncClient() as client:
|
||||
response = await client.post(
|
||||
"http://localhost:11235/monitor/actions/kill_browser",
|
||||
json={"browser_id": browser_id}
|
||||
)
|
||||
|
||||
if response.status_code == 200:
|
||||
print(f"✅ Browser {browser_id} killed successfully")
|
||||
```
|
||||
|
||||
### Reset Statistics
|
||||
|
||||
```python
|
||||
async def reset_stats():
|
||||
async with httpx.AsyncClient() as client:
|
||||
response = await client.post("http://localhost:11235/monitor/stats/reset")
|
||||
print("📊 Statistics reset for fresh monitoring")
|
||||
```
|
||||
|
||||
## 📈 Production Integration Patterns
|
||||
|
||||
### Prometheus Integration
|
||||
|
||||
```python
|
||||
# Export metrics for Prometheus scraping
|
||||
async def export_prometheus_metrics():
|
||||
async with httpx.AsyncClient() as client:
|
||||
health = await client.get("http://localhost:11235/monitor/health")
|
||||
data = health.json()
|
||||
|
||||
# Export in Prometheus format
|
||||
metrics = f"""
|
||||
# HELP crawl4ai_memory_usage_percent Memory usage percentage
|
||||
# TYPE crawl4ai_memory_usage_percent gauge
|
||||
crawl4ai_memory_usage_percent {data['container']['memory_percent']}
|
||||
|
||||
# HELP crawl4ai_request_success_rate Request success rate
|
||||
# TYPE crawl4ai_request_success_rate gauge
|
||||
crawl4ai_request_success_rate {data['stats']['success_rate_percent']}
|
||||
|
||||
# HELP crawl4ai_browser_pool_count Total browsers in pool
|
||||
# TYPE crawl4ai_browser_pool_count gauge
|
||||
crawl4ai_browser_pool_count {data['pool']['permanent']['active'] + data['pool']['hot']['count'] + data['pool']['cold']['count']}
|
||||
"""
|
||||
return metrics
|
||||
```
|
||||
|
||||
### Alerting Example
|
||||
|
||||
```python
|
||||
async def check_alerts():
|
||||
async with httpx.AsyncClient() as client:
|
||||
health = await client.get("http://localhost:11235/monitor/health")
|
||||
data = health.json()
|
||||
|
||||
# Memory alert
|
||||
if data['container']['memory_percent'] > 80:
|
||||
print("🚨 ALERT: Memory usage above 80%")
|
||||
# Trigger cleanup
|
||||
await client.post("http://localhost:11235/monitor/actions/cleanup")
|
||||
|
||||
# Success rate alert
|
||||
if data['stats']['success_rate_percent'] < 90:
|
||||
print("🚨 ALERT: Success rate below 90%")
|
||||
# Check error logs
|
||||
errors = await client.get("http://localhost:11235/monitor/logs/errors")
|
||||
print(f"Recent errors: {len(errors.json())}")
|
||||
|
||||
# Latency alert
|
||||
if data['stats']['avg_latency_ms'] > 5000:
|
||||
print("🚨 ALERT: Average latency above 5s")
|
||||
```
|
||||
|
||||
### Key Metrics to Track
|
||||
|
||||
```python
|
||||
CRITICAL_METRICS = {
|
||||
"memory_usage": {
|
||||
"current": "container.memory_percent",
|
||||
"target": "<80%",
|
||||
"alert_threshold": ">80%",
|
||||
"action": "Force cleanup or scale"
|
||||
},
|
||||
"success_rate": {
|
||||
"current": "stats.success_rate_percent",
|
||||
"target": ">95%",
|
||||
"alert_threshold": "<90%",
|
||||
"action": "Check error logs"
|
||||
},
|
||||
"avg_latency": {
|
||||
"current": "stats.avg_latency_ms",
|
||||
"target": "<2000ms",
|
||||
"alert_threshold": ">5000ms",
|
||||
"action": "Investigate slow requests"
|
||||
},
|
||||
"browser_reuse_rate": {
|
||||
"current": "browsers.summary.reuse_rate_percent",
|
||||
"target": ">80%",
|
||||
"alert_threshold": "<60%",
|
||||
"action": "Check pool configuration"
|
||||
},
|
||||
"total_browsers": {
|
||||
"current": "browsers.summary.total_count",
|
||||
"target": "<15",
|
||||
"alert_threshold": ">20",
|
||||
"action": "Check for browser leaks"
|
||||
},
|
||||
"error_frequency": {
|
||||
"current": "len(errors)",
|
||||
"target": "<5/hour",
|
||||
"alert_threshold": ">10/hour",
|
||||
"action": "Review error patterns"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 🐛 Critical Bug Fixes
|
||||
|
||||
This release includes significant bug fixes that improve stability and performance:
|
||||
|
||||
### Async LLM Extraction (#1590)
|
||||
|
||||
**The Problem:** LLM extraction was blocking async execution, causing URLs to be processed sequentially instead of in parallel (issue #1055).
|
||||
|
||||
**The Fix:** Resolved the blocking issue to enable true parallel processing for LLM extraction.
|
||||
|
||||
```python
|
||||
# Before v0.7.7: Sequential processing
|
||||
# After v0.7.7: True parallel processing
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
urls = ["url1", "url2", "url3", "url4"]
|
||||
|
||||
# Now processes truly in parallel with LLM extraction
|
||||
results = await crawler.arun_many(
|
||||
urls,
|
||||
config=CrawlerRunConfig(
|
||||
extraction_strategy=LLMExtractionStrategy(...)
|
||||
)
|
||||
)
|
||||
# 4x faster for parallel LLM extraction!
|
||||
```
|
||||
|
||||
**Expected Impact:** Major performance improvement for batch LLM extraction workflows.
|
||||
|
||||
### DFS Deep Crawling (#1607)
|
||||
|
||||
**The Problem:** DFS (Depth-First Search) deep crawl strategy had implementation issues.
|
||||
|
||||
**The Fix:** Enhanced DFSDeepCrawlStrategy with proper seen URL tracking and improved documentation.
|
||||
|
||||
### Browser & Crawler Config Documentation (#1609)
|
||||
|
||||
**The Problem:** Documentation didn't match the actual `async_configs.py` implementation.
|
||||
|
||||
**The Fix:** Updated all configuration documentation to accurately reflect the current implementation.
|
||||
|
||||
### Sitemap Seeder (#1598)
|
||||
|
||||
**The Problem:** Sitemap parsing and URL normalization issues in AsyncUrlSeeder (issue #1559).
|
||||
|
||||
**The Fix:** Added comprehensive tests and fixes for sitemap namespace parsing and URL normalization.
|
||||
|
||||
### Remove Overlay Elements (#1529)
|
||||
|
||||
**The Problem:** The `remove_overlay_elements` functionality wasn't working (issue #1396).
|
||||
|
||||
**The Fix:** Fixed by properly calling the injected JavaScript function.
|
||||
|
||||
### Viewport Configuration (#1495)
|
||||
|
||||
**The Problem:** Viewport configuration wasn't working in managed browsers (issue #1490).
|
||||
|
||||
**The Fix:** Added proper viewport size configuration support for browser launch.
|
||||
|
||||
### Managed Browser CDP Timing (#1528)
|
||||
|
||||
**The Problem:** CDP (Chrome DevTools Protocol) endpoint verification had timing issues causing connection failures (issue #1445).
|
||||
|
||||
**The Fix:** Added exponential backoff for CDP endpoint verification to handle timing variations.
|
||||
|
||||
### Security Updates
|
||||
|
||||
- **pyOpenSSL**: Updated from >=24.3.0 to >=25.3.0 to address security vulnerability
|
||||
- Added verification tests for the security update
|
||||
|
||||
### Docker Fixes
|
||||
|
||||
- **Port Standardization**: Fixed inconsistent port usage (11234 vs 11235) - now standardized to 11235
|
||||
- **LLM Environment**: Fixed LLM API key handling for multi-provider support (PR #1537)
|
||||
- **Error Handling**: Improved Docker API error messages with comprehensive status codes
|
||||
- **Serialization**: Fixed `fit_html` property serialization in `/crawl` and `/crawl/stream` endpoints
|
||||
|
||||
### Other Important Fixes
|
||||
|
||||
- **arun_many Returns**: Fixed function to always return a list, even on exception (PR #1530)
|
||||
- **Webhook Serialization**: Properly serialize Pydantic HttpUrl in webhook config
|
||||
- **LLMConfig Documentation**: Fixed casing and variable name consistency (issue #1551)
|
||||
- **Python Version**: Dropped Python 3.9 support, now requires Python >=3.10
|
||||
|
||||
## 📊 Expected Real-World Impact
|
||||
|
||||
### For DevOps & Infrastructure Teams
|
||||
- **Full Visibility**: Know exactly what's happening inside your crawling infrastructure
|
||||
- **Proactive Monitoring**: Catch issues before they become problems
|
||||
- **Resource Optimization**: Identify memory leaks and performance bottlenecks
|
||||
- **Operational Control**: Manual intervention when automated systems need help
|
||||
|
||||
### For Production Deployments
|
||||
- **Enterprise Observability**: Prometheus, Grafana, and alerting integration
|
||||
- **Debugging**: Real-time logs and error tracking
|
||||
- **Capacity Planning**: Historical metrics for scaling decisions
|
||||
- **SLA Monitoring**: Track success rates and latency against targets
|
||||
|
||||
### For Development Teams
|
||||
- **Local Monitoring**: Understand crawler behavior during development
|
||||
- **Performance Testing**: Measure impact of configuration changes
|
||||
- **Troubleshooting**: Quickly identify and fix issues
|
||||
- **Learning**: See exactly how the browser pool works
|
||||
|
||||
## 🔄 Breaking Changes
|
||||
|
||||
**None!** This release is fully backward compatible.
|
||||
|
||||
- All existing Docker configurations continue to work
|
||||
- No API changes to existing endpoints
|
||||
- Monitoring is additive functionality
|
||||
- No migration required
|
||||
|
||||
## 🚀 Upgrade Instructions
|
||||
|
||||
### Docker
|
||||
|
||||
```bash
|
||||
# Pull the latest version
|
||||
docker pull unclecode/crawl4ai:0.7.7
|
||||
|
||||
# Or use the latest tag
|
||||
docker pull unclecode/crawl4ai:latest
|
||||
|
||||
# Run with monitoring enabled (default)
|
||||
docker run -d \
|
||||
-p 11235:11235 \
|
||||
--shm-size=1g \
|
||||
--name crawl4ai \
|
||||
unclecode/crawl4ai:0.7.7
|
||||
|
||||
# Access the monitoring dashboard
|
||||
open http://localhost:11235/dashboard
|
||||
```
|
||||
|
||||
### Python Package
|
||||
|
||||
```bash
|
||||
# Upgrade to latest version
|
||||
pip install --upgrade crawl4ai
|
||||
|
||||
# Or install specific version
|
||||
pip install crawl4ai==0.7.7
|
||||
```
|
||||
|
||||
## 🎬 Try the Demo
|
||||
|
||||
Run the comprehensive demo that showcases all monitoring features:
|
||||
|
||||
```bash
|
||||
python docs/releases_review/demo_v0.7.7.py
|
||||
```
|
||||
|
||||
**The demo includes:**
|
||||
1. System health overview with live metrics
|
||||
2. Request tracking with active/completed monitoring
|
||||
3. Browser pool management (permanent/hot/cold)
|
||||
4. Complete Monitor API endpoint examples
|
||||
5. WebSocket streaming demonstration
|
||||
6. Control actions (cleanup, kill, restart)
|
||||
7. Production metrics and alerting patterns
|
||||
8. Self-hosting value proposition
|
||||
|
||||
## 📚 Documentation
|
||||
|
||||
### New Documentation
|
||||
- **[Self-Hosting Guide](https://docs.crawl4ai.com/core/self-hosting/)** - Complete self-hosting documentation with monitoring
|
||||
- **Demo Script**: `docs/releases_review/demo_v0.7.7.py` - Working examples
|
||||
|
||||
### Updated Documentation
|
||||
- **Docker Deployment** → **Self-Hosting** (renamed for better positioning)
|
||||
- Added comprehensive monitoring sections
|
||||
- Production integration patterns
|
||||
- WebSocket streaming examples
|
||||
|
||||
## 💡 Pro Tips
|
||||
|
||||
1. **Start with the dashboard** - Visit `/dashboard` to get familiar with the monitoring system
|
||||
2. **Track the 6 key metrics** - Memory, success rate, latency, reuse rate, browser count, errors
|
||||
3. **Set up alerting early** - Use the Monitor API to build alerts before issues occur
|
||||
4. **Monitor browser pool efficiency** - Aim for >80% reuse rate for optimal performance
|
||||
5. **Use WebSocket for custom dashboards** - Build tailored monitoring UIs for your team
|
||||
6. **Leverage Prometheus integration** - Export metrics for long-term storage and analysis
|
||||
7. **Check janitor logs** - Understand automatic cleanup patterns
|
||||
8. **Use control actions judiciously** - Manual interventions are for exceptional cases
|
||||
|
||||
## 🙏 Acknowledgments
|
||||
|
||||
Thank you to our community for the feedback, bug reports, and feature requests that shaped this release. Special thanks to everyone who contributed to the issues that were fixed in this version.
|
||||
|
||||
The monitoring system was built based on real user needs for production deployments, and your input made it comprehensive and practical.
|
||||
|
||||
## 📞 Support & Resources
|
||||
|
||||
- **📖 Documentation**: [docs.crawl4ai.com](https://docs.crawl4ai.com)
|
||||
- **🐙 GitHub**: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
|
||||
- **💬 Discord**: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN)
|
||||
- **🐦 Twitter**: [@unclecode](https://x.com/unclecode)
|
||||
- **📊 Dashboard**: `http://localhost:11235/dashboard` (when running)
|
||||
|
||||
---
|
||||
|
||||
**Crawl4AI v0.7.7 delivers complete self-hosting with enterprise-grade monitoring. You now have full visibility and control over your web crawling infrastructure. The monitoring dashboard, comprehensive API, and WebSocket streaming give you everything needed for production deployments. Try the self-hosting platform—it's a game changer for operational excellence!**
|
||||
|
||||
**Happy crawling with full visibility!** 🕷️📊
|
||||
|
||||
*- unclecode*
|
||||
628
docs/releases_review/demo_v0.7.7.py
Normal file
628
docs/releases_review/demo_v0.7.7.py
Normal file
@@ -0,0 +1,628 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Crawl4AI v0.7.7 Release Demo
|
||||
============================
|
||||
|
||||
This demo showcases the major feature in v0.7.7:
|
||||
**Self-Hosting with Real-time Monitoring Dashboard**
|
||||
|
||||
Features Demonstrated:
|
||||
1. System health monitoring with live metrics
|
||||
2. Real-time request tracking (active & completed)
|
||||
3. Browser pool management (permanent/hot/cold pools)
|
||||
4. Monitor API endpoints for programmatic access
|
||||
5. WebSocket streaming for real-time updates
|
||||
6. Control actions (kill browser, cleanup, restart)
|
||||
7. Production metrics (efficiency, reuse rates, memory)
|
||||
|
||||
Prerequisites:
|
||||
- Crawl4AI Docker container running on localhost:11235
|
||||
- Python packages: pip install httpx websockets
|
||||
|
||||
Usage:
|
||||
python docs/releases_review/demo_v0.7.7.py
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import httpx
|
||||
import json
|
||||
import time
|
||||
from datetime import datetime
|
||||
from typing import Dict, Any
|
||||
|
||||
# Configuration
|
||||
CRAWL4AI_BASE_URL = "http://localhost:11235"
|
||||
MONITOR_DASHBOARD_URL = f"{CRAWL4AI_BASE_URL}/dashboard"
|
||||
|
||||
|
||||
def print_section(title: str, description: str = ""):
|
||||
"""Print a formatted section header"""
|
||||
print(f"\n{'=' * 70}")
|
||||
print(f"📊 {title}")
|
||||
if description:
|
||||
print(f"{description}")
|
||||
print(f"{'=' * 70}\n")
|
||||
|
||||
|
||||
def print_subsection(title: str):
|
||||
"""Print a formatted subsection header"""
|
||||
print(f"\n{'-' * 70}")
|
||||
print(f"{title}")
|
||||
print(f"{'-' * 70}")
|
||||
|
||||
|
||||
async def check_server_health():
|
||||
"""Check if Crawl4AI server is running"""
|
||||
try:
|
||||
async with httpx.AsyncClient(timeout=5.0) as client:
|
||||
response = await client.get(f"{CRAWL4AI_BASE_URL}/health")
|
||||
return response.status_code == 200
|
||||
except:
|
||||
return False
|
||||
|
||||
|
||||
async def demo_1_system_health_overview():
|
||||
"""Demo 1: System Health Overview - Live metrics and pool status"""
|
||||
print_section(
|
||||
"Demo 1: System Health Overview",
|
||||
"Real-time monitoring of system resources and browser pool"
|
||||
)
|
||||
|
||||
async with httpx.AsyncClient(timeout=30.0) as client:
|
||||
print("🔍 Fetching system health metrics...")
|
||||
|
||||
try:
|
||||
response = await client.get(f"{CRAWL4AI_BASE_URL}/monitor/health")
|
||||
health = response.json()
|
||||
|
||||
print("\n✅ System Health Report:")
|
||||
print(f"\n🖥️ Container Metrics:")
|
||||
print(f" • CPU Usage: {health['container']['cpu_percent']:.1f}%")
|
||||
print(f" • Memory Usage: {health['container']['memory_percent']:.1f}% "
|
||||
f"({health['container']['memory_mb']:.0f} MB)")
|
||||
print(f" • Network RX: {health['container']['network_rx_mb']:.2f} MB")
|
||||
print(f" • Network TX: {health['container']['network_tx_mb']:.2f} MB")
|
||||
print(f" • Uptime: {health['container']['uptime_seconds']:.0f}s")
|
||||
|
||||
print(f"\n🌐 Browser Pool Status:")
|
||||
print(f" Permanent Browser:")
|
||||
print(f" • Active: {health['pool']['permanent']['active']}")
|
||||
print(f" • Total Requests: {health['pool']['permanent']['total_requests']}")
|
||||
|
||||
print(f" Hot Pool (Frequently Used Configs):")
|
||||
print(f" • Count: {health['pool']['hot']['count']}")
|
||||
print(f" • Total Requests: {health['pool']['hot']['total_requests']}")
|
||||
|
||||
print(f" Cold Pool (On-Demand Configs):")
|
||||
print(f" • Count: {health['pool']['cold']['count']}")
|
||||
print(f" • Total Requests: {health['pool']['cold']['total_requests']}")
|
||||
|
||||
print(f"\n📈 Overall Statistics:")
|
||||
print(f" • Total Requests: {health['stats']['total_requests']}")
|
||||
print(f" • Success Rate: {health['stats']['success_rate_percent']:.1f}%")
|
||||
print(f" • Avg Latency: {health['stats']['avg_latency_ms']:.0f}ms")
|
||||
|
||||
print(f"\n💡 Dashboard URL: {MONITOR_DASHBOARD_URL}")
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ Error fetching health: {e}")
|
||||
|
||||
|
||||
async def demo_2_request_tracking():
|
||||
"""Demo 2: Real-time Request Tracking - Generate and monitor requests"""
|
||||
print_section(
|
||||
"Demo 2: Real-time Request Tracking",
|
||||
"Submit crawl jobs and watch them in real-time"
|
||||
)
|
||||
|
||||
async with httpx.AsyncClient(timeout=60.0) as client:
|
||||
print("🚀 Submitting crawl requests...")
|
||||
|
||||
# Submit multiple requests
|
||||
urls_to_crawl = [
|
||||
"https://httpbin.org/html",
|
||||
"https://httpbin.org/json",
|
||||
"https://example.com"
|
||||
]
|
||||
|
||||
tasks = []
|
||||
for url in urls_to_crawl:
|
||||
task = client.post(
|
||||
f"{CRAWL4AI_BASE_URL}/crawl",
|
||||
json={"urls": [url], "crawler_config": {}}
|
||||
)
|
||||
tasks.append(task)
|
||||
|
||||
print(f" • Submitting {len(urls_to_crawl)} requests in parallel...")
|
||||
results = await asyncio.gather(*tasks, return_exceptions=True)
|
||||
|
||||
successful = sum(1 for r in results if not isinstance(r, Exception) and r.status_code == 200)
|
||||
print(f" ✅ {successful}/{len(urls_to_crawl)} requests submitted")
|
||||
|
||||
# Check request tracking
|
||||
print("\n📊 Checking request tracking...")
|
||||
await asyncio.sleep(2) # Wait for requests to process
|
||||
|
||||
response = await client.get(f"{CRAWL4AI_BASE_URL}/monitor/requests")
|
||||
requests_data = response.json()
|
||||
|
||||
print(f"\n📋 Request Status:")
|
||||
print(f" • Active Requests: {len(requests_data['active'])}")
|
||||
print(f" • Completed Requests: {len(requests_data['completed'])}")
|
||||
|
||||
if requests_data['completed']:
|
||||
print(f"\n📝 Recent Completed Requests:")
|
||||
for req in requests_data['completed'][:3]:
|
||||
status_icon = "✅" if req['success'] else "❌"
|
||||
print(f" {status_icon} {req['endpoint']} - {req['latency_ms']:.0f}ms")
|
||||
|
||||
|
||||
async def demo_3_browser_pool_management():
|
||||
"""Demo 3: Browser Pool Management - 3-tier architecture in action"""
|
||||
print_section(
|
||||
"Demo 3: Browser Pool Management",
|
||||
"Understanding permanent, hot, and cold browser pools"
|
||||
)
|
||||
|
||||
async with httpx.AsyncClient(timeout=60.0) as client:
|
||||
print("🌊 Testing browser pool with different configurations...")
|
||||
|
||||
# Test 1: Default config (permanent browser)
|
||||
print("\n🔥 Test 1: Default Config → Permanent Browser")
|
||||
for i in range(3):
|
||||
await client.post(
|
||||
f"{CRAWL4AI_BASE_URL}/crawl",
|
||||
json={"urls": [f"https://httpbin.org/html?req={i}"], "crawler_config": {}}
|
||||
)
|
||||
print(f" • Request {i+1}/3 sent (should use permanent browser)")
|
||||
|
||||
await asyncio.sleep(2)
|
||||
|
||||
# Test 2: Custom viewport (cold → hot promotion after 3 uses)
|
||||
print("\n♨️ Test 2: Custom Viewport → Cold Pool (promoting to Hot)")
|
||||
viewport_config = {"viewport": {"width": 1280, "height": 720}}
|
||||
for i in range(4):
|
||||
await client.post(
|
||||
f"{CRAWL4AI_BASE_URL}/crawl",
|
||||
json={
|
||||
"urls": [f"https://httpbin.org/json?viewport={i}"],
|
||||
"browser_config": viewport_config,
|
||||
"crawler_config": {}
|
||||
}
|
||||
)
|
||||
print(f" • Request {i+1}/4 sent (cold→hot promotion after 3rd use)")
|
||||
|
||||
await asyncio.sleep(2)
|
||||
|
||||
# Check browser pool status
|
||||
print("\n📊 Browser Pool Report:")
|
||||
response = await client.get(f"{CRAWL4AI_BASE_URL}/monitor/browsers")
|
||||
browsers = response.json()
|
||||
|
||||
print(f"\n🎯 Pool Summary:")
|
||||
print(f" • Total Browsers: {browsers['summary']['total_count']}")
|
||||
print(f" • Total Memory: {browsers['summary']['total_memory_mb']} MB")
|
||||
print(f" • Reuse Rate: {browsers['summary']['reuse_rate_percent']:.1f}%")
|
||||
|
||||
print(f"\n📋 Browser Pool Details:")
|
||||
if browsers['permanent']:
|
||||
for browser in browsers['permanent']:
|
||||
print(f" 🔥 Permanent: {browser['browser_id'][:8]}... | "
|
||||
f"Requests: {browser['request_count']} | "
|
||||
f"Memory: {browser['memory_mb']:.0f} MB")
|
||||
|
||||
if browsers['hot']:
|
||||
for browser in browsers['hot']:
|
||||
print(f" ♨️ Hot: {browser['browser_id'][:8]}... | "
|
||||
f"Requests: {browser['request_count']} | "
|
||||
f"Memory: {browser['memory_mb']:.0f} MB")
|
||||
|
||||
if browsers['cold']:
|
||||
for browser in browsers['cold']:
|
||||
print(f" ❄️ Cold: {browser['browser_id'][:8]}... | "
|
||||
f"Requests: {browser['request_count']} | "
|
||||
f"Memory: {browser['memory_mb']:.0f} MB")
|
||||
|
||||
|
||||
async def demo_4_monitor_api_endpoints():
|
||||
"""Demo 4: Monitor API Endpoints - Complete API surface"""
|
||||
print_section(
|
||||
"Demo 4: Monitor API Endpoints",
|
||||
"Programmatic access to all monitoring data"
|
||||
)
|
||||
|
||||
async with httpx.AsyncClient(timeout=30.0) as client:
|
||||
print("🔌 Testing Monitor API endpoints...")
|
||||
|
||||
# Endpoint performance statistics
|
||||
print_subsection("Endpoint Performance Statistics")
|
||||
response = await client.get(f"{CRAWL4AI_BASE_URL}/monitor/endpoints/stats")
|
||||
endpoint_stats = response.json()
|
||||
|
||||
print("\n📊 Per-Endpoint Analytics:")
|
||||
for endpoint, stats in endpoint_stats.items():
|
||||
print(f" {endpoint}:")
|
||||
print(f" • Requests: {stats['count']}")
|
||||
print(f" • Avg Latency: {stats['avg_latency_ms']:.0f}ms")
|
||||
print(f" • Success Rate: {stats['success_rate_percent']:.1f}%")
|
||||
|
||||
# Timeline data for charts
|
||||
print_subsection("Timeline Data (for Charts)")
|
||||
response = await client.get(f"{CRAWL4AI_BASE_URL}/monitor/timeline?minutes=5")
|
||||
timeline = response.json()
|
||||
|
||||
print(f"\n📈 Timeline Metrics (last 5 minutes):")
|
||||
print(f" • Data Points: {len(timeline['memory'])}")
|
||||
if timeline['memory']:
|
||||
latest = timeline['memory'][-1]
|
||||
print(f" • Latest Memory: {latest['value']:.1f}%")
|
||||
print(f" • Timestamp: {latest['timestamp']}")
|
||||
|
||||
# Janitor logs
|
||||
print_subsection("Janitor Cleanup Events")
|
||||
response = await client.get(f"{CRAWL4AI_BASE_URL}/monitor/logs/janitor?limit=3")
|
||||
janitor_logs = response.json()
|
||||
|
||||
print(f"\n🧹 Recent Cleanup Activities:")
|
||||
if janitor_logs:
|
||||
for log in janitor_logs[:3]:
|
||||
print(f" • {log['timestamp']}: {log['message']}")
|
||||
else:
|
||||
print(" (No cleanup events yet - janitor runs periodically)")
|
||||
|
||||
# Error logs
|
||||
print_subsection("Error Monitoring")
|
||||
response = await client.get(f"{CRAWL4AI_BASE_URL}/monitor/logs/errors?limit=3")
|
||||
error_logs = response.json()
|
||||
|
||||
print(f"\n❌ Recent Errors:")
|
||||
if error_logs:
|
||||
for log in error_logs[:3]:
|
||||
print(f" • {log['timestamp']}: {log['error_type']}")
|
||||
print(f" {log['message'][:100]}...")
|
||||
else:
|
||||
print(" ✅ No recent errors!")
|
||||
|
||||
|
||||
async def demo_5_websocket_streaming():
|
||||
"""Demo 5: WebSocket Streaming - Real-time updates"""
|
||||
print_section(
|
||||
"Demo 5: WebSocket Streaming",
|
||||
"Live monitoring with 2-second update intervals"
|
||||
)
|
||||
|
||||
print("⚡ WebSocket Streaming Demo")
|
||||
print("\n💡 The monitoring dashboard uses WebSocket for real-time updates")
|
||||
print(f" • Connection: ws://localhost:11235/monitor/ws")
|
||||
print(f" • Update Interval: 2 seconds")
|
||||
print(f" • Data: Health, requests, browsers, memory, errors")
|
||||
|
||||
print("\n📝 Sample WebSocket Integration Code:")
|
||||
print("""
|
||||
import websockets
|
||||
import json
|
||||
|
||||
async def monitor_realtime():
|
||||
uri = "ws://localhost:11235/monitor/ws"
|
||||
async with websockets.connect(uri) as websocket:
|
||||
while True:
|
||||
data = await websocket.recv()
|
||||
update = json.loads(data)
|
||||
|
||||
print(f"Memory: {update['health']['container']['memory_percent']:.1f}%")
|
||||
print(f"Active Requests: {len(update['requests']['active'])}")
|
||||
print(f"Browser Pool: {update['health']['pool']['permanent']['active']}")
|
||||
""")
|
||||
|
||||
print("\n🌐 Open the dashboard to see WebSocket in action:")
|
||||
print(f" {MONITOR_DASHBOARD_URL}")
|
||||
|
||||
|
||||
async def demo_6_control_actions():
|
||||
"""Demo 6: Control Actions - Manual browser management"""
|
||||
print_section(
|
||||
"Demo 6: Control Actions",
|
||||
"Manual control over browser pool and cleanup"
|
||||
)
|
||||
|
||||
async with httpx.AsyncClient(timeout=30.0) as client:
|
||||
print("🎮 Testing control actions...")
|
||||
|
||||
# Force cleanup
|
||||
print_subsection("Force Immediate Cleanup")
|
||||
print("🧹 Triggering manual cleanup...")
|
||||
try:
|
||||
response = await client.post(f"{CRAWL4AI_BASE_URL}/monitor/actions/cleanup")
|
||||
if response.status_code == 200:
|
||||
result = response.json()
|
||||
print(f" ✅ Cleanup completed")
|
||||
print(f" • Browsers cleaned: {result.get('cleaned_count', 0)}")
|
||||
print(f" • Memory freed: {result.get('memory_freed_mb', 0):.1f} MB")
|
||||
else:
|
||||
print(f" ⚠️ Response: {response.status_code}")
|
||||
except Exception as e:
|
||||
print(f" ℹ️ Cleanup action: {e}")
|
||||
|
||||
# Get browser list for potential kill/restart
|
||||
print_subsection("Browser Management")
|
||||
response = await client.get(f"{CRAWL4AI_BASE_URL}/monitor/browsers")
|
||||
browsers = response.json()
|
||||
|
||||
cold_browsers = browsers.get('cold', [])
|
||||
if cold_browsers:
|
||||
browser_id = cold_browsers[0]['browser_id']
|
||||
print(f"\n🎯 Example: Kill specific browser")
|
||||
print(f" POST /monitor/actions/kill_browser")
|
||||
print(f" JSON: {{'browser_id': '{browser_id[:16]}...'}}")
|
||||
print(f" → Kills the browser and frees resources")
|
||||
|
||||
print(f"\n🔄 Example: Restart browser")
|
||||
print(f" POST /monitor/actions/restart_browser")
|
||||
print(f" JSON: {{'browser_id': 'browser_id_here'}}")
|
||||
print(f" → Restart a specific browser instance")
|
||||
|
||||
# Reset statistics
|
||||
print_subsection("Reset Statistics")
|
||||
print("📊 Statistics can be reset for fresh monitoring:")
|
||||
print(f" POST /monitor/stats/reset")
|
||||
print(f" → Clears all accumulated statistics")
|
||||
|
||||
|
||||
async def demo_7_production_metrics():
|
||||
"""Demo 7: Production Metrics - Key indicators for operations"""
|
||||
print_section(
|
||||
"Demo 7: Production Metrics",
|
||||
"Critical metrics for production monitoring"
|
||||
)
|
||||
|
||||
async with httpx.AsyncClient(timeout=30.0) as client:
|
||||
print("📊 Key Production Metrics:")
|
||||
|
||||
# Overall health
|
||||
response = await client.get(f"{CRAWL4AI_BASE_URL}/monitor/health")
|
||||
health = response.json()
|
||||
|
||||
# Browser efficiency
|
||||
response = await client.get(f"{CRAWL4AI_BASE_URL}/monitor/browsers")
|
||||
browsers = response.json()
|
||||
|
||||
print("\n🎯 Critical Metrics to Track:")
|
||||
|
||||
print(f"\n1️⃣ Memory Usage Trends")
|
||||
print(f" • Current: {health['container']['memory_percent']:.1f}%")
|
||||
print(f" • Alert if: >80%")
|
||||
print(f" • Action: Trigger cleanup or scale")
|
||||
|
||||
print(f"\n2️⃣ Request Success Rate")
|
||||
print(f" • Current: {health['stats']['success_rate_percent']:.1f}%")
|
||||
print(f" • Target: >95%")
|
||||
print(f" • Alert if: <90%")
|
||||
|
||||
print(f"\n3️⃣ Average Latency")
|
||||
print(f" • Current: {health['stats']['avg_latency_ms']:.0f}ms")
|
||||
print(f" • Target: <2000ms")
|
||||
print(f" • Alert if: >5000ms")
|
||||
|
||||
print(f"\n4️⃣ Browser Pool Efficiency")
|
||||
print(f" • Reuse Rate: {browsers['summary']['reuse_rate_percent']:.1f}%")
|
||||
print(f" • Target: >80%")
|
||||
print(f" • Indicates: Effective browser pooling")
|
||||
|
||||
print(f"\n5️⃣ Total Browsers")
|
||||
print(f" • Current: {browsers['summary']['total_count']}")
|
||||
print(f" • Alert if: >20 (possible leak)")
|
||||
print(f" • Check: Janitor is running correctly")
|
||||
|
||||
print(f"\n6️⃣ Error Frequency")
|
||||
response = await client.get(f"{CRAWL4AI_BASE_URL}/monitor/logs/errors?limit=10")
|
||||
errors = response.json()
|
||||
print(f" • Recent Errors: {len(errors)}")
|
||||
print(f" • Alert if: >10 in last hour")
|
||||
print(f" • Action: Review error patterns")
|
||||
|
||||
print("\n💡 Integration Examples:")
|
||||
print(" • Prometheus: Scrape /monitor/health")
|
||||
print(" • Alerting: Monitor memory, success rate, latency")
|
||||
print(" • Dashboards: WebSocket streaming to custom UI")
|
||||
print(" • Log Aggregation: Collect /monitor/logs/* endpoints")
|
||||
|
||||
|
||||
async def demo_8_self_hosting_value():
|
||||
"""Demo 8: Self-Hosting Value Proposition"""
|
||||
print_section(
|
||||
"Demo 8: Why Self-Host Crawl4AI?",
|
||||
"The value proposition of owning your infrastructure"
|
||||
)
|
||||
|
||||
print("🎯 Self-Hosting Benefits:\n")
|
||||
|
||||
print("🔒 Data Privacy & Security")
|
||||
print(" • Your data never leaves your infrastructure")
|
||||
print(" • No third-party access to crawled content")
|
||||
print(" • Keep sensitive workflows behind your firewall")
|
||||
|
||||
print("\n💰 Cost Control")
|
||||
print(" • No per-request pricing or rate limits")
|
||||
print(" • Predictable infrastructure costs")
|
||||
print(" • Scale based on your actual needs")
|
||||
|
||||
print("\n🎯 Full Customization")
|
||||
print(" • Complete control over browser configs")
|
||||
print(" • Custom hooks and strategies")
|
||||
print(" • Tailored monitoring and alerting")
|
||||
|
||||
print("\n📊 Complete Transparency")
|
||||
print(" • Real-time monitoring dashboard")
|
||||
print(" • Full visibility into system performance")
|
||||
print(" • Detailed request and error tracking")
|
||||
|
||||
print("\n⚡ Performance & Flexibility")
|
||||
print(" • Direct access, no network overhead")
|
||||
print(" • Integrate with existing infrastructure")
|
||||
print(" • Custom resource allocation")
|
||||
|
||||
print("\n🛡️ Enterprise-Grade Operations")
|
||||
print(" • Prometheus integration ready")
|
||||
print(" • WebSocket for real-time dashboards")
|
||||
print(" • Full API for automation")
|
||||
print(" • Manual controls for troubleshooting")
|
||||
|
||||
print(f"\n🌐 Get Started:")
|
||||
print(f" docker pull unclecode/crawl4ai:0.7.7")
|
||||
print(f" docker run -d -p 11235:11235 --shm-size=1g unclecode/crawl4ai:0.7.7")
|
||||
print(f" # Visit: {MONITOR_DASHBOARD_URL}")
|
||||
|
||||
|
||||
def print_summary():
|
||||
"""Print comprehensive demo summary"""
|
||||
print("\n" + "=" * 70)
|
||||
print("📊 DEMO SUMMARY - Crawl4AI v0.7.7")
|
||||
print("=" * 70)
|
||||
|
||||
print("\n✨ Features Demonstrated:")
|
||||
print("=" * 70)
|
||||
print("✅ System Health Overview")
|
||||
print(" → Real-time CPU, memory, network, and uptime monitoring")
|
||||
|
||||
print("\n✅ Request Tracking")
|
||||
print(" → Active and completed request monitoring with full details")
|
||||
|
||||
print("\n✅ Browser Pool Management")
|
||||
print(" → 3-tier architecture: Permanent, Hot, and Cold pools")
|
||||
print(" → Automatic promotion and cleanup")
|
||||
|
||||
print("\n✅ Monitor API Endpoints")
|
||||
print(" → Complete REST API for programmatic access")
|
||||
print(" → Health, requests, browsers, timeline, logs, errors")
|
||||
|
||||
print("\n✅ WebSocket Streaming")
|
||||
print(" → Real-time updates every 2 seconds")
|
||||
print(" → Build custom dashboards with live data")
|
||||
|
||||
print("\n✅ Control Actions")
|
||||
print(" → Manual browser management (kill, restart)")
|
||||
print(" → Force cleanup and statistics reset")
|
||||
|
||||
print("\n✅ Production Metrics")
|
||||
print(" → 6 critical metrics for operational excellence")
|
||||
print(" → Prometheus integration patterns")
|
||||
|
||||
print("\n✅ Self-Hosting Value")
|
||||
print(" → Data privacy, cost control, full customization")
|
||||
print(" → Enterprise-grade transparency and control")
|
||||
|
||||
print("\n" + "=" * 70)
|
||||
print("🎯 What's New in v0.7.7?")
|
||||
print("=" * 70)
|
||||
print("• 📊 Complete Real-time Monitoring System")
|
||||
print("• 🌐 Interactive Web Dashboard (/dashboard)")
|
||||
print("• 🔌 Comprehensive Monitor API")
|
||||
print("• ⚡ WebSocket Streaming (2-second updates)")
|
||||
print("• 🎮 Manual Control Actions")
|
||||
print("• 📈 Production Integration Examples")
|
||||
print("• 🏭 Prometheus, Alerting, Log Aggregation")
|
||||
print("• 🔥 Smart Browser Pool (Permanent/Hot/Cold)")
|
||||
print("• 🧹 Automatic Janitor Cleanup")
|
||||
print("• 📋 Full Request & Error Tracking")
|
||||
|
||||
print("\n" + "=" * 70)
|
||||
print("💡 Why This Matters")
|
||||
print("=" * 70)
|
||||
print("Before v0.7.7: Docker was just a containerized crawler")
|
||||
print("After v0.7.7: Complete self-hosting platform with enterprise monitoring")
|
||||
print("\nYou now have:")
|
||||
print(" • Full visibility into what's happening inside")
|
||||
print(" • Real-time operational dashboards")
|
||||
print(" • Complete control over browser resources")
|
||||
print(" • Production-ready observability")
|
||||
print(" • Zero external dependencies")
|
||||
|
||||
print("\n" + "=" * 70)
|
||||
print("📚 Next Steps")
|
||||
print("=" * 70)
|
||||
print(f"1. Open the dashboard: {MONITOR_DASHBOARD_URL}")
|
||||
print("2. Read the docs: https://docs.crawl4ai.com/basic/self-hosting/")
|
||||
print("3. Try the Monitor API endpoints yourself")
|
||||
print("4. Set up Prometheus integration for production")
|
||||
print("5. Build custom dashboards with WebSocket streaming")
|
||||
|
||||
print("\n" + "=" * 70)
|
||||
print("🔗 Resources")
|
||||
print("=" * 70)
|
||||
print(f"• Dashboard: {MONITOR_DASHBOARD_URL}")
|
||||
print(f"• Health API: {CRAWL4AI_BASE_URL}/monitor/health")
|
||||
print(f"• Documentation: https://docs.crawl4ai.com/")
|
||||
print(f"• GitHub: https://github.com/unclecode/crawl4ai")
|
||||
|
||||
print("\n" + "=" * 70)
|
||||
print("🎉 You're now in control of your web crawling destiny!")
|
||||
print("=" * 70)
|
||||
|
||||
|
||||
async def main():
|
||||
"""Run all demos"""
|
||||
print("\n" + "=" * 70)
|
||||
print("🚀 Crawl4AI v0.7.7 Release Demo")
|
||||
print("=" * 70)
|
||||
print("Feature: Self-Hosting with Real-time Monitoring Dashboard")
|
||||
print("=" * 70)
|
||||
|
||||
# Check if server is running
|
||||
print("\n🔍 Checking Crawl4AI server...")
|
||||
server_running = await check_server_health()
|
||||
|
||||
if not server_running:
|
||||
print(f"❌ Cannot connect to Crawl4AI at {CRAWL4AI_BASE_URL}")
|
||||
print("\nPlease start the Docker container:")
|
||||
print(" docker pull unclecode/crawl4ai:0.7.7")
|
||||
print(" docker run -d -p 11235:11235 --shm-size=1g unclecode/crawl4ai:0.7.7")
|
||||
print("\nThen re-run this demo.")
|
||||
return
|
||||
|
||||
print(f"✅ Crawl4AI server is running!")
|
||||
print(f"📊 Dashboard available at: {MONITOR_DASHBOARD_URL}")
|
||||
|
||||
# Run all demos
|
||||
demos = [
|
||||
demo_1_system_health_overview,
|
||||
demo_2_request_tracking,
|
||||
demo_3_browser_pool_management,
|
||||
demo_4_monitor_api_endpoints,
|
||||
demo_5_websocket_streaming,
|
||||
demo_6_control_actions,
|
||||
demo_7_production_metrics,
|
||||
demo_8_self_hosting_value,
|
||||
]
|
||||
|
||||
for i, demo_func in enumerate(demos, 1):
|
||||
try:
|
||||
await demo_func()
|
||||
|
||||
if i < len(demos):
|
||||
await asyncio.sleep(2) # Brief pause between demos
|
||||
|
||||
except KeyboardInterrupt:
|
||||
print(f"\n\n⚠️ Demo interrupted by user")
|
||||
return
|
||||
except Exception as e:
|
||||
print(f"\n❌ Demo {i} error: {e}")
|
||||
print("Continuing to next demo...\n")
|
||||
continue
|
||||
|
||||
# Print comprehensive summary
|
||||
print_summary()
|
||||
|
||||
print("\n" + "=" * 70)
|
||||
print("✅ Demo completed!")
|
||||
print("=" * 70)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
try:
|
||||
asyncio.run(main())
|
||||
except KeyboardInterrupt:
|
||||
print("\n\n👋 Demo stopped by user. Thanks for trying Crawl4AI v0.7.7!")
|
||||
except Exception as e:
|
||||
print(f"\n\n❌ Demo failed: {e}")
|
||||
print("Make sure the Docker container is running:")
|
||||
print(" docker run -d -p 11235:11235 --shm-size=1g unclecode/crawl4ai:0.7.7")
|
||||
Reference in New Issue
Block a user