Commit Graph

1113 Commits

Author SHA1 Message Date
unclecode
1a22fb4d4f docs: rename Docker deployment to self-hosting guide with comprehensive monitoring documentation
Major documentation restructuring to emphasize self-hosting capabilities and fully document the real-time monitoring system.

Changes:
- Renamed docker-deployment.md → self-hosting.md to better reflect the value proposition
- Updated mkdocs.yml navigation to "Self-Hosting Guide"
- Completely rewrote introduction emphasizing self-hosting benefits:
  * Data privacy and ownership
  * Cost control and transparency
  * Performance and security advantages
  * Full customization capabilities

- Expanded "Metrics & Monitoring" → "Real-time Monitoring & Operations" with:
  * Monitoring Dashboard section documenting the /monitor UI
  * Complete feature breakdown (system health, requests, browsers, janitor, errors)
  * Monitor API Endpoints with all REST endpoints and examples
  * WebSocket Streaming integration guide with Python examples
  * Control Actions for manual browser management
  * Production Integration patterns (Prometheus, custom dashboards, alerting)
  * Key production metrics to track

- Enhanced summary section:
  * What users learned checklist
  * Why self-hosting matters
  * Clear next steps
  * Key resources with monitoring dashboard URL

The monitoring dashboard built 2-3 weeks ago is now fully documented and discoverable.
Users will understand they have complete operational visibility at http://localhost:11235/monitor
with real-time updates, browser pool management, and programmatic control via REST/WebSocket APIs.

This positions Crawl4AI as an enterprise-grade self-hosting solution with DevOps-level
monitoring capabilities, not just a Docker deployment.
2025-11-09 13:31:52 +08:00
unclecode
81b5312629 Update gitignore 2025-11-09 10:49:42 +08:00
unclecode
73a5a7b0f5 Update gitignore 2025-10-18 12:41:29 +08:00
unclecode
05921811b8 docs: add comprehensive technical architecture documentation
Created ARCHITECTURE.md as a complete technical reference for the
Crawl4AI Docker server, replacing the stress test pipeline document
with production-grade documentation.

Contents:
- System overview with architecture diagrams
- Core components deep-dive (server, API, utils)
- Smart browser pool implementation details
- Real-time monitoring system architecture
- WebSocket implementation and fallback strategy
- Memory management and container detection
- Production optimizations and code review fixes
- Deployment guides (local, Docker, production)
- Comprehensive troubleshooting section
- Debug tools and performance tuning
- Test suite documentation
- Architecture decision log (ADRs)

Target audience: Developers maintaining or extending the system
Goal: Enable rapid onboarding and confident modifications
2025-10-18 12:05:49 +08:00
unclecode
25507adb5b feat(monitor): implement code review fixes and real-time WebSocket monitoring
Backend Improvements (11 fixes applied):

Critical Fixes:
- Add lock protection for browser pool access in monitor stats
- Ensure async track_janitor_event across all call sites
- Improve error handling in monitor request tracking (already in place)

Important Fixes:
- Replace fire-and-forget Redis with background persistence worker
- Add time-based expiry for completed requests/errors (5min cleanup)
- Implement input validation for monitor route parameters
- Add 4s timeout to timeline updater to prevent hangs
- Add warning when killing browsers with active requests
- Implement monitor cleanup on shutdown with final persistence
- Document memory estimates with TODO for actual tracking

Frontend Enhancements:

WebSocket Real-time Updates:
- Add WebSocket endpoint at /monitor/ws for live monitoring
- Implement auto-reconnect with exponential backoff (max 5 attempts)
- Add graceful fallback to HTTP polling on WebSocket failure
- Send comprehensive updates every 2 seconds (health, requests, browsers, timeline, events)

UI/UX Improvements:
- Add live connection status indicator with pulsing animation
  - Green "Live" = WebSocket connected
  - Yellow "Connecting..." = Attempting connection
  - Blue "Polling" = Fallback to HTTP polling
  - Red "Disconnected" = Connection failed
- Restore original beautiful styling for all sections
- Improve request table layout with flex-grow for URL column
- Add browser type text labels alongside emojis
- Add flex layout to browser section header

Testing:
- Add test-websocket.py for WebSocket validation
- All 7 integration tests passing successfully

Summary: 563 additions across 6 files
2025-10-18 11:38:25 +08:00
unclecode
aba4036ab6 Add demo and test scripts for monitor dashboard activity
- Introduced a demo script (`demo_monitor_dashboard.py`) to showcase various monitoring features through simulated activity.
- Implemented a test script (`test_monitor_demo.py`) to generate dashboard activity and verify monitor health and endpoint statistics.
- Added a logo image to the static assets for branding purposes.
2025-10-17 22:43:06 +08:00
unclecode
e2af031b09 feat(monitor): add real-time monitoring dashboard with Redis persistence
Complete observability solution for production deployments with terminal-style UI.

**Backend Implementation:**
- `monitor.py`: Stats manager tracking requests, browsers, errors, timeline data
- `monitor_routes.py`: REST API endpoints for all monitor functionality
  - GET /monitor/health - System health snapshot
  - GET /monitor/requests - Active & completed requests
  - GET /monitor/browsers - Browser pool details
  - GET /monitor/endpoints/stats - Aggregated endpoint analytics
  - GET /monitor/timeline - Time-series data (memory, requests, browsers)
  - GET /monitor/logs/{janitor,errors} - Event logs
  - POST /monitor/actions/{cleanup,kill_browser,restart_browser} - Control actions
  - POST /monitor/stats/reset - Reset counters
- Redis persistence for endpoint stats (survives restart)
- Timeline tracking (5min window, 5s resolution, 60 data points)

**Frontend Dashboard** (`/dashboard`):
- **System Health Bar**: CPU%, Memory%, Network I/O, Uptime
- **Pool Status**: Live counts (permanent/hot/cold browsers + memory)
- **Live Activity Tabs**:
  - Requests: Active (realtime) + recent completed (last 100)
  - Browsers: Detailed table with actions (kill/restart)
  - Janitor: Cleanup event log with timestamps
  - Errors: Recent errors with stack traces
- **Endpoint Analytics**: Count, avg latency, success%, pool hit%
- **Resource Timeline**: SVG charts (memory/requests/browsers) with terminal aesthetics
- **Control Actions**: Force cleanup, restart permanent, reset stats
- **Auto-refresh**: 5s polling (toggleable)

**Integration:**
- Janitor events tracked (close_cold, close_hot, promote)
- Crawler pool promotion events logged
- Timeline updater background task (5s interval)
- Lifespan hooks for monitor initialization

**UI Design:**
- Terminal vibe matching Crawl4AI theme
- Dark background, cyan/pink accents, monospace font
- Neon glow effects on charts
- Responsive layout, hover interactions
- Cross-navigation: Playground ↔ Monitor

**Key Features:**
- Zero-config: Works out of the box with existing Redis
- Real-time visibility into pool efficiency
- Manual browser management (kill/restart)
- Historical data persistence
- DevOps-friendly UX

Routes:
- API: `/monitor/*` (backend endpoints)
- UI: `/dashboard` (static HTML)
2025-10-17 21:36:25 +08:00
unclecode
b97eaeea4c feat(docker): implement smart browser pool with 10x memory efficiency
Major refactoring to eliminate memory leaks and enable high-scale crawling:

- **Smart 3-Tier Browser Pool**:
  - Permanent browser (always-ready default config)
  - Hot pool (configs used 3+ times, longer TTL)
  - Cold pool (new/rare configs, short TTL)
  - Auto-promotion: cold → hot after 3 uses
  - 100% pool reuse achieved in tests

- **Container-Aware Memory Detection**:
  - Read cgroup v1/v2 memory limits (not host metrics)
  - Accurate memory pressure detection in Docker
  - Memory-based browser creation blocking

- **Adaptive Janitor**:
  - Dynamic cleanup intervals (10s/30s/60s based on memory)
  - Tiered TTLs: cold 30-300s, hot 120-600s
  - Aggressive cleanup at high memory pressure

- **Unified Pool Usage**:
  - All endpoints now use pool (/html, /screenshot, /pdf, /execute_js, /md, /llm)
  - Fixed config signature mismatch (permanent browser matches endpoints)
  - get_default_browser_config() helper for consistency

- **Configuration**:
  - Reduced idle_ttl: 1800s → 300s (30min → 5min)
  - Fixed port: 11234 → 11235 (match Gunicorn)

**Performance Results** (from stress tests):
- Memory: 10x reduction (500-700MB × N → 270MB permanent)
- Latency: 30-50x faster (<100ms pool hits vs 3-5s startup)
- Reuse: 100% for default config, 60%+ for variants
- Capacity: 100+ concurrent requests (vs ~20 before)
- Leak: 0 MB/cycle (stable across tests)

**Test Infrastructure**:
- 7-phase sequential test suite (tests/)
- Docker stats integration + log analysis
- Pool promotion verification
- Memory leak detection
- Full endpoint coverage

Fixes memory issues reported in production deployments.
2025-10-17 20:38:39 +08:00
unclecode
216019f29a fix(marketplace): prevent hero image overflow and secondary card stretching
- Fixed hero image to 200px height with min/max constraints
- Added object-fit: cover to hero-image img elements
- Changed secondary-featured align-items from stretch to flex-start
- Fixed secondary-card height to 118px (no flex: 1 stretching)
- Updated responsive grid layouts for wider screens
- Added flex: 1 to hero-content for better content distribution

These changes ensure a rigid, predictable layout that prevents:
1. Large images from pushing text content down
2. Single secondary cards from stretching to fill entire height
2025-10-11 12:52:04 +08:00
unclecode
abe8a92561 fix(marketplace): resolve app detail page routing and styling issues
- Fixed JavaScript errors from missing HTML elements (install-code, usage-code, integration-code)
- Added missing CSS classes for tabs, overview layout, sidebar, and integration content
- Fixed tab navigation to display horizontally in single line
- Added proper padding to tab content sections (removed from container, added to content)
- Fixed tab selector from .nav-tab to .tab-btn to match HTML structure
- Added sidebar styling with stats grid and metadata display
- Improved responsive design with mobile-friendly tab scrolling
- Fixed code block positioning for copy buttons
- Removed margin from first headings to prevent extra spacing
- Added null checks for DOM elements in JavaScript to prevent errors

These changes resolve the routing issue where clicking on apps caused page redirects,
and fix the broken layout where CSS was not properly applied to the app detail page.
2025-10-11 11:51:22 +08:00
unclecode
5a4f21fad9 fix(marketplace): isolate api under marketplace prefix 2025-10-09 22:26:15 +08:00
unclecode
2c373f0642 fix(marketplace): align admin api with backend endpoints 2025-10-08 18:42:19 +08:00
unclecode
d2c7f345ab feat(docs): add chatgpt quick link to page actions 2025-10-07 11:59:25 +08:00
unclecode
8c62277718 feat(marketplace): add sponsor logo uploads
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-06 20:58:35 +08:00
unclecode
5145d42df7 fix(docs): hide copy menu on non-markdown pages 2025-10-03 20:11:20 +08:00
Nasrin
80aa6c11d9 Merge pull request #1530 from Sjoeborg/fix/arun-many-returns-none
Fix: run_urls() returns None, crashing arun_many()
2025-10-03 12:57:06 +08:00
unclecode
749d200866 fix(marketplace): Update URLs to use /marketplace path and relative API endpoints
- Change API_BASE to relative '/api' for production
- Move marketplace to /marketplace instead of /marketplace/frontend
- Update MkDocs navigation
- Fix logo path in marketplace index
2025-10-02 17:08:50 +08:00
unclecode
408ad1b750 feat(marketplace): Add Crawl4AI marketplace with secure configuration
- Implement marketplace frontend and admin dashboard
- Add FastAPI backend with environment-based configuration
- Use .env file for secrets management
- Include data generation scripts
- Add proper CORS configuration
- Remove hardcoded password from admin login
- Update gitignore for security
2025-10-02 16:41:11 +08:00
Martin Sjöborg
35dd206925 fix: always return a list, even if we catch an exception 2025-10-02 09:21:44 +02:00
Martin Sjöborg
8d30662647 fix: remove this import as it causes python to treat "json" as a variable in the except block 2025-10-02 09:19:15 +02:00
unclecode
ef46df10da Update gitignore add local scripts folder 2025-09-30 18:31:57 +08:00
unclecode
0d8d043109 feat(docs): add brand book and page copy functionality
- Add comprehensive brand book with color system, typography, components
- Add page copy dropdown with markdown copy/view functionality
- Update mkdocs.yml with new assets and branding navigation
- Use terminal-style ASCII icons and condensed menu design
2025-09-30 18:28:05 +08:00
ntohidi
3fe49a766c fix(docker-deployment): replace console.log with print for metadata extraction 2025-09-25 14:12:59 +08:00
ntohidi
fef715a891 Merge branch 'feature/docker-hooks' into develop 2025-09-25 14:11:46 +08:00
Nasrin
69e8ca3d0d Merge pull request #1508 from unclecode/docker/base_config_overrides
#1505 fix(api): update config handling to only set base config if not provided by user
2025-09-22 18:02:14 +08:00
AHMET YILMAZ
a1950afd98 #1505 fix(api): update config handling to only set base config if not provided by user 2025-09-22 17:19:27 +08:00
Nasrin
d0eb5a6ffe Merge pull request #1501 from unclecode/fix/n-playwright-stealth
feat(StealthAdapter): fix stealth features for Playwright integration
2025-09-19 14:17:35 +08:00
ntohidi
77559f3373 feat(StealthAdapter): fix stealth features for Playwright integration. ref #1481 2025-09-18 15:39:06 +08:00
Nasrin
3899ac3d3b Merge pull request #1464 from unclecode/fix/proxy_deprecation
Fix/proxy deprecation
2025-09-16 15:48:45 +08:00
Nasrin
23431d8109 Merge pull request #1389 from unclecode/fix/deep-crawl-scoring
fix(deep-crawl): BestFirst priority inversion
2025-09-16 15:45:54 +08:00
AHMET YILMAZ
1717827732 refactor(BrowserConfig): change deprecation warning for 'proxy' parameter to UserWarning 2025-09-12 11:10:38 +08:00
Nasrin
f8eaf01ed1 Merge pull request #1467 from unclecode/fix/request-crawl-stream
Fix: request /crawl with stream: true issue
2025-09-11 17:40:43 +08:00
Nasrin
14b42b1f9a Merge pull request #1471 from unclecode/fix/adaptive-crawler-llm-config
Fix: allow custom LLM providers for adaptive crawler embedding config…
2025-09-09 12:56:33 +08:00
ntohidi
3bc56dd028 fix: allow custom LLM providers for adaptive crawler embedding config. ref: #1291
- Change embedding_llm_config from Dict to Union[LLMConfig, Dict] for type safety
  - Add backward-compatible conversion property _embedding_llm_config_dict
  - Replace all hardcoded OpenAI embedding configs with configurable options
  - Fix LLMConfig object attribute access in query expansion logic
  - Add comprehensive example demonstrating multiple provider configurations
  - Update documentation with both LLMConfig object and dictionary usage patterns

  Users can now specify any LLM provider for query expansion in embedding strategy:
  - New: embedding_llm_config=LLMConfig(provider='anthropic/claude-3', api_token='key')
  - Old: embedding_llm_config={'provider': 'openai/gpt-4', 'api_token': 'key'} (still works)
2025-09-09 12:49:55 +08:00
AHMET YILMAZ
1874a7b8d2 fix: update option labels in request builder for clarity 2025-09-05 17:06:25 +08:00
Nasrin
0482c1eafc Merge pull request #1469 from unclecode/fix/docker-jwt
Fix(auth): Fixed Docker JWT authentication
2025-09-04 15:00:15 +08:00
AHMET YILMAZ
6a3b3e9d38 Commit without API 2025-09-03 17:02:40 +08:00
Nasrin
1eacea1d2d Merge pull request #1432 from unclecode/example/web2api-example
feat: Add comprehensive website to API example with frontend
2025-09-03 16:30:39 +08:00
Nasrin
bc6d8147d2 Merge pull request #1451 from unclecode/fix/remove-python3.9-version
Remove python 3.9 from supported versions and require Python >= 3.10
2025-09-02 16:50:40 +08:00
ntohidi
487839640f fix: raise error on last attempt failure in perform_completion_with_backoff. ref #989 2025-09-02 16:49:01 +08:00
ntohidi
6772134a3a remove: delete unused yoyo snapshot subproject 2025-09-02 12:07:08 +08:00
Nasrin
ae67d66b81 Merge pull request #1454 from nafeqq-1306/docstring-changes
issue #1329: Docs are not detected due to triplequotes not being first line
2025-09-02 11:59:59 +08:00
Nasrin
af28e84a21 Merge pull request #1441 from unclecode/fix/improve-docker-error-handling
Improve docker error handling
2025-09-02 11:56:01 +08:00
Nasrin
5e7fcb17e1 Merge pull request #1448 from unclecode/fix/https-reditrect
feat: add preserve_https_for_internal_links flag to maintain HTTPS during crawling
2025-09-01 16:11:25 +08:00
ntohidi
6e728096fa fix(auth): fixed Docker JWT authentication. ref #1442 2025-09-01 12:48:16 +08:00
Nasrin
2de200c1ba Merge pull request #1433 from Thermofish/fix/excluded_selector
fix(deps): reintroduce cssselect to restore excluded_selector support (#1405)
2025-08-29 16:08:24 +08:00
nafeqq-1306
9749e2832d issue #1329 refactor(crawler): move unwanted properties to CrawlerRunConfig class 2025-08-29 10:20:47 +08:00
Soham Kukreti
70f473b84d fix: drop Python 3.9 support and require Python >=3.10.
The library no longer supports Python 3.9 and so it was important to drop all references to python 3.9.
Following changes have been made:
- pyproject.toml: set requires-python to ">=3.10"; remove 3.9 classifier
- setup.py: set python_requires to ">=3.10"; remove 3.9 classifier
- docs: update Python version mentions
  - deploy/docker/c4ai-doc-context.md: options -> 3.10, 3.11, 3.12, 3.13
2025-08-28 19:31:19 +05:30
ntohidi
bdacf61ca9 feat: update documentation for preserve_https_for_internal_links. ref #1410 2025-08-28 17:48:12 +08:00
ntohidi
f566c5a376 feat: add preserve_https_for_internal_links flag to maintain HTTPS during crawling. Ref #1410
Added a new `preserve_https_for_internal_links` configuration flag that preserves the original HTTPS scheme for same-domain links even when the server redirects to HTTP.
2025-08-28 17:38:40 +08:00