crawl4ai

Author	SHA1	Message	Date
Claude	f0cfd884a9	docs: add production platform deployment PRD Comprehensive PRD for split architecture deployment on Digital Ocean: Architecture: - Separate API servers (lightweight FastAPI) - Browser worker pool (Crawl4AI + Chromium) - Redis job queue for coordination - DO Load Balancer + auto-scaling Components: - api_server.py - Job queue only, no browser - worker.py - Job processor, pulls from Redis - Dockerfiles for both images - Cloud-init configs for auto-deployment Infrastructure: - DO CLI deployment scripts - Auto-scaler daemon (queue-based) - Monitoring and alerting setup - Cost optimization strategies Includes: - Complete code structure - Deployment scripts - Testing strategy - Security setup - Rollback plan - Success metrics Cost estimate: $87-135/mo base, scales to $300/mo Target: 100-500 req/min capacity Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-22 11:05:32 +00:00
Claude	52da8d72bc	test: add comprehensive webhook feature test script Added end-to-end test script that automates webhook feature testing: Script Features (test_webhook_feature.sh): - Automatic branch switching and dependency installation - Redis and server startup/shutdown management - Webhook receiver implementation - Integration test for webhook notifications - Comprehensive cleanup and error handling - Returns to original branch after completion Test Flow: 1. Fetch and checkout webhook feature branch 2. Activate venv and install dependencies 3. Start Redis and Crawl4AI server 4. Submit crawl job with webhook config 5. Verify webhook delivery and payload 6. Clean up all processes and return to original branch Documentation: - WEBHOOK_TEST_README.md with usage instructions - Troubleshooting guide - Exit codes and safety features Usage: ./tests/test_webhook_feature.sh Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-22 00:35:07 +00:00
Claude	8b7e67566e	test: add webhook implementation validation tests Added comprehensive test suite to validate webhook implementation: - Module import verification - WebhookDeliveryService initialization - Pydantic model validation (WebhookConfig) - Payload construction logic - Exponential backoff calculation - API integration checks All tests pass (6/6), confirming implementation is correct. Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-22 00:25:35 +00:00
Claude	7388baa205	docs: add webhook example for Docker deployment Added docker_webhook_example.py demonstrating: - Submitting crawl jobs with webhook configuration - Flask-based webhook receiver implementation - Three usage patterns: 1. Webhook notification only (fetch data separately) 2. Webhook with full data in payload 3. Traditional polling approach for comparison Includes comprehensive comments explaining: - Webhook payload structure - Authentication headers setup - Error handling - Production deployment tips Example is fully functional and ready to run with Flask installed. Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-21 16:38:53 +00:00
Claude	897bc3a493	docs: add webhook documentation to Docker README Added comprehensive webhook section to README.md including: - Overview of asynchronous job queue with webhooks - Benefits and use cases - Quick start examples - Webhook authentication - Global webhook configuration - Job status polling alternative Updated table of contents and summary to include webhook feature. Maintains consistent tone and style with rest of README. Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-21 16:21:07 +00:00
Claude	8a37710313	feat: add webhook notifications for crawl job completion Implements webhook support for the crawl job API to eliminate polling requirements. Changes: - Added WebhookConfig and WebhookPayload schemas to schemas.py - Created webhook.py with WebhookDeliveryService class - Integrated webhook notifications in api.py handle_crawl_job - Updated job.py CrawlJobPayload to accept webhook_config - Added webhook configuration section to config.yml - Included comprehensive usage examples in WEBHOOK_EXAMPLES.md Features: - Webhook notifications on job completion (success/failure) - Configurable data inclusion in webhook payload - Custom webhook headers support - Global default webhook URL configuration - Exponential backoff retry logic (5 attempts: 1s, 2s, 4s, 8s, 16s) - 30-second timeout per webhook call Usage: POST /crawl/job with optional webhook_config: - webhook_url: URL to receive notifications - webhook_data_in_payload: include full results (default: false) - webhook_headers: custom headers for authentication Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-21 16:17:40 +00:00
UncleCode	fdbcddbf1a	Merge pull request #1546 from unclecode/sponsors	2025-10-17 18:07:16 +08:00
Aravind Karnam	564d437d97	docs: fix order of star history and Current sponsors	2025-10-17 15:31:29 +05:30
Aravind Karnam	9cd06ea7eb	docs: fix order of star history and Current sponsors	2025-10-17 15:30:02 +05:30
Aravind Karnam	eb257c2ba3	docs: fixed sponsorship link	2025-10-13 17:47:42 +05:30
Aravind Karnam	8d364a0731	docs: Adjust background of sponsor logo to compensate for light themes	2025-10-13 17:45:10 +05:30
Aravind Karnam	6aff0e55aa	docs: Adjust background of sponsor logo to compensate for light themes	2025-10-13 17:42:29 +05:30
Aravind Karnam	38a0742708	docs: Adjust background of sponsor logo to compensate for light themes	2025-10-13 17:41:19 +05:30
Aravind Karnam	a720a3a9fe	docs: Adjust background of sponsor logo to compensate for light themes	2025-10-13 17:32:34 +05:30
Aravind Karnam	017144c2dd	docs: Adjust background of sponsor logo to compensate for light themes	2025-10-13 17:30:22 +05:30
Aravind Karnam	32887ea40d	docs: Adjust background of sponsor logo to compensate for light themes	2025-10-13 17:13:52 +05:30
Aravind Karnam	eea41bf1ca	docs: Add a slight background to compensate light theme on github docs	2025-10-13 17:00:24 +05:30
Aravind Karnam	21c302f439	docs: Add Current sponsors section in README file	2025-10-13 16:45:16 +05:30
UncleCode	e651e045c4	Release v0.7.4: Merge release branch - Merge release/v0.7.4 into main - Version: 0.7.4 - Ready for tag and publication v0.7.4	2025-08-17 19:46:48 +08:00
UncleCode	5398acc7d2	docs: add v0.7.4 release blog post and update documentation - Add comprehensive v0.7.4 release blog post with LLMTableExtraction feature highlight - Update blog index to feature v0.7.4 as latest release - Update README.md to showcase v0.7.4 features alongside v0.7.3 - Accurately describe dispatcher fix as bug fix rather than major enhancement - Include practical code examples for new LLMTableExtraction capabilities	2025-08-17 19:45:23 +08:00
UncleCode	22c7932ba3	chore(version): update version to 0.7.4	2025-08-17 19:22:23 +08:00
UncleCode	2ab0bf27c2	refactor(utils): move memory utilities to utils and update imports	2025-08-17 19:14:55 +08:00
ntohidi	d30dc9fdc1	fix(http-crawler): bring back HTTP crawler strategy	2025-08-16 09:27:23 +08:00
ntohidi	e6044e6053	Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop	2025-08-15 19:44:06 +08:00
ntohidi	a50e47adad	Merge branch 'feature/table-extraction-strategies' into develop	2025-08-15 19:41:37 +08:00
ntohidi	ada7441bd1	refactor: Update LLMTableExtraction examples and tests	2025-08-15 19:11:26 +08:00
ntohidi	9f7fee91a9	feat: 🚀 Introduce revolutionary LLMTableExtraction with intelligent chunking for massive tables BREAKING CHANGE: Table extraction now uses Strategy Design Pattern This epic commit introduces a game-changing approach to table extraction in Crawl4AI: ✨ NEW FEATURES: - LLMTableExtraction: AI-powered extraction for complex HTML tables with rowspan/colspan - Smart Chunking: Automatically splits massive tables into optimal chunks at row boundaries - Parallel Processing: Processes multiple chunks simultaneously for blazing-fast extraction - Intelligent Merging: Seamlessly combines chunk results into complete tables - Header Preservation: Each chunk maintains context with original headers - Auto-retry Logic: Built-in resilience with configurable retry attempts 🏗️ ARCHITECTURE: - Strategy Design Pattern for pluggable table extraction strategies - ThreadPoolExecutor for concurrent chunk processing - Token-based chunking with configurable thresholds - Handles tables without headers gracefully ⚡ PERFORMANCE: - Process 1000+ row tables without timeout - Parallel processing with up to 5 concurrent chunks - Smart token estimation prevents LLM context overflow - Optimized for providers like Groq for massive tables 🔧 CONFIGURATION: - enable_chunking: Auto-handle large tables (default: True) - chunk_token_threshold: When to split (default: 3000 tokens) - min_rows_per_chunk: Meaningful chunk sizes (default: 10) - max_parallel_chunks: Concurrent processing (default: 5) 📚 BACKWARD COMPATIBILITY: - Existing code continues to work unchanged - DefaultTableExtraction remains the default strategy - Progressive enhancement approach This is the future of web table extraction - handling everything from simple tables to massive, complex data grids with merged cells and nested structures. The chunking is completely transparent to users while providing unprecedented scalability.	2025-08-15 19:11:26 +08:00
AHMET YILMAZ	7f48655cf1	feat(browser-profiler): implement cross-platform keyboard listeners and improve quit handling	2025-08-15 19:11:26 +08:00
prokopis3	1417a67e90	chore(profile-test): fix filename typo ( test_crteate_profile.py → test_create_profile.py ) - Rename file to correct spelling - No content changes	2025-08-15 19:11:26 +08:00
prokopis3	19398d33ef	fix(browser_profiler): improve keyboard input handling - fix handling of special keys in Windows msvcrt implementation - Guard against UnicodeDecodeError from multi-byte key sequences - Filter out non-printable characters and control sequences - Add error handling to prevent coroutine crashes - Add unit test to verify keyboard input handling Key changes: - Safe UTF-8 decoding with try/except for special keys - Skip non-printable and multi-byte character sequences - Add broad exception handling in keyboard listener Test runs on Windows only due to msvcrt dependency.	2025-08-15 19:11:26 +08:00
prokopis3	263d362daa	fix(browser_profiler): cross-platform 'q' to quit This commit introduces platform-specific handling for the 'q' key press to quit the browser profiler, ensuring compatibility with both Windows and Unix-like systems. It also adds a check to see if the browser process has already exited, terminating the input listener if so. - Implemented `msvcrt` for Windows to capture keyboard input without requiring a newline. - Retained `termios`, `tty`, and `select` for Unix-like systems. - Added a check for browser process termination to gracefully exit the input listener. - Updated logger messages to use colored output for better user experience.	2025-08-15 19:11:26 +08:00
ntohidi	bac92a47e4	refactor: Update LLMTableExtraction examples and tests	2025-08-15 18:47:31 +08:00
ntohidi	a51545c883	feat: 🚀 Introduce revolutionary LLMTableExtraction with intelligent chunking for massive tables BREAKING CHANGE: Table extraction now uses Strategy Design Pattern This epic commit introduces a game-changing approach to table extraction in Crawl4AI: ✨ NEW FEATURES: - LLMTableExtraction: AI-powered extraction for complex HTML tables with rowspan/colspan - Smart Chunking: Automatically splits massive tables into optimal chunks at row boundaries - Parallel Processing: Processes multiple chunks simultaneously for blazing-fast extraction - Intelligent Merging: Seamlessly combines chunk results into complete tables - Header Preservation: Each chunk maintains context with original headers - Auto-retry Logic: Built-in resilience with configurable retry attempts 🏗️ ARCHITECTURE: - Strategy Design Pattern for pluggable table extraction strategies - ThreadPoolExecutor for concurrent chunk processing - Token-based chunking with configurable thresholds - Handles tables without headers gracefully ⚡ PERFORMANCE: - Process 1000+ row tables without timeout - Parallel processing with up to 5 concurrent chunks - Smart token estimation prevents LLM context overflow - Optimized for providers like Groq for massive tables 🔧 CONFIGURATION: - enable_chunking: Auto-handle large tables (default: True) - chunk_token_threshold: When to split (default: 3000 tokens) - min_rows_per_chunk: Meaningful chunk sizes (default: 10) - max_parallel_chunks: Concurrent processing (default: 5) 📚 BACKWARD COMPATIBILITY: - Existing code continues to work unchanged - DefaultTableExtraction remains the default strategy - Progressive enhancement approach This is the future of web table extraction - handling everything from simple tables to massive, complex data grids with merged cells and nested structures. The chunking is completely transparent to users while providing unprecedented scalability.	2025-08-14 18:21:24 +08:00
Nasrin	11b310edef	Merge pull request #1378 from unclecode/fix/exit_with_q Cross Platform fix for browser profiler	2025-08-13 14:16:47 +08:00
Nasrin	926e41aab8	Merge pull request #1378 from unclecode/fix/exit_with_q Cross Platform fix for browser profiler	2025-08-13 14:16:47 +08:00
Nasrin	489981e670	Merge pull request #1390 from unclecode/fix/docker-raw-html Check for raw: and raw:// URLs before auto-appending https:// prefix	2025-08-13 13:56:33 +08:00
Nasrin	b92be4ef66	Merge pull request #1371 from unclecode/bug/proxy_config #1057 : enhance ProxyConfig initialization to support dict and string…	2025-08-12 16:55:52 +08:00
Nasrin	7c0edaf266	Merge pull request #1384 from unclecode/fix/update_docker_examples docs: remove CRAWL4AI_API_TOKEN references and use correct endpoints in Docker example scripts (#1015)	2025-08-12 16:53:42 +08:00
ntohidi	dfcfd8ae57	fix(dispatcher): enable true concurrency for fast-completing tasks in arun_many. REF: #560 The MemoryAdaptiveDispatcher was processing tasks sequentially despite max_session_permit > 1 due to fetching only one task per event loop iteration. This particularly affected raw:// URLs which complete in microseconds. Changes: - Replace single task fetch with greedy slot filling using get_nowait() - Fill all available slots (up to max_session_permit) immediately - Break on empty queue instead of waiting with timeout This ensures proper parallelization for all task types, especially ultra-fast operations like raw HTML processing.	2025-08-12 16:51:22 +08:00
ntohidi	955110a8b0	Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop	2025-08-12 12:22:25 +08:00
Soham Kukreti	f30811b524	fix: Check for raw: and raw:// URLs before auto-appending https:// prefix - Add raw HTML URL validation alongside http/https checks - Fix URL preprocessing logic to handle raw: and raw:// prefixes - Update error message and add comprehensive test cases	2025-08-11 22:10:53 +05:30
ntohidi	8146d477e9	Merge branch 'main' into develop	2025-08-11 18:56:15 +08:00
ntohidi	96c4b0de67	fix(browser_manager): serialize new_page on persistent context to avoid races ref #1198 - Add _page_lock and guarded creation; handle empty context.pages safely - Prevents BrowserContext.new_page “Target page/context closed” during concurrent arun_many	2025-08-11 18:55:43 +08:00
Nasrin	57c14db7cb	Merge pull request #1381 from unclecode/fix/base-tag-link-resolution fix: Implement base tag support in link extraction (#1147)	2025-08-11 18:32:32 +08:00
Soham Kukreti	cd2dd68e4c	docs: remove CRAWL4AI_API_TOKEN references and use correct endpoints in Docker example scripts (#1015 ) - Remove deprecated API token authentication from all Docker examples - Fix async job endpoints: /crawl -> /crawl/job for submission, /task/{id} -> /crawl/job/{id} for polling - Fix sync endpoint: /crawl_sync -> /crawl (synchronous) - Remove non-existent /crawl_direct endpoint - Update request format to use new structure with browser_config and crawler_config - Fix response handling for both async and sync calls - Update extraction strategy format to use proper nested structure - Add Ollama connectivity check before running tests - Update test schemas and selectors for current website structures This makes the Docker examples work out-of-the-box with the current API structure.	2025-08-09 19:37:22 +05:30
UncleCode	f0ce7b2710	feat: add v0.7.3 release notes, changelog updates, and documentation for new features	2025-08-09 21:04:18 +08:00
UncleCode	21f79fe166	Release v0.7.3: Merge release branch - Merge release/v0.7.3 into main - Version: 0.7.3 - Ready for tag and publication v0.7.3	2025-08-09 20:11:35 +08:00
unclecode	a9a2d798b4	feat: update sponsorship tier details and add custom arrangements note	2025-08-09 20:10:32 +08:00
unclecode	612270fcb0	feat: add scheduling link to contact information in SPONSORS.md	2025-08-09 20:05:59 +08:00
unclecode	bc099fdd76	Merge branch 'main' into release/v0.7.3	2025-08-09 19:30:46 +08:00

1 2 3 4 5 ...

1053 Commits