crawl4ai

Author	SHA1	Message	Date
Nasrin	af28e84a21	Merge pull request #1441 from unclecode/fix/improve-docker-error-handling Improve docker error handling	2025-09-02 11:56:01 +08:00
Nasrin	5e7fcb17e1	Merge pull request #1448 from unclecode/fix/https-reditrect feat: add preserve_https_for_internal_links flag to maintain HTTPS during crawling	2025-09-01 16:11:25 +08:00
ntohidi	6e728096fa	fix(auth): fixed Docker JWT authentication. ref #1442	2025-09-01 12:48:16 +08:00
Nasrin	2de200c1ba	Merge pull request #1433 from Thermofish/fix/excluded_selector fix(deps): reintroduce cssselect to restore excluded_selector support (#1405)	2025-08-29 16:08:24 +08:00
nafeqq-1306	9749e2832d	issue #1329 refactor(crawler): move unwanted properties to CrawlerRunConfig class	2025-08-29 10:20:47 +08:00
Soham Kukreti	70f473b84d	fix: drop Python 3.9 support and require Python >=3.10. The library no longer supports Python 3.9 and so it was important to drop all references to python 3.9. Following changes have been made: - pyproject.toml: set requires-python to ">=3.10"; remove 3.9 classifier - setup.py: set python_requires to ">=3.10"; remove 3.9 classifier - docs: update Python version mentions - deploy/docker/c4ai-doc-context.md: options -> 3.10, 3.11, 3.12, 3.13	2025-08-28 19:31:19 +05:30
ntohidi	bdacf61ca9	feat: update documentation for preserve_https_for_internal_links. ref #1410	2025-08-28 17:48:12 +08:00
ntohidi	f566c5a376	feat: add preserve_https_for_internal_links flag to maintain HTTPS during crawling. Ref #1410 Added a new `preserve_https_for_internal_links` configuration flag that preserves the original HTTPS scheme for same-domain links even when the server redirects to HTTP.	2025-08-28 17:38:40 +08:00
AHMET YILMAZ	4ed33fce9e	Remove deprecated test for 'proxy' parameter in BrowserConfig and update .gitignore to include test_scripts directory.	2025-08-28 17:26:10 +08:00
AHMET YILMAZ	f7a3366f72	#1375 : refactor(proxy) Deprecate 'proxy' parameter in BrowserConfig and enhance proxy string parsing - Updated ProxyConfig.from_string to support multiple proxy formats, including URLs with credentials. - Deprecated the 'proxy' parameter in BrowserConfig, replacing it with 'proxy_config' for better flexibility. - Added warnings for deprecated usage and clarified behavior when both parameters are provided. - Updated documentation and tests to reflect changes in proxy configuration handling.	2025-08-28 17:21:49 +08:00
Nasrin	4e1c4bd24e	Merge pull request #1436 from unclecode/fix/docker-filter fix(docker): resolve filter serialization and JSON encoding errors in deep crawl strategy	2025-08-27 11:08:42 +08:00
Soham Kukreti	2ad3fb5fc8	feat(docker): improve docker error handling - Return comprehensive error messages along with status codes for api internal errors. - Fix fit_html property serialization issue in both /crawl and /crawl/stream endpoints - Add sanitization to ensure fit_html is always JSON-serializable (string or None) - Add comprehensive error handling test suite.	2025-08-26 23:18:35 +05:30
Nasrin	cce3390a2d	Merge pull request #1426 from unclecode/fix/update-quickstart-and-adaptive-strategies-docs Update Quickstart and Adaptive Strategies documentation	2025-08-26 16:53:47 +08:00
Nasrin	4fe2d01361	Merge pull request #1440 from unclecode/feature/docker-llm-parameters feat(docker): Add temperature and base_url parameters for LLM configuration	2025-08-26 16:48:17 +08:00
ntohidi	159207b86f	feat(docker): Add temperature and base_url parameters for LLM configuration. ref #1035 Implement hierarchical configuration for LLM parameters with support for: - Temperature control (0.0-2.0) to adjust response creativity - Custom base_url for proxy servers and alternative endpoints - 4-tier priority: request params > provider env > global env > defaults Add helper functions in utils.py, update API schemas and handlers, support environment variables (LLM_TEMPERATURE, OPENAI_TEMPERATURE, etc.), and provide comprehensive documentation with examples.	2025-08-26 16:44:07 +08:00
ntohidi	38f3ea42a7	fix(logger): ensure logger is a Logger instance in crawling strategies. ref #1437	2025-08-26 12:06:56 +08:00
ntohidi	102352eac4	fix(docker): resolve filter serialization and JSON encoding errors in deep crawl strategy (ref #1419 ) - Fix URLPatternFilter serialization by preventing private __slots__ from being serialized as constructor params - Add public attributes to URLPatternFilter to store original constructor parameters for proper serialization - Handle property descriptors in CrawlResult.model_dump() to prevent JSON serialization errors - Ensure filter chains work correctly with Docker client and REST API The issue occurred because: 1. Private implementation details (_simple_suffixes, etc.) were being serialized and passed as constructor arguments during deserialization 2. Property descriptors were being included in the serialized output, causing "Object of type property is not JSON serializable" errors Changes: - async_configs.py: Comment out __slots__ serialization logic (lines 100-109) - filters.py: Add patterns, use_glob, reverse to URLPatternFilter __slots__ and store as public attributes - models.py: Convert property descriptors to strings in model_dump() instead of including them directly	2025-08-25 14:04:08 +08:00
James T. Wood	f2da460bb9	fix(dependencies): add cssselect to project dependencies Fixes bug reported in issue #1405 [Bug]: Excluded selector (excluded_selector) doesn't work This commit reintroduces the cssselect library which was removed by PR (https://github.com/unclecode/crawl4ai/pull/1368) and merged via (`437395e490`). Integration tested against 0.7.4 Docker container. Reintroducing cssselector package eliminated errors seen in logs and excluded_selector functionality was restored. Refs: #1405	2025-08-24 22:12:20 -04:00
Soham Kukreti	b1dff5a4d3	feat: Add comprehensive website to API example with frontend This commit adds a complete, web scraping API example that demonstrates how to get structured data from any website and use it like an API using the crawl4ai library with a minimalist frontend interface. Core Functionality - AI-powered web scraping with plain English queries - Dual scraping approaches: Schema-based (faster) and LLM-based (flexible) - Intelligent schema caching for improved performance - Custom LLM model support with API key management - Automatic duplicate request prevention Modern Frontend Interface - Minimalist black-and-white design inspired by modern web apps - Responsive layout with smooth animations and transitions - Three main pages: Scrape Data, Models Management, API Request History - Real-time results display with JSON formatting - Copy-to-clipboard functionality for extracted data - Toast notifications for user feedback - Auto-scroll to results when scraping starts Model Management System - Web-based model configuration interface - Support for any LLM provider (OpenAI, Gemini, Anthropic, etc.) - Simplified configuration requiring only provider and API token - Add, list, and delete model configurations - Secure storage of API keys in local JSON files API Request History - Automatic saving of all API requests and responses - Display of request history with URL, query, and cURL commands - Duplicate prevention (same URL + query combinations) - Request deletion functionality - Clean, simplified display focusing on essential information Technical Implementation Backend (FastAPI) - RESTful API with comprehensive endpoints - Pydantic models for request/response validation - Async web scraping with crawl4ai library - Error handling with detailed error messages - File-based storage for models and request history Frontend (Vanilla JS/CSS/HTML) - No framework dependencies - pure HTML, CSS, JavaScript - Modern CSS Grid and Flexbox layouts - Custom dropdown styling with SVG arrows - Responsive design for mobile and desktop - Smooth scrolling and animations Core Library Integration - WebScraperAgent class for orchestration - ModelConfig class for LLM configuration management - Schema generation and caching system - LLM extraction strategy support - Browser configuration with headless mode	2025-08-24 18:52:37 +05:30
ntohidi	40ab287c90	fix(utils): Improve URL normalization by avoiding quote/unquote to preserve '+' signs. ref #1332	2025-08-22 12:05:21 +08:00
Soham Kukreti	c09a57644f	docs: update adaptive crawler docs and cache defaults; remove deprecated examples (#1330 ) - Replace BaseStrategy with CrawlStrategy in custom strategy examples (DomainSpecificStrategy, HybridStrategy) - Remove “Custom Link Scoring” and “Caching Strategy” sections no longer aligned with current library - Revise memory pruning example to use adaptive.get_relevant_content and index-based retention of top 500 docs - Correct Quickstart note: default cache mode is CacheMode.BYPASS; instruct enabling with CacheMode.ENABLED	2025-08-21 19:11:31 +05:30
ntohidi	90af453506	Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop	2025-08-21 14:10:01 +08:00
Nasrin	8bb0e68cce	Merge pull request #1422 from unclecode/fix/docker-llmEnvFile fix(docker): Fix LLM API key handling for multi-provider support	2025-08-21 14:05:06 +08:00
ntohidi	95051020f4	fix(docker): Fix LLM API key handling for multi-provider support Previously, the system incorrectly used OPENAI_API_KEY for all LLM providers due to a hardcoded api_key_env fallback in config.yml. This caused authentication errors when using non-OpenAI providers like Gemini. Changes: - Remove api_key_env from config.yml to let litellm handle provider-specific env vars - Simplify get_llm_api_key() to return None, allowing litellm to auto-detect keys - Update validate_llm_provider() to trust litellm's built-in key detection - Update documentation to reflect the new automatic key handling The fix leverages litellm's existing capability to automatically find the correct environment variable for each provider (OPENAI_API_KEY, GEMINI_API_TOKEN, etc.) without manual configuration. ref #1291	2025-08-21 14:01:04 +08:00
ntohidi	69961cf40b	Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop	2025-08-20 16:56:19 +08:00
Nasrin	ef174a4c7a	Merge pull request #1104 from emmanuel-ferdman/main fix(docker-api): migrate to modern datetime library API	2025-08-20 10:57:39 +08:00
Nasrin	f4206d6ba1	Merge pull request #1369 from NezarAli/main Fix examples in README.md	2025-08-18 14:22:54 +08:00
ntohidi	9447054a65	docs: update Docker instructions to use the latest release tag	2025-08-18 14:20:05 +08:00
Nasrin	dad7c51481	Merge pull request #1398 from unclecode/fix/update-url-seeding-docs Update URL seeding examples to use proper async context managers	2025-08-18 13:00:26 +08:00
ntohidi	f4a432829e	fix(crawler): Removed the incorrect reference in browser_config variable #1310	2025-08-18 10:59:14 +08:00
UncleCode	e651e045c4	Release v0.7.4: Merge release branch - Merge release/v0.7.4 into main - Version: 0.7.4 - Ready for tag and publication v0.7.4	2025-08-17 19:46:48 +08:00
UncleCode	5398acc7d2	docs: add v0.7.4 release blog post and update documentation - Add comprehensive v0.7.4 release blog post with LLMTableExtraction feature highlight - Update blog index to feature v0.7.4 as latest release - Update README.md to showcase v0.7.4 features alongside v0.7.3 - Accurately describe dispatcher fix as bug fix rather than major enhancement - Include practical code examples for new LLMTableExtraction capabilities	2025-08-17 19:45:23 +08:00
UncleCode	22c7932ba3	chore(version): update version to 0.7.4	2025-08-17 19:22:23 +08:00
UncleCode	2ab0bf27c2	refactor(utils): move memory utilities to utils and update imports	2025-08-17 19:14:55 +08:00
ntohidi	d30dc9fdc1	fix(http-crawler): bring back HTTP crawler strategy	2025-08-16 09:27:23 +08:00
ntohidi	e6044e6053	Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop	2025-08-15 19:44:06 +08:00
ntohidi	a50e47adad	Merge branch 'feature/table-extraction-strategies' into develop	2025-08-15 19:41:37 +08:00
ntohidi	ada7441bd1	refactor: Update LLMTableExtraction examples and tests	2025-08-15 19:11:26 +08:00
ntohidi	9f7fee91a9	feat: 🚀 Introduce revolutionary LLMTableExtraction with intelligent chunking for massive tables BREAKING CHANGE: Table extraction now uses Strategy Design Pattern This epic commit introduces a game-changing approach to table extraction in Crawl4AI: ✨ NEW FEATURES: - LLMTableExtraction: AI-powered extraction for complex HTML tables with rowspan/colspan - Smart Chunking: Automatically splits massive tables into optimal chunks at row boundaries - Parallel Processing: Processes multiple chunks simultaneously for blazing-fast extraction - Intelligent Merging: Seamlessly combines chunk results into complete tables - Header Preservation: Each chunk maintains context with original headers - Auto-retry Logic: Built-in resilience with configurable retry attempts 🏗️ ARCHITECTURE: - Strategy Design Pattern for pluggable table extraction strategies - ThreadPoolExecutor for concurrent chunk processing - Token-based chunking with configurable thresholds - Handles tables without headers gracefully ⚡ PERFORMANCE: - Process 1000+ row tables without timeout - Parallel processing with up to 5 concurrent chunks - Smart token estimation prevents LLM context overflow - Optimized for providers like Groq for massive tables 🔧 CONFIGURATION: - enable_chunking: Auto-handle large tables (default: True) - chunk_token_threshold: When to split (default: 3000 tokens) - min_rows_per_chunk: Meaningful chunk sizes (default: 10) - max_parallel_chunks: Concurrent processing (default: 5) 📚 BACKWARD COMPATIBILITY: - Existing code continues to work unchanged - DefaultTableExtraction remains the default strategy - Progressive enhancement approach This is the future of web table extraction - handling everything from simple tables to massive, complex data grids with merged cells and nested structures. The chunking is completely transparent to users while providing unprecedented scalability.	2025-08-15 19:11:26 +08:00
AHMET YILMAZ	7f48655cf1	feat(browser-profiler): implement cross-platform keyboard listeners and improve quit handling	2025-08-15 19:11:26 +08:00
prokopis3	1417a67e90	chore(profile-test): fix filename typo ( test_crteate_profile.py → test_create_profile.py ) - Rename file to correct spelling - No content changes	2025-08-15 19:11:26 +08:00
prokopis3	19398d33ef	fix(browser_profiler): improve keyboard input handling - fix handling of special keys in Windows msvcrt implementation - Guard against UnicodeDecodeError from multi-byte key sequences - Filter out non-printable characters and control sequences - Add error handling to prevent coroutine crashes - Add unit test to verify keyboard input handling Key changes: - Safe UTF-8 decoding with try/except for special keys - Skip non-printable and multi-byte character sequences - Add broad exception handling in keyboard listener Test runs on Windows only due to msvcrt dependency.	2025-08-15 19:11:26 +08:00
prokopis3	263d362daa	fix(browser_profiler): cross-platform 'q' to quit This commit introduces platform-specific handling for the 'q' key press to quit the browser profiler, ensuring compatibility with both Windows and Unix-like systems. It also adds a check to see if the browser process has already exited, terminating the input listener if so. - Implemented `msvcrt` for Windows to capture keyboard input without requiring a newline. - Retained `termios`, `tty`, and `select` for Unix-like systems. - Added a check for browser process termination to gracefully exit the input listener. - Updated logger messages to use colored output for better user experience.	2025-08-15 19:11:26 +08:00
ntohidi	bac92a47e4	refactor: Update LLMTableExtraction examples and tests	2025-08-15 18:47:31 +08:00
ntohidi	a51545c883	feat: 🚀 Introduce revolutionary LLMTableExtraction with intelligent chunking for massive tables BREAKING CHANGE: Table extraction now uses Strategy Design Pattern This epic commit introduces a game-changing approach to table extraction in Crawl4AI: ✨ NEW FEATURES: - LLMTableExtraction: AI-powered extraction for complex HTML tables with rowspan/colspan - Smart Chunking: Automatically splits massive tables into optimal chunks at row boundaries - Parallel Processing: Processes multiple chunks simultaneously for blazing-fast extraction - Intelligent Merging: Seamlessly combines chunk results into complete tables - Header Preservation: Each chunk maintains context with original headers - Auto-retry Logic: Built-in resilience with configurable retry attempts 🏗️ ARCHITECTURE: - Strategy Design Pattern for pluggable table extraction strategies - ThreadPoolExecutor for concurrent chunk processing - Token-based chunking with configurable thresholds - Handles tables without headers gracefully ⚡ PERFORMANCE: - Process 1000+ row tables without timeout - Parallel processing with up to 5 concurrent chunks - Smart token estimation prevents LLM context overflow - Optimized for providers like Groq for massive tables 🔧 CONFIGURATION: - enable_chunking: Auto-handle large tables (default: True) - chunk_token_threshold: When to split (default: 3000 tokens) - min_rows_per_chunk: Meaningful chunk sizes (default: 10) - max_parallel_chunks: Concurrent processing (default: 5) 📚 BACKWARD COMPATIBILITY: - Existing code continues to work unchanged - DefaultTableExtraction remains the default strategy - Progressive enhancement approach This is the future of web table extraction - handling everything from simple tables to massive, complex data grids with merged cells and nested structures. The chunking is completely transparent to users while providing unprecedented scalability.	2025-08-14 18:21:24 +08:00
Soham Kukreti	ecbe5ffb84	docs: Update URL seeding examples to use proper async context managers - Wrap all AsyncUrlSeeder usage with async context managers - Update URL seeding adventure example to use "sitemap+cc" source, focus on course posts, and add stream=True parameter to fix runtime error	2025-08-13 18:16:46 +05:30
Nasrin	11b310edef	Merge pull request #1378 from unclecode/fix/exit_with_q Cross Platform fix for browser profiler	2025-08-13 14:16:47 +08:00
Nasrin	926e41aab8	Merge pull request #1378 from unclecode/fix/exit_with_q Cross Platform fix for browser profiler	2025-08-13 14:16:47 +08:00
Nasrin	489981e670	Merge pull request #1390 from unclecode/fix/docker-raw-html Check for raw: and raw:// URLs before auto-appending https:// prefix	2025-08-13 13:56:33 +08:00
Nasrin	b92be4ef66	Merge pull request #1371 from unclecode/bug/proxy_config #1057 : enhance ProxyConfig initialization to support dict and string…	2025-08-12 16:55:52 +08:00

1 2 3 4 5 ...

1171 Commits