crawl4ai

Author	SHA1	Message	Date
unclecode	13a414802b	Add set_defaults/get_defaults/reset_defaults to config classes	2026-01-31 11:44:07 +00:00
unclecode	19b9140c68	Improve CDP connection handling	2026-01-31 11:07:26 +00:00
unclecode	911bbce8b1	Fix agenerate_schema() JSON parsing for Anthropic models Strip markdown code fences (```json ... ```) from LLM responses before json.loads() in agenerate_schema(). Anthropic models wrap JSON output in markdown fences when litellm silently drops the unsupported response_format parameter, causing json.loads("") parse failures. - Add _strip_markdown_fences() helper to extraction_strategy.py - Apply fence stripping + empty response check in agenerate_schema() - Separate JSONDecodeError for clearer error messages - Add 34 tests: unit, real API integration (Anthropic/OpenAI/Groq against quotes.toscrape.com), and regression parametrized	2026-01-29 11:38:53 +00:00
unclecode	0a17fe8f19	Improve page tracking with global CDP endpoint-based tracking - Use class-level tracking keyed by normalized CDP URL - All BrowserManager instances connecting to same browser share tracking - For CDP connections, always create new pages (cross-connection page sharing isn't reliable in Playwright) - For managed browsers, page reuse works within same process - Normalize CDP URLs to handle different formats (http, ws, query params)	2026-01-28 09:30:20 +00:00
unclecode	9b52c1490b	Fix page reuse race condition when create_isolated_context=False When using create_isolated_context=False with concurrent crawls, multiple tasks would reuse the same page (pages[0]) causing navigation race conditions and "Page.content: Unable to retrieve content because the page is navigating" errors. Changes: - Add _pages_in_use set to track pages currently being used by crawls - Rewrite get_page() to only reuse pages that are not in use - Create new pages when all existing pages are busy - Add release_page() method to release pages after crawl completes - Update cleanup paths to release pages before closing This maintains context sharing (cookies, localStorage) while ensuring each concurrent crawl gets its own isolated page for navigation. Includes integration tests verifying: - Single and sequential crawls still work - Concurrent crawls don't cause race conditions - High concurrency (10 simultaneous crawls) works - Page tracking state remains consistent	2026-01-28 01:43:21 +00:00
unclecode	94e19a4c72	Enhance browser profile management capabilities	2026-01-24 08:02:52 +00:00
unclecode	f6897d1429	Add cancellation support for deep crawl strategies - Add should_cancel callback parameter to BFS, DFS, and BestFirst strategies - Add cancel() method for immediate cancellation (thread-safe) - Add cancelled property to check cancellation status - Add _check_cancellation() internal method supporting both sync/async callbacks - Reset cancel event on strategy reuse for multiple crawls - Include cancelled flag in state notifications via on_state_change - Handle callback exceptions gracefully (fail-open, log warning) - Add comprehensive test suite with 26 tests covering all edge cases This enables external callers (e.g., cloud platforms) to stop a running deep crawl mid-execution and retrieve partial results.	2026-01-22 06:08:25 +00:00
unclecode	418bfcfd3b	Fix redirected_url containing raw HTML content for raw: URLs When using raw: URLs without a base_url, redirected_url was incorrectly set to the entire raw HTML string (potentially 300KB+) instead of None. Changes: - async_crawler_strategy.py: Don't fall back to url for raw:/file:// URLs in fast path, browser path, and HTTP strategy - async_crawler_strategy.py: Skip page.url assignment for local content (would return "about:blank") - async_webcrawler.py: Don't fall back to url for raw: URLs in crawl result and cached result paths - Add comprehensive test suite for redirected_url handling	2026-01-20 00:45:15 +00:00
ntohidi	acfab80dd4	Enhance authentication flow by implementing JWT token retrieval and adding authorization headers to API requests	2026-01-12 13:46:32 +01:00
unclecode	2550f3d2d5	Add browser pipeline support for raw:/file:// URLs - Add process_in_browser parameter to CrawlerRunConfig - Route raw:/file:// URLs through _crawl_web() when browser operations needed - Use page.set_content() instead of goto() for local content - Fix cookie handling for non-HTTP URLs in browser_manager - Auto-detect browser requirements: js_code, wait_for, screenshot, etc. - Maintain fast path for raw:/file:// without browser params Fixes #310	2025-12-27 12:32:42 +00:00
unclecode	9e7f5aa44b	Updates on proxy rotation and proxy configuration	2025-12-26 12:45:57 +00:00
unclecode	fde4e9f0c6	Add prefetch mode for two-phase deep crawling - Add `prefetch` parameter to CrawlerRunConfig - Add `quick_extract_links()` function for fast link extraction - Add short-circuit in aprocess_html() for prefetch mode - Add 42 tests (unit, integration, regression) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-25 01:55:08 +00:00
unclecode	31ebf37252	Add crash recovery for deep crawl strategies Add optional resume_state and on_state_change parameters to all deep crawl strategies (BFS, DFS, Best-First) for cloud deployment crash recovery. Features: - resume_state: Pass saved state to resume from checkpoint - on_state_change: Async callback fired after each URL for real-time state persistence to external storage (Redis, DB, etc.) - export_state(): Get last captured state manually - Zero overhead when features are disabled (None defaults) State includes visited URLs, pending queue/stack, depths, and pages_crawled count. All state is JSON-serializable.	2025-12-22 14:51:10 +00:00
unclecode	48426f73f0	Some debugging for caching	2025-12-21 04:48:03 +00:00
unclecode	02acad1dc6	Fix CDP connection handling: support WS URLs and proper cleanup Changes to browser_manager.py: 1. _verify_cdp_ready(): Support multiple URL formats - WebSocket URLs (ws://, wss://): Skip HTTP verification, Playwright handles directly - HTTP URLs with query params: Properly parse with urlparse to preserve query string - Fixes issue where naive f"{cdp_url}/json/version" broke WS URLs and query params 2. close(): Proper cleanup when cdp_cleanup_on_close=True - Close all sessions (pages) - Close all contexts - Call browser.close() to disconnect (doesn't terminate browser, just releases connection) - Wait 1 second for CDP connection to fully release - Stop Playwright instance to prevent memory leaks This enables: - Connecting to specific browsers via WS URL - Reusing the same browser with multiple sequential connections - No user wait needed between connections (internal 1s delay handles it) Added tests/browser/test_cdp_cleanup_reuse.py with comprehensive tests.	2025-12-18 22:04:52 +08:00
unclecode	8ae908bede	Add browser_context_id and target_id parameters to BrowserConfig Enable Crawl4AI to connect to pre-created CDP browser contexts, which is essential for cloud browser services that pre-create isolated contexts. Changes: - Add browser_context_id and target_id parameters to BrowserConfig - Update from_kwargs() and to_dict() methods - Modify BrowserManager.start() to use existing context when provided - Add _get_page_by_target_id() helper method - Update get_page() to handle pre-existing targets - Add test for browser_context_id functionality This enables cloud services to: 1. Create isolated CDP contexts before Crawl4AI connects 2. Pass context/target IDs to BrowserConfig 3. Have Crawl4AI reuse existing contexts instead of creating new ones	2025-12-13 02:42:48 +00:00
Nasrin	5a8fb57795	Merge pull request #1648 from christopher-w-murphy/fix/content-relevance-filter [Fix]: Docker server does not decode ContentRelevanceFilter	2025-12-03 18:36:07 +08:00
ntohidi	df4d87ed78	refactor: replace PyPDF2 with pypdf across the codebase. ref #1412	2025-12-03 10:59:18 +01:00
ntohidi	07ccf13be6	Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268	2025-12-02 13:00:54 +01:00
Chris Murphy	6893094f58	parameterized tests	2025-12-01 16:19:19 -05:00
Chris Murphy	33a3cc3933	reproduced AttributeError from #1642	2025-12-01 11:31:07 -05:00
Rachel Bushrian	7771ed3894	Merge branch 'develop' into fix/wrong_url_raw	2025-11-24 13:54:07 +02:00
Nasrin	b207ae2848	Merge pull request #1528 from unclecode/fix/managed-browser-cdp-timing Add CDP endpoint verification with exponential backoff for managed browsers	2025-11-12 23:53:57 +08:00
AHMET YILMAZ	80745bceb9	#1559 :Add tests for sitemap parsing and URL normalization in AsyncUrlSeeder	2025-11-10 14:15:54 +08:00
ntohidi	a30548a98f	This commit resolves issue #1055 where LLM extraction was blocking async execution, causing URLs to be processed sequentially instead of in parallel. Changes: - Added aperform_completion_with_backoff() using litellm.acompletion for async LLM calls - Implemented arun() method in ExtractionStrategy base class with thread pool fallback - Created async arun() and aextract() methods in LLMExtractionStrategy using asyncio.gather - Updated AsyncWebCrawler.arun() to detect and use arun() when available - Added comprehensive test suite to verify parallel execution Impact: - LLM extraction now runs truly in parallel across multiple URLs - Significant performance improvement for multi-URL crawls with LLM strategies - Backward compatible - existing extraction strategies continue to work - No breaking changes to public API Technical details: - Uses litellm.acompletion for non-blocking LLM calls - Leverages asyncio.gather for concurrent chunk processing - Maintains backward compatibility via asyncio.to_thread fallback - Works seamlessly with MemoryAdaptiveDispatcher and other dispatchers	2025-11-06 11:22:45 +01:00
Nasrin	2c918155aa	Merge pull request #1529 from unclecode/fix/remove_overlay_elements Fix remove_overlay_elements functionality by calling injected JS function.	2025-11-06 00:10:32 +08:00
Claude	613097d121	test: add verification tests for pyOpenSSL security update - Add lightweight security test to verify version requirements - Add comprehensive integration test for crawl4ai functionality - Tests verify pyOpenSSL >= 25.3.0 and cryptography >= 45.0.7 - All tests passing: security vulnerability is resolved Related to #1545 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-23 06:57:25 +00:00
ntohidi	b71d624168	Merge branch 'implement-webhook-crawl-feature-011CULZY1Jy8N5MUkZqXkRVp' into develop	2025-10-22 13:12:25 +02:00
Claude	52da8d72bc	test: add comprehensive webhook feature test script Added end-to-end test script that automates webhook feature testing: Script Features (test_webhook_feature.sh): - Automatic branch switching and dependency installation - Redis and server startup/shutdown management - Webhook receiver implementation - Integration test for webhook notifications - Comprehensive cleanup and error handling - Returns to original branch after completion Test Flow: 1. Fetch and checkout webhook feature branch 2. Activate venv and install dependencies 3. Start Redis and Crawl4AI server 4. Submit crawl job with webhook config 5. Verify webhook delivery and payload 6. Clean up all processes and return to original branch Documentation: - WEBHOOK_TEST_README.md with usage instructions - Troubleshooting guide - Exit codes and safety features Usage: ./tests/test_webhook_feature.sh Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-22 00:35:07 +00:00
ntohidi	a3f057e19f	feat: Add hooks utility for function-based hooks with Docker client integration. ref #1377 Add hooks_to_string() utility function that converts Python function objects to string representations for the Docker API, enabling developers to write hooks as regular Python functions instead of strings. Core Changes: - New hooks_to_string() utility in crawl4ai/utils.py using inspect.getsource() - Docker client now accepts both function objects and strings for hooks - Automatic detection and conversion in Crawl4aiDockerClient._prepare_request() - New hooks and hooks_timeout parameters in client.crawl() method Documentation: - Docker client examples with function-based hooks (docs/examples/docker_client_hooks_example.py) - Updated main Docker deployment guide with comprehensive hooks section - Added unit tests for hooks utility (tests/docker/test_hooks_utility.py)	2025-10-13 12:34:08 +08:00
Soham Kukreti	2dc6588573	fix: remove_overlay_elements functionality by calling injected JS function. ref: #1396 - Fix critical bug where overlay removal JS function was injected but never called - Change remove_overlay_elements() to properly execute the injected async function - Wrap JS execution in async to handle the async overlay removal logic - Add test_remove_overlay_elements() test case to verify functionality works - Ensure overlay elements (cookie banners, popups, modals) are actually removed The remove_overlay_elements feature now works as intended: - Before: Function definition injected but never executed (silent failure) - After: Function injected and called, successfully removing overlay elements	2025-09-29 20:40:08 +05:30
Soham Kukreti	34c0996ee4	fix: Add CDP endpoint verification with exponential backoff for managed browsers (#1445 ) browser_manager: - Add CDP endpoint verification with retry logic and exponential backoff - Call verification before connecting to CDP in `start()` method - Graceful handling of timing issues during browser startup test_cdp_strategy: - Fix cookie persistence test by adding storage state management - Fix session management test to work with managed browser architecture - Add comprehensive CDP timing tests covering: - Fast startup scenarios - Delayed browser startup simulation - Exponential backoff behavior validation - Concurrent browser connections - Stress testing with multiple successive startups - Retry count verification Impact: - Eliminates browser startup failures due to CDP timing issues - Provides robust fallback with automatic retries - Maintains fast startup when CDP is immediately available - Comprehensive test coverage ensures reliability Resolves CDP connection timing issues in managed browser mode.	2025-09-29 19:31:09 +05:30
ntohidi	fef715a891	Merge branch 'feature/docker-hooks' into develop	2025-09-25 14:11:46 +08:00
Nasrin	3899ac3d3b	Merge pull request #1464 from unclecode/fix/proxy_deprecation Fix/proxy deprecation	2025-09-16 15:48:45 +08:00
Nasrin	23431d8109	Merge pull request #1389 from unclecode/fix/deep-crawl-scoring fix(deep-crawl): BestFirst priority inversion	2025-09-16 15:45:54 +08:00
Nasrin	f8eaf01ed1	Merge pull request #1467 from unclecode/fix/request-crawl-stream Fix: request /crawl with stream: true issue	2025-09-11 17:40:43 +08:00
ntohidi	3bc56dd028	fix: allow custom LLM providers for adaptive crawler embedding config. ref: #1291 - Change embedding_llm_config from Dict to Union[LLMConfig, Dict] for type safety - Add backward-compatible conversion property _embedding_llm_config_dict - Replace all hardcoded OpenAI embedding configs with configurable options - Fix LLMConfig object attribute access in query expansion logic - Add comprehensive example demonstrating multiple provider configurations - Update documentation with both LLMConfig object and dictionary usage patterns Users can now specify any LLM provider for query expansion in embedding strategy: - New: embedding_llm_config=LLMConfig(provider='anthropic/claude-3', api_token='key') - Old: embedding_llm_config={'provider': 'openai/gpt-4', 'api_token': 'key'} (still works)	2025-09-09 12:49:55 +08:00
AHMET YILMAZ	6a3b3e9d38	Commit without API	2025-09-03 17:02:40 +08:00
Nasrin	af28e84a21	Merge pull request #1441 from unclecode/fix/improve-docker-error-handling Improve docker error handling	2025-09-02 11:56:01 +08:00
rbushria	edd0b576b1	Fix: Use correct URL variable for raw HTML extraction (#1116 ) - Prevents full HTML content from being passed as URL to extraction strategies - Added unit tests to verify raw HTML and regular URL processing Fix: Wrong URL variable used for extraction of raw html	2025-09-01 23:15:56 +03:00
Nasrin	5e7fcb17e1	Merge pull request #1448 from unclecode/fix/https-reditrect feat: add preserve_https_for_internal_links flag to maintain HTTPS during crawling	2025-09-01 16:11:25 +08:00
ntohidi	f566c5a376	feat: add preserve_https_for_internal_links flag to maintain HTTPS during crawling. Ref #1410 Added a new `preserve_https_for_internal_links` configuration flag that preserves the original HTTPS scheme for same-domain links even when the server redirects to HTTP.	2025-08-28 17:38:40 +08:00
AHMET YILMAZ	f7a3366f72	#1375 : refactor(proxy) Deprecate 'proxy' parameter in BrowserConfig and enhance proxy string parsing - Updated ProxyConfig.from_string to support multiple proxy formats, including URLs with credentials. - Deprecated the 'proxy' parameter in BrowserConfig, replacing it with 'proxy_config' for better flexibility. - Added warnings for deprecated usage and clarified behavior when both parameters are provided. - Updated documentation and tests to reflect changes in proxy configuration handling.	2025-08-28 17:21:49 +08:00
Nasrin	4e1c4bd24e	Merge pull request #1436 from unclecode/fix/docker-filter fix(docker): resolve filter serialization and JSON encoding errors in deep crawl strategy	2025-08-27 11:08:42 +08:00
Soham Kukreti	2ad3fb5fc8	feat(docker): improve docker error handling - Return comprehensive error messages along with status codes for api internal errors. - Fix fit_html property serialization issue in both /crawl and /crawl/stream endpoints - Add sanitization to ensure fit_html is always JSON-serializable (string or None) - Add comprehensive error handling test suite.	2025-08-26 23:18:35 +05:30
ntohidi	159207b86f	feat(docker): Add temperature and base_url parameters for LLM configuration. ref #1035 Implement hierarchical configuration for LLM parameters with support for: - Temperature control (0.0-2.0) to adjust response creativity - Custom base_url for proxy servers and alternative endpoints - 4-tier priority: request params > provider env > global env > defaults Add helper functions in utils.py, update API schemas and handlers, support environment variables (LLM_TEMPERATURE, OPENAI_TEMPERATURE, etc.), and provide comprehensive documentation with examples.	2025-08-26 16:44:07 +08:00
ntohidi	102352eac4	fix(docker): resolve filter serialization and JSON encoding errors in deep crawl strategy (ref #1419 ) - Fix URLPatternFilter serialization by preventing private __slots__ from being serialized as constructor params - Add public attributes to URLPatternFilter to store original constructor parameters for proper serialization - Handle property descriptors in CrawlResult.model_dump() to prevent JSON serialization errors - Ensure filter chains work correctly with Docker client and REST API The issue occurred because: 1. Private implementation details (_simple_suffixes, etc.) were being serialized and passed as constructor arguments during deserialization 2. Property descriptors were being included in the serialized output, causing "Object of type property is not JSON serializable" errors Changes: - async_configs.py: Comment out __slots__ serialization logic (lines 100-109) - filters.py: Add patterns, use_glob, reverse to URLPatternFilter __slots__ and store as public attributes - models.py: Convert property descriptors to strings in model_dump() instead of including them directly	2025-08-25 14:04:08 +08:00
UncleCode	2ab0bf27c2	refactor(utils): move memory utilities to utils and update imports	2025-08-17 19:14:55 +08:00
ntohidi	bac92a47e4	refactor: Update LLMTableExtraction examples and tests	2025-08-15 18:47:31 +08:00
ntohidi	a51545c883	feat: 🚀 Introduce revolutionary LLMTableExtraction with intelligent chunking for massive tables BREAKING CHANGE: Table extraction now uses Strategy Design Pattern This epic commit introduces a game-changing approach to table extraction in Crawl4AI: ✨ NEW FEATURES: - LLMTableExtraction: AI-powered extraction for complex HTML tables with rowspan/colspan - Smart Chunking: Automatically splits massive tables into optimal chunks at row boundaries - Parallel Processing: Processes multiple chunks simultaneously for blazing-fast extraction - Intelligent Merging: Seamlessly combines chunk results into complete tables - Header Preservation: Each chunk maintains context with original headers - Auto-retry Logic: Built-in resilience with configurable retry attempts 🏗️ ARCHITECTURE: - Strategy Design Pattern for pluggable table extraction strategies - ThreadPoolExecutor for concurrent chunk processing - Token-based chunking with configurable thresholds - Handles tables without headers gracefully ⚡ PERFORMANCE: - Process 1000+ row tables without timeout - Parallel processing with up to 5 concurrent chunks - Smart token estimation prevents LLM context overflow - Optimized for providers like Groq for massive tables 🔧 CONFIGURATION: - enable_chunking: Auto-handle large tables (default: True) - chunk_token_threshold: When to split (default: 3000 tokens) - min_rows_per_chunk: Meaningful chunk sizes (default: 10) - max_parallel_chunks: Concurrent processing (default: 5) 📚 BACKWARD COMPATIBILITY: - Existing code continues to work unchanged - DefaultTableExtraction remains the default strategy - Progressive enhancement approach This is the future of web table extraction - handling everything from simple tables to massive, complex data grids with merged cells and nested structures. The chunking is completely transparent to users while providing unprecedented scalability.	2025-08-14 18:21:24 +08:00

1 2 3 4

169 Commits