crawl4ai

Author	SHA1	Message	Date
ntohidi	07ccf13be6	Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268	2025-12-02 13:00:54 +01:00
Rachel Bushrian	7771ed3894	Merge branch 'develop' into fix/wrong_url_raw	2025-11-24 13:54:07 +02:00
Nasrin	b207ae2848	Merge pull request #1528 from unclecode/fix/managed-browser-cdp-timing Add CDP endpoint verification with exponential backoff for managed browsers	2025-11-12 23:53:57 +08:00
AHMET YILMAZ	80745bceb9	#1559 :Add tests for sitemap parsing and URL normalization in AsyncUrlSeeder	2025-11-10 14:15:54 +08:00
ntohidi	a30548a98f	This commit resolves issue #1055 where LLM extraction was blocking async execution, causing URLs to be processed sequentially instead of in parallel. Changes: - Added aperform_completion_with_backoff() using litellm.acompletion for async LLM calls - Implemented arun() method in ExtractionStrategy base class with thread pool fallback - Created async arun() and aextract() methods in LLMExtractionStrategy using asyncio.gather - Updated AsyncWebCrawler.arun() to detect and use arun() when available - Added comprehensive test suite to verify parallel execution Impact: - LLM extraction now runs truly in parallel across multiple URLs - Significant performance improvement for multi-URL crawls with LLM strategies - Backward compatible - existing extraction strategies continue to work - No breaking changes to public API Technical details: - Uses litellm.acompletion for non-blocking LLM calls - Leverages asyncio.gather for concurrent chunk processing - Maintains backward compatibility via asyncio.to_thread fallback - Works seamlessly with MemoryAdaptiveDispatcher and other dispatchers	2025-11-06 11:22:45 +01:00
Nasrin	2c918155aa	Merge pull request #1529 from unclecode/fix/remove_overlay_elements Fix remove_overlay_elements functionality by calling injected JS function.	2025-11-06 00:10:32 +08:00
Claude	613097d121	test: add verification tests for pyOpenSSL security update - Add lightweight security test to verify version requirements - Add comprehensive integration test for crawl4ai functionality - Tests verify pyOpenSSL >= 25.3.0 and cryptography >= 45.0.7 - All tests passing: security vulnerability is resolved Related to #1545 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-23 06:57:25 +00:00
ntohidi	b71d624168	Merge branch 'implement-webhook-crawl-feature-011CULZY1Jy8N5MUkZqXkRVp' into develop	2025-10-22 13:12:25 +02:00
Claude	52da8d72bc	test: add comprehensive webhook feature test script Added end-to-end test script that automates webhook feature testing: Script Features (test_webhook_feature.sh): - Automatic branch switching and dependency installation - Redis and server startup/shutdown management - Webhook receiver implementation - Integration test for webhook notifications - Comprehensive cleanup and error handling - Returns to original branch after completion Test Flow: 1. Fetch and checkout webhook feature branch 2. Activate venv and install dependencies 3. Start Redis and Crawl4AI server 4. Submit crawl job with webhook config 5. Verify webhook delivery and payload 6. Clean up all processes and return to original branch Documentation: - WEBHOOK_TEST_README.md with usage instructions - Troubleshooting guide - Exit codes and safety features Usage: ./tests/test_webhook_feature.sh Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-22 00:35:07 +00:00
ntohidi	a3f057e19f	feat: Add hooks utility for function-based hooks with Docker client integration. ref #1377 Add hooks_to_string() utility function that converts Python function objects to string representations for the Docker API, enabling developers to write hooks as regular Python functions instead of strings. Core Changes: - New hooks_to_string() utility in crawl4ai/utils.py using inspect.getsource() - Docker client now accepts both function objects and strings for hooks - Automatic detection and conversion in Crawl4aiDockerClient._prepare_request() - New hooks and hooks_timeout parameters in client.crawl() method Documentation: - Docker client examples with function-based hooks (docs/examples/docker_client_hooks_example.py) - Updated main Docker deployment guide with comprehensive hooks section - Added unit tests for hooks utility (tests/docker/test_hooks_utility.py)	2025-10-13 12:34:08 +08:00
Soham Kukreti	2dc6588573	fix: remove_overlay_elements functionality by calling injected JS function. ref: #1396 - Fix critical bug where overlay removal JS function was injected but never called - Change remove_overlay_elements() to properly execute the injected async function - Wrap JS execution in async to handle the async overlay removal logic - Add test_remove_overlay_elements() test case to verify functionality works - Ensure overlay elements (cookie banners, popups, modals) are actually removed The remove_overlay_elements feature now works as intended: - Before: Function definition injected but never executed (silent failure) - After: Function injected and called, successfully removing overlay elements	2025-09-29 20:40:08 +05:30
Soham Kukreti	34c0996ee4	fix: Add CDP endpoint verification with exponential backoff for managed browsers (#1445 ) browser_manager: - Add CDP endpoint verification with retry logic and exponential backoff - Call verification before connecting to CDP in `start()` method - Graceful handling of timing issues during browser startup test_cdp_strategy: - Fix cookie persistence test by adding storage state management - Fix session management test to work with managed browser architecture - Add comprehensive CDP timing tests covering: - Fast startup scenarios - Delayed browser startup simulation - Exponential backoff behavior validation - Concurrent browser connections - Stress testing with multiple successive startups - Retry count verification Impact: - Eliminates browser startup failures due to CDP timing issues - Provides robust fallback with automatic retries - Maintains fast startup when CDP is immediately available - Comprehensive test coverage ensures reliability Resolves CDP connection timing issues in managed browser mode.	2025-09-29 19:31:09 +05:30
ntohidi	fef715a891	Merge branch 'feature/docker-hooks' into develop	2025-09-25 14:11:46 +08:00
Nasrin	3899ac3d3b	Merge pull request #1464 from unclecode/fix/proxy_deprecation Fix/proxy deprecation	2025-09-16 15:48:45 +08:00
Nasrin	23431d8109	Merge pull request #1389 from unclecode/fix/deep-crawl-scoring fix(deep-crawl): BestFirst priority inversion	2025-09-16 15:45:54 +08:00
Nasrin	f8eaf01ed1	Merge pull request #1467 from unclecode/fix/request-crawl-stream Fix: request /crawl with stream: true issue	2025-09-11 17:40:43 +08:00
ntohidi	3bc56dd028	fix: allow custom LLM providers for adaptive crawler embedding config. ref: #1291 - Change embedding_llm_config from Dict to Union[LLMConfig, Dict] for type safety - Add backward-compatible conversion property _embedding_llm_config_dict - Replace all hardcoded OpenAI embedding configs with configurable options - Fix LLMConfig object attribute access in query expansion logic - Add comprehensive example demonstrating multiple provider configurations - Update documentation with both LLMConfig object and dictionary usage patterns Users can now specify any LLM provider for query expansion in embedding strategy: - New: embedding_llm_config=LLMConfig(provider='anthropic/claude-3', api_token='key') - Old: embedding_llm_config={'provider': 'openai/gpt-4', 'api_token': 'key'} (still works)	2025-09-09 12:49:55 +08:00
AHMET YILMAZ	6a3b3e9d38	Commit without API	2025-09-03 17:02:40 +08:00
Nasrin	af28e84a21	Merge pull request #1441 from unclecode/fix/improve-docker-error-handling Improve docker error handling	2025-09-02 11:56:01 +08:00
rbushria	edd0b576b1	Fix: Use correct URL variable for raw HTML extraction (#1116 ) - Prevents full HTML content from being passed as URL to extraction strategies - Added unit tests to verify raw HTML and regular URL processing Fix: Wrong URL variable used for extraction of raw html	2025-09-01 23:15:56 +03:00
Nasrin	5e7fcb17e1	Merge pull request #1448 from unclecode/fix/https-reditrect feat: add preserve_https_for_internal_links flag to maintain HTTPS during crawling	2025-09-01 16:11:25 +08:00
ntohidi	f566c5a376	feat: add preserve_https_for_internal_links flag to maintain HTTPS during crawling. Ref #1410 Added a new `preserve_https_for_internal_links` configuration flag that preserves the original HTTPS scheme for same-domain links even when the server redirects to HTTP.	2025-08-28 17:38:40 +08:00
AHMET YILMAZ	f7a3366f72	#1375 : refactor(proxy) Deprecate 'proxy' parameter in BrowserConfig and enhance proxy string parsing - Updated ProxyConfig.from_string to support multiple proxy formats, including URLs with credentials. - Deprecated the 'proxy' parameter in BrowserConfig, replacing it with 'proxy_config' for better flexibility. - Added warnings for deprecated usage and clarified behavior when both parameters are provided. - Updated documentation and tests to reflect changes in proxy configuration handling.	2025-08-28 17:21:49 +08:00
Nasrin	4e1c4bd24e	Merge pull request #1436 from unclecode/fix/docker-filter fix(docker): resolve filter serialization and JSON encoding errors in deep crawl strategy	2025-08-27 11:08:42 +08:00
Soham Kukreti	2ad3fb5fc8	feat(docker): improve docker error handling - Return comprehensive error messages along with status codes for api internal errors. - Fix fit_html property serialization issue in both /crawl and /crawl/stream endpoints - Add sanitization to ensure fit_html is always JSON-serializable (string or None) - Add comprehensive error handling test suite.	2025-08-26 23:18:35 +05:30
ntohidi	159207b86f	feat(docker): Add temperature and base_url parameters for LLM configuration. ref #1035 Implement hierarchical configuration for LLM parameters with support for: - Temperature control (0.0-2.0) to adjust response creativity - Custom base_url for proxy servers and alternative endpoints - 4-tier priority: request params > provider env > global env > defaults Add helper functions in utils.py, update API schemas and handlers, support environment variables (LLM_TEMPERATURE, OPENAI_TEMPERATURE, etc.), and provide comprehensive documentation with examples.	2025-08-26 16:44:07 +08:00
ntohidi	102352eac4	fix(docker): resolve filter serialization and JSON encoding errors in deep crawl strategy (ref #1419 ) - Fix URLPatternFilter serialization by preventing private __slots__ from being serialized as constructor params - Add public attributes to URLPatternFilter to store original constructor parameters for proper serialization - Handle property descriptors in CrawlResult.model_dump() to prevent JSON serialization errors - Ensure filter chains work correctly with Docker client and REST API The issue occurred because: 1. Private implementation details (_simple_suffixes, etc.) were being serialized and passed as constructor arguments during deserialization 2. Property descriptors were being included in the serialized output, causing "Object of type property is not JSON serializable" errors Changes: - async_configs.py: Comment out __slots__ serialization logic (lines 100-109) - filters.py: Add patterns, use_glob, reverse to URLPatternFilter __slots__ and store as public attributes - models.py: Convert property descriptors to strings in model_dump() instead of including them directly	2025-08-25 14:04:08 +08:00
UncleCode	2ab0bf27c2	refactor(utils): move memory utilities to utils and update imports	2025-08-17 19:14:55 +08:00
ntohidi	bac92a47e4	refactor: Update LLMTableExtraction examples and tests	2025-08-15 18:47:31 +08:00
ntohidi	a51545c883	feat: 🚀 Introduce revolutionary LLMTableExtraction with intelligent chunking for massive tables BREAKING CHANGE: Table extraction now uses Strategy Design Pattern This epic commit introduces a game-changing approach to table extraction in Crawl4AI: ✨ NEW FEATURES: - LLMTableExtraction: AI-powered extraction for complex HTML tables with rowspan/colspan - Smart Chunking: Automatically splits massive tables into optimal chunks at row boundaries - Parallel Processing: Processes multiple chunks simultaneously for blazing-fast extraction - Intelligent Merging: Seamlessly combines chunk results into complete tables - Header Preservation: Each chunk maintains context with original headers - Auto-retry Logic: Built-in resilience with configurable retry attempts 🏗️ ARCHITECTURE: - Strategy Design Pattern for pluggable table extraction strategies - ThreadPoolExecutor for concurrent chunk processing - Token-based chunking with configurable thresholds - Handles tables without headers gracefully ⚡ PERFORMANCE: - Process 1000+ row tables without timeout - Parallel processing with up to 5 concurrent chunks - Smart token estimation prevents LLM context overflow - Optimized for providers like Groq for massive tables 🔧 CONFIGURATION: - enable_chunking: Auto-handle large tables (default: True) - chunk_token_threshold: When to split (default: 3000 tokens) - min_rows_per_chunk: Meaningful chunk sizes (default: 10) - max_parallel_chunks: Concurrent processing (default: 5) 📚 BACKWARD COMPATIBILITY: - Existing code continues to work unchanged - DefaultTableExtraction remains the default strategy - Progressive enhancement approach This is the future of web table extraction - handling everything from simple tables to massive, complex data grids with merged cells and nested structures. The chunking is completely transparent to users while providing unprecedented scalability.	2025-08-14 18:21:24 +08:00
Nasrin	926e41aab8	Merge pull request #1378 from unclecode/fix/exit_with_q Cross Platform fix for browser profiler	2025-08-13 14:16:47 +08:00
Nasrin	489981e670	Merge pull request #1390 from unclecode/fix/docker-raw-html Check for raw: and raw:// URLs before auto-appending https:// prefix	2025-08-13 13:56:33 +08:00
Nasrin	b92be4ef66	Merge pull request #1371 from unclecode/bug/proxy_config #1057 : enhance ProxyConfig initialization to support dict and string…	2025-08-12 16:55:52 +08:00
Nasrin	7c0edaf266	Merge pull request #1384 from unclecode/fix/update_docker_examples docs: remove CRAWL4AI_API_TOKEN references and use correct endpoints in Docker example scripts (#1015)	2025-08-12 16:53:42 +08:00
ntohidi	955110a8b0	Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop	2025-08-12 12:22:25 +08:00
Soham Kukreti	f30811b524	fix: Check for raw: and raw:// URLs before auto-appending https:// prefix - Add raw HTML URL validation alongside http/https checks - Fix URL preprocessing logic to handle raw: and raw:// prefixes - Update error message and add comprehensive test cases	2025-08-11 22:10:53 +05:30
ntohidi	96c4b0de67	fix(browser_manager): serialize new_page on persistent context to avoid races ref #1198 - Add _page_lock and guarded creation; handle empty context.pages safely - Prevents BrowserContext.new_page “Target page/context closed” during concurrent arun_many	2025-08-11 18:55:43 +08:00
ntohidi	88a9fbbb7e	fix(deep-crawl): BestFirst priority inversion; remove pre-scoring truncation. ref #1253 Use negative scores in PQ to visit high-score URLs first and drop link cap prior to scoring; add test for ordering.	2025-08-11 18:16:57 +08:00
ntohidi	be63c98db3	feat(docker): add user-provided hooks support to Docker API Implements comprehensive hooks functionality allowing users to provide custom Python functions as strings that execute at specific points in the crawling pipeline. Key Features: - Support for all 8 crawl4ai hook points: • on_browser_created: Initialize browser settings • on_page_context_created: Configure page context • before_goto: Pre-navigation setup • after_goto: Post-navigation processing • on_user_agent_updated: User agent modification handling • on_execution_started: Crawl execution initialization • before_retrieve_html: Pre-extraction processing • before_return_html: Final HTML processing Implementation Details: - Created UserHookManager for validation, compilation, and safe execution - Added IsolatedHookWrapper for error isolation and timeout protection - AST-based validation ensures code structure correctness - Sandboxed execution with restricted builtins for security - Configurable timeout (1-120 seconds) prevents infinite loops - Comprehensive error handling ensures hooks don't crash main process - Execution tracking with detailed statistics and logging API Changes: - Added HookConfig schema with code and timeout fields - Extended CrawlRequest with optional hooks parameter - Added /hooks/info endpoint for hook discovery - Updated /crawl and /crawl/stream endpoints to support hooks Safety Features: - Malformed hooks return clear validation errors - Hook errors are isolated and reported without stopping crawl - Execution statistics track success/failure/timeout rates - All hook results are JSON-serializable Testing: - Comprehensive test suite covering all 8 hooks - Error handling and timeout scenarios validated - Authentication, performance, and content extraction examples - 100% success rate in production testing Documentation: - Added extensive hooks section to docker-deployment.md - Security warnings about user-provided code risks - Real-world examples using httpbin.org, GitHub, BBC - Best practices and troubleshooting guide ref #1377	2025-08-11 13:25:17 +08:00
Soham Kukreti	cd2dd68e4c	docs: remove CRAWL4AI_API_TOKEN references and use correct endpoints in Docker example scripts (#1015 ) - Remove deprecated API token authentication from all Docker examples - Fix async job endpoints: /crawl -> /crawl/job for submission, /task/{id} -> /crawl/job/{id} for polling - Fix sync endpoint: /crawl_sync -> /crawl (synchronous) - Remove non-existent /crawl_direct endpoint - Update request format to use new structure with browser_config and crawler_config - Fix response handling for both async and sync calls - Update extraction strategy format to use proper nested structure - Add Ollama connectivity check before running tests - Update test schemas and selectors for current website structures This makes the Docker examples work out-of-the-box with the current API structure.	2025-08-09 19:37:22 +05:30
Soham Kukreti	18ad3ef159	fix: Implement base tag support in link extraction (#1147 ) - Extract base href from <head><base> tag using XPath in _process_element method - Use base URL as the primary URL for link normalization when present - Add error handling with logging for malformed or problematic base tags - Maintain backward compatibility when no base tag is present - Add test to verify the functionality of the base tag extraction.	2025-08-08 20:11:57 +05:30
AHMET YILMAZ	89cf5aba2b	#1057 : enhance ProxyConfig initialization to support dict and string formats	2025-08-06 18:34:58 +08:00
Nasrin	64f37792a7	Merge pull request #1170 from prokopis3/fix/create-profile fix(browser_profiler): cross-platform 'q' to quit - create profile	2025-08-06 16:29:14 +08:00
ntohidi	437395e490	Merge branch 'feat/undetected-browser' into develop-future	2025-08-06 15:03:30 +08:00
ntohidi	ff6ea41ac3	feat(docker): add flexible LLM provider configuration - Support LLM_PROVIDER env var to override default provider (openai/gpt-4o-mini) - Add optional 'provider' parameter to API endpoints for per-request overrides - Implement provider validation to ensure API keys exist - Update documentation and examples with new configuration options Closes the need to hardcode providers in config.yml	2025-08-05 14:09:54 +08:00
ntohidi	7a6ad547f0	Squashed commit of the following: commit 2def6524cdacb69c72760bf55a41089257c0bb07 Author: ntohidi <nasrin@kidocode.com> Date: Mon Aug 4 18:59:10 2025 +0800 refactor: consolidate WebScrapingStrategy to use LXML implementation only BREAKING CHANGE: None - full backward compatibility maintained This commit simplifies the content scraping architecture by removing the redundant BeautifulSoup-based WebScrapingStrategy implementation and making it an alias for LXMLWebScrapingStrategy. Changes: - Remove ~1000 lines of BeautifulSoup-based WebScrapingStrategy code - Make WebScrapingStrategy an alias for LXMLWebScrapingStrategy - Update LXMLWebScrapingStrategy to inherit directly from ContentScrapingStrategy - Add required methods (scrap, ascrap, process_element, _log) to LXMLWebScrapingStrategy - Maintain 100% backward compatibility - existing code continues to work Code changes: - crawl4ai/content_scraping_strategy.py: Remove WebScrapingStrategy class, add alias - crawl4ai/async_configs.py: Remove WebScrapingStrategy from imports - crawl4ai/__init__.py: Update imports to show alias relationship - crawl4ai/types.py: Update type definitions - crawl4ai/legacy/web_crawler.py: Update import to use alias - tests/async/test_content_scraper_strategy.py: Update to use LXMLWebScrapingStrategy - docs/examples/scraping_strategies_performance.py: Update to use single strategy Documentation updates: - docs/md_v2/core/content-selection.md: Update scraping modes section - docs/md_v2/migration/webscraping-strategy-migration.md: Add migration guide - CHANGELOG.md: Document the refactoring under [Unreleased] Benefits: - 10-20x faster HTML parsing for large documents - Reduced memory usage and simplified codebase - Consistent parsing behavior - No migration required for existing users All existing code using WebScrapingStrategy continues to work without modification, while benefiting from LXML's superior performance.	2025-08-04 19:02:01 +08:00
ntohidi	307fe28b32	fix: Correct URL matcher fallback behavior and improve memory monitoring Fix critical issue where unmatched URLs incorrectly used the first config instead of failing safely. Also clarify that configs without url_matcher match ALL URLs by design, and improve memory usage monitoring. Bug fixes: - Change select_config() to return None when no config matches instead of using first config - Add proper error handling in dispatchers when no config matches a URL - Return failed CrawlResult with "No matching configuration found" error message - Fix is_match() to return True when url_matcher is None (matches all URLs) - Import and use get_true_memory_usage_percent() for more accurate memory monitoring Behavior clarification: - CrawlerRunConfig with url_matcher=None matches ALL URLs (not nothing) - This is the intended behavior for default/fallback configurations - Enables clean pattern: specific configs first, default config last Documentation updates: - Clarify that configs without url_matcher match everything - Explain "No matching configuration found" error when no default config - Add examples showing proper default config usage - Update all relevant docs: multi-url-crawling.md, arun_many.md, parameters.md - Simplify API config examples by removing extraction_strategy Demo and test updates: - Update demo_multi_config_clean.py with commented default config to show behavior - Change example URL to w3schools.com to demonstrate no-match scenario - Uncomment all test URLs in test_multi_config.py for comprehensive testing Breaking changes: None - this restores the intended behavior This ensures URLs only get processed with appropriate configs, preventing issues like HTML pages being processed with PDF extraction strategies.	2025-08-03 16:50:54 +08:00
ntohidi	a03e68fa2f	feat: Add URL-specific crawler configurations for multi-URL crawling Implement dynamic configuration selection based on URL patterns to optimize crawling for different content types. This feature enables users to apply different crawling strategies (PDF extraction, content filtering, JavaScript execution) based on URL matching patterns. Key additions: - Add url_matcher and match_mode parameters to CrawlerRunConfig - Implement is_match() method supporting string patterns, functions, and mixed lists - Add MatchMode enum for OR/AND logic when combining multiple matchers - Update AsyncWebCrawler.arun_many() to accept List[CrawlerRunConfig] - Add select_config() method to dispatchers for runtime config selection - First matching config wins, with fallback to default Pattern matching supports: - Glob-style strings: .pdf, /blog/, api* - Lambda functions: lambda url: 'github.com' in url - Mixed patterns with AND/OR logic for complex matching This enables optimal per-URL configuration: - PDFs: Use PDFContentScrapingStrategy without JavaScript - Blogs: Apply content filtering to reduce noise - APIs: Skip JavaScript, use JSON extraction - Dynamic sites: Execute only necessary JavaScript Breaking changes: None - fully backward compatible	2025-08-02 19:10:36 +08:00
ntohidi	cf8badfe27	feat: cleanup unused code and enhance documentation for v0.7.1 - Remove unused StealthConfig from browser_manager.py - Update LinkPreviewConfig import path in __init__.py and examples - Fix infinity handling in content_scraping_strategy.py (use 0 instead of float('inf')) - Remove sanitize_json_data functions from API endpoints - Add comprehensive C4A Script documentation to release notes - Update v0.7.0 release notes with improved code examples - Create v0.7.1 release notes focusing on cleanup and documentation improvements - Update demo files with corrected import paths and examples - Fix virtual scroll and adaptive crawling examples across documentation 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-07-17 11:35:16 +02:00
unclecode	6a728cbe5b	feat: add stealth mode and enhance undetected browser support - Add playwright-stealth integration with enable_stealth parameter in BrowserConfig - Merge undetected browser strategy into main async_crawler_strategy.py using adapter pattern - Add browser adapters (BrowserAdapter, PlaywrightAdapter, UndetectedAdapter) for flexible browser switching - Update install.py to install both playwright and patchright browsers automatically - Add comprehensive documentation for anti-bot features (stealth mode + undetected browser) - Create examples demonstrating stealth mode usage and comparison tests - Update pyproject.toml and requirements.txt with patchright>=1.49.0 and other dependencies - Remove duplicate/unused dependencies (alphashape, cssselect, pyperclip, shapely, selenium) - Add dependency checker tool in tests/check_dependencies.py Breaking changes: None - all existing functionality preserved 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-07-17 16:59:10 +08:00

1 2 3

149 Commits