crawl4ai

Author	SHA1	Message	Date
ntohidi	e6044e6053	Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop	2025-08-15 19:44:06 +08:00
ntohidi	a50e47adad	Merge branch 'feature/table-extraction-strategies' into develop	2025-08-15 19:41:37 +08:00
ntohidi	ada7441bd1	refactor: Update LLMTableExtraction examples and tests	2025-08-15 19:11:26 +08:00
ntohidi	9f7fee91a9	feat: 🚀 Introduce revolutionary LLMTableExtraction with intelligent chunking for massive tables BREAKING CHANGE: Table extraction now uses Strategy Design Pattern This epic commit introduces a game-changing approach to table extraction in Crawl4AI: ✨ NEW FEATURES: - LLMTableExtraction: AI-powered extraction for complex HTML tables with rowspan/colspan - Smart Chunking: Automatically splits massive tables into optimal chunks at row boundaries - Parallel Processing: Processes multiple chunks simultaneously for blazing-fast extraction - Intelligent Merging: Seamlessly combines chunk results into complete tables - Header Preservation: Each chunk maintains context with original headers - Auto-retry Logic: Built-in resilience with configurable retry attempts 🏗️ ARCHITECTURE: - Strategy Design Pattern for pluggable table extraction strategies - ThreadPoolExecutor for concurrent chunk processing - Token-based chunking with configurable thresholds - Handles tables without headers gracefully ⚡ PERFORMANCE: - Process 1000+ row tables without timeout - Parallel processing with up to 5 concurrent chunks - Smart token estimation prevents LLM context overflow - Optimized for providers like Groq for massive tables 🔧 CONFIGURATION: - enable_chunking: Auto-handle large tables (default: True) - chunk_token_threshold: When to split (default: 3000 tokens) - min_rows_per_chunk: Meaningful chunk sizes (default: 10) - max_parallel_chunks: Concurrent processing (default: 5) 📚 BACKWARD COMPATIBILITY: - Existing code continues to work unchanged - DefaultTableExtraction remains the default strategy - Progressive enhancement approach This is the future of web table extraction - handling everything from simple tables to massive, complex data grids with merged cells and nested structures. The chunking is completely transparent to users while providing unprecedented scalability.	2025-08-15 19:11:26 +08:00
AHMET YILMAZ	7f48655cf1	feat(browser-profiler): implement cross-platform keyboard listeners and improve quit handling	2025-08-15 19:11:26 +08:00
prokopis3	1417a67e90	chore(profile-test): fix filename typo ( test_crteate_profile.py → test_create_profile.py ) - Rename file to correct spelling - No content changes	2025-08-15 19:11:26 +08:00
prokopis3	19398d33ef	fix(browser_profiler): improve keyboard input handling - fix handling of special keys in Windows msvcrt implementation - Guard against UnicodeDecodeError from multi-byte key sequences - Filter out non-printable characters and control sequences - Add error handling to prevent coroutine crashes - Add unit test to verify keyboard input handling Key changes: - Safe UTF-8 decoding with try/except for special keys - Skip non-printable and multi-byte character sequences - Add broad exception handling in keyboard listener Test runs on Windows only due to msvcrt dependency.	2025-08-15 19:11:26 +08:00
prokopis3	263d362daa	fix(browser_profiler): cross-platform 'q' to quit This commit introduces platform-specific handling for the 'q' key press to quit the browser profiler, ensuring compatibility with both Windows and Unix-like systems. It also adds a check to see if the browser process has already exited, terminating the input listener if so. - Implemented `msvcrt` for Windows to capture keyboard input without requiring a newline. - Retained `termios`, `tty`, and `select` for Unix-like systems. - Added a check for browser process termination to gracefully exit the input listener. - Updated logger messages to use colored output for better user experience.	2025-08-15 19:11:26 +08:00
ntohidi	bac92a47e4	refactor: Update LLMTableExtraction examples and tests	2025-08-15 18:47:31 +08:00
ntohidi	a51545c883	feat: 🚀 Introduce revolutionary LLMTableExtraction with intelligent chunking for massive tables BREAKING CHANGE: Table extraction now uses Strategy Design Pattern This epic commit introduces a game-changing approach to table extraction in Crawl4AI: ✨ NEW FEATURES: - LLMTableExtraction: AI-powered extraction for complex HTML tables with rowspan/colspan - Smart Chunking: Automatically splits massive tables into optimal chunks at row boundaries - Parallel Processing: Processes multiple chunks simultaneously for blazing-fast extraction - Intelligent Merging: Seamlessly combines chunk results into complete tables - Header Preservation: Each chunk maintains context with original headers - Auto-retry Logic: Built-in resilience with configurable retry attempts 🏗️ ARCHITECTURE: - Strategy Design Pattern for pluggable table extraction strategies - ThreadPoolExecutor for concurrent chunk processing - Token-based chunking with configurable thresholds - Handles tables without headers gracefully ⚡ PERFORMANCE: - Process 1000+ row tables without timeout - Parallel processing with up to 5 concurrent chunks - Smart token estimation prevents LLM context overflow - Optimized for providers like Groq for massive tables 🔧 CONFIGURATION: - enable_chunking: Auto-handle large tables (default: True) - chunk_token_threshold: When to split (default: 3000 tokens) - min_rows_per_chunk: Meaningful chunk sizes (default: 10) - max_parallel_chunks: Concurrent processing (default: 5) 📚 BACKWARD COMPATIBILITY: - Existing code continues to work unchanged - DefaultTableExtraction remains the default strategy - Progressive enhancement approach This is the future of web table extraction - handling everything from simple tables to massive, complex data grids with merged cells and nested structures. The chunking is completely transparent to users while providing unprecedented scalability.	2025-08-14 18:21:24 +08:00
Nasrin	11b310edef	Merge pull request #1378 from unclecode/fix/exit_with_q Cross Platform fix for browser profiler	2025-08-13 14:16:47 +08:00
Nasrin	926e41aab8	Merge pull request #1378 from unclecode/fix/exit_with_q Cross Platform fix for browser profiler	2025-08-13 14:16:47 +08:00
Nasrin	489981e670	Merge pull request #1390 from unclecode/fix/docker-raw-html Check for raw: and raw:// URLs before auto-appending https:// prefix	2025-08-13 13:56:33 +08:00
Nasrin	b92be4ef66	Merge pull request #1371 from unclecode/bug/proxy_config #1057 : enhance ProxyConfig initialization to support dict and string…	2025-08-12 16:55:52 +08:00
Nasrin	7c0edaf266	Merge pull request #1384 from unclecode/fix/update_docker_examples docs: remove CRAWL4AI_API_TOKEN references and use correct endpoints in Docker example scripts (#1015)	2025-08-12 16:53:42 +08:00
ntohidi	dfcfd8ae57	fix(dispatcher): enable true concurrency for fast-completing tasks in arun_many. REF: #560 The MemoryAdaptiveDispatcher was processing tasks sequentially despite max_session_permit > 1 due to fetching only one task per event loop iteration. This particularly affected raw:// URLs which complete in microseconds. Changes: - Replace single task fetch with greedy slot filling using get_nowait() - Fill all available slots (up to max_session_permit) immediately - Break on empty queue instead of waiting with timeout This ensures proper parallelization for all task types, especially ultra-fast operations like raw HTML processing.	2025-08-12 16:51:22 +08:00
ntohidi	955110a8b0	Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop	2025-08-12 12:22:25 +08:00
Soham Kukreti	f30811b524	fix: Check for raw: and raw:// URLs before auto-appending https:// prefix - Add raw HTML URL validation alongside http/https checks - Fix URL preprocessing logic to handle raw: and raw:// prefixes - Update error message and add comprehensive test cases	2025-08-11 22:10:53 +05:30
ntohidi	8146d477e9	Merge branch 'main' into develop	2025-08-11 18:56:15 +08:00
ntohidi	96c4b0de67	fix(browser_manager): serialize new_page on persistent context to avoid races ref #1198 - Add _page_lock and guarded creation; handle empty context.pages safely - Prevents BrowserContext.new_page “Target page/context closed” during concurrent arun_many	2025-08-11 18:55:43 +08:00
Nasrin	57c14db7cb	Merge pull request #1381 from unclecode/fix/base-tag-link-resolution fix: Implement base tag support in link extraction (#1147)	2025-08-11 18:32:32 +08:00
Soham Kukreti	cd2dd68e4c	docs: remove CRAWL4AI_API_TOKEN references and use correct endpoints in Docker example scripts (#1015 ) - Remove deprecated API token authentication from all Docker examples - Fix async job endpoints: /crawl -> /crawl/job for submission, /task/{id} -> /crawl/job/{id} for polling - Fix sync endpoint: /crawl_sync -> /crawl (synchronous) - Remove non-existent /crawl_direct endpoint - Update request format to use new structure with browser_config and crawler_config - Fix response handling for both async and sync calls - Update extraction strategy format to use proper nested structure - Add Ollama connectivity check before running tests - Update test schemas and selectors for current website structures This makes the Docker examples work out-of-the-box with the current API structure.	2025-08-09 19:37:22 +05:30
UncleCode	f0ce7b2710	feat: add v0.7.3 release notes, changelog updates, and documentation for new features	2025-08-09 21:04:18 +08:00
UncleCode	21f79fe166	Release v0.7.3: Merge release branch - Merge release/v0.7.3 into main - Version: 0.7.3 - Ready for tag and publication v0.7.3	2025-08-09 20:11:35 +08:00
unclecode	a9a2d798b4	feat: update sponsorship tier details and add custom arrangements note	2025-08-09 20:10:32 +08:00
unclecode	612270fcb0	feat: add scheduling link to contact information in SPONSORS.md	2025-08-09 20:05:59 +08:00
unclecode	bc099fdd76	Merge branch 'main' into release/v0.7.3	2025-08-09 19:30:46 +08:00
unclecode	18504d782e	Add Founding Sponsors section and update README with detailed project information - Introduced a new section in SPONSORS.md to recognize the first 50 sponsors as Founding Sponsors. - Updated README-first.md to include comprehensive project details, features, installation instructions, and advanced usage examples. - Highlighted the recent version 0.7.0 release with new features and improvements. - Added a sponsorship program with tiered benefits and a mission statement to promote data democratization.	2025-08-09 19:11:32 +08:00
unclecode	ad547607b9	feat: add GitHub Sponsors support with 4 tiers - Add FUNDING.yml to enable sponsor button - Add sponsor section to README with tier overview - Create SPONSORS.md for sponsor recognition - Set up 4 tiers: Believer, Builder, Growing Team, Data Infrastructure Partner	2025-08-09 17:57:47 +08:00
Soham Kukreti	18ad3ef159	fix: Implement base tag support in link extraction (#1147 ) - Extract base href from <head><base> tag using XPath in _process_element method - Use base URL as the primary URL for link normalization when present - Add error handling with logging for malformed or problematic base tags - Maintain backward compatibility when no base tag is present - Add test to verify the functionality of the base tag extraction.	2025-08-08 20:11:57 +05:30
AHMET YILMAZ	0541b61405	feat(browser-profiler): implement cross-platform keyboard listeners and improve quit handling	2025-08-08 11:18:34 +08:00
AHMET YILMAZ	b61b2ee676	feat(browser-profiler): implement cross-platform keyboard listeners and improve quit handling	2025-08-08 11:18:34 +08:00
AHMET YILMAZ	89cf5aba2b	#1057 : enhance ProxyConfig initialization to support dict and string formats	2025-08-06 18:34:58 +08:00
ntohidi	6b0b5301ba	Release v0.7.3: - Updated version to 0.7.3 - Added release notes - Updated documentation	2025-08-06 17:52:01 +08:00
Nasrin	6735c68288	Merge pull request #1170 from prokopis3/fix/create-profile fix(browser_profiler): cross-platform 'q' to quit - create profile	2025-08-06 16:29:14 +08:00
Nasrin	64f37792a7	Merge pull request #1170 from prokopis3/fix/create-profile fix(browser_profiler): cross-platform 'q' to quit - create profile	2025-08-06 16:29:14 +08:00
ntohidi	a5bcac4c9d	feat(docs): enhance table data access example with a real url	2025-08-06 15:19:37 +08:00
Nasrin	45d8327d23	Merge pull request #1366 from unclecode/fix/update-tables-documentation docs: Update README.md and modify Media and Tables Documentation.(#1271)	2025-08-06 15:15:24 +08:00
ntohidi	437395e490	Merge branch 'feat/undetected-browser' into develop-future	2025-08-06 15:03:30 +08:00
Soham Kukreti	fddae303fb	docs: Update README.md and modify Media and Tables Documentation.(#1271 ) - Update Table-to-DataFrame Extraction example in README.md - Replace old method of accessing tables via result.media directly with result.tables in the documentation - Remove tables section from links & media page. - Add tables section to crawler result page.	2025-08-05 23:29:19 +05:30
ntohidi	ff6ea41ac3	feat(docker): add flexible LLM provider configuration - Support LLM_PROVIDER env var to override default provider (openai/gpt-4o-mini) - Add optional 'provider' parameter to API endpoints for per-request overrides - Implement provider validation to ensure API keys exist - Update documentation and examples with new configuration options Closes the need to hardcode providers in config.yml	2025-08-05 14:09:54 +08:00
ntohidi	31a435fb0e	Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop	2025-08-04 19:12:19 +08:00
Nasrin	5de6a28055	Merge pull request #1361 from unclecode/fix/crawler-result-docs Update CrawlResult documentation with missing fields	2025-08-04 19:12:09 +08:00
ntohidi	de1561ad14	Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop	2025-08-04 19:04:50 +08:00
Nasrin	337b588732	Merge pull request #1358 from shonenada/patch-1 Fix typos in examples.md	2025-08-04 19:04:42 +08:00
ntohidi	7a6ad547f0	Squashed commit of the following: commit 2def6524cdacb69c72760bf55a41089257c0bb07 Author: ntohidi <nasrin@kidocode.com> Date: Mon Aug 4 18:59:10 2025 +0800 refactor: consolidate WebScrapingStrategy to use LXML implementation only BREAKING CHANGE: None - full backward compatibility maintained This commit simplifies the content scraping architecture by removing the redundant BeautifulSoup-based WebScrapingStrategy implementation and making it an alias for LXMLWebScrapingStrategy. Changes: - Remove ~1000 lines of BeautifulSoup-based WebScrapingStrategy code - Make WebScrapingStrategy an alias for LXMLWebScrapingStrategy - Update LXMLWebScrapingStrategy to inherit directly from ContentScrapingStrategy - Add required methods (scrap, ascrap, process_element, _log) to LXMLWebScrapingStrategy - Maintain 100% backward compatibility - existing code continues to work Code changes: - crawl4ai/content_scraping_strategy.py: Remove WebScrapingStrategy class, add alias - crawl4ai/async_configs.py: Remove WebScrapingStrategy from imports - crawl4ai/__init__.py: Update imports to show alias relationship - crawl4ai/types.py: Update type definitions - crawl4ai/legacy/web_crawler.py: Update import to use alias - tests/async/test_content_scraper_strategy.py: Update to use LXMLWebScrapingStrategy - docs/examples/scraping_strategies_performance.py: Update to use single strategy Documentation updates: - docs/md_v2/core/content-selection.md: Update scraping modes section - docs/md_v2/migration/webscraping-strategy-migration.md: Add migration guide - CHANGELOG.md: Document the refactoring under [Unreleased] Benefits: - 10-20x faster HTML parsing for large documents - Reduced memory usage and simplified codebase - Consistent parsing behavior - No migration required for existing users All existing code using WebScrapingStrategy continues to work without modification, while benefiting from LXML's superior performance.	2025-08-04 19:02:01 +08:00
Soham Kukreti	e6692b987d	docs: Update CrawlResult documentation with missing fields. - Add missing fields: fit_html, js_execution_result, redirected_url, network_requests, console_messages, tables	2025-08-04 15:43:40 +05:30
ntohidi	307fe28b32	fix: Correct URL matcher fallback behavior and improve memory monitoring Fix critical issue where unmatched URLs incorrectly used the first config instead of failing safely. Also clarify that configs without url_matcher match ALL URLs by design, and improve memory usage monitoring. Bug fixes: - Change select_config() to return None when no config matches instead of using first config - Add proper error handling in dispatchers when no config matches a URL - Return failed CrawlResult with "No matching configuration found" error message - Fix is_match() to return True when url_matcher is None (matches all URLs) - Import and use get_true_memory_usage_percent() for more accurate memory monitoring Behavior clarification: - CrawlerRunConfig with url_matcher=None matches ALL URLs (not nothing) - This is the intended behavior for default/fallback configurations - Enables clean pattern: specific configs first, default config last Documentation updates: - Clarify that configs without url_matcher match everything - Explain "No matching configuration found" error when no default config - Add examples showing proper default config usage - Update all relevant docs: multi-url-crawling.md, arun_many.md, parameters.md - Simplify API config examples by removing extraction_strategy Demo and test updates: - Update demo_multi_config_clean.py with commented default config to show behavior - Change example URL to w3schools.com to demonstrate no-match scenario - Uncomment all test URLs in test_multi_config.py for comprehensive testing Breaking changes: None - this restores the intended behavior This ensures URLs only get processed with appropriate configs, preventing issues like HTML pages being processed with PDF extraction strategies.	2025-08-03 16:50:54 +08:00
Yaoda Liu	438a103b17	Fix typos in examples.md	2025-08-03 14:33:10 +08:00
ntohidi	a03e68fa2f	feat: Add URL-specific crawler configurations for multi-URL crawling Implement dynamic configuration selection based on URL patterns to optimize crawling for different content types. This feature enables users to apply different crawling strategies (PDF extraction, content filtering, JavaScript execution) based on URL matching patterns. Key additions: - Add url_matcher and match_mode parameters to CrawlerRunConfig - Implement is_match() method supporting string patterns, functions, and mixed lists - Add MatchMode enum for OR/AND logic when combining multiple matchers - Update AsyncWebCrawler.arun_many() to accept List[CrawlerRunConfig] - Add select_config() method to dispatchers for runtime config selection - First matching config wins, with fallback to default Pattern matching supports: - Glob-style strings: .pdf, /blog/, api* - Lambda functions: lambda url: 'github.com' in url - Mixed patterns with AND/OR logic for complex matching This enables optimal per-URL configuration: - PDFs: Use PDFContentScrapingStrategy without JavaScript - Blogs: Apply content filtering to reduce noise - APIs: Skip JavaScript, use JSON extraction - Dynamic sites: Execute only necessary JavaScript Breaking changes: None - fully backward compatible	2025-08-02 19:10:36 +08:00

1 2 3 4 5 ...

1030 Commits