Commit Graph

1030 Commits

Author SHA1 Message Date
ntohidi
e6044e6053 Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop 2025-08-15 19:44:06 +08:00
ntohidi
a50e47adad Merge branch 'feature/table-extraction-strategies' into develop 2025-08-15 19:41:37 +08:00
ntohidi
ada7441bd1 refactor: Update LLMTableExtraction examples and tests 2025-08-15 19:11:26 +08:00
ntohidi
9f7fee91a9 feat: 🚀 Introduce revolutionary LLMTableExtraction with intelligent chunking for massive tables
BREAKING CHANGE: Table extraction now uses Strategy Design Pattern

This epic commit introduces a game-changing approach to table extraction in Crawl4AI:

 NEW FEATURES:
- LLMTableExtraction: AI-powered extraction for complex HTML tables with rowspan/colspan
- Smart Chunking: Automatically splits massive tables into optimal chunks at row boundaries
- Parallel Processing: Processes multiple chunks simultaneously for blazing-fast extraction
- Intelligent Merging: Seamlessly combines chunk results into complete tables
- Header Preservation: Each chunk maintains context with original headers
- Auto-retry Logic: Built-in resilience with configurable retry attempts

🏗️ ARCHITECTURE:
- Strategy Design Pattern for pluggable table extraction strategies
- ThreadPoolExecutor for concurrent chunk processing
- Token-based chunking with configurable thresholds
- Handles tables without headers gracefully

 PERFORMANCE:
- Process 1000+ row tables without timeout
- Parallel processing with up to 5 concurrent chunks
- Smart token estimation prevents LLM context overflow
- Optimized for providers like Groq for massive tables

🔧 CONFIGURATION:
- enable_chunking: Auto-handle large tables (default: True)
- chunk_token_threshold: When to split (default: 3000 tokens)
- min_rows_per_chunk: Meaningful chunk sizes (default: 10)
- max_parallel_chunks: Concurrent processing (default: 5)

📚 BACKWARD COMPATIBILITY:
- Existing code continues to work unchanged
- DefaultTableExtraction remains the default strategy
- Progressive enhancement approach

This is the future of web table extraction - handling everything from simple tables to massive, complex data grids with merged cells and nested structures. The chunking is completely transparent to users while providing unprecedented scalability.
2025-08-15 19:11:26 +08:00
AHMET YILMAZ
7f48655cf1 feat(browser-profiler): implement cross-platform keyboard listeners and improve quit handling 2025-08-15 19:11:26 +08:00
prokopis3
1417a67e90 chore(profile-test): fix filename typo ( test_crteate_profile.py → test_create_profile.py )
- Rename file to correct spelling
- No content changes
2025-08-15 19:11:26 +08:00
prokopis3
19398d33ef fix(browser_profiler): improve keyboard input handling
- fix handling of special keys in Windows msvcrt implementation
- Guard against UnicodeDecodeError from multi-byte key sequences
- Filter out non-printable characters and control sequences
- Add error handling to prevent coroutine crashes
- Add unit test to verify keyboard input handling

Key changes:
- Safe UTF-8 decoding with try/except for special keys
- Skip non-printable and multi-byte character sequences
- Add broad exception handling in keyboard listener

Test runs on Windows only due to msvcrt dependency.
2025-08-15 19:11:26 +08:00
prokopis3
263d362daa fix(browser_profiler): cross-platform 'q' to quit
This commit introduces platform-specific handling for the 'q' key press to quit the browser profiler, ensuring compatibility with both Windows and Unix-like systems. It also adds a check to see if the browser process has already exited, terminating the input listener if so.

- Implemented `msvcrt` for Windows to capture keyboard input without requiring a newline.
- Retained `termios`, `tty`, and `select` for Unix-like systems.
- Added a check for browser process termination to gracefully exit the input listener.
- Updated logger messages to use colored output for better user experience.
2025-08-15 19:11:26 +08:00
ntohidi
bac92a47e4 refactor: Update LLMTableExtraction examples and tests 2025-08-15 18:47:31 +08:00
ntohidi
a51545c883 feat: 🚀 Introduce revolutionary LLMTableExtraction with intelligent chunking for massive tables
BREAKING CHANGE: Table extraction now uses Strategy Design Pattern

This epic commit introduces a game-changing approach to table extraction in Crawl4AI:

 NEW FEATURES:
- LLMTableExtraction: AI-powered extraction for complex HTML tables with rowspan/colspan
- Smart Chunking: Automatically splits massive tables into optimal chunks at row boundaries
- Parallel Processing: Processes multiple chunks simultaneously for blazing-fast extraction
- Intelligent Merging: Seamlessly combines chunk results into complete tables
- Header Preservation: Each chunk maintains context with original headers
- Auto-retry Logic: Built-in resilience with configurable retry attempts

🏗️ ARCHITECTURE:
- Strategy Design Pattern for pluggable table extraction strategies
- ThreadPoolExecutor for concurrent chunk processing
- Token-based chunking with configurable thresholds
- Handles tables without headers gracefully

 PERFORMANCE:
- Process 1000+ row tables without timeout
- Parallel processing with up to 5 concurrent chunks
- Smart token estimation prevents LLM context overflow
- Optimized for providers like Groq for massive tables

🔧 CONFIGURATION:
- enable_chunking: Auto-handle large tables (default: True)
- chunk_token_threshold: When to split (default: 3000 tokens)
- min_rows_per_chunk: Meaningful chunk sizes (default: 10)
- max_parallel_chunks: Concurrent processing (default: 5)

📚 BACKWARD COMPATIBILITY:
- Existing code continues to work unchanged
- DefaultTableExtraction remains the default strategy
- Progressive enhancement approach

This is the future of web table extraction - handling everything from simple tables to massive, complex data grids with merged cells and nested structures. The chunking is completely transparent to users while providing unprecedented scalability.
2025-08-14 18:21:24 +08:00
Nasrin
11b310edef Merge pull request #1378 from unclecode/fix/exit_with_q
Cross Platform fix for browser profiler
2025-08-13 14:16:47 +08:00
Nasrin
926e41aab8 Merge pull request #1378 from unclecode/fix/exit_with_q
Cross Platform fix for browser profiler
2025-08-13 14:16:47 +08:00
Nasrin
489981e670 Merge pull request #1390 from unclecode/fix/docker-raw-html
Check for raw: and raw:// URLs before auto-appending https:// prefix
2025-08-13 13:56:33 +08:00
Nasrin
b92be4ef66 Merge pull request #1371 from unclecode/bug/proxy_config
#1057 : enhance ProxyConfig initialization to support dict and string…
2025-08-12 16:55:52 +08:00
Nasrin
7c0edaf266 Merge pull request #1384 from unclecode/fix/update_docker_examples
docs: remove CRAWL4AI_API_TOKEN references and use correct endpoints in Docker example scripts (#1015)
2025-08-12 16:53:42 +08:00
ntohidi
dfcfd8ae57 fix(dispatcher): enable true concurrency for fast-completing tasks in arun_many. REF: #560
The MemoryAdaptiveDispatcher was processing tasks sequentially despite
  max_session_permit > 1 due to fetching only one task per event loop iteration.
  This particularly affected raw:// URLs which complete in microseconds.

  Changes:
  - Replace single task fetch with greedy slot filling using get_nowait()
  - Fill all available slots (up to max_session_permit) immediately
  - Break on empty queue instead of waiting with timeout

  This ensures proper parallelization for all task types, especially
  ultra-fast operations like raw HTML processing.
2025-08-12 16:51:22 +08:00
ntohidi
955110a8b0 Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop 2025-08-12 12:22:25 +08:00
Soham Kukreti
f30811b524 fix: Check for raw: and raw:// URLs before auto-appending https:// prefix
- Add raw HTML URL validation alongside http/https checks
- Fix URL preprocessing logic to handle raw: and raw:// prefixes
- Update error message and add comprehensive test cases
2025-08-11 22:10:53 +05:30
ntohidi
8146d477e9 Merge branch 'main' into develop 2025-08-11 18:56:15 +08:00
ntohidi
96c4b0de67 fix(browser_manager): serialize new_page on persistent context to avoid races ref #1198
- Add _page_lock and guarded creation; handle empty context.pages safely
  - Prevents BrowserContext.new_page “Target page/context closed” during concurrent arun_many
2025-08-11 18:55:43 +08:00
Nasrin
57c14db7cb Merge pull request #1381 from unclecode/fix/base-tag-link-resolution
fix: Implement base tag support in link extraction (#1147)
2025-08-11 18:32:32 +08:00
Soham Kukreti
cd2dd68e4c docs: remove CRAWL4AI_API_TOKEN references and use correct endpoints in Docker example scripts (#1015)
- Remove deprecated API token authentication from all Docker examples
- Fix async job endpoints: /crawl -> /crawl/job for submission, /task/{id} -> /crawl/job/{id} for polling
- Fix sync endpoint: /crawl_sync -> /crawl (synchronous)
- Remove non-existent /crawl_direct endpoint
- Update request format to use new structure with browser_config and crawler_config
- Fix response handling for both async and sync calls
- Update extraction strategy format to use proper nested structure
- Add Ollama connectivity check before running tests
- Update test schemas and selectors for current website structures

This makes the Docker examples work out-of-the-box with the current API structure.
2025-08-09 19:37:22 +05:30
UncleCode
f0ce7b2710 feat: add v0.7.3 release notes, changelog updates, and documentation for new features 2025-08-09 21:04:18 +08:00
UncleCode
21f79fe166 Release v0.7.3: Merge release branch
- Merge release/v0.7.3 into main
- Version: 0.7.3
- Ready for tag and publication
v0.7.3
2025-08-09 20:11:35 +08:00
unclecode
a9a2d798b4 feat: update sponsorship tier details and add custom arrangements note 2025-08-09 20:10:32 +08:00
unclecode
612270fcb0 feat: add scheduling link to contact information in SPONSORS.md 2025-08-09 20:05:59 +08:00
unclecode
bc099fdd76 Merge branch 'main' into release/v0.7.3 2025-08-09 19:30:46 +08:00
unclecode
18504d782e Add Founding Sponsors section and update README with detailed project information
- Introduced a new section in SPONSORS.md to recognize the first 50 sponsors as Founding Sponsors.
- Updated README-first.md to include comprehensive project details, features, installation instructions, and advanced usage examples.
- Highlighted the recent version 0.7.0 release with new features and improvements.
- Added a sponsorship program with tiered benefits and a mission statement to promote data democratization.
2025-08-09 19:11:32 +08:00
unclecode
ad547607b9 feat: add GitHub Sponsors support with 4 tiers
- Add FUNDING.yml to enable sponsor button
- Add sponsor section to README with tier overview
- Create SPONSORS.md for sponsor recognition
- Set up 4 tiers: Believer, Builder, Growing Team, Data Infrastructure Partner
2025-08-09 17:57:47 +08:00
Soham Kukreti
18ad3ef159 fix: Implement base tag support in link extraction (#1147)
- Extract base href from <head><base> tag using XPath in _process_element method
- Use base URL as the primary URL for link normalization when present
- Add error handling with logging for malformed or problematic base tags
- Maintain backward compatibility when no base tag is present
- Add test to verify the functionality of the base tag extraction.
2025-08-08 20:11:57 +05:30
AHMET YILMAZ
0541b61405 feat(browser-profiler): implement cross-platform keyboard listeners and improve quit handling 2025-08-08 11:18:34 +08:00
AHMET YILMAZ
b61b2ee676 feat(browser-profiler): implement cross-platform keyboard listeners and improve quit handling 2025-08-08 11:18:34 +08:00
AHMET YILMAZ
89cf5aba2b #1057 : enhance ProxyConfig initialization to support dict and string formats 2025-08-06 18:34:58 +08:00
ntohidi
6b0b5301ba Release v0.7.3:
- Updated version to 0.7.3
- Added release notes
- Updated documentation
2025-08-06 17:52:01 +08:00
Nasrin
6735c68288 Merge pull request #1170 from prokopis3/fix/create-profile
fix(browser_profiler): cross-platform 'q' to quit - create profile
2025-08-06 16:29:14 +08:00
Nasrin
64f37792a7 Merge pull request #1170 from prokopis3/fix/create-profile
fix(browser_profiler): cross-platform 'q' to quit - create profile
2025-08-06 16:29:14 +08:00
ntohidi
a5bcac4c9d feat(docs): enhance table data access example with a real url 2025-08-06 15:19:37 +08:00
Nasrin
45d8327d23 Merge pull request #1366 from unclecode/fix/update-tables-documentation
docs: Update README.md and modify Media and Tables Documentation.(#1271)
2025-08-06 15:15:24 +08:00
ntohidi
437395e490 Merge branch 'feat/undetected-browser' into develop-future 2025-08-06 15:03:30 +08:00
Soham Kukreti
fddae303fb docs: Update README.md and modify Media and Tables Documentation.(#1271)
- Update Table-to-DataFrame Extraction example in README.md
- Replace old method of accessing tables via result.media directly with result.tables in the documentation
- Remove tables section from links & media page.
- Add tables section to crawler result page.
2025-08-05 23:29:19 +05:30
ntohidi
ff6ea41ac3 feat(docker): add flexible LLM provider configuration
- Support LLM_PROVIDER env var to override default provider (openai/gpt-4o-mini)
- Add optional 'provider' parameter to API endpoints for per-request overrides
- Implement provider validation to ensure API keys exist
- Update documentation and examples with new configuration options

Closes the need to hardcode providers in config.yml
2025-08-05 14:09:54 +08:00
ntohidi
31a435fb0e Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop 2025-08-04 19:12:19 +08:00
Nasrin
5de6a28055 Merge pull request #1361 from unclecode/fix/crawler-result-docs
Update CrawlResult documentation with missing fields
2025-08-04 19:12:09 +08:00
ntohidi
de1561ad14 Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop 2025-08-04 19:04:50 +08:00
Nasrin
337b588732 Merge pull request #1358 from shonenada/patch-1
Fix typos in examples.md
2025-08-04 19:04:42 +08:00
ntohidi
7a6ad547f0 Squashed commit of the following:
commit 2def6524cdacb69c72760bf55a41089257c0bb07
Author: ntohidi <nasrin@kidocode.com>
Date:   Mon Aug 4 18:59:10 2025 +0800

    refactor: consolidate WebScrapingStrategy to use LXML implementation only

    BREAKING CHANGE: None - full backward compatibility maintained

    This commit simplifies the content scraping architecture by removing the
    redundant BeautifulSoup-based WebScrapingStrategy implementation and making
    it an alias for LXMLWebScrapingStrategy.

    Changes:
    - Remove ~1000 lines of BeautifulSoup-based WebScrapingStrategy code
    - Make WebScrapingStrategy an alias for LXMLWebScrapingStrategy
    - Update LXMLWebScrapingStrategy to inherit directly from ContentScrapingStrategy
    - Add required methods (scrap, ascrap, process_element, _log) to LXMLWebScrapingStrategy
    - Maintain 100% backward compatibility - existing code continues to work

    Code changes:
    - crawl4ai/content_scraping_strategy.py: Remove WebScrapingStrategy class, add alias
    - crawl4ai/async_configs.py: Remove WebScrapingStrategy from imports
    - crawl4ai/__init__.py: Update imports to show alias relationship
    - crawl4ai/types.py: Update type definitions
    - crawl4ai/legacy/web_crawler.py: Update import to use alias
    - tests/async/test_content_scraper_strategy.py: Update to use LXMLWebScrapingStrategy
    - docs/examples/scraping_strategies_performance.py: Update to use single strategy

    Documentation updates:
    - docs/md_v2/core/content-selection.md: Update scraping modes section
    - docs/md_v2/migration/webscraping-strategy-migration.md: Add migration guide
    - CHANGELOG.md: Document the refactoring under [Unreleased]

    Benefits:
    - 10-20x faster HTML parsing for large documents
    - Reduced memory usage and simplified codebase
    - Consistent parsing behavior
    - No migration required for existing users

    All existing code using WebScrapingStrategy continues to work without
    modification, while benefiting from LXML's superior performance.
2025-08-04 19:02:01 +08:00
Soham Kukreti
e6692b987d docs: Update CrawlResult documentation with missing fields.
- Add missing fields: fit_html, js_execution_result, redirected_url, network_requests, console_messages, tables
2025-08-04 15:43:40 +05:30
ntohidi
307fe28b32 fix: Correct URL matcher fallback behavior and improve memory monitoring
Fix critical issue where unmatched URLs incorrectly used the first config instead of failing safely. Also clarify that configs without url_matcher match ALL URLs by design, and improve memory usage monitoring.

Bug fixes:
- Change select_config() to return None when no config matches instead of using first config
- Add proper error handling in dispatchers when no config matches a URL
- Return failed CrawlResult with "No matching configuration found" error message
- Fix is_match() to return True when url_matcher is None (matches all URLs)
- Import and use get_true_memory_usage_percent() for more accurate memory monitoring

Behavior clarification:
- CrawlerRunConfig with url_matcher=None matches ALL URLs (not nothing)
- This is the intended behavior for default/fallback configurations
- Enables clean pattern: specific configs first, default config last

Documentation updates:
- Clarify that configs without url_matcher match everything
- Explain "No matching configuration found" error when no default config
- Add examples showing proper default config usage
- Update all relevant docs: multi-url-crawling.md, arun_many.md, parameters.md
- Simplify API config examples by removing extraction_strategy

Demo and test updates:
- Update demo_multi_config_clean.py with commented default config to show behavior
- Change example URL to w3schools.com to demonstrate no-match scenario
- Uncomment all test URLs in test_multi_config.py for comprehensive testing

Breaking changes: None - this restores the intended behavior

This ensures URLs only get processed with appropriate configs, preventing
issues like HTML pages being processed with PDF extraction strategies.
2025-08-03 16:50:54 +08:00
Yaoda Liu
438a103b17 Fix typos in examples.md 2025-08-03 14:33:10 +08:00
ntohidi
a03e68fa2f feat: Add URL-specific crawler configurations for multi-URL crawling
Implement dynamic configuration selection based on URL patterns to optimize crawling for different content types. This feature enables users to apply different crawling strategies (PDF extraction, content filtering, JavaScript execution) based on URL matching patterns.

Key additions:
- Add url_matcher and match_mode parameters to CrawlerRunConfig
- Implement is_match() method supporting string patterns, functions, and mixed lists
- Add MatchMode enum for OR/AND logic when combining multiple matchers
- Update AsyncWebCrawler.arun_many() to accept List[CrawlerRunConfig]
- Add select_config() method to dispatchers for runtime config selection
- First matching config wins, with fallback to default

Pattern matching supports:
- Glob-style strings: *.pdf, */blog/*, *api*
- Lambda functions: lambda url: 'github.com' in url
- Mixed patterns with AND/OR logic for complex matching

This enables optimal per-URL configuration:
- PDFs: Use PDFContentScrapingStrategy without JavaScript
- Blogs: Apply content filtering to reduce noise
- APIs: Skip JavaScript, use JSON extraction
- Dynamic sites: Execute only necessary JavaScript

Breaking changes: None - fully backward compatible
2025-08-02 19:10:36 +08:00