Commit Graph

1205 Commits

Author SHA1 Message Date
ntohidi
102352eac4 fix(docker): resolve filter serialization and JSON encoding errors in deep crawl strategy (ref #1419)
- Fix URLPatternFilter serialization by preventing private __slots__ from being serialized as constructor params
  - Add public attributes to URLPatternFilter to store original constructor parameters for proper serialization
  - Handle property descriptors in CrawlResult.model_dump() to prevent JSON serialization errors
  - Ensure filter chains work correctly with Docker client and REST API

  The issue occurred because:
  1. Private implementation details (_simple_suffixes, etc.) were being serialized and passed as constructor arguments during deserialization
  2. Property descriptors were being included in the serialized output, causing "Object of type property is not JSON serializable" errors

  Changes:
  - async_configs.py: Comment out __slots__ serialization logic (lines 100-109)
  - filters.py: Add patterns, use_glob, reverse to URLPatternFilter __slots__ and store as public attributes
  - models.py: Convert property descriptors to strings in model_dump() instead of including them directly
2025-08-25 14:04:08 +08:00
James T. Wood
f2da460bb9 fix(dependencies): add cssselect to project dependencies
Fixes bug reported in issue #1405
[Bug]: Excluded selector (excluded_selector) doesn't work

This commit reintroduces the cssselect library which was removed by PR (https://github.com/unclecode/crawl4ai/pull/1368) and merged via (437395e490).

Integration tested against 0.7.4 Docker container. Reintroducing cssselector package eliminated errors seen in logs and excluded_selector functionality was restored.

Refs: #1405
2025-08-24 22:12:20 -04:00
Soham Kukreti
b1dff5a4d3 feat: Add comprehensive website to API example with frontend
This commit adds a complete, web scraping API example that demonstrates how to get structured data from any website and use it like an API using the crawl4ai library with a minimalist frontend interface.

Core Functionality
- AI-powered web scraping with plain English queries
- Dual scraping approaches: Schema-based (faster) and LLM-based (flexible)
- Intelligent schema caching for improved performance
- Custom LLM model support with API key management
- Automatic duplicate request prevention

Modern Frontend Interface
- Minimalist black-and-white design inspired by modern web apps
- Responsive layout with smooth animations and transitions
- Three main pages: Scrape Data, Models Management, API Request History
- Real-time results display with JSON formatting
- Copy-to-clipboard functionality for extracted data
- Toast notifications for user feedback
- Auto-scroll to results when scraping starts

Model Management System
- Web-based model configuration interface
- Support for any LLM provider (OpenAI, Gemini, Anthropic, etc.)
- Simplified configuration requiring only provider and API token
- Add, list, and delete model configurations
- Secure storage of API keys in local JSON files

API Request History
- Automatic saving of all API requests and responses
- Display of request history with URL, query, and cURL commands
- Duplicate prevention (same URL + query combinations)
- Request deletion functionality
- Clean, simplified display focusing on essential information

Technical Implementation

Backend (FastAPI)
- RESTful API with comprehensive endpoints
- Pydantic models for request/response validation
- Async web scraping with crawl4ai library
- Error handling with detailed error messages
- File-based storage for models and request history

Frontend (Vanilla JS/CSS/HTML)
- No framework dependencies - pure HTML, CSS, JavaScript
- Modern CSS Grid and Flexbox layouts
- Custom dropdown styling with SVG arrows
- Responsive design for mobile and desktop
- Smooth scrolling and animations

Core Library Integration
- WebScraperAgent class for orchestration
- ModelConfig class for LLM configuration management
- Schema generation and caching system
- LLM extraction strategy support
- Browser configuration with headless mode
2025-08-24 18:52:37 +05:30
ntohidi
40ab287c90 fix(utils): Improve URL normalization by avoiding quote/unquote to preserve '+' signs. ref #1332 2025-08-22 12:05:21 +08:00
Soham Kukreti
c09a57644f docs: update adaptive crawler docs and cache defaults; remove deprecated examples (#1330)
- Replace BaseStrategy with CrawlStrategy in custom strategy examples (DomainSpecificStrategy, HybridStrategy)
- Remove “Custom Link Scoring” and “Caching Strategy” sections no longer aligned with current library
- Revise memory pruning example to use adaptive.get_relevant_content and index-based retention of top 500 docs
- Correct Quickstart note: default cache mode is CacheMode.BYPASS; instruct enabling with CacheMode.ENABLED
2025-08-21 19:11:31 +05:30
ntohidi
90af453506 Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop 2025-08-21 14:10:01 +08:00
Nasrin
8bb0e68cce Merge pull request #1422 from unclecode/fix/docker-llmEnvFile
fix(docker): Fix LLM API key handling for multi-provider support
2025-08-21 14:05:06 +08:00
ntohidi
95051020f4 fix(docker): Fix LLM API key handling for multi-provider support
Previously, the system incorrectly used OPENAI_API_KEY for all LLM providers
due to a hardcoded api_key_env fallback in config.yml. This caused authentication
errors when using non-OpenAI providers like Gemini.

Changes:
- Remove api_key_env from config.yml to let litellm handle provider-specific env vars
- Simplify get_llm_api_key() to return None, allowing litellm to auto-detect keys
- Update validate_llm_provider() to trust litellm's built-in key detection
- Update documentation to reflect the new automatic key handling

The fix leverages litellm's existing capability to automatically find the correct
environment variable for each provider (OPENAI_API_KEY, GEMINI_API_TOKEN, etc.)
without manual configuration.

ref #1291
2025-08-21 14:01:04 +08:00
ntohidi
69961cf40b Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop 2025-08-20 16:56:19 +08:00
Nasrin
ef174a4c7a Merge pull request #1104 from emmanuel-ferdman/main
fix(docker-api): migrate to modern datetime library API
2025-08-20 10:57:39 +08:00
Nasrin
f4206d6ba1 Merge pull request #1369 from NezarAli/main
Fix examples in README.md
2025-08-18 14:22:54 +08:00
ntohidi
9447054a65 docs: update Docker instructions to use the latest release tag 2025-08-18 14:20:05 +08:00
Nasrin
dad7c51481 Merge pull request #1398 from unclecode/fix/update-url-seeding-docs
Update URL seeding examples to use proper async context managers
2025-08-18 13:00:26 +08:00
ntohidi
f4a432829e fix(crawler): Removed the incorrect reference in browser_config variable #1310 2025-08-18 10:59:14 +08:00
UncleCode
e651e045c4 Release v0.7.4: Merge release branch
- Merge release/v0.7.4 into main
- Version: 0.7.4
- Ready for tag and publication
v0.7.4
2025-08-17 19:46:48 +08:00
UncleCode
5398acc7d2 docs: add v0.7.4 release blog post and update documentation
- Add comprehensive v0.7.4 release blog post with LLMTableExtraction feature highlight
- Update blog index to feature v0.7.4 as latest release
- Update README.md to showcase v0.7.4 features alongside v0.7.3
- Accurately describe dispatcher fix as bug fix rather than major enhancement
- Include practical code examples for new LLMTableExtraction capabilities
2025-08-17 19:45:23 +08:00
UncleCode
22c7932ba3 chore(version): update version to 0.7.4 2025-08-17 19:22:23 +08:00
UncleCode
2ab0bf27c2 refactor(utils): move memory utilities to utils and update imports 2025-08-17 19:14:55 +08:00
ntohidi
d30dc9fdc1 fix(http-crawler): bring back HTTP crawler strategy 2025-08-16 09:27:23 +08:00
ntohidi
e6044e6053 Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop 2025-08-15 19:44:06 +08:00
ntohidi
a50e47adad Merge branch 'feature/table-extraction-strategies' into develop 2025-08-15 19:41:37 +08:00
ntohidi
ada7441bd1 refactor: Update LLMTableExtraction examples and tests 2025-08-15 19:11:26 +08:00
ntohidi
9f7fee91a9 feat: 🚀 Introduce revolutionary LLMTableExtraction with intelligent chunking for massive tables
BREAKING CHANGE: Table extraction now uses Strategy Design Pattern

This epic commit introduces a game-changing approach to table extraction in Crawl4AI:

 NEW FEATURES:
- LLMTableExtraction: AI-powered extraction for complex HTML tables with rowspan/colspan
- Smart Chunking: Automatically splits massive tables into optimal chunks at row boundaries
- Parallel Processing: Processes multiple chunks simultaneously for blazing-fast extraction
- Intelligent Merging: Seamlessly combines chunk results into complete tables
- Header Preservation: Each chunk maintains context with original headers
- Auto-retry Logic: Built-in resilience with configurable retry attempts

🏗️ ARCHITECTURE:
- Strategy Design Pattern for pluggable table extraction strategies
- ThreadPoolExecutor for concurrent chunk processing
- Token-based chunking with configurable thresholds
- Handles tables without headers gracefully

 PERFORMANCE:
- Process 1000+ row tables without timeout
- Parallel processing with up to 5 concurrent chunks
- Smart token estimation prevents LLM context overflow
- Optimized for providers like Groq for massive tables

🔧 CONFIGURATION:
- enable_chunking: Auto-handle large tables (default: True)
- chunk_token_threshold: When to split (default: 3000 tokens)
- min_rows_per_chunk: Meaningful chunk sizes (default: 10)
- max_parallel_chunks: Concurrent processing (default: 5)

📚 BACKWARD COMPATIBILITY:
- Existing code continues to work unchanged
- DefaultTableExtraction remains the default strategy
- Progressive enhancement approach

This is the future of web table extraction - handling everything from simple tables to massive, complex data grids with merged cells and nested structures. The chunking is completely transparent to users while providing unprecedented scalability.
2025-08-15 19:11:26 +08:00
AHMET YILMAZ
7f48655cf1 feat(browser-profiler): implement cross-platform keyboard listeners and improve quit handling 2025-08-15 19:11:26 +08:00
prokopis3
1417a67e90 chore(profile-test): fix filename typo ( test_crteate_profile.py → test_create_profile.py )
- Rename file to correct spelling
- No content changes
2025-08-15 19:11:26 +08:00
prokopis3
19398d33ef fix(browser_profiler): improve keyboard input handling
- fix handling of special keys in Windows msvcrt implementation
- Guard against UnicodeDecodeError from multi-byte key sequences
- Filter out non-printable characters and control sequences
- Add error handling to prevent coroutine crashes
- Add unit test to verify keyboard input handling

Key changes:
- Safe UTF-8 decoding with try/except for special keys
- Skip non-printable and multi-byte character sequences
- Add broad exception handling in keyboard listener

Test runs on Windows only due to msvcrt dependency.
2025-08-15 19:11:26 +08:00
prokopis3
263d362daa fix(browser_profiler): cross-platform 'q' to quit
This commit introduces platform-specific handling for the 'q' key press to quit the browser profiler, ensuring compatibility with both Windows and Unix-like systems. It also adds a check to see if the browser process has already exited, terminating the input listener if so.

- Implemented `msvcrt` for Windows to capture keyboard input without requiring a newline.
- Retained `termios`, `tty`, and `select` for Unix-like systems.
- Added a check for browser process termination to gracefully exit the input listener.
- Updated logger messages to use colored output for better user experience.
2025-08-15 19:11:26 +08:00
ntohidi
bac92a47e4 refactor: Update LLMTableExtraction examples and tests 2025-08-15 18:47:31 +08:00
ntohidi
a51545c883 feat: 🚀 Introduce revolutionary LLMTableExtraction with intelligent chunking for massive tables
BREAKING CHANGE: Table extraction now uses Strategy Design Pattern

This epic commit introduces a game-changing approach to table extraction in Crawl4AI:

 NEW FEATURES:
- LLMTableExtraction: AI-powered extraction for complex HTML tables with rowspan/colspan
- Smart Chunking: Automatically splits massive tables into optimal chunks at row boundaries
- Parallel Processing: Processes multiple chunks simultaneously for blazing-fast extraction
- Intelligent Merging: Seamlessly combines chunk results into complete tables
- Header Preservation: Each chunk maintains context with original headers
- Auto-retry Logic: Built-in resilience with configurable retry attempts

🏗️ ARCHITECTURE:
- Strategy Design Pattern for pluggable table extraction strategies
- ThreadPoolExecutor for concurrent chunk processing
- Token-based chunking with configurable thresholds
- Handles tables without headers gracefully

 PERFORMANCE:
- Process 1000+ row tables without timeout
- Parallel processing with up to 5 concurrent chunks
- Smart token estimation prevents LLM context overflow
- Optimized for providers like Groq for massive tables

🔧 CONFIGURATION:
- enable_chunking: Auto-handle large tables (default: True)
- chunk_token_threshold: When to split (default: 3000 tokens)
- min_rows_per_chunk: Meaningful chunk sizes (default: 10)
- max_parallel_chunks: Concurrent processing (default: 5)

📚 BACKWARD COMPATIBILITY:
- Existing code continues to work unchanged
- DefaultTableExtraction remains the default strategy
- Progressive enhancement approach

This is the future of web table extraction - handling everything from simple tables to massive, complex data grids with merged cells and nested structures. The chunking is completely transparent to users while providing unprecedented scalability.
2025-08-14 18:21:24 +08:00
Soham Kukreti
ecbe5ffb84 docs: Update URL seeding examples to use proper async context managers
- Wrap all AsyncUrlSeeder usage with async context managers
- Update URL seeding adventure example to use "sitemap+cc" source, focus on course posts, and add stream=True parameter to fix runtime error
2025-08-13 18:16:46 +05:30
Nasrin
11b310edef Merge pull request #1378 from unclecode/fix/exit_with_q
Cross Platform fix for browser profiler
2025-08-13 14:16:47 +08:00
Nasrin
926e41aab8 Merge pull request #1378 from unclecode/fix/exit_with_q
Cross Platform fix for browser profiler
2025-08-13 14:16:47 +08:00
Nasrin
489981e670 Merge pull request #1390 from unclecode/fix/docker-raw-html
Check for raw: and raw:// URLs before auto-appending https:// prefix
2025-08-13 13:56:33 +08:00
Nasrin
b92be4ef66 Merge pull request #1371 from unclecode/bug/proxy_config
#1057 : enhance ProxyConfig initialization to support dict and string…
2025-08-12 16:55:52 +08:00
Nasrin
7c0edaf266 Merge pull request #1384 from unclecode/fix/update_docker_examples
docs: remove CRAWL4AI_API_TOKEN references and use correct endpoints in Docker example scripts (#1015)
2025-08-12 16:53:42 +08:00
ntohidi
dfcfd8ae57 fix(dispatcher): enable true concurrency for fast-completing tasks in arun_many. REF: #560
The MemoryAdaptiveDispatcher was processing tasks sequentially despite
  max_session_permit > 1 due to fetching only one task per event loop iteration.
  This particularly affected raw:// URLs which complete in microseconds.

  Changes:
  - Replace single task fetch with greedy slot filling using get_nowait()
  - Fill all available slots (up to max_session_permit) immediately
  - Break on empty queue instead of waiting with timeout

  This ensures proper parallelization for all task types, especially
  ultra-fast operations like raw HTML processing.
2025-08-12 16:51:22 +08:00
ntohidi
955110a8b0 Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop 2025-08-12 12:22:25 +08:00
Soham Kukreti
f30811b524 fix: Check for raw: and raw:// URLs before auto-appending https:// prefix
- Add raw HTML URL validation alongside http/https checks
- Fix URL preprocessing logic to handle raw: and raw:// prefixes
- Update error message and add comprehensive test cases
2025-08-11 22:10:53 +05:30
ntohidi
8146d477e9 Merge branch 'main' into develop 2025-08-11 18:56:15 +08:00
ntohidi
96c4b0de67 fix(browser_manager): serialize new_page on persistent context to avoid races ref #1198
- Add _page_lock and guarded creation; handle empty context.pages safely
  - Prevents BrowserContext.new_page “Target page/context closed” during concurrent arun_many
2025-08-11 18:55:43 +08:00
Nasrin
57c14db7cb Merge pull request #1381 from unclecode/fix/base-tag-link-resolution
fix: Implement base tag support in link extraction (#1147)
2025-08-11 18:32:32 +08:00
ntohidi
88a9fbbb7e fix(deep-crawl): BestFirst priority inversion; remove pre-scoring truncation. ref #1253
Use negative scores in PQ to visit high-score URLs first and drop link cap prior to scoring; add test for ordering.
2025-08-11 18:16:57 +08:00
ntohidi
be63c98db3 feat(docker): add user-provided hooks support to Docker API
Implements comprehensive hooks functionality allowing users to provide custom Python
functions as strings that execute at specific points in the crawling pipeline.

Key Features:
- Support for all 8 crawl4ai hook points:
  • on_browser_created: Initialize browser settings
  • on_page_context_created: Configure page context
  • before_goto: Pre-navigation setup
  • after_goto: Post-navigation processing
  • on_user_agent_updated: User agent modification handling
  • on_execution_started: Crawl execution initialization
  • before_retrieve_html: Pre-extraction processing
  • before_return_html: Final HTML processing

Implementation Details:
- Created UserHookManager for validation, compilation, and safe execution
- Added IsolatedHookWrapper for error isolation and timeout protection
- AST-based validation ensures code structure correctness
- Sandboxed execution with restricted builtins for security
- Configurable timeout (1-120 seconds) prevents infinite loops
- Comprehensive error handling ensures hooks don't crash main process
- Execution tracking with detailed statistics and logging

API Changes:
- Added HookConfig schema with code and timeout fields
- Extended CrawlRequest with optional hooks parameter
- Added /hooks/info endpoint for hook discovery
- Updated /crawl and /crawl/stream endpoints to support hooks

Safety Features:
- Malformed hooks return clear validation errors
- Hook errors are isolated and reported without stopping crawl
- Execution statistics track success/failure/timeout rates
- All hook results are JSON-serializable

Testing:
- Comprehensive test suite covering all 8 hooks
- Error handling and timeout scenarios validated
- Authentication, performance, and content extraction examples
- 100% success rate in production testing

Documentation:
- Added extensive hooks section to docker-deployment.md
- Security warnings about user-provided code risks
- Real-world examples using httpbin.org, GitHub, BBC
- Best practices and troubleshooting guide

ref #1377
2025-08-11 13:25:17 +08:00
Soham Kukreti
cd2dd68e4c docs: remove CRAWL4AI_API_TOKEN references and use correct endpoints in Docker example scripts (#1015)
- Remove deprecated API token authentication from all Docker examples
- Fix async job endpoints: /crawl -> /crawl/job for submission, /task/{id} -> /crawl/job/{id} for polling
- Fix sync endpoint: /crawl_sync -> /crawl (synchronous)
- Remove non-existent /crawl_direct endpoint
- Update request format to use new structure with browser_config and crawler_config
- Fix response handling for both async and sync calls
- Update extraction strategy format to use proper nested structure
- Add Ollama connectivity check before running tests
- Update test schemas and selectors for current website structures

This makes the Docker examples work out-of-the-box with the current API structure.
2025-08-09 19:37:22 +05:30
UncleCode
f0ce7b2710 feat: add v0.7.3 release notes, changelog updates, and documentation for new features 2025-08-09 21:04:18 +08:00
UncleCode
21f79fe166 Release v0.7.3: Merge release branch
- Merge release/v0.7.3 into main
- Version: 0.7.3
- Ready for tag and publication
v0.7.3
2025-08-09 20:11:35 +08:00
unclecode
a9a2d798b4 feat: update sponsorship tier details and add custom arrangements note 2025-08-09 20:10:32 +08:00
unclecode
612270fcb0 feat: add scheduling link to contact information in SPONSORS.md 2025-08-09 20:05:59 +08:00
unclecode
bc099fdd76 Merge branch 'main' into release/v0.7.3 2025-08-09 19:30:46 +08:00
unclecode
18504d782e Add Founding Sponsors section and update README with detailed project information
- Introduced a new section in SPONSORS.md to recognize the first 50 sponsors as Founding Sponsors.
- Updated README-first.md to include comprehensive project details, features, installation instructions, and advanced usage examples.
- Highlighted the recent version 0.7.0 release with new features and improvements.
- Added a sponsorship program with tiered benefits and a mission statement to promote data democratization.
2025-08-09 19:11:32 +08:00