crawl4ai

Author	SHA1	Message	Date
ntohidi	0f210f6e02	Merge branch '2025-MAY-2' into next-MAY	2025-07-08 11:46:13 +02:00
UncleCode	1a73fb60db	feat(crawl4ai): Implement adaptive crawling feature This commit introduces the adaptive crawling feature to the crawl4ai project. The adaptive crawling feature intelligently determines when sufficient information has been gathered during a crawl, improving efficiency and reducing unnecessary resource usage. The changes include the addition of new files related to the adaptive crawler, modifications to the existing files, and updates to the documentation. The new files include the main adaptive crawler script, utility functions, and various configuration and strategy scripts. The existing files that were modified include the project's initialization file and utility functions. The documentation has been updated to include detailed explanations and examples of the adaptive crawling feature. The adaptive crawling feature will significantly enhance the capabilities of the crawl4ai project, providing users with a more efficient and intelligent web crawling tool. Significant modifications: - Added adaptive_crawler.py and related scripts - Modified __init__.py and utils.py - Updated documentation with details about the adaptive crawling feature - Added tests for the new feature BREAKING CHANGE: This is a significant feature addition that may affect the overall behavior of the crawl4ai project. Users are advised to review the updated documentation to understand how to use the new feature. Refs: #123, #456	2025-07-04 15:16:53 +08:00
UncleCode	048d9b0f5b	feat: Implement nightly build script and update version handling	2025-07-03 20:53:03 +08:00
UncleCode	a353515271	feat: Add virtual scroll support for modern web scraping Add comprehensive virtual scroll handling to capture all content from pages that use DOM recycling techniques (Twitter, Instagram, etc). Key features: - New VirtualScrollConfig class for configuring virtual scroll behavior - Automatic detection of three scrolling scenarios: no change, content appended, content replaced - Intelligent HTML chunk capture and merging with deduplication - 100% content capture from virtual scroll pages - Seamless integration with existing extraction strategies - JavaScript-based detection and capture for performance - Tree-based DOM merging with text-based deduplication Documentation: - Comprehensive guide at docs/md_v2/advanced/virtual-scroll.md - API reference updates in parameters.md and page-interaction.md - Blog article explaining the solution and techniques - Complete examples with local test server Testing: - Full test suite achieving 100% capture of 1000 items - Examples for Twitter timeline, Instagram grid scenarios - Local test server with different scrolling behaviors This enables scraping of modern websites that were previously impossible to fully capture with traditional scrolling techniques.	2025-06-29 20:41:37 +08:00
UncleCode	539a324cf6	refactor(link_extractor): remove link_extractor and rename to link_preview This change removes the link_extractor module and renames it to link_preview, streamlining the codebase. The removal of 395 lines of code reduces complexity and improves maintainability. Other files have been updated to reflect this change, ensuring consistency across the project. BREAKING CHANGE: The link_extractor module has been deleted and replaced with link_preview. Update imports accordingly.	2025-06-27 21:54:22 +08:00
UncleCode	5c9c305dbf	feat: Add advanced link head extraction with three-layer scoring system (#1 ) Squashed commit from feature/link-extractor branch implementing comprehensive link analysis: - Extract HTML head content from discovered links with parallel processing - Three-layer scoring: Intrinsic (URL quality), Contextual (BM25), and Total scores - New LinkExtractionConfig class for type-safe configuration - Pattern-based filtering for internal/external links - Comprehensive documentation and examples	2025-06-27 20:06:04 +08:00
UncleCode	e528086341	test(async_assistant): add new tests for extract pipeline Introduced two new test files to enhance coverage for the extract pipeline functionality. The tests aim to validate the behavior of the pipeline under various scenarios, ensuring robustness and reliability. No breaking changes. Closes issue #123.	2025-06-23 10:44:27 +08:00
prokopis3	c4d625fb3c	chore(profile-test): fix filename typo ( test_crteate_profile.py → test_create_profile.py ) - Rename file to correct spelling - No content changes	2025-06-12 14:38:32 +03:00
prokopis3	ef722766f0	fix(browser_profiler): improve keyboard input handling - fix handling of special keys in Windows msvcrt implementation - Guard against UnicodeDecodeError from multi-byte key sequences - Filter out non-printable characters and control sequences - Add error handling to prevent coroutine crashes - Add unit test to verify keyboard input handling Key changes: - Safe UTF-8 decoding with try/except for special keys - Skip non-printable and multi-byte character sequences - Add broad exception handling in keyboard listener Test runs on Windows only due to msvcrt dependency.	2025-06-12 14:33:12 +03:00
ntohidi	5d9213a0e9	fix: Update JavaScript execution in AsyncPlaywrightCrawlerStrategy to handle script errors and add basic download test case. ref #1215	2025-06-12 12:21:40 +02:00
UncleCode	c0fd36982d	Update all documentation to import extraction strategies directly from crawl4ai.	2025-06-10 18:08:27 +08:00
ntohidi	4679ee023d	fix: Enhance URLPatternFilter to enforce path boundary checks for prefix matching. ref #1003	2025-06-10 11:19:18 +02:00
AHMET YILMAZ	74b06d4b80	#1167 Add PHP MIME types to ContentTypeFilter for better file handling	2025-06-09 11:49:33 +08:00
ntohidi	5ac19a61d7	feat: Implement max_scroll_steps parameter for full page scanning. ref: #1168	2025-06-05 16:40:34 +02:00
UncleCode	3048cc1ff9	feat: Add AsyncUrlSeeder for intelligent URL discovery and filtering This commit introduces AsyncUrlSeeder, a high-performance URL discovery system that enables intelligent crawling at scale by pre-discovering and filtering URLs before crawling. ## Core Features ### AsyncUrlSeeder Component - Discovers URLs from multiple sources: - Sitemaps (including nested and gzipped) - Common Crawl index - Combined sources for maximum coverage - Extracts page metadata without full crawling: - Title, description, keywords - Open Graph and Twitter Card tags - JSON-LD structured data - Language and charset information - BM25 relevance scoring for intelligent filtering: - Query-based URL discovery - Configurable score thresholds - Automatic ranking by relevance - Performance optimizations: - Async/concurrent processing with configurable workers - Rate limiting (hits per second) - Automatic caching with TTL - Streaming results for large datasets ### SeedingConfig - Comprehensive configuration for URL seeding: - Source selection (sitemap, cc, or both) - URL pattern filtering with wildcards - Live URL validation options - Metadata extraction controls - BM25 scoring parameters - Concurrency and rate limiting ### Integration with AsyncWebCrawler - Seamless pipeline: discover → filter → crawl - Direct compatibility with arun_many() - Significant resource savings by pre-filtering URLs ## Documentation - Comprehensive guide comparing URL seeding vs deep crawling - Complete API reference with parameter tables - Practical examples showing all features - Performance benchmarks and best practices - Integration patterns with AsyncWebCrawler ## Examples - url_seeder_demo.py: Interactive Rich-based demo with: - Basic discovery - Cache management - Live validation - BM25 scoring - Multi-domain discovery - Complete pipeline integration - url_seeder_quick_demo.py: Screenshot-friendly examples: - Pattern-based filtering - Metadata exploration - Smart search with BM25 ## Testing - Comprehensive test suite (test_async_url_seeder_bm25.py) - Coverage of all major features - Edge cases and error handling - Performance and consistency tests ## Implementation Details - Built on httpx with HTTP/2 support - Optional dependencies: lxml, brotli, rank_bm25 - Cache management in ~/.crawl4ai/seeder_cache/ - Logger integration with AsyncLoggerBase - Proper error handling and retry logic ## Bug Fixes - Fixed logger color compatibility (lightblack → bright_black) - Corrected URL extraction from seeder results for arun_many() - Updated all examples and documentation with proper usage This feature enables users to crawl smarter, not harder, by discovering and analyzing URLs before committing resources to crawling them.	2025-06-03 23:27:12 +08:00
João Martins	58c1e17170	Merge branch 'main' into fix-raw-url-parsing	2025-05-30 13:03:25 +01:00
Ahmed Tawfik	ce09648af1	Merge pull request #1054 from Sacristaan/feature/readme_example Fix: README.md urls list	2025-05-19 14:20:21 +08:00
Ahmed-Tawfik94	b4fc60a555	#1103 fix(url): enhance URL normalization to handle invalid schemes and trailing slashes	2025-05-19 13:51:16 +08:00
UncleCode	9499164d3c	feat(browser): improve browser profile management and cleanup Enhance browser profile handling with better process cleanup and documentation: - Add process cleanup for existing Chromium instances on Windows/Unix - Fix profile creation by passing complete browser config - Add comprehensive documentation for browser and CLI components - Add initial profile creation test - Bump version to 0.6.3 This change improves reliability when managing browser profiles and provides better documentation for developers.	2025-04-29 23:04:32 +08:00
Marc Sacristán	53245e4e0e	Fix: README.md urls list	2025-04-29 16:26:35 +02:00
UncleCode	4812f08a73	feat(docker): update Docker deployment for v0.6.0 Major updates to Docker deployment infrastructure: - Switch default port to 11235 for all services - Add MCP (Model Context Protocol) support with WebSocket/SSE endpoints - Simplify docker-compose.yml with auto-platform detection - Update documentation with new features and examples - Consolidate configuration and improve resource management BREAKING CHANGE: Default port changed from 8020 to 11235. Update your configurations and deployment scripts accordingly.	2025-04-22 22:35:25 +08:00
UncleCode	5297e362f3	feat(mcp): Implement MCP protocol and enhance server capabilities This commit introduces several significant enhancements to the Crawl4AI Docker deployment: 1. Add MCP Protocol Support: - Implement WebSocket and SSE transport layers for MCP server communication - Create mcp_bridge.py to expose existing API endpoints via MCP protocol - Add comprehensive tests for both socket and SSE transport methods 2. Enhance Docker Server Capabilities: - Add PDF generation endpoint with file saving functionality - Add screenshot capture endpoint with configurable wait time - Implement JavaScript execution endpoint for dynamic page interaction - Add intelligent file path handling for saving generated assets 3. Improve Search and Context Functionality: - Implement syntax-aware code function chunking using AST parsing - Add BM25-based intelligent document search with relevance scoring - Create separate code and documentation context endpoints - Enhance response format with structured results and scores 4. Rename and Fix File Organization: - Fix typo in test_docker_config_gen.py filename - Update import statements and dependencies - Add FileResponse for context endpoints This enhancement significantly improves the machine-to-machine communication capabilities of Crawl4AI, making it more suitable for integration with LLM agents and other automated systems. The CHANGELOG update has been applied successfully, highlighting the key features and improvements made in this release. The commit message provides a detailed explanation of all the changes, which will be helpful for tracking the project's evolution.	2025-04-21 22:22:02 +08:00
UncleCode	a58c8000aa	refactor(server): migrate to pool-based crawler management Replace crawler_manager.py with simpler crawler_pool.py implementation: - Add global page semaphore for hard concurrency cap - Implement browser pool with idle cleanup - Add playground UI for testing and stress testing - Update API handlers to use pooled crawlers - Enhance logging levels and symbols BREAKING CHANGE: Removes CrawlerManager class in favor of simpler pool-based approach	2025-04-20 20:14:26 +08:00
UncleCode	16b2318242	feat(api): implement crawler pool manager for improved resource handling Adds a new CrawlerManager class to handle browser instance pooling and failover: - Implements auto-scaling based on system resources - Adds primary/backup crawler management - Integrates memory monitoring and throttling - Adds streaming support with memory tracking - Updates API endpoints to use pooled crawlers BREAKING CHANGE: API endpoints now require CrawlerManager initialization	2025-04-18 22:26:24 +08:00
UncleCode	907cba194f	Merge branch 'next-stress' into next	2025-04-17 22:34:43 +08:00
UncleCode	921e0c46b6	feat(tests): implement high volume stress testing framework Add comprehensive stress testing solution for SDK using arun_many and dispatcher system: - Create test_stress_sdk.py for running high volume crawl tests - Add run_benchmark.py for orchestrating tests with predefined configs - Implement benchmark_report.py for generating performance reports - Add memory tracking and local test site generation - Support both streaming and batch processing modes - Add detailed documentation in README.md The framework enables testing SDK performance, concurrency handling, and memory behavior under high-volume scenarios.	2025-04-17 22:31:51 +08:00
UncleCode	7db6b468d9	feat(markdown): add content source selection for markdown generation Adds a new content_source parameter to MarkdownGenerationStrategy that allows selecting which HTML content to use for markdown generation: - cleaned_html (default): uses post-processed HTML - raw_html: uses original webpage HTML - fit_html: uses preprocessed HTML for schema extraction Changes include: - Added content_source parameter to MarkdownGenerationStrategy - Updated AsyncWebCrawler to handle HTML source selection - Added examples and tests for the new feature - Updated documentation with new parameter details BREAKING CHANGE: Renamed cleaned_html parameter to input_html in generate_markdown() method signature to better reflect its generalized purpose	2025-04-17 20:13:53 +08:00
UncleCode	94d486579c	docs(tests): clarify server URL comments in deep crawl tests Improve documentation of test configuration URLs by adding clearer comments explaining when to use each URL configuration - Docker vs development mode. No functional changes, only comment improvements.	2025-04-15 22:32:27 +08:00
UncleCode	5206c6f2d6	Modify the test file	2025-04-15 22:28:01 +08:00
UncleCode	230f22da86	refactor(proxy): move ProxyConfig to async_configs and improve LLM token handling Moved ProxyConfig class from proxy_strategy.py to async_configs.py for better organization. Improved LLM token handling with new PROVIDER_MODELS_PREFIXES. Added test cases for deep crawling and proxy rotation. Removed docker_config from BrowserConfig as it's handled separately. BREAKING CHANGE: ProxyConfig import path changed from crawl4ai.proxy_strategy to crawl4ai	2025-04-15 22:27:18 +08:00
UncleCode	ecec53a8c1	Docker tested on Windows machine.	2025-04-13 20:14:41 +08:00
UncleCode	3179d6ad0c	fix(core): improve error handling and stability in core components Enhance error handling and stability across multiple components: - Add safety checks in async_configs.py for type and params existence - Fix browser manager initialization and cleanup logic - Add default LLM config fallback in extraction strategy - Add comprehensive Docker deployment guide and server tests BREAKING CHANGE: BrowserManager.start() now automatically closes existing instances	2025-04-11 20:58:39 +08:00
unclecode	66ac07b4f3	feat(crawler): add network request and console message capturing Implement comprehensive network request and console message capturing functionality: - Add capture_network_requests and capture_console_messages config parameters - Add network_requests and console_messages fields to models - Implement Playwright event listeners to capture requests, responses, and console output - Create detailed documentation and examples - Add comprehensive tests This feature enables deep visibility into web page activity for debugging, security analysis, performance profiling, and API discovery in web applications.	2025-04-10 16:03:48 +08:00
UncleCode	a2061bf31e	feat(crawler): add MHTML capture functionality Add ability to capture web pages as MHTML format, which includes all page resources in a single file. This enables complete page archival and offline viewing. - Add capture_mhtml parameter to CrawlerRunConfig - Implement MHTML capture using CDP in AsyncPlaywrightCrawlerStrategy - Add mhtml field to CrawlResult and AsyncCrawlResponse models - Add comprehensive tests for MHTML capture functionality - Update documentation with MHTML capture details - Add exclude_all_images option for better memory management Breaking changes: None	2025-04-09 15:39:04 +08:00
UncleCode	555455d710	feat(browser): implement browser pooling and page pre-warming Adds a new BrowserManager implementation with browser pooling and page pre-warming capabilities: - Adds support for managing multiple browser instances per configuration - Implements page pre-warming for improved performance - Adds configurable behavior for when no browsers are available - Includes comprehensive status reporting and monitoring - Maintains backward compatibility with existing API - Adds demo script showcasing new features BREAKING CHANGE: BrowserManager API now returns a strategy instance along with page and context	2025-03-31 21:55:07 +08:00
UncleCode	bb02398086	refactor(browser): improve browser strategy architecture and lifecycle management Major refactoring of browser strategy implementations to improve code organization and reliability: - Move CrawlResultContainer and RunManyReturn types from async_webcrawler to models.py - Simplify browser lifecycle management in AsyncWebCrawler - Standardize browser strategy interface with _generate_page method - Improve headless mode handling and browser args construction - Clean up Docker and Playwright strategy implementations - Fix session management and context handling across strategies BREAKING CHANGE: Browser strategy interface has changed with new _generate_page method requirement	2025-03-30 20:58:39 +08:00
UncleCode	3ff7eec8f3	refactor(browser): consolidate browser strategy implementations Moves common browser functionality into BaseBrowserStrategy class to reduce code duplication and improve maintainability. Key changes: - Adds shared browser argument building and session management to base class - Standardizes storage state handling across strategies - Improves process cleanup and error handling - Consolidates CDP URL management and container lifecycle BREAKING CHANGE: Changes browser_mode="custom" to "cdp" for consistency	2025-03-28 22:47:28 +08:00
UncleCode	64f20ab44a	refactor(docker): update Dockerfile and browser strategy to use Chromium	2025-03-28 15:59:02 +08:00
UncleCode	c635f6b9a2	refactor(browser): reorganize browser strategies and improve Docker implementation Reorganize browser strategy code into separate modules for better maintainability and separation of concerns. Improve Docker implementation with: - Add Alpine and Debian-based Dockerfiles for better container options - Enhance Docker registry to share configuration with BuiltinBrowserStrategy - Add CPU and memory limits to container configuration - Improve error handling and logging - Update documentation and examples BREAKING CHANGE: DockerConfig, DockerRegistry, and DockerUtils have been moved to new locations and their APIs have been updated.	2025-03-27 21:35:13 +08:00
UncleCode	7f93e88379	refactor(tests): remove unused imports in test_docker_browser.py	2025-03-26 15:19:29 +08:00
UncleCode	8c08521301	feat(browser): add Docker-based browser automation strategy Implements a new browser strategy that runs Chrome in Docker containers, providing better isolation and cross-platform consistency. Features include: - Connect and launch modes for different container configurations - Persistent storage support for maintaining browser state - Container registry for efficient reuse - Comprehensive test suite for Docker browser functionality This addition allows users to run browser automation workloads in isolated containers, improving security and resource management.	2025-03-24 21:36:58 +08:00
UncleCode	462d5765e2	fix(browser): improve storage state persistence in CDP strategy Enhance storage state persistence mechanism in CDP browser strategy by: - Explicitly saving storage state for each browser context - Using proper file path for storage state - Removing unnecessary sleep delay Also includes test improvements: - Simplified test configurations in playwright tests - Temporarily disabled some CDP tests	2025-03-23 21:06:41 +08:00
UncleCode	0094cac675	refactor(browser): improve parallel crawling and browser management Remove PagePoolConfig in favor of direct page management in browser strategies. Add get_pages() method for efficient parallel page creation. Improve storage state handling and persistence. Add comprehensive parallel crawling tests and performance analysis. BREAKING CHANGE: Removed PagePoolConfig class and related functionality.	2025-03-23 18:53:24 +08:00
UncleCode	4ab0893ffb	feat(browser): implement modular browser management system Adds a new browser management system with strategy pattern implementation: - Introduces BrowserManager class with strategy pattern support - Adds PlaywrightBrowserStrategy, CDPBrowserStrategy, and BuiltinBrowserStrategy - Implements BrowserProfileManager for profile management - Adds PagePoolConfig for browser page pooling - Includes comprehensive test suite for all browser strategies BREAKING CHANGE: Browser management has been moved to browser/ module. Direct usage of browser_manager.py and browser_profiler.py is deprecated.	2025-03-21 22:50:00 +08:00
UncleCode	6432ff1257	feat(browser): add builtin browser management system Implements a persistent browser management system that allows running a single shared browser instance that can be reused across multiple crawler sessions. Key changes include: - Added browser_mode config option with 'builtin', 'dedicated', and 'custom' modes - Implemented builtin browser management in BrowserProfiler - Added CLI commands for managing builtin browser (start, stop, status, restart, view) - Modified browser process handling to support detached processes - Added automatic builtin browser setup during package installation BREAKING CHANGE: The browser_mode config option changes how browser instances are managed	2025-03-20 12:13:59 +08:00
UncleCode	b750542e6d	feat(crawler): optimize single URL handling and add performance comparison Add special handling for single URL requests in Docker API to use arun() instead of arun_many() Add new example script demonstrating performance differences between sequential and parallel crawling Update cache mode from aggressive to bypass in examples and tests Remove unused dependencies (zstandard, msgpack) BREAKING CHANGE: Changed default cache_mode from aggressive to bypass in examples	2025-03-13 22:15:15 +08:00
UncleCode	dc36997a08	feat(schema): improve HTML preprocessing for schema generation Add new preprocess_html_for_schema utility function to better handle HTML cleaning for schema generation. This replaces the previous optimize_html function in the GoogleSearchCrawler and includes smarter attribute handling and pattern detection. Other changes: - Update default provider to gpt-4o - Add DEFAULT_PROVIDER_API_KEY constant - Make LLMConfig creation more flexible with create_llm_config helper - Add new dependencies: zstandard and msgpack This change improves schema generation reliability while reducing noise in the processed HTML.	2025-03-12 22:40:46 +08:00
UncleCode	1630fbdafe	feat(monitor): add real-time crawler monitoring system with memory management Implements a comprehensive monitoring and visualization system for tracking web crawler operations in real-time. The system includes: - Terminal-based dashboard with rich UI for displaying task statuses - Memory pressure monitoring and adaptive dispatch control - Queue statistics and performance metrics tracking - Detailed task progress visualization - Stress testing framework for memory management This addition helps operators track crawler performance and manage memory usage more effectively.	2025-03-12 19:05:24 +08:00
UncleCode	a68cbb232b	feat(browser): add standalone CDP browser launch and lxml extraction strategy Add new features to enhance browser automation and HTML extraction: - Add CDP browser launch capability with customizable ports and profiles - Implement JsonLxmlExtractionStrategy for faster HTML parsing - Add CLI command 'crwl cdp' for launching standalone CDP browsers - Support connecting to external CDP browsers via URL - Optimize selector caching and context-sensitive queries BREAKING CHANGE: LLMConfig import path changed from crawl4ai.types to crawl4ai	2025-03-07 20:55:56 +08:00
UncleCode	baee4949d3	refactor(llm): rename LlmConfig to LLMConfig for consistency Rename LlmConfig to LLMConfig across the codebase to follow consistent naming conventions. Update all imports and usages to use the new name. Update documentation and examples to reflect the change. BREAKING CHANGE: LlmConfig has been renamed to LLMConfig. Users need to update their imports and usage.	2025-03-05 14:17:04 +08:00

1 2 3

146 Commits