crawl4ai

Author	SHA1	Message	Date
UncleCode	a353515271	feat: Add virtual scroll support for modern web scraping Add comprehensive virtual scroll handling to capture all content from pages that use DOM recycling techniques (Twitter, Instagram, etc). Key features: - New VirtualScrollConfig class for configuring virtual scroll behavior - Automatic detection of three scrolling scenarios: no change, content appended, content replaced - Intelligent HTML chunk capture and merging with deduplication - 100% content capture from virtual scroll pages - Seamless integration with existing extraction strategies - JavaScript-based detection and capture for performance - Tree-based DOM merging with text-based deduplication Documentation: - Comprehensive guide at docs/md_v2/advanced/virtual-scroll.md - API reference updates in parameters.md and page-interaction.md - Blog article explaining the solution and techniques - Complete examples with local test server Testing: - Full test suite achieving 100% capture of 1000 items - Examples for Twitter timeline, Instagram grid scenarios - Local test server with different scrolling behaviors This enables scraping of modern websites that were previously impossible to fully capture with traditional scrolling techniques.	2025-06-29 20:41:37 +08:00
UncleCode	3f9424e884	Update CHANGELOG	2025-06-03 23:27:31 +08:00
UncleCode	9b5ccac76e	feat(extraction): add RegexExtractionStrategy for pattern-based extraction Add new RegexExtractionStrategy for fast, zero-LLM extraction of common data types: - Built-in patterns for emails, URLs, phones, dates, and more - Support for custom regex patterns - LLM-assisted pattern generation utility - Optimized HTML preprocessing with fit_html field - Enhanced network response body capture Breaking changes: None	2025-05-02 21:15:24 +08:00
UncleCode	ccec40ed17	feat(models): add dedicated tables field to CrawlResult - Add tables field to CrawlResult model while maintaining backward compatibility - Update async_webcrawler.py to extract tables from media and pass to tables field - Update crypto_analysis_example.py to use the new tables field - Add /config/dump examples to demo_docker_api.py - Bump version to 0.6.1	2025-04-24 18:36:25 +08:00
UncleCode	ad4dfb21e1	Remoce "rc1"	2025-04-23 21:00:00 +08:00
UncleCode	c98ffe2130	Update CHANGELOG	2025-04-22 22:36:41 +08:00
UncleCode	4812f08a73	feat(docker): update Docker deployment for v0.6.0 Major updates to Docker deployment infrastructure: - Switch default port to 11235 for all services - Add MCP (Model Context Protocol) support with WebSocket/SSE endpoints - Simplify docker-compose.yml with auto-platform detection - Update documentation with new features and examples - Consolidate configuration and improve resource management BREAKING CHANGE: Default port changed from 8020 to 11235. Update your configurations and deployment scripts accordingly.	2025-04-22 22:35:25 +08:00
UncleCode	0007aea204	Update changelog	2025-04-21 23:21:49 +08:00
UncleCode	5297e362f3	feat(mcp): Implement MCP protocol and enhance server capabilities This commit introduces several significant enhancements to the Crawl4AI Docker deployment: 1. Add MCP Protocol Support: - Implement WebSocket and SSE transport layers for MCP server communication - Create mcp_bridge.py to expose existing API endpoints via MCP protocol - Add comprehensive tests for both socket and SSE transport methods 2. Enhance Docker Server Capabilities: - Add PDF generation endpoint with file saving functionality - Add screenshot capture endpoint with configurable wait time - Implement JavaScript execution endpoint for dynamic page interaction - Add intelligent file path handling for saving generated assets 3. Improve Search and Context Functionality: - Implement syntax-aware code function chunking using AST parsing - Add BM25-based intelligent document search with relevance scoring - Create separate code and documentation context endpoints - Enhance response format with structured results and scores 4. Rename and Fix File Organization: - Fix typo in test_docker_config_gen.py filename - Update import statements and dependencies - Add FileResponse for context endpoints This enhancement significantly improves the machine-to-machine communication capabilities of Crawl4AI, making it more suitable for integration with LLM agents and other automated systems. The CHANGELOG update has been applied successfully, highlighting the key features and improvements made in this release. The commit message provides a detailed explanation of all the changes, which will be helpful for tracking the project's evolution.	2025-04-21 22:22:02 +08:00
UncleCode	7db6b468d9	feat(markdown): add content source selection for markdown generation Adds a new content_source parameter to MarkdownGenerationStrategy that allows selecting which HTML content to use for markdown generation: - cleaned_html (default): uses post-processed HTML - raw_html: uses original webpage HTML - fit_html: uses preprocessed HTML for schema extraction Changes include: - Added content_source parameter to MarkdownGenerationStrategy - Updated AsyncWebCrawler to handle HTML source selection - Added examples and tests for the new feature - Updated documentation with new parameter details BREAKING CHANGE: Renamed cleaned_html parameter to input_html in generate_markdown() method signature to better reflect its generalized purpose	2025-04-17 20:13:53 +08:00
UncleCode	a31d7b86be	feat(changelog): update CHANGELOG for version 0.5.0.post5 with new features, changes, fixes, and breaking changes	2025-03-14 15:26:37 +08:00
UncleCode	f334daa979	feat(deep-crawling): add max_pages and score_threshold parameters for improved crawling control	2025-03-03 21:54:58 +08:00
UncleCode	d024749633	refactor(deep-crawl): add max_pages limit and improve crawl control Add max_pages parameter to all deep crawling strategies to limit total pages crawled. Add score_threshold parameter to BFS/DFS strategies for quality control. Remove legacy parameter handling in AsyncWebCrawler. Improve error handling and logging in crawl strategies. BREAKING CHANGE: Removed support for legacy parameters in AsyncWebCrawler.run_many()	2025-03-03 21:51:11 +08:00
UncleCode	367cd71db9	feat(core): release version 0.5.0 with deep crawling and CLI This major release adds deep crawling capabilities, memory-adaptive dispatcher, multiple crawling strategies, Docker deployment, and a new CLI. It also includes significant improvements to proxy handling, PDF processing, and LLM integration. BREAKING CHANGES: - Add memory-adaptive dispatcher as default for arun_many() - Move max_depth to CrawlerRunConfig - Replace ScrapingMode enum with strategy pattern - Update BrowserContext API - Make model fields optional with defaults - Remove content_filter parameter from CrawlerRunConfig - Remove synchronous WebCrawler and old CLI - Update Docker deployment configuration - Replace FastFilterChain with FilterChain - Change license to Apache 2.0 with attribution clause	2025-02-21 19:55:02 +08:00
UncleCode	45809d1c91	Merge branch 'vr0.4.3b2'	2025-01-22 20:51:46 +08:00
UncleCode	2d69bf2366	refactor(models): rename final_url to redirected_url for consistency Renames the final_url field to redirected_url across all components to maintain consistent terminology throughout the codebase. This change affects: - AsyncCrawlResponse model - AsyncPlaywrightCrawlerStrategy - Documentation and examples No functional changes, purely naming consistency improvement.	2025-01-22 17:14:24 +08:00
UncleCode	16b8d4945b	feat(release): prepare v0.4.3 beta release Prepare the v0.4.3 beta release with major feature additions and improvements: - Add JsonXPathExtractionStrategy and LLMContentFilter to exports - Update version to 0.4.3b1 - Improve documentation for dispatchers and markdown generation - Update development status to Beta - Reorganize changelog format BREAKING CHANGE: Memory threshold in MemoryAdaptiveDispatcher increased to 90% and SemaphoreDispatcher parameter renamed to max_session_permit	2025-01-21 21:03:11 +08:00
UncleCode	d09c611d15	feat(robots): add robots.txt compliance support Add support for checking and respecting robots.txt rules before crawling websites: - Implement RobotsParser class with SQLite caching - Add check_robots_txt parameter to CrawlerRunConfig - Integrate robots.txt checking in AsyncWebCrawler - Update documentation with robots.txt compliance examples - Add tests for robot parser functionality The cache uses WAL mode for better concurrency and has a default TTL of 7 days.	2025-01-21 17:54:13 +08:00
UncleCode	9247877037	feat(proxy): add proxy configuration support to CrawlerRunConfig Add proxy_config parameter to CrawlerRunConfig to support dynamic proxy configuration per crawl request. This enables users to specify different proxy settings for each crawl operation without modifying the browser config. - Added proxy_config parameter to CrawlerRunConfig - Updated BrowserManager to apply proxy settings from CrawlerRunConfig - Updated proxy-security documentation with new usage examples	2025-01-20 22:14:05 +08:00
UncleCode	2cec527a22	feat(extraction): add LLM-powered schema generation utility Adds new static method generate_schema() to JsonElementExtractionStrategy classes that can automatically generate extraction schemas using LLM (OpenAI or Ollama). This provides a convenient way to bootstrap extraction schemas while maintaining the performance benefits of selector-based extraction. Key changes: - Added generate_schema() static method to base extraction strategy - Added support for both CSS and XPath schema generation - Updated documentation with examples and best practices - Added new prompt templates for schema generation	2025-01-20 17:28:00 +08:00
UncleCode	3427ead8b8	Update CHANGELOG	2025-01-06 15:13:43 +08:00
UncleCode	171ce25ba6	Fixe typo in CHANGELOG	2024-12-31 19:49:00 +08:00
UncleCode	553a4622bf	chore: prepare for version 0.4.24	2024-12-31 19:18:36 +08:00
UncleCode	849765712f	Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies.	2024-12-19 21:02:29 +08:00
UncleCode	393bb911c0	Enhance crawler strategies with new features - ReImplemented JsonXPathExtractionStrategy for enhanced JSON data extraction. - Updated existing extraction strategies for better performance. - Improved handling of response status codes during crawls.	2024-12-17 22:40:10 +08:00
UncleCode	c51e901f68	feat: Enhance AsyncPlaywrightCrawlerStrategy with text-only and light modes, dynamic viewport adjustment, and session management ### New Features: - Text-Only Mode: Added support for text-only crawling by disabling images, JavaScript, GPU, and other non-essential features. - Light Mode: Optimized browser settings to reduce resource usage and improve efficiency during crawling. - Dynamic Viewport Adjustment: Automatically adjusts viewport dimensions based on content size, ensuring accurate rendering and scaling. - Full Page Scanning: Introduced a feature to scroll and capture dynamic content for pages with infinite scroll or lazy-loading elements. - Session Management: Added `create_session` method for creating and managing browser sessions with unique IDs. ### Improvements: - Unified viewport handling across contexts by dynamically setting dimensions using `self.viewport_width` and `self.viewport_height`. - Enhanced logging and error handling for viewport adjustments, page scanning, and content evaluation. - Reduced resource usage with additional browser flags for both `light_mode` and `text_only` configurations. - Improved handling of cookies, headers, and proxies in session creation. ### Refactoring: - Removed hardcoded viewport dimensions and replaced them with dynamic configurations. - Cleaned up unused and commented-out code for better readability and maintainability. - Introduced defaults for frequently used parameters like `delay_before_return_html`. ### Fixes: - Resolved potential inconsistencies in viewport handling. - Improved robustness of content loading and dynamic adjustments to avoid failures and timeouts. ### Docs Update: - Updated schema usage in `quickstart_async.py` example: - Changed `OpenAIModelFee.schema()` to `OpenAIModelFee.model_json_schema()` for compatibility. - Enhanced LLM extraction instruction documentation. This commit introduces significant enhancements to improve efficiency, flexibility, and reliability of the crawler strategy.	2024-12-08 20:04:44 +08:00
unclecode	293f299c08	Add PruningContentFilter with unit tests and update documentation - Introduced the PruningContentFilter for better content relevance. - Implemented comprehensive unit tests for verification of functionality. - Enhanced existing BM25ContentFilter tests for edge case coverage. - Updated documentation to include usage examples for new filter.	2024-12-01 19:17:33 +08:00
UncleCode	f9c98a377d	Enhance Docker support and improve installation process - Added new Docker commands for platform-specific builds. - Updated README with comprehensive installation and setup instructions. - Introduced `post_install` method in setup script for automation. - Refined migration processes with enhanced error logging. - Bump version to 0.3.746 and updated dependencies.	2024-11-29 20:52:51 +08:00
UncleCode	3ff0b0b2c4	feat: update changelog for version 0.3.743 with new features, improvements, and contributor acknowledgments	2024-11-28 12:48:07 +08:00
UncleCode	a59c107b23	Update changelog for 0.3.74	2024-11-17 18:42:43 +08:00
UncleCode	ae7ebc0bd8	chore: update .gitignore and enhance changelog with major feature additions and examples	2024-11-15 20:16:13 +08:00
UncleCode	7f1ae5adcf	Update changelog	2024-11-14 22:51:51 +08:00
UncleCode	c38ac29edb	perf(crawler): major performance improvements & raw HTML support - Switch to lxml parser (~4x speedup) - Add raw HTML & local file crawling support - Fix cache headers & async cleanup - Add browser process monitoring - Optimize BeautifulSoup operations - Pre-compile regex patterns Breaking: Raw HTML handling requires new URL prefixes Fixes: #256, #253	2024-11-13 19:40:40 +08:00
UncleCode	61b93ebf36	Update change log	2024-11-13 15:38:30 +08:00
UncleCode	67a23c3182	feat(core): Release v0.3.73 with Browser Takeover and Docker Support Major changes: - Add browser takeover feature using CDP for authentic browsing - Implement Docker support with full API server documentation - Enhance Mockdown with tag preservation system - Improve parallel crawling performance This release focuses on authenticity and scalability, introducing the ability to use users' own browsers while providing containerized deployment options. Breaking changes include modified browser handling and API response structure. See CHANGELOG.md for detailed migration guide.	2024-11-05 20:04:18 +08:00
unclecode	fbdf870fbf	Update CHANGELOG	2024-11-04 14:10:27 +08:00
unclecode	54d5a3a259	Improved database management and error handling, updated README instructions, refined .gitignore, enhanced async web crawling capabilities, and updated dependencies.	2024-11-04 13:22:13 +08:00
UncleCode	bcfe83f702	feat: enhance crawler with overlay removal and improved screenshot capabilities • Add smart overlay removal system for handling popups and modals • Improve screenshot functionality with configurable timing controls • Implement URL normalization and enhanced link processing • Add custom base directory support for cache storage • Refine external content filtering and social media domain handling This commit significantly improves the crawler's ability to handle modern websites by automatically removing intrusive overlays and providing better screenshot capabilities. URL handling is now more robust with proper normalization and duplicate detection. The cache system is more flexible with customizable base directory support. Breaking changes: None Issue numbers: None	2024-10-24 20:22:47 +08:00
UncleCode	60ba131ac8	[v0.3.72] Enhance content extraction and proxy support - Add ContentCleaningStrategy for improved content extraction - Implement advanced proxy configuration with authentication - Enhance image source detection and handling - Add fit_markdown and fit_html for refined content output - Improve external link and image handling flexibility	2024-10-22 20:19:22 +08:00
UncleCode	04d16e6d2b	Fix Base64 image parsing in WebScrappingStrategy (issue 182) - Add support for extracting Base64 encoded images - Improve image format detection to include Base64 images - Enhance compatibility with locally saved HTML files using Base64 image encoding	2024-10-20 19:25:25 +08:00
UncleCode	6ec4cb33ca	Enhance Markdown generation and external content control - Integrate customized html2text library for flexible Markdown output - Add options to exclude external links and images - Improve content scraping efficiency and error handling - Update AsyncPlaywrightCrawlerStrategy for faster closing - Enhance CosineStrategy with generic embedding model loading	2024-10-20 18:56:58 +08:00
UncleCode	e7cd8a1c2d	Update Changelog	2024-10-19 18:37:12 +08:00
UncleCode	b8147b64e0	chore: Bump version to 0.3.71 and improve error handling - Update version number to 0.3.71 - Add sleep_on_close option to AsyncPlaywrightCrawlerStrategy - Enhance context creation with additional options - Improve error message formatting and visibility - Update quickstart documentation	2024-10-18 13:31:12 +08:00
UncleCode	768aa06ceb	feat(crawler): Enhance stealth and flexibility, improve error handling - Implement playwright_stealth for better bot detection avoidance - Add user simulation and navigator override options - Improve iframe processing and browser selection - Enhance error reporting and debugging capabilities - Optimize image processing and parallel crawling - Add new example for user simulation feature - Added support for including links in Markdown content, by definin g a new flag `include_links_on_markdown` in `crawl` method.	2024-10-17 21:37:48 +08:00
unclecode	2b73bdf6b0	Update changelog	2024-10-14 21:04:02 +08:00
unclecode	68e9144ce3	feat: Enhance crawling control and LLM extraction flexibility - Add before_retrieve_html hook and delay_before_return_html option - Implement flexible page_timeout for smart_wait function - Support extra_args and custom headers in LLM extraction - Allow arbitrary kwargs in AsyncWebCrawler initialization - Improve perform_completion_with_backoff for custom API calls - Update examples with new features and diverse LLM providers	2024-10-12 14:48:22 +08:00
unclecode	9b2b267820	CHANGELOG UPDATE	2024-10-12 13:42:56 +08:00
unclecode	ff3524d9b1	feat(v0.3.6): Add screenshot capture, delayed content, and custom timeouts - Implement screenshot capture functionality - Add delayed content retrieval method - Introduce custom page timeout parameter - Enhance LLM support with multiple providers - Improve database schema auto-updates - Optimize image processing in WebScrappingStrategy - Update error handling and logging - Expand examples in quickstart_async.py	2024-10-12 13:42:42 +08:00
unclecode	4750810a67	Enhance AsyncWebCrawler with smart waiting and screenshot capabilities - Implement smart_wait function in AsyncPlaywrightCrawlerStrategy - Add screenshot support to AsyncCrawlResponse and AsyncWebCrawler - Improve error handling and timeout management in crawling process - Fix typo in CrawlResult model (responser_headers -> response_headers) - Update .gitignore to exclude additional files - Adjust import path in test_basic_crawling.py	2024-10-02 17:34:56 +08:00
unclecode	e5e6a34e80	## [v0.2.77] - 2024-08-04 Significant improvements in text processing and performance: - 🚀 Dependency reduction: Removed dependency on spaCy model for text chunk labeling in cosine extraction strategy. - 🤖 Transformer upgrade: Implemented text sequence classification using a transformer model for labeling text chunks. - ⚡ Performance enhancement: Improved model loading speed due to removal of spaCy dependency. - 🔧 Future-proofing: Laid groundwork for potential complete removal of spaCy dependency in future versions. These changes address issue #68 and provide a foundation for faster, more efficient text processing in Crawl4AI.	2024-08-04 14:54:18 +08:00

1 2

67 Commits