From 553a4622bf750de67264eec984c291489f914e17 Mon Sep 17 00:00:00 2001 From: UncleCode Date: Tue, 31 Dec 2024 19:18:36 +0800 Subject: [PATCH] chore: prepare for version 0.4.24 --- CHANGELOG.md | 71 +++++++++++++++++++++++++++++++++++++++-- README.md | 24 ++++++-------- crawl4ai/__version__.py | 2 +- 3 files changed, 80 insertions(+), 17 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index a3bc68de..ccd1e098 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,6 +1,73 @@ # Changelog -## [0.4.1] December 8, 2024 +All notable changes to Crawl4AI will be documented in this file. + +The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), +and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). + +## [0.4.24] - 2024-12-31 + +### Added +- **Browser and SSL Handling** + - SSL certificate validation options in extraction strategies + - Custom certificate paths support + - Configurable certificate validation skipping + - Enhanced response status code handling with retry logic + +- **Content Processing** + - New content filtering system with regex support + - Advanced chunking strategies for large content + - Memory-efficient parallel processing + - Configurable chunk size optimization + +- **JSON Extraction** + - Complex JSONPath expression support + - JSON-LD and Microdata extraction + - RDFa parsing capabilities + - Advanced data transformation pipeline + +- **Field Types** + - New field types: `computed`, `conditional`, `aggregate`, `template` + - Field inheritance system + - Reusable field definitions + - Custom validation rules + +### Changed +- **Performance** + - Optimized selector compilation with caching + - Improved HTML parsing efficiency + - Enhanced memory management for large documents + - Batch processing optimizations + +- **Error Handling** + - More detailed error messages and categorization + - Enhanced debugging capabilities + - Improved performance metrics tracking + - Better error recovery mechanisms + +### Deprecated +- Old field computation method using `eval` +- Direct browser manipulation without proper SSL handling +- Simple text-based content filtering + +### Removed +- Legacy extraction patterns without proper error handling +- Unsafe eval-based field computation +- Direct DOM manipulation without sanitization + +### Fixed +- Memory leaks in large document processing +- SSL certificate validation issues +- Incorrect handling of nested JSON structures +- Performance bottlenecks in parallel processing + +### Security +- Improved input validation and sanitization +- Safe expression evaluation system +- Enhanced resource protection +- Rate limiting implementation + +## [0.4.1] - 2024-12-08 ### **File: `crawl4ai/async_crawler_strategy.py`** @@ -980,6 +1047,6 @@ These changes focus on refining the existing codebase, resulting in a more stabl - Maintaining the semantic context of inline tags (e.g., abbreviation, DEL, INS) for improved LLM-friendliness. - Updated Dockerfile to ensure compatibility across multiple platforms (Hopefully!). -## [0.2.4] - 2024-06-17 +## [v0.2.4] - 2024-06-17 ### Fixed - Fix issue #22: Use MD5 hash for caching HTML files to handle long URLs diff --git a/README.md b/README.md index 36ee81a9..2f807eec 100644 --- a/README.md +++ b/README.md @@ -11,9 +11,9 @@ Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease. -[✨ Check out latest update v0.4.2](#-recent-updates) +[✨ Check out latest update v0.4.24](#-recent-updates) -🎉 **Version 0.4.2 is out!** Introducing our experimental PruningContentFilter - a powerful new algorithm for smarter Markdown generation. Test it out and [share your feedback](https://github.com/unclecode/crawl4ai/issues)! [Read the release notes →](https://crawl4ai.com/mkdocs/blog) +🎉 **Version 0.4.24 is out!** Major improvements in extraction strategies with enhanced JSON handling, SSL security, and Amazon product extraction. Plus, a completely revamped content filtering system! [Read the release notes →](https://crawl4ai.com/mkdocs/blog) ## 🧐 Why Crawl4AI? @@ -626,19 +626,15 @@ async def test_news_crawl(): ## ✨ Recent Updates -- 🔧 **Configurable Crawlers and Browsers**: Simplified crawling with `BrowserConfig` and `CrawlerRunConfig`, making setups cleaner and more scalable. -- 🔐 **Session Management Enhancements**: Import/export local storage for personalized crawling with seamless session reuse. -- 📸 **Supercharged Screenshots**: Take lightning-fast, full-page screenshots of very long pages. -- 📜 **Full-Page PDF Export**: Convert any web page into a PDF for easy sharing or archiving. -- 🖼️ **Lazy Load Handling**: Improved support for websites with lazy-loaded images. The crawler now waits for all images to fully load, ensuring no content is missed. -- ⚡ **Text-Only Mode**: New mode for fast, lightweight crawling. Disables images, JavaScript, and GPU rendering, improving speed by 3-4x for text-focused crawls. -- 📐 **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to fit page content, ensuring accurate rendering and capturing of all elements. -- 🔄 **Full-Page Scanning**: Added scrolling support for pages with infinite scroll or dynamic content loading. Ensures every part of the page is captured. -- 🧑‍💻 **Session Reuse**: Introduced `create_session` for efficient crawling by reusing the same browser session across multiple requests. -- 🌟 **Light Mode**: Optimized browser performance by disabling unnecessary features like extensions, background timers, and sync processes. +- 🔒 **Enhanced SSL & Security**: New SSL certificate handling with custom paths and validation options for secure crawling +- 🔍 **Smart Content Filtering**: Advanced filtering system with regex support and efficient chunking strategies +- 📦 **Improved JSON Extraction**: Support for complex JSONPath, JSON-LD, and Microdata extraction +- 🏗️ **New Field Types**: Added `computed`, `conditional`, `aggregate`, and `template` field types +- ⚡ **Performance Boost**: Optimized caching, parallel processing, and memory management +- 🐛 **Better Error Handling**: Enhanced debugging capabilities with detailed error tracking +- 🔐 **Security Features**: Improved input validation and safe expression evaluation - -Read the full details of this release in our [0.4.2 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/blog/releases/0.4.2.md). +Read the full details of this release in our [0.4.24 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md). ## 📖 Documentation & Roadmap diff --git a/crawl4ai/__version__.py b/crawl4ai/__version__.py index 38af4878..73e5c025 100644 --- a/crawl4ai/__version__.py +++ b/crawl4ai/__version__.py @@ -1,2 +1,2 @@ # crawl4ai/_version.py -__version__ = "0.4.23" +__version__ = "0.4.24"