chore: prepare for version 0.4.24

This commit is contained in:
UncleCode
2024-12-31 19:18:36 +08:00
parent 6f81ef006d
commit 553a4622bf
3 changed files with 80 additions and 17 deletions

View File

@@ -1,6 +1,73 @@
# Changelog
## [0.4.1] December 8, 2024
All notable changes to Crawl4AI will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [0.4.24] - 2024-12-31
### Added
- **Browser and SSL Handling**
- SSL certificate validation options in extraction strategies
- Custom certificate paths support
- Configurable certificate validation skipping
- Enhanced response status code handling with retry logic
- **Content Processing**
- New content filtering system with regex support
- Advanced chunking strategies for large content
- Memory-efficient parallel processing
- Configurable chunk size optimization
- **JSON Extraction**
- Complex JSONPath expression support
- JSON-LD and Microdata extraction
- RDFa parsing capabilities
- Advanced data transformation pipeline
- **Field Types**
- New field types: `computed`, `conditional`, `aggregate`, `template`
- Field inheritance system
- Reusable field definitions
- Custom validation rules
### Changed
- **Performance**
- Optimized selector compilation with caching
- Improved HTML parsing efficiency
- Enhanced memory management for large documents
- Batch processing optimizations
- **Error Handling**
- More detailed error messages and categorization
- Enhanced debugging capabilities
- Improved performance metrics tracking
- Better error recovery mechanisms
### Deprecated
- Old field computation method using `eval`
- Direct browser manipulation without proper SSL handling
- Simple text-based content filtering
### Removed
- Legacy extraction patterns without proper error handling
- Unsafe eval-based field computation
- Direct DOM manipulation without sanitization
### Fixed
- Memory leaks in large document processing
- SSL certificate validation issues
- Incorrect handling of nested JSON structures
- Performance bottlenecks in parallel processing
### Security
- Improved input validation and sanitization
- Safe expression evaluation system
- Enhanced resource protection
- Rate limiting implementation
## [0.4.1] - 2024-12-08
### **File: `crawl4ai/async_crawler_strategy.py`**
@@ -980,6 +1047,6 @@ These changes focus on refining the existing codebase, resulting in a more stabl
- Maintaining the semantic context of inline tags (e.g., abbreviation, DEL, INS) for improved LLM-friendliness.
- Updated Dockerfile to ensure compatibility across multiple platforms (Hopefully!).
## [0.2.4] - 2024-06-17
## [v0.2.4] - 2024-06-17
### Fixed
- Fix issue #22: Use MD5 hash for caching HTML files to handle long URLs

View File

@@ -11,9 +11,9 @@
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
[✨ Check out latest update v0.4.2](#-recent-updates)
[✨ Check out latest update v0.4.24](#-recent-updates)
🎉 **Version 0.4.2 is out!** Introducing our experimental PruningContentFilter - a powerful new algorithm for smarter Markdown generation. Test it out and [share your feedback](https://github.com/unclecode/crawl4ai/issues)! [Read the release notes →](https://crawl4ai.com/mkdocs/blog)
🎉 **Version 0.4.24 is out!** Major improvements in extraction strategies with enhanced JSON handling, SSL security, and Amazon product extraction. Plus, a completely revamped content filtering system! [Read the release notes →](https://crawl4ai.com/mkdocs/blog)
## 🧐 Why Crawl4AI?
@@ -626,19 +626,15 @@ async def test_news_crawl():
## ✨ Recent Updates
- 🔧 **Configurable Crawlers and Browsers**: Simplified crawling with `BrowserConfig` and `CrawlerRunConfig`, making setups cleaner and more scalable.
- 🔐 **Session Management Enhancements**: Import/export local storage for personalized crawling with seamless session reuse.
- 📸 **Supercharged Screenshots**: Take lightning-fast, full-page screenshots of very long pages.
- 📜 **Full-Page PDF Export**: Convert any web page into a PDF for easy sharing or archiving.
- 🖼️ **Lazy Load Handling**: Improved support for websites with lazy-loaded images. The crawler now waits for all images to fully load, ensuring no content is missed.
- ⚡ **Text-Only Mode**: New mode for fast, lightweight crawling. Disables images, JavaScript, and GPU rendering, improving speed by 3-4x for text-focused crawls.
- 📐 **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to fit page content, ensuring accurate rendering and capturing of all elements.
- 🔄 **Full-Page Scanning**: Added scrolling support for pages with infinite scroll or dynamic content loading. Ensures every part of the page is captured.
- 🧑‍💻 **Session Reuse**: Introduced `create_session` for efficient crawling by reusing the same browser session across multiple requests.
- 🌟 **Light Mode**: Optimized browser performance by disabling unnecessary features like extensions, background timers, and sync processes.
- 🔒 **Enhanced SSL & Security**: New SSL certificate handling with custom paths and validation options for secure crawling
- 🔍 **Smart Content Filtering**: Advanced filtering system with regex support and efficient chunking strategies
- 📦 **Improved JSON Extraction**: Support for complex JSONPath, JSON-LD, and Microdata extraction
- 🏗️ **New Field Types**: Added `computed`, `conditional`, `aggregate`, and `template` field types
-**Performance Boost**: Optimized caching, parallel processing, and memory management
- 🐛 **Better Error Handling**: Enhanced debugging capabilities with detailed error tracking
- 🔐 **Security Features**: Improved input validation and safe expression evaluation
Read the full details of this release in our [0.4.2 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/blog/releases/0.4.2.md).
Read the full details of this release in our [0.4.24 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
## 📖 Documentation & Roadmap

View File

@@ -1,2 +1,2 @@
# crawl4ai/_version.py
__version__ = "0.4.23"
__version__ = "0.4.24"