chore: prepare for version 0.4.24
This commit is contained in:
71
CHANGELOG.md
71
CHANGELOG.md
@@ -1,6 +1,73 @@
|
||||
# Changelog
|
||||
|
||||
## [0.4.1] December 8, 2024
|
||||
All notable changes to Crawl4AI will be documented in this file.
|
||||
|
||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||
|
||||
## [0.4.24] - 2024-12-31
|
||||
|
||||
### Added
|
||||
- **Browser and SSL Handling**
|
||||
- SSL certificate validation options in extraction strategies
|
||||
- Custom certificate paths support
|
||||
- Configurable certificate validation skipping
|
||||
- Enhanced response status code handling with retry logic
|
||||
|
||||
- **Content Processing**
|
||||
- New content filtering system with regex support
|
||||
- Advanced chunking strategies for large content
|
||||
- Memory-efficient parallel processing
|
||||
- Configurable chunk size optimization
|
||||
|
||||
- **JSON Extraction**
|
||||
- Complex JSONPath expression support
|
||||
- JSON-LD and Microdata extraction
|
||||
- RDFa parsing capabilities
|
||||
- Advanced data transformation pipeline
|
||||
|
||||
- **Field Types**
|
||||
- New field types: `computed`, `conditional`, `aggregate`, `template`
|
||||
- Field inheritance system
|
||||
- Reusable field definitions
|
||||
- Custom validation rules
|
||||
|
||||
### Changed
|
||||
- **Performance**
|
||||
- Optimized selector compilation with caching
|
||||
- Improved HTML parsing efficiency
|
||||
- Enhanced memory management for large documents
|
||||
- Batch processing optimizations
|
||||
|
||||
- **Error Handling**
|
||||
- More detailed error messages and categorization
|
||||
- Enhanced debugging capabilities
|
||||
- Improved performance metrics tracking
|
||||
- Better error recovery mechanisms
|
||||
|
||||
### Deprecated
|
||||
- Old field computation method using `eval`
|
||||
- Direct browser manipulation without proper SSL handling
|
||||
- Simple text-based content filtering
|
||||
|
||||
### Removed
|
||||
- Legacy extraction patterns without proper error handling
|
||||
- Unsafe eval-based field computation
|
||||
- Direct DOM manipulation without sanitization
|
||||
|
||||
### Fixed
|
||||
- Memory leaks in large document processing
|
||||
- SSL certificate validation issues
|
||||
- Incorrect handling of nested JSON structures
|
||||
- Performance bottlenecks in parallel processing
|
||||
|
||||
### Security
|
||||
- Improved input validation and sanitization
|
||||
- Safe expression evaluation system
|
||||
- Enhanced resource protection
|
||||
- Rate limiting implementation
|
||||
|
||||
## [0.4.1] - 2024-12-08
|
||||
|
||||
### **File: `crawl4ai/async_crawler_strategy.py`**
|
||||
|
||||
@@ -980,6 +1047,6 @@ These changes focus on refining the existing codebase, resulting in a more stabl
|
||||
- Maintaining the semantic context of inline tags (e.g., abbreviation, DEL, INS) for improved LLM-friendliness.
|
||||
- Updated Dockerfile to ensure compatibility across multiple platforms (Hopefully!).
|
||||
|
||||
## [0.2.4] - 2024-06-17
|
||||
## [v0.2.4] - 2024-06-17
|
||||
### Fixed
|
||||
- Fix issue #22: Use MD5 hash for caching HTML files to handle long URLs
|
||||
|
||||
24
README.md
24
README.md
@@ -11,9 +11,9 @@
|
||||
|
||||
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
|
||||
|
||||
[✨ Check out latest update v0.4.2](#-recent-updates)
|
||||
[✨ Check out latest update v0.4.24](#-recent-updates)
|
||||
|
||||
🎉 **Version 0.4.2 is out!** Introducing our experimental PruningContentFilter - a powerful new algorithm for smarter Markdown generation. Test it out and [share your feedback](https://github.com/unclecode/crawl4ai/issues)! [Read the release notes →](https://crawl4ai.com/mkdocs/blog)
|
||||
🎉 **Version 0.4.24 is out!** Major improvements in extraction strategies with enhanced JSON handling, SSL security, and Amazon product extraction. Plus, a completely revamped content filtering system! [Read the release notes →](https://crawl4ai.com/mkdocs/blog)
|
||||
|
||||
## 🧐 Why Crawl4AI?
|
||||
|
||||
@@ -626,19 +626,15 @@ async def test_news_crawl():
|
||||
|
||||
## ✨ Recent Updates
|
||||
|
||||
- 🔧 **Configurable Crawlers and Browsers**: Simplified crawling with `BrowserConfig` and `CrawlerRunConfig`, making setups cleaner and more scalable.
|
||||
- 🔐 **Session Management Enhancements**: Import/export local storage for personalized crawling with seamless session reuse.
|
||||
- 📸 **Supercharged Screenshots**: Take lightning-fast, full-page screenshots of very long pages.
|
||||
- 📜 **Full-Page PDF Export**: Convert any web page into a PDF for easy sharing or archiving.
|
||||
- 🖼️ **Lazy Load Handling**: Improved support for websites with lazy-loaded images. The crawler now waits for all images to fully load, ensuring no content is missed.
|
||||
- ⚡ **Text-Only Mode**: New mode for fast, lightweight crawling. Disables images, JavaScript, and GPU rendering, improving speed by 3-4x for text-focused crawls.
|
||||
- 📐 **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to fit page content, ensuring accurate rendering and capturing of all elements.
|
||||
- 🔄 **Full-Page Scanning**: Added scrolling support for pages with infinite scroll or dynamic content loading. Ensures every part of the page is captured.
|
||||
- 🧑💻 **Session Reuse**: Introduced `create_session` for efficient crawling by reusing the same browser session across multiple requests.
|
||||
- 🌟 **Light Mode**: Optimized browser performance by disabling unnecessary features like extensions, background timers, and sync processes.
|
||||
- 🔒 **Enhanced SSL & Security**: New SSL certificate handling with custom paths and validation options for secure crawling
|
||||
- 🔍 **Smart Content Filtering**: Advanced filtering system with regex support and efficient chunking strategies
|
||||
- 📦 **Improved JSON Extraction**: Support for complex JSONPath, JSON-LD, and Microdata extraction
|
||||
- 🏗️ **New Field Types**: Added `computed`, `conditional`, `aggregate`, and `template` field types
|
||||
- ⚡ **Performance Boost**: Optimized caching, parallel processing, and memory management
|
||||
- 🐛 **Better Error Handling**: Enhanced debugging capabilities with detailed error tracking
|
||||
- 🔐 **Security Features**: Improved input validation and safe expression evaluation
|
||||
|
||||
|
||||
Read the full details of this release in our [0.4.2 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/blog/releases/0.4.2.md).
|
||||
Read the full details of this release in our [0.4.24 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
|
||||
|
||||
## 📖 Documentation & Roadmap
|
||||
|
||||
|
||||
@@ -1,2 +1,2 @@
|
||||
# crawl4ai/_version.py
|
||||
__version__ = "0.4.23"
|
||||
__version__ = "0.4.24"
|
||||
|
||||
Reference in New Issue
Block a user