CHANGELOG UPDATE

This commit is contained in:
unclecode
2024-10-12 13:42:56 +08:00
parent ff3524d9b1
commit 9b2b267820

View File

@@ -1,37 +1,51 @@
# Changelog # Changelog
## [0.3.6] - 2024-10-12 ## [v0.3.6] - 2024-10-12
### Added ### 1. Screenshot Capture
- New `.tests/` directory added to `.gitignore` - **What's new**: Added ability to capture screenshots during crawling.
- Screenshot functionality: - **Why it matters**: You can now visually verify the content of crawled pages, which is useful for debugging and content verification.
- Added `screenshot` column to the database schema - **How to use**: Set `screenshot=True` when calling `crawler.arun()`.
- Implemented `take_screenshot` method in `AsyncPlaywrightCrawlerStrategy`
- Added option to capture screenshots when crawling
- Delayed content retrieval:
- New `get_delayed_content` method in `AsyncCrawlResponse`
- Database schema updates:
- Auto-update mechanism for database schema
- New columns: 'media', 'links', 'metadata', 'screenshot'
- LLM extraction examples in `quickstart_async.py`:
- Support for OpenAI, Hugging Face, and Ollama models
### Changed ### 2. Delayed Content Retrieval
- Updated version number to 0.3.6 in `__init__.py` - **What's new**: Introduced `get_delayed_content` method in `AsyncCrawlResponse`.
- Improved error handling and logging in various components - **Why it matters**: Allows you to retrieve content after a specified delay, useful for pages that load content dynamically.
- Enhanced `WebScrappingStrategy` to handle image processing more efficiently - **How to use**: Access `result.get_delayed_content(delay_in_seconds)` after crawling.
- Modified `AsyncPlaywrightCrawlerStrategy` to support custom timeout values
### Fixed ### 3. Custom Page Timeout
- Adjusted image processing in `WebScrappingStrategy` to prevent premature decomposition of img tags - **What's new**: Added `page_timeout` parameter to control page load timeout.
- **Why it matters**: Gives you more control over crawling behavior, especially for slow-loading pages.
- **How to use**: Set `page_timeout=your_desired_timeout` (in milliseconds) when calling `crawler.arun()`.
### Removed ### 4. Enhanced LLM Support
- Removed `pypi_build.sh` from version control (added to `.gitignore`) - **What's new**: Added support for multiple LLM providers (OpenAI, Hugging Face, Ollama).
- **Why it matters**: Provides more flexibility in choosing AI models for content extraction.
- **How to use**: Specify the desired provider when using `LLMExtractionStrategy`.
### Developer Notes ## Improvements
- Added examples for using different LLM providers in `quickstart_async.py`
- Improved error messages for better debugging ### 1. Database Schema Auto-updates
- Enhanced type hinting throughout the codebase - **What's new**: Automatic database schema updates.
- **Why it matters**: Ensures your database stays compatible with the latest version without manual intervention.
### 2. Enhanced Error Handling
- **What's new**: Improved error messages and logging.
- **Why it matters**: Makes debugging easier with more informative error messages.
### 3. Optimized Image Processing
- **What's new**: Refined image handling in `WebScrappingStrategy`.
- **Why it matters**: Improves the accuracy of content extraction for pages with images.
## Bug Fixes
- Fixed an issue where image tags were being prematurely removed during content extraction.
## Developer Notes
- Added examples for using different LLM providers in `quickstart_async.py`.
- Enhanced type hinting throughout the codebase for better development experience.
We're constantly working to improve crawl4ai. These updates aim to provide you with more control, flexibility, and reliability in your web crawling tasks. As always, we appreciate your feedback and suggestions for future improvements!
## [v0.3.5] - 2024-09-02 ## [v0.3.5] - 2024-09-02