Update changelog

This commit is contained in:
unclecode
2024-10-14 21:04:02 +08:00
parent 6aa803d712
commit 2b73bdf6b0

View File

@@ -1,6 +1,6 @@
# Changelog # Changelog
## [v0.3.6] - 2024-10-12 - Part 1 ## [v0.3.6] - 2024-10-12
### 1. Improved Crawling Control ### 1. Improved Crawling Control
- **New Hook**: Added `before_retrieve_html` hook in `AsyncPlaywrightCrawlerStrategy`. - **New Hook**: Added `before_retrieve_html` hook in `AsyncPlaywrightCrawlerStrategy`.
@@ -8,73 +8,75 @@
- Useful for pages with delayed content loading. - Useful for pages with delayed content loading.
- **Flexible Timeout**: `smart_wait` function now uses `page_timeout` (default 60 seconds) instead of a fixed 30-second timeout. - **Flexible Timeout**: `smart_wait` function now uses `page_timeout` (default 60 seconds) instead of a fixed 30-second timeout.
- Provides better handling for slow-loading pages. - Provides better handling for slow-loading pages.
### 2. Enhanced LLM Extraction Strategy
- **Custom Arguments**: Added support for passing extra arguments to LLM providers via `extra_args` parameter.
- **Custom Headers**: Users can now pass custom headers to the extraction strategy.
- Enables more flexibility when interacting with different LLM APIs.
### 3. AsyncWebCrawler Improvements
- **Flexible Initialization**: `AsyncWebCrawler` now accepts arbitrary keyword arguments.
- These are passed directly to the crawler strategy, allowing for more customized setups.
### 4. Utility Function Enhancements
- **Improved API Interaction**: `perform_completion_with_backoff` function now supports additional arguments.
- Allows for more customized API calls to LLM providers.
## Examples and Documentation
- Updated `quickstart_async.py` with examples of using custom headers in LLM extraction.
- Added more diverse examples of LLM provider usage, including OpenAI, Hugging Face, and Ollama.
## Developer Notes
- Refactored code for better maintainability and flexibility.
- Enhanced error handling and logging for improved debugging experience.
## [v0.3.6] - 2024-10-12 - Part 2
### 1. Screenshot Capture
- **What's new**: Added ability to capture screenshots during crawling.
- **Why it matters**: You can now visually verify the content of crawled pages, which is useful for debugging and content verification.
- **How to use**: Set `screenshot=True` when calling `crawler.arun()`.
### 2. Delayed Content Retrieval
- **What's new**: Introduced `get_delayed_content` method in `AsyncCrawlResponse`.
- **Why it matters**: Allows you to retrieve content after a specified delay, useful for pages that load content dynamically.
- **How to use**: Access `result.get_delayed_content(delay_in_seconds)` after crawling.
### 3. Custom Page Timeout
- **What's new**: Added `page_timeout` parameter to control page load timeout.
- **Why it matters**: Gives you more control over crawling behavior, especially for slow-loading pages.
- **How to use**: Set `page_timeout=your_desired_timeout` (in milliseconds) when calling `crawler.arun()`. - **How to use**: Set `page_timeout=your_desired_timeout` (in milliseconds) when calling `crawler.arun()`.
### 4. Enhanced LLM Support ### 2. Browser Type Selection
- **What's new**: Added support for multiple LLM providers (OpenAI, Hugging Face, Ollama). - Added support for different browser types (Chromium, Firefox, WebKit).
- **Why it matters**: Provides more flexibility in choosing AI models for content extraction. - Users can now specify the browser type when initializing AsyncWebCrawler.
- **How to use**: Specify the desired provider when using `LLMExtractionStrategy`. - **How to use**: Set `browser_type="firefox"` or `browser_type="webkit"` when initializing AsyncWebCrawler.
## Improvements ### 3. Screenshot Capture
- Added ability to capture screenshots during crawling.
- Useful for debugging and content verification.
- **How to use**: Set `screenshot=True` when calling `crawler.arun()`.
### 1. Database Schema Auto-updates ### 4. Enhanced LLM Extraction Strategy
- **What's new**: Automatic database schema updates. - Added support for multiple LLM providers (OpenAI, Hugging Face, Ollama).
- **Why it matters**: Ensures your database stays compatible with the latest version without manual intervention. - **Custom Arguments**: Added support for passing extra arguments to LLM providers via `extra_args` parameter.
- **Custom Headers**: Users can now pass custom headers to the extraction strategy.
- **How to use**: Specify the desired provider and custom arguments when using `LLMExtractionStrategy`.
### 2. Enhanced Error Handling ### 5. iframe Content Extraction
- **What's new**: Improved error messages and logging. - New feature to process and extract content from iframes.
- **Why it matters**: Makes debugging easier with more informative error messages. - **How to use**: Set `process_iframes=True` in the crawl method.
### 3. Optimized Image Processing ### 6. Delayed Content Retrieval
- **What's new**: Refined image handling in `WebScrappingStrategy`. - Introduced `get_delayed_content` method in `AsyncCrawlResponse`.
- **Why it matters**: Improves the accuracy of content extraction for pages with images. - Allows retrieval of content after a specified delay, useful for dynamically loaded content.
- **How to use**: Access `result.get_delayed_content(delay_in_seconds)` after crawling.
## Improvements and Optimizations
### 1. AsyncWebCrawler Enhancements
- **Flexible Initialization**: Now accepts arbitrary keyword arguments, passed directly to the crawler strategy.
- Allows for more customized setups.
### 2. Image Processing Optimization
- Enhanced image handling in WebScrappingStrategy.
- Added filtering for small, invisible, or irrelevant images.
- Improved image scoring system for better content relevance.
- Implemented JavaScript-based image dimension updating for more accurate representation.
### 3. Database Schema Auto-updates
- Automatic database schema updates ensure compatibility with the latest version.
### 4. Enhanced Error Handling and Logging
- Improved error messages and logging for easier debugging.
### 5. Content Extraction Refinements
- Refined HTML sanitization process.
- Improved handling of base64 encoded images.
- Enhanced Markdown conversion process.
- Optimized content extraction algorithms.
### 6. Utility Function Enhancements
- `perform_completion_with_backoff` function now supports additional arguments for more customized API calls to LLM providers.
## Bug Fixes ## Bug Fixes
- Fixed an issue where image tags were being prematurely removed during content extraction. - Fixed an issue where image tags were being prematurely removed during content extraction.
## Examples and Documentation
- Updated `quickstart_async.py` with examples of:
- Using custom headers in LLM extraction.
- Different LLM provider usage (OpenAI, Hugging Face, Ollama).
- Custom browser type usage.
## Developer Notes ## Developer Notes
- Refactored code for better maintainability, flexibility, and performance.
- Enhanced type hinting throughout the codebase for improved development experience.
- Expanded error handling for more robust operation.
- Added examples for using different LLM providers in `quickstart_async.py`. These updates significantly enhance the flexibility, accuracy, and robustness of crawl4ai, providing users with more control and options for their web crawling and content extraction tasks.
- Enhanced type hinting throughout the codebase for better development experience.
## [v0.3.5] - 2024-09-02 ## [v0.3.5] - 2024-09-02