Update changelog

This commit is contained in:
unclecode
2024-10-14 21:04:02 +08:00
parent 6aa803d712
commit 2b73bdf6b0

View File

@@ -1,6 +1,6 @@
# Changelog
## [v0.3.6] - 2024-10-12 - Part 1
## [v0.3.6] - 2024-10-12
### 1. Improved Crawling Control
- **New Hook**: Added `before_retrieve_html` hook in `AsyncPlaywrightCrawlerStrategy`.
@@ -8,73 +8,75 @@
- Useful for pages with delayed content loading.
- **Flexible Timeout**: `smart_wait` function now uses `page_timeout` (default 60 seconds) instead of a fixed 30-second timeout.
- Provides better handling for slow-loading pages.
### 2. Enhanced LLM Extraction Strategy
- **Custom Arguments**: Added support for passing extra arguments to LLM providers via `extra_args` parameter.
- **Custom Headers**: Users can now pass custom headers to the extraction strategy.
- Enables more flexibility when interacting with different LLM APIs.
### 3. AsyncWebCrawler Improvements
- **Flexible Initialization**: `AsyncWebCrawler` now accepts arbitrary keyword arguments.
- These are passed directly to the crawler strategy, allowing for more customized setups.
### 4. Utility Function Enhancements
- **Improved API Interaction**: `perform_completion_with_backoff` function now supports additional arguments.
- Allows for more customized API calls to LLM providers.
## Examples and Documentation
- Updated `quickstart_async.py` with examples of using custom headers in LLM extraction.
- Added more diverse examples of LLM provider usage, including OpenAI, Hugging Face, and Ollama.
## Developer Notes
- Refactored code for better maintainability and flexibility.
- Enhanced error handling and logging for improved debugging experience.
## [v0.3.6] - 2024-10-12 - Part 2
### 1. Screenshot Capture
- **What's new**: Added ability to capture screenshots during crawling.
- **Why it matters**: You can now visually verify the content of crawled pages, which is useful for debugging and content verification.
- **How to use**: Set `screenshot=True` when calling `crawler.arun()`.
### 2. Delayed Content Retrieval
- **What's new**: Introduced `get_delayed_content` method in `AsyncCrawlResponse`.
- **Why it matters**: Allows you to retrieve content after a specified delay, useful for pages that load content dynamically.
- **How to use**: Access `result.get_delayed_content(delay_in_seconds)` after crawling.
### 3. Custom Page Timeout
- **What's new**: Added `page_timeout` parameter to control page load timeout.
- **Why it matters**: Gives you more control over crawling behavior, especially for slow-loading pages.
- **How to use**: Set `page_timeout=your_desired_timeout` (in milliseconds) when calling `crawler.arun()`.
### 4. Enhanced LLM Support
- **What's new**: Added support for multiple LLM providers (OpenAI, Hugging Face, Ollama).
- **Why it matters**: Provides more flexibility in choosing AI models for content extraction.
- **How to use**: Specify the desired provider when using `LLMExtractionStrategy`.
### 2. Browser Type Selection
- Added support for different browser types (Chromium, Firefox, WebKit).
- Users can now specify the browser type when initializing AsyncWebCrawler.
- **How to use**: Set `browser_type="firefox"` or `browser_type="webkit"` when initializing AsyncWebCrawler.
## Improvements
### 3. Screenshot Capture
- Added ability to capture screenshots during crawling.
- Useful for debugging and content verification.
- **How to use**: Set `screenshot=True` when calling `crawler.arun()`.
### 1. Database Schema Auto-updates
- **What's new**: Automatic database schema updates.
- **Why it matters**: Ensures your database stays compatible with the latest version without manual intervention.
### 4. Enhanced LLM Extraction Strategy
- Added support for multiple LLM providers (OpenAI, Hugging Face, Ollama).
- **Custom Arguments**: Added support for passing extra arguments to LLM providers via `extra_args` parameter.
- **Custom Headers**: Users can now pass custom headers to the extraction strategy.
- **How to use**: Specify the desired provider and custom arguments when using `LLMExtractionStrategy`.
### 2. Enhanced Error Handling
- **What's new**: Improved error messages and logging.
- **Why it matters**: Makes debugging easier with more informative error messages.
### 5. iframe Content Extraction
- New feature to process and extract content from iframes.
- **How to use**: Set `process_iframes=True` in the crawl method.
### 3. Optimized Image Processing
- **What's new**: Refined image handling in `WebScrappingStrategy`.
- **Why it matters**: Improves the accuracy of content extraction for pages with images.
### 6. Delayed Content Retrieval
- Introduced `get_delayed_content` method in `AsyncCrawlResponse`.
- Allows retrieval of content after a specified delay, useful for dynamically loaded content.
- **How to use**: Access `result.get_delayed_content(delay_in_seconds)` after crawling.
## Improvements and Optimizations
### 1. AsyncWebCrawler Enhancements
- **Flexible Initialization**: Now accepts arbitrary keyword arguments, passed directly to the crawler strategy.
- Allows for more customized setups.
### 2. Image Processing Optimization
- Enhanced image handling in WebScrappingStrategy.
- Added filtering for small, invisible, or irrelevant images.
- Improved image scoring system for better content relevance.
- Implemented JavaScript-based image dimension updating for more accurate representation.
### 3. Database Schema Auto-updates
- Automatic database schema updates ensure compatibility with the latest version.
### 4. Enhanced Error Handling and Logging
- Improved error messages and logging for easier debugging.
### 5. Content Extraction Refinements
- Refined HTML sanitization process.
- Improved handling of base64 encoded images.
- Enhanced Markdown conversion process.
- Optimized content extraction algorithms.
### 6. Utility Function Enhancements
- `perform_completion_with_backoff` function now supports additional arguments for more customized API calls to LLM providers.
## Bug Fixes
- Fixed an issue where image tags were being prematurely removed during content extraction.
## Examples and Documentation
- Updated `quickstart_async.py` with examples of:
- Using custom headers in LLM extraction.
- Different LLM provider usage (OpenAI, Hugging Face, Ollama).
- Custom browser type usage.
## Developer Notes
- Refactored code for better maintainability, flexibility, and performance.
- Enhanced type hinting throughout the codebase for improved development experience.
- Expanded error handling for more robust operation.
- Added examples for using different LLM providers in `quickstart_async.py`.
- Enhanced type hinting throughout the codebase for better development experience.
These updates significantly enhance the flexibility, accuracy, and robustness of crawl4ai, providing users with more control and options for their web crawling and content extraction tasks.
## [v0.3.5] - 2024-09-02