Update changelog
This commit is contained in:
114
CHANGELOG.md
114
CHANGELOG.md
@@ -1,6 +1,6 @@
|
||||
# Changelog
|
||||
|
||||
## [v0.3.6] - 2024-10-12 - Part 1
|
||||
## [v0.3.6] - 2024-10-12
|
||||
|
||||
### 1. Improved Crawling Control
|
||||
- **New Hook**: Added `before_retrieve_html` hook in `AsyncPlaywrightCrawlerStrategy`.
|
||||
@@ -8,73 +8,75 @@
|
||||
- Useful for pages with delayed content loading.
|
||||
- **Flexible Timeout**: `smart_wait` function now uses `page_timeout` (default 60 seconds) instead of a fixed 30-second timeout.
|
||||
- Provides better handling for slow-loading pages.
|
||||
|
||||
### 2. Enhanced LLM Extraction Strategy
|
||||
- **Custom Arguments**: Added support for passing extra arguments to LLM providers via `extra_args` parameter.
|
||||
- **Custom Headers**: Users can now pass custom headers to the extraction strategy.
|
||||
- Enables more flexibility when interacting with different LLM APIs.
|
||||
|
||||
### 3. AsyncWebCrawler Improvements
|
||||
- **Flexible Initialization**: `AsyncWebCrawler` now accepts arbitrary keyword arguments.
|
||||
- These are passed directly to the crawler strategy, allowing for more customized setups.
|
||||
|
||||
### 4. Utility Function Enhancements
|
||||
- **Improved API Interaction**: `perform_completion_with_backoff` function now supports additional arguments.
|
||||
- Allows for more customized API calls to LLM providers.
|
||||
|
||||
## Examples and Documentation
|
||||
- Updated `quickstart_async.py` with examples of using custom headers in LLM extraction.
|
||||
- Added more diverse examples of LLM provider usage, including OpenAI, Hugging Face, and Ollama.
|
||||
|
||||
## Developer Notes
|
||||
- Refactored code for better maintainability and flexibility.
|
||||
- Enhanced error handling and logging for improved debugging experience.
|
||||
|
||||
## [v0.3.6] - 2024-10-12 - Part 2
|
||||
|
||||
### 1. Screenshot Capture
|
||||
- **What's new**: Added ability to capture screenshots during crawling.
|
||||
- **Why it matters**: You can now visually verify the content of crawled pages, which is useful for debugging and content verification.
|
||||
- **How to use**: Set `screenshot=True` when calling `crawler.arun()`.
|
||||
|
||||
### 2. Delayed Content Retrieval
|
||||
- **What's new**: Introduced `get_delayed_content` method in `AsyncCrawlResponse`.
|
||||
- **Why it matters**: Allows you to retrieve content after a specified delay, useful for pages that load content dynamically.
|
||||
- **How to use**: Access `result.get_delayed_content(delay_in_seconds)` after crawling.
|
||||
|
||||
### 3. Custom Page Timeout
|
||||
- **What's new**: Added `page_timeout` parameter to control page load timeout.
|
||||
- **Why it matters**: Gives you more control over crawling behavior, especially for slow-loading pages.
|
||||
- **How to use**: Set `page_timeout=your_desired_timeout` (in milliseconds) when calling `crawler.arun()`.
|
||||
|
||||
### 4. Enhanced LLM Support
|
||||
- **What's new**: Added support for multiple LLM providers (OpenAI, Hugging Face, Ollama).
|
||||
- **Why it matters**: Provides more flexibility in choosing AI models for content extraction.
|
||||
- **How to use**: Specify the desired provider when using `LLMExtractionStrategy`.
|
||||
### 2. Browser Type Selection
|
||||
- Added support for different browser types (Chromium, Firefox, WebKit).
|
||||
- Users can now specify the browser type when initializing AsyncWebCrawler.
|
||||
- **How to use**: Set `browser_type="firefox"` or `browser_type="webkit"` when initializing AsyncWebCrawler.
|
||||
|
||||
## Improvements
|
||||
### 3. Screenshot Capture
|
||||
- Added ability to capture screenshots during crawling.
|
||||
- Useful for debugging and content verification.
|
||||
- **How to use**: Set `screenshot=True` when calling `crawler.arun()`.
|
||||
|
||||
### 1. Database Schema Auto-updates
|
||||
- **What's new**: Automatic database schema updates.
|
||||
- **Why it matters**: Ensures your database stays compatible with the latest version without manual intervention.
|
||||
### 4. Enhanced LLM Extraction Strategy
|
||||
- Added support for multiple LLM providers (OpenAI, Hugging Face, Ollama).
|
||||
- **Custom Arguments**: Added support for passing extra arguments to LLM providers via `extra_args` parameter.
|
||||
- **Custom Headers**: Users can now pass custom headers to the extraction strategy.
|
||||
- **How to use**: Specify the desired provider and custom arguments when using `LLMExtractionStrategy`.
|
||||
|
||||
### 2. Enhanced Error Handling
|
||||
- **What's new**: Improved error messages and logging.
|
||||
- **Why it matters**: Makes debugging easier with more informative error messages.
|
||||
### 5. iframe Content Extraction
|
||||
- New feature to process and extract content from iframes.
|
||||
- **How to use**: Set `process_iframes=True` in the crawl method.
|
||||
|
||||
### 3. Optimized Image Processing
|
||||
- **What's new**: Refined image handling in `WebScrappingStrategy`.
|
||||
- **Why it matters**: Improves the accuracy of content extraction for pages with images.
|
||||
### 6. Delayed Content Retrieval
|
||||
- Introduced `get_delayed_content` method in `AsyncCrawlResponse`.
|
||||
- Allows retrieval of content after a specified delay, useful for dynamically loaded content.
|
||||
- **How to use**: Access `result.get_delayed_content(delay_in_seconds)` after crawling.
|
||||
|
||||
## Improvements and Optimizations
|
||||
|
||||
### 1. AsyncWebCrawler Enhancements
|
||||
- **Flexible Initialization**: Now accepts arbitrary keyword arguments, passed directly to the crawler strategy.
|
||||
- Allows for more customized setups.
|
||||
|
||||
### 2. Image Processing Optimization
|
||||
- Enhanced image handling in WebScrappingStrategy.
|
||||
- Added filtering for small, invisible, or irrelevant images.
|
||||
- Improved image scoring system for better content relevance.
|
||||
- Implemented JavaScript-based image dimension updating for more accurate representation.
|
||||
|
||||
### 3. Database Schema Auto-updates
|
||||
- Automatic database schema updates ensure compatibility with the latest version.
|
||||
|
||||
### 4. Enhanced Error Handling and Logging
|
||||
- Improved error messages and logging for easier debugging.
|
||||
|
||||
### 5. Content Extraction Refinements
|
||||
- Refined HTML sanitization process.
|
||||
- Improved handling of base64 encoded images.
|
||||
- Enhanced Markdown conversion process.
|
||||
- Optimized content extraction algorithms.
|
||||
|
||||
### 6. Utility Function Enhancements
|
||||
- `perform_completion_with_backoff` function now supports additional arguments for more customized API calls to LLM providers.
|
||||
|
||||
## Bug Fixes
|
||||
|
||||
- Fixed an issue where image tags were being prematurely removed during content extraction.
|
||||
|
||||
## Examples and Documentation
|
||||
- Updated `quickstart_async.py` with examples of:
|
||||
- Using custom headers in LLM extraction.
|
||||
- Different LLM provider usage (OpenAI, Hugging Face, Ollama).
|
||||
- Custom browser type usage.
|
||||
|
||||
## Developer Notes
|
||||
- Refactored code for better maintainability, flexibility, and performance.
|
||||
- Enhanced type hinting throughout the codebase for improved development experience.
|
||||
- Expanded error handling for more robust operation.
|
||||
|
||||
- Added examples for using different LLM providers in `quickstart_async.py`.
|
||||
- Enhanced type hinting throughout the codebase for better development experience.
|
||||
|
||||
These updates significantly enhance the flexibility, accuracy, and robustness of crawl4ai, providing users with more control and options for their web crawling and content extraction tasks.
|
||||
|
||||
## [v0.3.5] - 2024-09-02
|
||||
|
||||
|
||||
Reference in New Issue
Block a user