Update changelog
This commit is contained in:
114
CHANGELOG.md
114
CHANGELOG.md
@@ -1,6 +1,6 @@
|
|||||||
# Changelog
|
# Changelog
|
||||||
|
|
||||||
## [v0.3.6] - 2024-10-12 - Part 1
|
## [v0.3.6] - 2024-10-12
|
||||||
|
|
||||||
### 1. Improved Crawling Control
|
### 1. Improved Crawling Control
|
||||||
- **New Hook**: Added `before_retrieve_html` hook in `AsyncPlaywrightCrawlerStrategy`.
|
- **New Hook**: Added `before_retrieve_html` hook in `AsyncPlaywrightCrawlerStrategy`.
|
||||||
@@ -8,73 +8,75 @@
|
|||||||
- Useful for pages with delayed content loading.
|
- Useful for pages with delayed content loading.
|
||||||
- **Flexible Timeout**: `smart_wait` function now uses `page_timeout` (default 60 seconds) instead of a fixed 30-second timeout.
|
- **Flexible Timeout**: `smart_wait` function now uses `page_timeout` (default 60 seconds) instead of a fixed 30-second timeout.
|
||||||
- Provides better handling for slow-loading pages.
|
- Provides better handling for slow-loading pages.
|
||||||
|
|
||||||
### 2. Enhanced LLM Extraction Strategy
|
|
||||||
- **Custom Arguments**: Added support for passing extra arguments to LLM providers via `extra_args` parameter.
|
|
||||||
- **Custom Headers**: Users can now pass custom headers to the extraction strategy.
|
|
||||||
- Enables more flexibility when interacting with different LLM APIs.
|
|
||||||
|
|
||||||
### 3. AsyncWebCrawler Improvements
|
|
||||||
- **Flexible Initialization**: `AsyncWebCrawler` now accepts arbitrary keyword arguments.
|
|
||||||
- These are passed directly to the crawler strategy, allowing for more customized setups.
|
|
||||||
|
|
||||||
### 4. Utility Function Enhancements
|
|
||||||
- **Improved API Interaction**: `perform_completion_with_backoff` function now supports additional arguments.
|
|
||||||
- Allows for more customized API calls to LLM providers.
|
|
||||||
|
|
||||||
## Examples and Documentation
|
|
||||||
- Updated `quickstart_async.py` with examples of using custom headers in LLM extraction.
|
|
||||||
- Added more diverse examples of LLM provider usage, including OpenAI, Hugging Face, and Ollama.
|
|
||||||
|
|
||||||
## Developer Notes
|
|
||||||
- Refactored code for better maintainability and flexibility.
|
|
||||||
- Enhanced error handling and logging for improved debugging experience.
|
|
||||||
|
|
||||||
## [v0.3.6] - 2024-10-12 - Part 2
|
|
||||||
|
|
||||||
### 1. Screenshot Capture
|
|
||||||
- **What's new**: Added ability to capture screenshots during crawling.
|
|
||||||
- **Why it matters**: You can now visually verify the content of crawled pages, which is useful for debugging and content verification.
|
|
||||||
- **How to use**: Set `screenshot=True` when calling `crawler.arun()`.
|
|
||||||
|
|
||||||
### 2. Delayed Content Retrieval
|
|
||||||
- **What's new**: Introduced `get_delayed_content` method in `AsyncCrawlResponse`.
|
|
||||||
- **Why it matters**: Allows you to retrieve content after a specified delay, useful for pages that load content dynamically.
|
|
||||||
- **How to use**: Access `result.get_delayed_content(delay_in_seconds)` after crawling.
|
|
||||||
|
|
||||||
### 3. Custom Page Timeout
|
|
||||||
- **What's new**: Added `page_timeout` parameter to control page load timeout.
|
|
||||||
- **Why it matters**: Gives you more control over crawling behavior, especially for slow-loading pages.
|
|
||||||
- **How to use**: Set `page_timeout=your_desired_timeout` (in milliseconds) when calling `crawler.arun()`.
|
- **How to use**: Set `page_timeout=your_desired_timeout` (in milliseconds) when calling `crawler.arun()`.
|
||||||
|
|
||||||
### 4. Enhanced LLM Support
|
### 2. Browser Type Selection
|
||||||
- **What's new**: Added support for multiple LLM providers (OpenAI, Hugging Face, Ollama).
|
- Added support for different browser types (Chromium, Firefox, WebKit).
|
||||||
- **Why it matters**: Provides more flexibility in choosing AI models for content extraction.
|
- Users can now specify the browser type when initializing AsyncWebCrawler.
|
||||||
- **How to use**: Specify the desired provider when using `LLMExtractionStrategy`.
|
- **How to use**: Set `browser_type="firefox"` or `browser_type="webkit"` when initializing AsyncWebCrawler.
|
||||||
|
|
||||||
## Improvements
|
### 3. Screenshot Capture
|
||||||
|
- Added ability to capture screenshots during crawling.
|
||||||
|
- Useful for debugging and content verification.
|
||||||
|
- **How to use**: Set `screenshot=True` when calling `crawler.arun()`.
|
||||||
|
|
||||||
### 1. Database Schema Auto-updates
|
### 4. Enhanced LLM Extraction Strategy
|
||||||
- **What's new**: Automatic database schema updates.
|
- Added support for multiple LLM providers (OpenAI, Hugging Face, Ollama).
|
||||||
- **Why it matters**: Ensures your database stays compatible with the latest version without manual intervention.
|
- **Custom Arguments**: Added support for passing extra arguments to LLM providers via `extra_args` parameter.
|
||||||
|
- **Custom Headers**: Users can now pass custom headers to the extraction strategy.
|
||||||
|
- **How to use**: Specify the desired provider and custom arguments when using `LLMExtractionStrategy`.
|
||||||
|
|
||||||
### 2. Enhanced Error Handling
|
### 5. iframe Content Extraction
|
||||||
- **What's new**: Improved error messages and logging.
|
- New feature to process and extract content from iframes.
|
||||||
- **Why it matters**: Makes debugging easier with more informative error messages.
|
- **How to use**: Set `process_iframes=True` in the crawl method.
|
||||||
|
|
||||||
### 3. Optimized Image Processing
|
### 6. Delayed Content Retrieval
|
||||||
- **What's new**: Refined image handling in `WebScrappingStrategy`.
|
- Introduced `get_delayed_content` method in `AsyncCrawlResponse`.
|
||||||
- **Why it matters**: Improves the accuracy of content extraction for pages with images.
|
- Allows retrieval of content after a specified delay, useful for dynamically loaded content.
|
||||||
|
- **How to use**: Access `result.get_delayed_content(delay_in_seconds)` after crawling.
|
||||||
|
|
||||||
|
## Improvements and Optimizations
|
||||||
|
|
||||||
|
### 1. AsyncWebCrawler Enhancements
|
||||||
|
- **Flexible Initialization**: Now accepts arbitrary keyword arguments, passed directly to the crawler strategy.
|
||||||
|
- Allows for more customized setups.
|
||||||
|
|
||||||
|
### 2. Image Processing Optimization
|
||||||
|
- Enhanced image handling in WebScrappingStrategy.
|
||||||
|
- Added filtering for small, invisible, or irrelevant images.
|
||||||
|
- Improved image scoring system for better content relevance.
|
||||||
|
- Implemented JavaScript-based image dimension updating for more accurate representation.
|
||||||
|
|
||||||
|
### 3. Database Schema Auto-updates
|
||||||
|
- Automatic database schema updates ensure compatibility with the latest version.
|
||||||
|
|
||||||
|
### 4. Enhanced Error Handling and Logging
|
||||||
|
- Improved error messages and logging for easier debugging.
|
||||||
|
|
||||||
|
### 5. Content Extraction Refinements
|
||||||
|
- Refined HTML sanitization process.
|
||||||
|
- Improved handling of base64 encoded images.
|
||||||
|
- Enhanced Markdown conversion process.
|
||||||
|
- Optimized content extraction algorithms.
|
||||||
|
|
||||||
|
### 6. Utility Function Enhancements
|
||||||
|
- `perform_completion_with_backoff` function now supports additional arguments for more customized API calls to LLM providers.
|
||||||
|
|
||||||
## Bug Fixes
|
## Bug Fixes
|
||||||
|
|
||||||
- Fixed an issue where image tags were being prematurely removed during content extraction.
|
- Fixed an issue where image tags were being prematurely removed during content extraction.
|
||||||
|
|
||||||
|
## Examples and Documentation
|
||||||
|
- Updated `quickstart_async.py` with examples of:
|
||||||
|
- Using custom headers in LLM extraction.
|
||||||
|
- Different LLM provider usage (OpenAI, Hugging Face, Ollama).
|
||||||
|
- Custom browser type usage.
|
||||||
|
|
||||||
## Developer Notes
|
## Developer Notes
|
||||||
|
- Refactored code for better maintainability, flexibility, and performance.
|
||||||
|
- Enhanced type hinting throughout the codebase for improved development experience.
|
||||||
|
- Expanded error handling for more robust operation.
|
||||||
|
|
||||||
- Added examples for using different LLM providers in `quickstart_async.py`.
|
These updates significantly enhance the flexibility, accuracy, and robustness of crawl4ai, providing users with more control and options for their web crawling and content extraction tasks.
|
||||||
- Enhanced type hinting throughout the codebase for better development experience.
|
|
||||||
|
|
||||||
|
|
||||||
## [v0.3.5] - 2024-09-02
|
## [v0.3.5] - 2024-09-02
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user