diff --git a/CHANGELOG.md b/CHANGELOG.md index 197fa32b..a377d794 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,6 +1,6 @@ # Changelog -## [v0.3.6] - 2024-10-12 - Part 1 +## [v0.3.6] - 2024-10-12 ### 1. Improved Crawling Control - **New Hook**: Added `before_retrieve_html` hook in `AsyncPlaywrightCrawlerStrategy`. @@ -8,73 +8,75 @@ - Useful for pages with delayed content loading. - **Flexible Timeout**: `smart_wait` function now uses `page_timeout` (default 60 seconds) instead of a fixed 30-second timeout. - Provides better handling for slow-loading pages. - -### 2. Enhanced LLM Extraction Strategy -- **Custom Arguments**: Added support for passing extra arguments to LLM providers via `extra_args` parameter. -- **Custom Headers**: Users can now pass custom headers to the extraction strategy. - - Enables more flexibility when interacting with different LLM APIs. - -### 3. AsyncWebCrawler Improvements -- **Flexible Initialization**: `AsyncWebCrawler` now accepts arbitrary keyword arguments. - - These are passed directly to the crawler strategy, allowing for more customized setups. - -### 4. Utility Function Enhancements -- **Improved API Interaction**: `perform_completion_with_backoff` function now supports additional arguments. - - Allows for more customized API calls to LLM providers. - -## Examples and Documentation -- Updated `quickstart_async.py` with examples of using custom headers in LLM extraction. -- Added more diverse examples of LLM provider usage, including OpenAI, Hugging Face, and Ollama. - -## Developer Notes -- Refactored code for better maintainability and flexibility. -- Enhanced error handling and logging for improved debugging experience. - -## [v0.3.6] - 2024-10-12 - Part 2 - -### 1. Screenshot Capture -- **What's new**: Added ability to capture screenshots during crawling. -- **Why it matters**: You can now visually verify the content of crawled pages, which is useful for debugging and content verification. -- **How to use**: Set `screenshot=True` when calling `crawler.arun()`. - -### 2. Delayed Content Retrieval -- **What's new**: Introduced `get_delayed_content` method in `AsyncCrawlResponse`. -- **Why it matters**: Allows you to retrieve content after a specified delay, useful for pages that load content dynamically. -- **How to use**: Access `result.get_delayed_content(delay_in_seconds)` after crawling. - -### 3. Custom Page Timeout -- **What's new**: Added `page_timeout` parameter to control page load timeout. -- **Why it matters**: Gives you more control over crawling behavior, especially for slow-loading pages. - **How to use**: Set `page_timeout=your_desired_timeout` (in milliseconds) when calling `crawler.arun()`. -### 4. Enhanced LLM Support -- **What's new**: Added support for multiple LLM providers (OpenAI, Hugging Face, Ollama). -- **Why it matters**: Provides more flexibility in choosing AI models for content extraction. -- **How to use**: Specify the desired provider when using `LLMExtractionStrategy`. +### 2. Browser Type Selection +- Added support for different browser types (Chromium, Firefox, WebKit). +- Users can now specify the browser type when initializing AsyncWebCrawler. +- **How to use**: Set `browser_type="firefox"` or `browser_type="webkit"` when initializing AsyncWebCrawler. -## Improvements +### 3. Screenshot Capture +- Added ability to capture screenshots during crawling. +- Useful for debugging and content verification. +- **How to use**: Set `screenshot=True` when calling `crawler.arun()`. -### 1. Database Schema Auto-updates -- **What's new**: Automatic database schema updates. -- **Why it matters**: Ensures your database stays compatible with the latest version without manual intervention. +### 4. Enhanced LLM Extraction Strategy +- Added support for multiple LLM providers (OpenAI, Hugging Face, Ollama). +- **Custom Arguments**: Added support for passing extra arguments to LLM providers via `extra_args` parameter. +- **Custom Headers**: Users can now pass custom headers to the extraction strategy. +- **How to use**: Specify the desired provider and custom arguments when using `LLMExtractionStrategy`. -### 2. Enhanced Error Handling -- **What's new**: Improved error messages and logging. -- **Why it matters**: Makes debugging easier with more informative error messages. +### 5. iframe Content Extraction +- New feature to process and extract content from iframes. +- **How to use**: Set `process_iframes=True` in the crawl method. -### 3. Optimized Image Processing -- **What's new**: Refined image handling in `WebScrappingStrategy`. -- **Why it matters**: Improves the accuracy of content extraction for pages with images. +### 6. Delayed Content Retrieval +- Introduced `get_delayed_content` method in `AsyncCrawlResponse`. +- Allows retrieval of content after a specified delay, useful for dynamically loaded content. +- **How to use**: Access `result.get_delayed_content(delay_in_seconds)` after crawling. + +## Improvements and Optimizations + +### 1. AsyncWebCrawler Enhancements +- **Flexible Initialization**: Now accepts arbitrary keyword arguments, passed directly to the crawler strategy. +- Allows for more customized setups. + +### 2. Image Processing Optimization +- Enhanced image handling in WebScrappingStrategy. +- Added filtering for small, invisible, or irrelevant images. +- Improved image scoring system for better content relevance. +- Implemented JavaScript-based image dimension updating for more accurate representation. + +### 3. Database Schema Auto-updates +- Automatic database schema updates ensure compatibility with the latest version. + +### 4. Enhanced Error Handling and Logging +- Improved error messages and logging for easier debugging. + +### 5. Content Extraction Refinements +- Refined HTML sanitization process. +- Improved handling of base64 encoded images. +- Enhanced Markdown conversion process. +- Optimized content extraction algorithms. + +### 6. Utility Function Enhancements +- `perform_completion_with_backoff` function now supports additional arguments for more customized API calls to LLM providers. ## Bug Fixes - - Fixed an issue where image tags were being prematurely removed during content extraction. +## Examples and Documentation +- Updated `quickstart_async.py` with examples of: + - Using custom headers in LLM extraction. + - Different LLM provider usage (OpenAI, Hugging Face, Ollama). + - Custom browser type usage. + ## Developer Notes +- Refactored code for better maintainability, flexibility, and performance. +- Enhanced type hinting throughout the codebase for improved development experience. +- Expanded error handling for more robust operation. -- Added examples for using different LLM providers in `quickstart_async.py`. -- Enhanced type hinting throughout the codebase for better development experience. - +These updates significantly enhance the flexibility, accuracy, and robustness of crawl4ai, providing users with more control and options for their web crawling and content extraction tasks. ## [v0.3.5] - 2024-09-02