Update changelog

2024-10-14 21:04:02 +08:00
parent 6aa803d712
commit 2b73bdf6b0
1 changed files with 58 additions and 56 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,6 +1,6 @@
 # Changelog
-## [v0.3.6] - 2024-10-12 - Part 1
+## [v0.3.6] - 2024-10-12 
 ### 1. Improved Crawling Control
 - **New Hook**: Added `before_retrieve_html` hook in `AsyncPlaywrightCrawlerStrategy`.
@@ -8,73 +8,75 @@
  - Useful for pages with delayed content loading.
 - **Flexible Timeout**: `smart_wait` function now uses `page_timeout` (default 60 seconds) instead of a fixed 30-second timeout.
  - Provides better handling for slow-loading pages.
 ### 2. Enhanced LLM Extraction Strategy
 - **Custom Arguments**: Added support for passing extra arguments to LLM providers via `extra_args` parameter.
 - **Custom Headers**: Users can now pass custom headers to the extraction strategy.
  - Enables more flexibility when interacting with different LLM APIs.
 ### 3. AsyncWebCrawler Improvements
 - **Flexible Initialization**: `AsyncWebCrawler` now accepts arbitrary keyword arguments.
  - These are passed directly to the crawler strategy, allowing for more customized setups.
 ### 4. Utility Function Enhancements
 - **Improved API Interaction**: `perform_completion_with_backoff` function now supports additional arguments.
  - Allows for more customized API calls to LLM providers.
 ## Examples and Documentation
 - Updated `quickstart_async.py` with examples of using custom headers in LLM extraction.
 - Added more diverse examples of LLM provider usage, including OpenAI, Hugging Face, and Ollama.
 ## Developer Notes
 - Refactored code for better maintainability and flexibility.
 - Enhanced error handling and logging for improved debugging experience.
 ## [v0.3.6] - 2024-10-12 - Part 2
 ### 1. Screenshot Capture
 - **What's new**: Added ability to capture screenshots during crawling.
 - **Why it matters**: You can now visually verify the content of crawled pages, which is useful for debugging and content verification.
 - **How to use**: Set `screenshot=True` when calling `crawler.arun()`.
 ### 2. Delayed Content Retrieval
 - **What's new**: Introduced `get_delayed_content` method in `AsyncCrawlResponse`.
 - **Why it matters**: Allows you to retrieve content after a specified delay, useful for pages that load content dynamically.
 - **How to use**: Access `result.get_delayed_content(delay_in_seconds)` after crawling.
 ### 3. Custom Page Timeout
 - **What's new**: Added `page_timeout` parameter to control page load timeout.
 - **Why it matters**: Gives you more control over crawling behavior, especially for slow-loading pages.
 - **How to use**: Set `page_timeout=your_desired_timeout` (in milliseconds) when calling `crawler.arun()`.
-### 4. Enhanced LLM Support
+### 2. Browser Type Selection
- **What's new**: Added support for multiple LLM providers (OpenAI, Hugging Face, Ollama).
+- Added support for different browser types (Chromium, Firefox, WebKit).
- **Why it matters**: Provides more flexibility in choosing AI models for content extraction.
+- Users can now specify the browser type when initializing AsyncWebCrawler.
- **How to use**: Specify the desired provider when using `LLMExtractionStrategy`.
+- **How to use**: Set `browser_type="firefox"` or `browser_type="webkit"` when initializing AsyncWebCrawler.
-## Improvements
+### 3. Screenshot Capture
 - Added ability to capture screenshots during crawling.
 - Useful for debugging and content verification.
 - **How to use**: Set `screenshot=True` when calling `crawler.arun()`.
-### 1. Database Schema Auto-updates
+### 4. Enhanced LLM Extraction Strategy
- **What's new**: Automatic database schema updates.
+- Added support for multiple LLM providers (OpenAI, Hugging Face, Ollama).
- **Why it matters**: Ensures your database stays compatible with the latest version without manual intervention.
+- **Custom Arguments**: Added support for passing extra arguments to LLM providers via `extra_args` parameter.
 - **Custom Headers**: Users can now pass custom headers to the extraction strategy.
 - **How to use**: Specify the desired provider and custom arguments when using `LLMExtractionStrategy`.
-### 2. Enhanced Error Handling
+### 5. iframe Content Extraction
- **What's new**: Improved error messages and logging.
+- New feature to process and extract content from iframes.
- **Why it matters**: Makes debugging easier with more informative error messages.
+- **How to use**: Set `process_iframes=True` in the crawl method.
-### 3. Optimized Image Processing
+### 6. Delayed Content Retrieval
- **What's new**: Refined image handling in `WebScrappingStrategy`.
+- Introduced `get_delayed_content` method in `AsyncCrawlResponse`.
- **Why it matters**: Improves the accuracy of content extraction for pages with images.
+- Allows retrieval of content after a specified delay, useful for dynamically loaded content.
 - **How to use**: Access `result.get_delayed_content(delay_in_seconds)` after crawling.
 ## Improvements and Optimizations
 ### 1. AsyncWebCrawler Enhancements
 - **Flexible Initialization**: Now accepts arbitrary keyword arguments, passed directly to the crawler strategy.
 - Allows for more customized setups.
 ### 2. Image Processing Optimization
 - Enhanced image handling in WebScrappingStrategy.
 - Added filtering for small, invisible, or irrelevant images.
 - Improved image scoring system for better content relevance.
 - Implemented JavaScript-based image dimension updating for more accurate representation.
 ### 3. Database Schema Auto-updates
 - Automatic database schema updates ensure compatibility with the latest version.
 ### 4. Enhanced Error Handling and Logging
 - Improved error messages and logging for easier debugging.
 ### 5. Content Extraction Refinements
 - Refined HTML sanitization process.
 - Improved handling of base64 encoded images.
 - Enhanced Markdown conversion process.
 - Optimized content extraction algorithms.
 ### 6. Utility Function Enhancements
 - `perform_completion_with_backoff` function now supports additional arguments for more customized API calls to LLM providers.
 ## Bug Fixes
 - Fixed an issue where image tags were being prematurely removed during content extraction.
 ## Examples and Documentation
 - Updated `quickstart_async.py` with examples of:
  - Using custom headers in LLM extraction.
  - Different LLM provider usage (OpenAI, Hugging Face, Ollama).
  - Custom browser type usage.
 ## Developer Notes
 - Refactored code for better maintainability, flexibility, and performance.
 - Enhanced type hinting throughout the codebase for improved development experience.
 - Expanded error handling for more robust operation.
- Added examples for using different LLM providers in `quickstart_async.py`.
+These updates significantly enhance the flexibility, accuracy, and robustness of crawl4ai, providing users with more control and options for their web crawling and content extraction tasks.
 - Enhanced type hinting throughout the codebase for better development experience.
 ## [v0.3.5] - 2024-09-02