Update changelog

2024-10-14 21:04:02 +08:00
parent 6aa803d712
commit 2b73bdf6b0
1 changed files with 58 additions and 56 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,6 +1,6 @@
 # Changelog

-## [v0.3.6] - 2024-10-12 - Part 1
+## [v0.3.6] - 2024-10-12 

 ### 1. Improved Crawling Control
 - **New Hook**: Added `before_retrieve_html` hook in `AsyncPlaywrightCrawlerStrategy`.
@@ -8,73 +8,75 @@
  - Useful for pages with delayed content loading.
 - **Flexible Timeout**: `smart_wait` function now uses `page_timeout` (default 60 seconds) instead of a fixed 30-second timeout.
  - Provides better handling for slow-loading pages.
-
-### 2. Enhanced LLM Extraction Strategy
- **Custom Arguments**: Added support for passing extra arguments to LLM providers via `extra_args` parameter.
- **Custom Headers**: Users can now pass custom headers to the extraction strategy.
-  - Enables more flexibility when interacting with different LLM APIs.
-
-### 3. AsyncWebCrawler Improvements
- **Flexible Initialization**: `AsyncWebCrawler` now accepts arbitrary keyword arguments.
-  - These are passed directly to the crawler strategy, allowing for more customized setups.
-
-### 4. Utility Function Enhancements
- **Improved API Interaction**: `perform_completion_with_backoff` function now supports additional arguments.
-  - Allows for more customized API calls to LLM providers.
-
-## Examples and Documentation
- Updated `quickstart_async.py` with examples of using custom headers in LLM extraction.
- Added more diverse examples of LLM provider usage, including OpenAI, Hugging Face, and Ollama.
-
-## Developer Notes
- Refactored code for better maintainability and flexibility.
- Enhanced error handling and logging for improved debugging experience.
-
-## [v0.3.6] - 2024-10-12 - Part 2
-
-### 1. Screenshot Capture
- **What's new**: Added ability to capture screenshots during crawling.
- **Why it matters**: You can now visually verify the content of crawled pages, which is useful for debugging and content verification.
- **How to use**: Set `screenshot=True` when calling `crawler.arun()`.
-
-### 2. Delayed Content Retrieval
- **What's new**: Introduced `get_delayed_content` method in `AsyncCrawlResponse`.
- **Why it matters**: Allows you to retrieve content after a specified delay, useful for pages that load content dynamically.
- **How to use**: Access `result.get_delayed_content(delay_in_seconds)` after crawling.
-
-### 3. Custom Page Timeout
- **What's new**: Added `page_timeout` parameter to control page load timeout.
- **Why it matters**: Gives you more control over crawling behavior, especially for slow-loading pages.
 - **How to use**: Set `page_timeout=your_desired_timeout` (in milliseconds) when calling `crawler.arun()`.

-### 4. Enhanced LLM Support
- **What's new**: Added support for multiple LLM providers (OpenAI, Hugging Face, Ollama).
- **Why it matters**: Provides more flexibility in choosing AI models for content extraction.
- **How to use**: Specify the desired provider when using `LLMExtractionStrategy`.
+### 2. Browser Type Selection
+- Added support for different browser types (Chromium, Firefox, WebKit).
+- Users can now specify the browser type when initializing AsyncWebCrawler.
+- **How to use**: Set `browser_type="firefox"` or `browser_type="webkit"` when initializing AsyncWebCrawler.

-## Improvements
+### 3. Screenshot Capture
+- Added ability to capture screenshots during crawling.
+- Useful for debugging and content verification.
+- **How to use**: Set `screenshot=True` when calling `crawler.arun()`.

-### 1. Database Schema Auto-updates
- **What's new**: Automatic database schema updates.
- **Why it matters**: Ensures your database stays compatible with the latest version without manual intervention.
+### 4. Enhanced LLM Extraction Strategy
+- Added support for multiple LLM providers (OpenAI, Hugging Face, Ollama).
+- **Custom Arguments**: Added support for passing extra arguments to LLM providers via `extra_args` parameter.
+- **Custom Headers**: Users can now pass custom headers to the extraction strategy.
+- **How to use**: Specify the desired provider and custom arguments when using `LLMExtractionStrategy`.

-### 2. Enhanced Error Handling
- **What's new**: Improved error messages and logging.
- **Why it matters**: Makes debugging easier with more informative error messages.
+### 5. iframe Content Extraction
+- New feature to process and extract content from iframes.
+- **How to use**: Set `process_iframes=True` in the crawl method.

-### 3. Optimized Image Processing
- **What's new**: Refined image handling in `WebScrappingStrategy`.
- **Why it matters**: Improves the accuracy of content extraction for pages with images.
+### 6. Delayed Content Retrieval
+- Introduced `get_delayed_content` method in `AsyncCrawlResponse`.
+- Allows retrieval of content after a specified delay, useful for dynamically loaded content.
+- **How to use**: Access `result.get_delayed_content(delay_in_seconds)` after crawling.
+
+## Improvements and Optimizations
+
+### 1. AsyncWebCrawler Enhancements
+- **Flexible Initialization**: Now accepts arbitrary keyword arguments, passed directly to the crawler strategy.
+- Allows for more customized setups.
+
+### 2. Image Processing Optimization
+- Enhanced image handling in WebScrappingStrategy.
+- Added filtering for small, invisible, or irrelevant images.
+- Improved image scoring system for better content relevance.
+- Implemented JavaScript-based image dimension updating for more accurate representation.
+
+### 3. Database Schema Auto-updates
+- Automatic database schema updates ensure compatibility with the latest version.
+
+### 4. Enhanced Error Handling and Logging
+- Improved error messages and logging for easier debugging.
+
+### 5. Content Extraction Refinements
+- Refined HTML sanitization process.
+- Improved handling of base64 encoded images.
+- Enhanced Markdown conversion process.
+- Optimized content extraction algorithms.
+
+### 6. Utility Function Enhancements
+- `perform_completion_with_backoff` function now supports additional arguments for more customized API calls to LLM providers.

 ## Bug Fixes
-
 - Fixed an issue where image tags were being prematurely removed during content extraction.

+## Examples and Documentation
+- Updated `quickstart_async.py` with examples of:
+  - Using custom headers in LLM extraction.
+  - Different LLM provider usage (OpenAI, Hugging Face, Ollama).
+  - Custom browser type usage.
+
 ## Developer Notes
+- Refactored code for better maintainability, flexibility, and performance.
+- Enhanced type hinting throughout the codebase for improved development experience.
+- Expanded error handling for more robust operation.

- Added examples for using different LLM providers in `quickstart_async.py`.
- Enhanced type hinting throughout the codebase for better development experience.
-
+These updates significantly enhance the flexibility, accuracy, and robustness of crawl4ai, providing users with more control and options for their web crawling and content extraction tasks.

 ## [v0.3.5] - 2024-09-02