[v0.3.72] Enhance content extraction and proxy support

- Add ContentCleaningStrategy for improved content extraction - Implement advanced proxy configuration with authentication - Enhance image source detection and handling - Add fit_markdown and fit_html for refined content output - Improve external link and image handling flexibility
2024-10-22 20:19:22 +08:00
parent 04d16e6d2b
commit 60ba131ac8
6 changed files with 260 additions and 3 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,39 @@
 # Changelog

+## [v0.3.72] - 2024-10-22
+
+### Added
+- New `ContentCleaningStrategy` class:
+  - Smart content extraction based on text density and element scoring
+  - Automatic removal of boilerplate content
+  - DOM tree analysis for better content identification
+  - Configurable thresholds for content detection
+- Advanced proxy support:
+  - Added `proxy_config` option for authenticated proxy connections
+  - Support for username/password in proxy configuration
+- New content output formats:
+  - `fit_markdown`: Optimized markdown output with main content focus
+  - `fit_html`: Clean HTML with only essential content
+
+### Enhanced
+- Image source detection:
+  - Support for multiple image source attributes (`src`, `data-src`, `srcset`, etc.)
+  - Automatic fallback through potential source attributes
+  - Smart handling of srcset attribute
+- External content handling:
+  - Made external link exclusion optional (disabled by default)
+  - Improved detection and handling of social media links
+  - Better control over external image filtering
+
+### Fixed
+- Image extraction reliability with multiple source attribute checks
+- External link and image handling logic for better accuracy
+
+### Developer Notes
+- The new `ContentCleaningStrategy` uses configurable thresholds for customization
+- Proxy configuration now supports more complex authentication scenarios
+- Content extraction process now provides both regular and optimized outputs
+
 ## [v0.3.72] - 2024-10-20

 ### Fixed