feat: Add virtual scroll support for modern web scraping
Add comprehensive virtual scroll handling to capture all content from pages that use DOM recycling techniques (Twitter, Instagram, etc). Key features: - New VirtualScrollConfig class for configuring virtual scroll behavior - Automatic detection of three scrolling scenarios: no change, content appended, content replaced - Intelligent HTML chunk capture and merging with deduplication - 100% content capture from virtual scroll pages - Seamless integration with existing extraction strategies - JavaScript-based detection and capture for performance - Tree-based DOM merging with text-based deduplication Documentation: - Comprehensive guide at docs/md_v2/advanced/virtual-scroll.md - API reference updates in parameters.md and page-interaction.md - Blog article explaining the solution and techniques - Complete examples with local test server Testing: - Full test suite achieving 100% capture of 1000 items - Examples for Twitter timeline, Instagram grid scenarios - Local test server with different scrolling behaviors This enables scraping of modern websites that were previously impossible to fully capture with traditional scrolling techniques.
This commit is contained in:
14
CHANGELOG.md
14
CHANGELOG.md
@@ -5,6 +5,20 @@ All notable changes to Crawl4AI will be documented in this file.
|
||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||
|
||||
## [0.7.x] - 2025-06-29
|
||||
|
||||
### Added
|
||||
- **Virtual Scroll Support**: New `VirtualScrollConfig` for handling virtualized scrolling on modern websites
|
||||
- Automatically detects and handles three scrolling scenarios:
|
||||
- Content unchanged (continue scrolling)
|
||||
- Content appended (traditional infinite scroll)
|
||||
- Content replaced (true virtual scroll - Twitter/Instagram style)
|
||||
- Captures ALL content from pages that replace DOM elements during scroll
|
||||
- Intelligent deduplication based on normalized text content
|
||||
- Configurable scroll amount, count, and wait times
|
||||
- Seamless integration with existing extraction strategies
|
||||
- Comprehensive examples including Twitter timeline, Instagram grid, and mixed content scenarios
|
||||
|
||||
## [Unreleased]
|
||||
|
||||
### Added
|
||||
|
||||
Reference in New Issue
Block a user