feat: Add virtual scroll support for modern web scraping

Add comprehensive virtual scroll handling to capture all content from pages that use DOM recycling techniques (Twitter, Instagram, etc). Key features: - New VirtualScrollConfig class for configuring virtual scroll behavior - Automatic detection of three scrolling scenarios: no change, content appended, content replaced - Intelligent HTML chunk capture and merging with deduplication - 100% content capture from virtual scroll pages - Seamless integration with existing extraction strategies - JavaScript-based detection and capture for performance - Tree-based DOM merging with text-based deduplication Documentation: - Comprehensive guide at docs/md_v2/advanced/virtual-scroll.md - API reference updates in parameters.md and page-interaction.md - Blog article explaining the solution and techniques - Complete examples with local test server Testing: - Full test suite achieving 100% capture of 1000 items - Examples for Twitter timeline, Instagram grid scenarios - Local test server with different scrolling behaviors This enables scraping of modern websites that were previously impossible to fully capture with traditional scrolling techniques.
2025-06-29 20:41:37 +08:00
parent 539a324cf6
commit a353515271
18 changed files with 2194 additions and 6 deletions
--- a/docs/md_v2/api/parameters.md
+++ b/docs/md_v2/api/parameters.md
@@ -169,7 +169,46 @@ Use these for link-level content filtering (often to keep crawls “internal”

 ---

-## 2.2 Helper Methods
+
+### H) **Virtual Scroll Configuration**
+
+| **Parameter**                | **Type / Default**           | **What It Does**                                                                                                                    |
+|------------------------------|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
+| **`virtual_scroll_config`**  | `VirtualScrollConfig or dict` (None) | Configuration for handling virtualized scrolling on sites like Twitter/Instagram where content is replaced rather than appended. |
+
+When sites use virtual scrolling (content replaced as you scroll), use `VirtualScrollConfig`:
+
+```python
+from crawl4ai import VirtualScrollConfig
+
+virtual_config = VirtualScrollConfig(
+    container_selector="#timeline",    # CSS selector for scrollable container
+    scroll_count=30,                   # Number of times to scroll
+    scroll_by="container_height",      # How much to scroll: "container_height", "page_height", or pixels (e.g. 500)
+    wait_after_scroll=0.5             # Seconds to wait after each scroll for content to load
+)
+
+config = CrawlerRunConfig(
+    virtual_scroll_config=virtual_config
+)
+```
+
+**VirtualScrollConfig Parameters:**
+
+| **Parameter**          | **Type / Default**        | **What It Does**                                                                          |
+|------------------------|---------------------------|-------------------------------------------------------------------------------------------|
+| **`container_selector`** | `str` (required)        | CSS selector for the scrollable container (e.g., `"#feed"`, `".timeline"`)              |
+| **`scroll_count`**     | `int` (10)               | Maximum number of scrolls to perform                                                      |
+| **`scroll_by`**        | `str or int` ("container_height") | Scroll amount: `"container_height"`, `"page_height"`, or pixels (e.g., `500`)   |
+| **`wait_after_scroll`** | `float` (0.5)           | Time in seconds to wait after each scroll for new content to load                        |
+
+**When to use Virtual Scroll vs scan_full_page:**
+- Use `virtual_scroll_config` when content is **replaced** during scroll (Twitter, Instagram)
+- Use `scan_full_page` when content is **appended** during scroll (traditional infinite scroll)
+
+See [Virtual Scroll documentation](../../advanced/virtual-scroll.md) for detailed examples.
+
+---## 2.2 Helper Methods

 Both `BrowserConfig` and `CrawlerRunConfig` provide a `clone()` method to create modified copies: