feat: Add virtual scroll support for modern web scraping

Add comprehensive virtual scroll handling to capture all content from pages that use DOM recycling techniques (Twitter, Instagram, etc). Key features: - New VirtualScrollConfig class for configuring virtual scroll behavior - Automatic detection of three scrolling scenarios: no change, content appended, content replaced - Intelligent HTML chunk capture and merging with deduplication - 100% content capture from virtual scroll pages - Seamless integration with existing extraction strategies - JavaScript-based detection and capture for performance - Tree-based DOM merging with text-based deduplication Documentation: - Comprehensive guide at docs/md_v2/advanced/virtual-scroll.md - API reference updates in parameters.md and page-interaction.md - Blog article explaining the solution and techniques - Complete examples with local test server Testing: - Full test suite achieving 100% capture of 1000 items - Examples for Twitter timeline, Instagram grid scenarios - Local test server with different scrolling behaviors This enables scraping of modern websites that were previously impossible to fully capture with traditional scrolling techniques.
2025-06-29 20:41:37 +08:00
parent 539a324cf6
commit a353515271
18 changed files with 2194 additions and 6 deletions
--- a/docs/md_v2/core/examples.md
+++ b/docs/md_v2/core/examples.md
@@ -28,6 +28,7 @@ This page provides a comprehensive list of example scripts that demonstrate vari
 | Example | Description | Link |
 |---------|-------------|------|
 | Deep Crawling | An extensive tutorial on deep crawling capabilities, demonstrating BFS and BestFirst strategies, stream vs. non-stream execution, filters, scorers, and advanced configurations. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/deepcrawl_example.py) |
+| Virtual Scroll | Comprehensive examples for handling virtualized scrolling on sites like Twitter, Instagram. Demonstrates different scrolling scenarios with local test server. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/virtual_scroll_example.py) |
 | Dispatcher | Shows how to use the crawl dispatcher for advanced workload management. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/dispatcher_example.py) |
 | Storage State | Tutorial on managing browser storage state for persistence. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/storage_state_tutorial.md) |
 | Network Console Capture | Demonstrates how to capture and analyze network requests and console logs. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/network_console_capture_example.py) |
--- a/docs/md_v2/core/page-interaction.md
+++ b/docs/md_v2/core/page-interaction.md
@@ -340,4 +340,45 @@ Crawl4AI’s **page interaction** features let you:
 3. **Handle** multi-step flows (like “Load More”) with partial reloads or persistent sessions.  
 4. Combine with **structured extraction** for dynamic sites.

-With these tools, you can scrape modern, interactive webpages confidently. For advanced hooking, user simulation, or in-depth config, check the [API reference](../api/parameters.md) or related advanced docs. Happy scripting!
+With these tools, you can scrape modern, interactive webpages confidently. For advanced hooking, user simulation, or in-depth config, check the [API reference](../api/parameters.md) or related advanced docs. Happy scripting!
+
+---
+
+## 9. Virtual Scrolling
+
+For sites that use **virtual scrolling** (where content is replaced rather than appended as you scroll, like Twitter or Instagram), Crawl4AI provides a dedicated `VirtualScrollConfig`:
+
+```python
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, VirtualScrollConfig
+
+async def crawl_twitter_timeline():
+    # Configure virtual scroll for Twitter-like feeds
+    virtual_config = VirtualScrollConfig(
+        container_selector="[data-testid='primaryColumn']",  # Twitter's main column
+        scroll_count=30,                # Scroll 30 times
+        scroll_by="container_height",   # Scroll by container height each time
+        wait_after_scroll=1.0          # Wait 1 second after each scroll
+    )
+    
+    config = CrawlerRunConfig(
+        virtual_scroll_config=virtual_config
+    )
+    
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(
+            url="https://twitter.com/search?q=AI",
+            config=config
+        )
+        # result.html now contains ALL tweets from the virtual scroll
+```
+
+### Virtual Scroll vs JavaScript Scrolling
+
+| Feature | Virtual Scroll | JS Code Scrolling |
+|---------|---------------|-------------------|
+| **Use Case** | Content replaced during scroll | Content appended or simple scroll |
+| **Configuration** | `VirtualScrollConfig` object | `js_code` with scroll commands |
+| **Automatic Merging** | Yes - merges all unique content | No - captures final state only |
+| **Best For** | Twitter, Instagram, virtual tables | Traditional pages, load more buttons |
+
+For detailed examples and configuration options, see the [Virtual Scroll documentation](../advanced/virtual-scroll.md).