feat: Add virtual scroll support for modern web scraping

Add comprehensive virtual scroll handling to capture all content from pages that use DOM recycling techniques (Twitter, Instagram, etc).

Key features:
- New VirtualScrollConfig class for configuring virtual scroll behavior
- Automatic detection of three scrolling scenarios: no change, content appended, content replaced
- Intelligent HTML chunk capture and merging with deduplication
- 100% content capture from virtual scroll pages
- Seamless integration with existing extraction strategies
- JavaScript-based detection and capture for performance
- Tree-based DOM merging with text-based deduplication

Documentation:
- Comprehensive guide at docs/md_v2/advanced/virtual-scroll.md
- API reference updates in parameters.md and page-interaction.md
- Blog article explaining the solution and techniques
- Complete examples with local test server

Testing:
- Full test suite achieving 100% capture of 1000 items
- Examples for Twitter timeline, Instagram grid scenarios
- Local test server with different scrolling behaviors

This enables scraping of modern websites that were previously impossible to fully capture with traditional scrolling techniques.
This commit is contained in:
UncleCode
2025-06-29 20:41:37 +08:00
parent 539a324cf6
commit a353515271
18 changed files with 2194 additions and 6 deletions

View File

@@ -169,7 +169,46 @@ Use these for link-level content filtering (often to keep crawls “internal”
---
## 2.2 Helper Methods
### H) **Virtual Scroll Configuration**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------------|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
| **`virtual_scroll_config`** | `VirtualScrollConfig or dict` (None) | Configuration for handling virtualized scrolling on sites like Twitter/Instagram where content is replaced rather than appended. |
When sites use virtual scrolling (content replaced as you scroll), use `VirtualScrollConfig`:
```python
from crawl4ai import VirtualScrollConfig
virtual_config = VirtualScrollConfig(
container_selector="#timeline", # CSS selector for scrollable container
scroll_count=30, # Number of times to scroll
scroll_by="container_height", # How much to scroll: "container_height", "page_height", or pixels (e.g. 500)
wait_after_scroll=0.5 # Seconds to wait after each scroll for content to load
)
config = CrawlerRunConfig(
virtual_scroll_config=virtual_config
)
```
**VirtualScrollConfig Parameters:**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------|---------------------------|-------------------------------------------------------------------------------------------|
| **`container_selector`** | `str` (required) | CSS selector for the scrollable container (e.g., `"#feed"`, `".timeline"`) |
| **`scroll_count`** | `int` (10) | Maximum number of scrolls to perform |
| **`scroll_by`** | `str or int` ("container_height") | Scroll amount: `"container_height"`, `"page_height"`, or pixels (e.g., `500`) |
| **`wait_after_scroll`** | `float` (0.5) | Time in seconds to wait after each scroll for new content to load |
**When to use Virtual Scroll vs scan_full_page:**
- Use `virtual_scroll_config` when content is **replaced** during scroll (Twitter, Instagram)
- Use `scan_full_page` when content is **appended** during scroll (traditional infinite scroll)
See [Virtual Scroll documentation](../../advanced/virtual-scroll.md) for detailed examples.
---## 2.2 Helper Methods
Both `BrowserConfig` and `CrawlerRunConfig` provide a `clone()` method to create modified copies: