feat: Add virtual scroll support for modern web scraping

Add comprehensive virtual scroll handling to capture all content from pages that use DOM recycling techniques (Twitter, Instagram, etc).

Key features:
- New VirtualScrollConfig class for configuring virtual scroll behavior
- Automatic detection of three scrolling scenarios: no change, content appended, content replaced
- Intelligent HTML chunk capture and merging with deduplication
- 100% content capture from virtual scroll pages
- Seamless integration with existing extraction strategies
- JavaScript-based detection and capture for performance
- Tree-based DOM merging with text-based deduplication

Documentation:
- Comprehensive guide at docs/md_v2/advanced/virtual-scroll.md
- API reference updates in parameters.md and page-interaction.md
- Blog article explaining the solution and techniques
- Complete examples with local test server

Testing:
- Full test suite achieving 100% capture of 1000 items
- Examples for Twitter timeline, Instagram grid scenarios
- Local test server with different scrolling behaviors

This enables scraping of modern websites that were previously impossible to fully capture with traditional scrolling techniques.
This commit is contained in:
UncleCode
2025-06-29 20:41:37 +08:00
parent 539a324cf6
commit a353515271
18 changed files with 2194 additions and 6 deletions

View File

@@ -28,6 +28,7 @@ This page provides a comprehensive list of example scripts that demonstrate vari
| Example | Description | Link |
|---------|-------------|------|
| Deep Crawling | An extensive tutorial on deep crawling capabilities, demonstrating BFS and BestFirst strategies, stream vs. non-stream execution, filters, scorers, and advanced configurations. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/deepcrawl_example.py) |
| Virtual Scroll | Comprehensive examples for handling virtualized scrolling on sites like Twitter, Instagram. Demonstrates different scrolling scenarios with local test server. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/virtual_scroll_example.py) |
| Dispatcher | Shows how to use the crawl dispatcher for advanced workload management. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/dispatcher_example.py) |
| Storage State | Tutorial on managing browser storage state for persistence. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/storage_state_tutorial.md) |
| Network Console Capture | Demonstrates how to capture and analyze network requests and console logs. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/network_console_capture_example.py) |

View File

@@ -340,4 +340,45 @@ Crawl4AIs **page interaction** features let you:
3. **Handle** multi-step flows (like “Load More”) with partial reloads or persistent sessions.
4. Combine with **structured extraction** for dynamic sites.
With these tools, you can scrape modern, interactive webpages confidently. For advanced hooking, user simulation, or in-depth config, check the [API reference](../api/parameters.md) or related advanced docs. Happy scripting!
With these tools, you can scrape modern, interactive webpages confidently. For advanced hooking, user simulation, or in-depth config, check the [API reference](../api/parameters.md) or related advanced docs. Happy scripting!
---
## 9. Virtual Scrolling
For sites that use **virtual scrolling** (where content is replaced rather than appended as you scroll, like Twitter or Instagram), Crawl4AI provides a dedicated `VirtualScrollConfig`:
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, VirtualScrollConfig
async def crawl_twitter_timeline():
# Configure virtual scroll for Twitter-like feeds
virtual_config = VirtualScrollConfig(
container_selector="[data-testid='primaryColumn']", # Twitter's main column
scroll_count=30, # Scroll 30 times
scroll_by="container_height", # Scroll by container height each time
wait_after_scroll=1.0 # Wait 1 second after each scroll
)
config = CrawlerRunConfig(
virtual_scroll_config=virtual_config
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://twitter.com/search?q=AI",
config=config
)
# result.html now contains ALL tweets from the virtual scroll
```
### Virtual Scroll vs JavaScript Scrolling
| Feature | Virtual Scroll | JS Code Scrolling |
|---------|---------------|-------------------|
| **Use Case** | Content replaced during scroll | Content appended or simple scroll |
| **Configuration** | `VirtualScrollConfig` object | `js_code` with scroll commands |
| **Automatic Merging** | Yes - merges all unique content | No - captures final state only |
| **Best For** | Twitter, Instagram, virtual tables | Traditional pages, load more buttons |
For detailed examples and configuration options, see the [Virtual Scroll documentation](../advanced/virtual-scroll.md).