feat: Add virtual scroll support for modern web scraping
Add comprehensive virtual scroll handling to capture all content from pages that use DOM recycling techniques (Twitter, Instagram, etc). Key features: - New VirtualScrollConfig class for configuring virtual scroll behavior - Automatic detection of three scrolling scenarios: no change, content appended, content replaced - Intelligent HTML chunk capture and merging with deduplication - 100% content capture from virtual scroll pages - Seamless integration with existing extraction strategies - JavaScript-based detection and capture for performance - Tree-based DOM merging with text-based deduplication Documentation: - Comprehensive guide at docs/md_v2/advanced/virtual-scroll.md - API reference updates in parameters.md and page-interaction.md - Blog article explaining the solution and techniques - Complete examples with local test server Testing: - Full test suite achieving 100% capture of 1000 items - Examples for Twitter timeline, Instagram grid scenarios - Local test server with different scrolling behaviors This enables scraping of modern websites that were previously impossible to fully capture with traditional scrolling techniques.
This commit is contained in:
@@ -28,6 +28,7 @@ This page provides a comprehensive list of example scripts that demonstrate vari
|
||||
| Example | Description | Link |
|
||||
|---------|-------------|------|
|
||||
| Deep Crawling | An extensive tutorial on deep crawling capabilities, demonstrating BFS and BestFirst strategies, stream vs. non-stream execution, filters, scorers, and advanced configurations. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/deepcrawl_example.py) |
|
||||
| Virtual Scroll | Comprehensive examples for handling virtualized scrolling on sites like Twitter, Instagram. Demonstrates different scrolling scenarios with local test server. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/virtual_scroll_example.py) |
|
||||
| Dispatcher | Shows how to use the crawl dispatcher for advanced workload management. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/dispatcher_example.py) |
|
||||
| Storage State | Tutorial on managing browser storage state for persistence. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/storage_state_tutorial.md) |
|
||||
| Network Console Capture | Demonstrates how to capture and analyze network requests and console logs. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/network_console_capture_example.py) |
|
||||
|
||||
@@ -340,4 +340,45 @@ Crawl4AI’s **page interaction** features let you:
|
||||
3. **Handle** multi-step flows (like “Load More”) with partial reloads or persistent sessions.
|
||||
4. Combine with **structured extraction** for dynamic sites.
|
||||
|
||||
With these tools, you can scrape modern, interactive webpages confidently. For advanced hooking, user simulation, or in-depth config, check the [API reference](../api/parameters.md) or related advanced docs. Happy scripting!
|
||||
With these tools, you can scrape modern, interactive webpages confidently. For advanced hooking, user simulation, or in-depth config, check the [API reference](../api/parameters.md) or related advanced docs. Happy scripting!
|
||||
|
||||
---
|
||||
|
||||
## 9. Virtual Scrolling
|
||||
|
||||
For sites that use **virtual scrolling** (where content is replaced rather than appended as you scroll, like Twitter or Instagram), Crawl4AI provides a dedicated `VirtualScrollConfig`:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, VirtualScrollConfig
|
||||
|
||||
async def crawl_twitter_timeline():
|
||||
# Configure virtual scroll for Twitter-like feeds
|
||||
virtual_config = VirtualScrollConfig(
|
||||
container_selector="[data-testid='primaryColumn']", # Twitter's main column
|
||||
scroll_count=30, # Scroll 30 times
|
||||
scroll_by="container_height", # Scroll by container height each time
|
||||
wait_after_scroll=1.0 # Wait 1 second after each scroll
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
virtual_scroll_config=virtual_config
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://twitter.com/search?q=AI",
|
||||
config=config
|
||||
)
|
||||
# result.html now contains ALL tweets from the virtual scroll
|
||||
```
|
||||
|
||||
### Virtual Scroll vs JavaScript Scrolling
|
||||
|
||||
| Feature | Virtual Scroll | JS Code Scrolling |
|
||||
|---------|---------------|-------------------|
|
||||
| **Use Case** | Content replaced during scroll | Content appended or simple scroll |
|
||||
| **Configuration** | `VirtualScrollConfig` object | `js_code` with scroll commands |
|
||||
| **Automatic Merging** | Yes - merges all unique content | No - captures final state only |
|
||||
| **Best For** | Twitter, Instagram, virtual tables | Traditional pages, load more buttons |
|
||||
|
||||
For detailed examples and configuration options, see the [Virtual Scroll documentation](../advanced/virtual-scroll.md).
|
||||
|
||||
Reference in New Issue
Block a user