Add comprehensive virtual scroll handling to capture all content from pages that use DOM recycling techniques (Twitter, Instagram, etc). Key features: - New VirtualScrollConfig class for configuring virtual scroll behavior - Automatic detection of three scrolling scenarios: no change, content appended, content replaced - Intelligent HTML chunk capture and merging with deduplication - 100% content capture from virtual scroll pages - Seamless integration with existing extraction strategies - JavaScript-based detection and capture for performance - Tree-based DOM merging with text-based deduplication Documentation: - Comprehensive guide at docs/md_v2/advanced/virtual-scroll.md - API reference updates in parameters.md and page-interaction.md - Blog article explaining the solution and techniques - Complete examples with local test server Testing: - Full test suite achieving 100% capture of 1000 items - Examples for Twitter timeline, Instagram grid scenarios - Local test server with different scrolling behaviors This enables scraping of modern websites that were previously impossible to fully capture with traditional scrolling techniques.
108 lines
3.7 KiB
YAML
108 lines
3.7 KiB
YAML
site_name: Crawl4AI Documentation (v0.6.x)
|
|
site_favicon: docs/md_v2/favicon.ico
|
|
site_description: 🚀🤖 Crawl4AI, Open-source LLM-Friendly Web Crawler & Scraper
|
|
site_url: https://docs.crawl4ai.com
|
|
repo_url: https://github.com/unclecode/crawl4ai
|
|
repo_name: unclecode/crawl4ai
|
|
docs_dir: docs/md_v2
|
|
|
|
nav:
|
|
- Home: 'index.md'
|
|
- "Ask AI": "core/ask-ai.md"
|
|
- "Quick Start": "core/quickstart.md"
|
|
- "Code Examples": "core/examples.md"
|
|
- Apps:
|
|
- "Demo Apps": "apps/index.md"
|
|
- "C4A-Script Editor": "apps/c4a-script/index.html"
|
|
- "LLM Context Builder": "apps/llmtxt/index.html"
|
|
- Setup & Installation:
|
|
- "Installation": "core/installation.md"
|
|
- "Docker Deployment": "core/docker-deployment.md"
|
|
- "Blog & Changelog":
|
|
- "Blog Home": "blog/index.md"
|
|
- "Changelog": "https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md"
|
|
- Core:
|
|
- "Command Line Interface": "core/cli.md"
|
|
- "Simple Crawling": "core/simple-crawling.md"
|
|
- "Deep Crawling": "core/deep-crawling.md"
|
|
- "C4A-Script": "core/c4a-script.md"
|
|
- "Crawler Result": "core/crawler-result.md"
|
|
- "Browser, Crawler & LLM Config": "core/browser-crawler-config.md"
|
|
- "Markdown Generation": "core/markdown-generation.md"
|
|
- "Fit Markdown": "core/fit-markdown.md"
|
|
- "Page Interaction": "core/page-interaction.md"
|
|
- "Content Selection": "core/content-selection.md"
|
|
- "Cache Modes": "core/cache-modes.md"
|
|
- "Local Files & Raw HTML": "core/local-files.md"
|
|
- "Link & Media": "core/link-media.md"
|
|
- Advanced:
|
|
- "Overview": "advanced/advanced-features.md"
|
|
- "Virtual Scroll": "advanced/virtual-scroll.md"
|
|
- "File Downloading": "advanced/file-downloading.md"
|
|
- "Lazy Loading": "advanced/lazy-loading.md"
|
|
- "Hooks & Auth": "advanced/hooks-auth.md"
|
|
- "Proxy & Security": "advanced/proxy-security.md"
|
|
- "Session Management": "advanced/session-management.md"
|
|
- "Multi-URL Crawling": "advanced/multi-url-crawling.md"
|
|
- "Crawl Dispatcher": "advanced/crawl-dispatcher.md"
|
|
- "Identity Based Crawling": "advanced/identity-based-crawling.md"
|
|
- "SSL Certificate": "advanced/ssl-certificate.md"
|
|
- "Network & Console Capture": "advanced/network-console-capture.md"
|
|
- Extraction:
|
|
- "LLM-Free Strategies": "extraction/no-llm-strategies.md"
|
|
- "LLM Strategies": "extraction/llm-strategies.md"
|
|
- "Clustering Strategies": "extraction/clustring-strategies.md"
|
|
- "Chunking": "extraction/chunking.md"
|
|
- API Reference:
|
|
- "AsyncWebCrawler": "api/async-webcrawler.md"
|
|
- "arun()": "api/arun.md"
|
|
- "arun_many()": "api/arun_many.md"
|
|
- "Browser, Crawler & LLM Config": "api/parameters.md"
|
|
- "CrawlResult": "api/crawl-result.md"
|
|
- "Strategies": "api/strategies.md"
|
|
- "C4A-Script Reference": "api/c4a-script-reference.md"
|
|
|
|
theme:
|
|
name: 'terminal'
|
|
palette: 'dark'
|
|
custom_dir: docs/md_v2/overrides
|
|
color_mode: 'dark'
|
|
icon:
|
|
repo: fontawesome/brands/github
|
|
|
|
plugins:
|
|
- search
|
|
|
|
markdown_extensions:
|
|
- pymdownx.highlight:
|
|
anchor_linenums: true
|
|
- pymdownx.inlinehilite
|
|
- pymdownx.snippets
|
|
- pymdownx.superfences
|
|
- admonition
|
|
- pymdownx.details
|
|
- attr_list
|
|
- tables
|
|
|
|
extra:
|
|
version: !ENV [CRAWL4AI_VERSION, 'development']
|
|
|
|
extra_css:
|
|
- assets/layout.css
|
|
- assets/styles.css
|
|
- assets/highlight.css
|
|
- assets/dmvendor.css
|
|
- assets/feedback-overrides.css
|
|
|
|
extra_javascript:
|
|
- https://www.googletagmanager.com/gtag/js?id=G-58W0K2ZQ25
|
|
- assets/gtag.js
|
|
- assets/highlight.min.js
|
|
- assets/highlight_init.js
|
|
- https://buttons.github.io/buttons.js
|
|
- assets/toc.js
|
|
- assets/github_stats.js
|
|
- assets/selection_ask_ai.js
|
|
- assets/copy_code.js
|
|
- assets/floating_ask_ai_button.js
|
|
- assets/mobile_menu.js |