crawl4ai

Author	SHA1	Message	Date
unclecode	8576331d4e	Add Shadow DOM flattening and reorder js_code execution pipeline - Add `flatten_shadow_dom` option to CrawlerRunConfig that serializes shadow DOM content into the light DOM before HTML capture. Uses a recursive serializer that resolves <slot> projections and strips only shadow-scoped <style> tags. Also injects an init script to force-open closed shadow roots via attachShadow patching. - Move `js_code` execution to after `wait_for` + `delay_before_return_html` so user scripts run on the fully-hydrated page. Add `js_code_before_wait` for the less common case of triggering loading before waiting. - Add JS snippet (flatten_shadow_dom.js), integration test, example, and documentation across all relevant doc files.	2026-02-18 06:43:00 +00:00
unclecode	3fc7730aaf	Add remove_consent_popups flag and fix from_kwargs dict deserialization Add CrawlerRunConfig.remove_consent_popups (bool, default False) that targets GDPR/cookie consent popups from 70+ known CMP providers including OneTrust, Cookiebot, TrustArc, Quantcast, Didomi, Usercentrics, Sourcepoint, Google FundingChoices, and many more. The JS strategy uses a 5-phase approach: 1. Click "Accept All" buttons (cleanest dismissal, sets cookies) 2. Try CMP JavaScript APIs (__tcfapi, Didomi, Cookiebot, Osano, Klaro) 3. Remove known CMP containers by selector (~120 selectors) 4. Handle iframe-based and shadow DOM CMPs 5. Restore body scroll and remove CMP body classes Also fix from_kwargs() in CrawlerRunConfig and BrowserConfig to auto-deserialize dict values using the existing from_serializable_dict() infrastructure. Previously, strategy objects like markdown_generator arriving as {"type": "DefaultMarkdownGenerator", "params": {...}} from JSON APIs were passed through as raw dicts, causing crashes when the crawler later called methods on them.	2026-02-11 12:46:47 +00:00
UncleCode	a353515271	feat: Add virtual scroll support for modern web scraping Add comprehensive virtual scroll handling to capture all content from pages that use DOM recycling techniques (Twitter, Instagram, etc). Key features: - New VirtualScrollConfig class for configuring virtual scroll behavior - Automatic detection of three scrolling scenarios: no change, content appended, content replaced - Intelligent HTML chunk capture and merging with deduplication - 100% content capture from virtual scroll pages - Seamless integration with existing extraction strategies - JavaScript-based detection and capture for performance - Tree-based DOM merging with text-based deduplication Documentation: - Comprehensive guide at docs/md_v2/advanced/virtual-scroll.md - API reference updates in parameters.md and page-interaction.md - Blog article explaining the solution and techniques - Complete examples with local test server Testing: - Full test suite achieving 100% capture of 1000 items - Examples for Twitter timeline, Instagram grid scenarios - Local test server with different scrolling behaviors This enables scraping of modern websites that were previously impossible to fully capture with traditional scrolling techniques.	2025-06-29 20:41:37 +08:00
UncleCode	c0fd36982d	Update all documentation to import extraction strategies directly from crawl4ai.	2025-06-10 18:08:27 +08:00
UncleCode	ca3e33122e	refactor(docs): reorganize documentation structure and update styles Reorganize documentation into core/advanced/extraction sections for better navigation. Update terminal theme styles and add rich library for better CLI output. Remove redundant tutorial files and consolidate content into core sections. Add personal story to index page for project context. BREAKING CHANGE: Documentation structure has been significantly reorganized	2025-01-07 20:49:50 +08:00

5 Commits