Add Shadow DOM flattening and reorder js_code execution pipeline

- Add `flatten_shadow_dom` option to CrawlerRunConfig that serializes
  shadow DOM content into the light DOM before HTML capture. Uses a
  recursive serializer that resolves <slot> projections and strips
  only shadow-scoped <style> tags. Also injects an init script to
  force-open closed shadow roots via attachShadow patching.

- Move `js_code` execution to after `wait_for` + `delay_before_return_html`
  so user scripts run on the fully-hydrated page. Add `js_code_before_wait`
  for the less common case of triggering loading before waiting.

- Add JS snippet (flatten_shadow_dom.js), integration test, example,
  and documentation across all relevant doc files.
This commit is contained in:
unclecode
2026-02-18 06:43:00 +00:00
parent 4fb02f8b50
commit 8576331d4e
11 changed files with 522 additions and 66 deletions

View File

@@ -183,6 +183,55 @@ if __name__ == "__main__":
---
## 3.1 Flattening Shadow DOM
Sites built with **Web Components** (Stencil, Lit, Shoelace, Angular Elements, etc.) render content inside [Shadow DOM](https://developer.mozilla.org/en-US/docs/Web/API/Web_components/Using_shadow_DOM) — an encapsulated sub-tree that is invisible to normal page serialization. The browser renders it on screen, but `page.content()` never includes it.
Set `flatten_shadow_dom=True` to walk all shadow trees, resolve `<slot>` projections, and produce a single flat HTML document:
```python
config = CrawlerRunConfig(
# Flatten shadow DOM into the main document
flatten_shadow_dom=True,
# Give web components time to hydrate
wait_until="load",
delay_before_return_html=3.0,
)
```
**Full example** — crawling a product page where specs live inside shadow roots:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
config = CrawlerRunConfig(
flatten_shadow_dom=True,
wait_until="load",
delay_before_return_html=3.0,
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://store.boschrexroth.com/en/us/p/hydraulic-cylinder-r900999011",
config=config,
)
# Without flatten_shadow_dom: ~1 KB of markdown (breadcrumbs only)
# With flatten_shadow_dom: ~33 KB (full product specs, downloads, etc.)
print(len(result.markdown.raw_markdown))
if __name__ == "__main__":
asyncio.run(main())
```
When `flatten_shadow_dom=True` is set, Crawl4AI also injects an init script that force-opens **closed** shadow roots (by patching `Element.prototype.attachShadow`), so even components that use `mode: 'closed'` become accessible.
> **Tip**: Web components need JavaScript to run before they render content (a process called *hydration*). Use `wait_until="load"` and a `delay_before_return_html` of 25 seconds to ensure components are fully hydrated before flattening.
For a complete runnable example, see [`shadow_dom_crawling.py`](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/shadow_dom_crawling.py).
---
## 4. Structured Extraction Examples
You can combine content selection with a more advanced extraction strategy. For instance, a **CSS-based** or **LLM-based** extraction strategy can run on the filtered HTML.