Add Shadow DOM flattening and reorder js_code execution pipeline

- Add `flatten_shadow_dom` option to CrawlerRunConfig that serializes
  shadow DOM content into the light DOM before HTML capture. Uses a
  recursive serializer that resolves <slot> projections and strips
  only shadow-scoped <style> tags. Also injects an init script to
  force-open closed shadow roots via attachShadow patching.

- Move `js_code` execution to after `wait_for` + `delay_before_return_html`
  so user scripts run on the fully-hydrated page. Add `js_code_before_wait`
  for the less common case of triggering loading before waiting.

- Add JS snippet (flatten_shadow_dom.js), integration test, example,
  and documentation across all relevant doc files.
This commit is contained in:
unclecode
2026-02-18 06:43:00 +00:00
parent 4fb02f8b50
commit 8576331d4e
11 changed files with 522 additions and 66 deletions

View File

@@ -0,0 +1,77 @@
"""
Shadow DOM Crawling Example
============================
Demonstrates how to use `flatten_shadow_dom=True` to extract content
hidden inside Shadow DOM trees on sites built with Web Components
(Stencil, Lit, Shoelace, Angular Elements, etc.).
Shadow DOM creates encapsulated sub-trees that are invisible to the
normal page serialization (page.content() / outerHTML). The
`flatten_shadow_dom` option walks these trees and produces a single
flat HTML document that includes all shadow content.
This example crawls a Bosch Rexroth product page where the product
description, technical specs, and downloads are rendered entirely
inside Shadow DOM by Stencil.js web components.
"""
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
URL = "https://store.boschrexroth.com/en/us/p/hydraulic-cylinder-r900999011"
async def main():
browser_config = BrowserConfig(headless=True)
# ── 1. Baseline: without shadow DOM flattening ──────────────────
print("=" * 60)
print("Without flatten_shadow_dom (baseline)")
print("=" * 60)
config = CrawlerRunConfig(
wait_until="load",
delay_before_return_html=3.0,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(URL, config=config)
md = result.markdown.raw_markdown if result.markdown else ""
print(f"Markdown length: {len(md)} chars")
print(f"Has product description: {'mill type design' in md.lower()}")
print(f"Has technical specs: {'CDH1' in md}")
print(f"Has downloads section: {'Downloads' in md}")
print()
# ── 2. With shadow DOM flattening ───────────────────────────────
print("=" * 60)
print("With flatten_shadow_dom=True")
print("=" * 60)
config = CrawlerRunConfig(
wait_until="load",
delay_before_return_html=3.0,
flatten_shadow_dom=True,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(URL, config=config)
md = result.markdown.raw_markdown if result.markdown else ""
print(f"Markdown length: {len(md)} chars")
print(f"Has product description: {'mill type design' in md.lower()}")
print(f"Has technical specs: {'CDH1' in md}")
print(f"Has downloads section: {'Downloads' in md}")
print()
# Show the product content section
idx = md.find("Product Description")
if idx >= 0:
print("── Extracted product content ──")
print(md[idx:idx + 1200])
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -152,7 +152,8 @@ Use these for controlling whether you read or write from a local content cache.
| **Parameter** | **Type / Default** | **What It Does** |
|----------------------------|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| **`js_code`** | `str or list[str]` (None) | JavaScript to run after load. E.g. `"document.querySelector('button')?.click();"`. |
| **`js_code`** | `str or list[str]` (None) | JavaScript to run **after** `wait_for` and `delay_before_return_html`, on the fully-loaded page. E.g. `"document.querySelector('button')?.click();"`. |
| **`js_code_before_wait`** | `str or list[str]` (None) | JavaScript to run **before** `wait_for`. Use for triggering loading that `wait_for` then checks (e.g. clicking a tab, then waiting for its content). |
| **`c4a_script`** | `str or list[str]` (None) | C4A script that compiles to JavaScript. Alternative to writing raw JS. |
| **`js_only`** | `bool` (False) | If `True`, indicates we're reusing an existing session and only applying JS. No full reload. |
| **`ignore_body_visibility`** | `bool` (True) | Skip checking if `<body>` is visible. Usually best to keep `True`. |
@@ -160,6 +161,7 @@ Use these for controlling whether you read or write from a local content cache.
| **`scroll_delay`** | `float` (0.2) | Delay between scroll steps if `scan_full_page=True`. |
| **`max_scroll_steps`** | `int or None` (None) | Maximum number of scroll steps during full page scan. If None, scrolls until entire page is loaded. |
| **`process_iframes`** | `bool` (False) | Inlines iframe content for single-page extraction. |
| **`flatten_shadow_dom`** | `bool` (False) | Flattens Shadow DOM content into the light DOM before HTML capture. Resolves slots, strips shadow-scoped styles, and force-opens closed shadow roots. Essential for sites built with Web Components (Stencil, Lit, Shoelace, etc.). |
| **`remove_overlay_elements`** | `bool` (False) | Removes potential modals/popups blocking the main content. |
| **`remove_consent_popups`** | `bool` (False) | Removes GDPR/cookie consent popups from known CMP providers (OneTrust, Cookiebot, TrustArc, Quantcast, Didomi, Sourcepoint, FundingChoices, etc.). Tries clicking "Accept All" first, then falls back to DOM removal. |
| **`simulate_user`** | `bool` (False) | Simulate user interactions (mouse movements) to avoid bot detection. |

View File

@@ -1781,12 +1781,14 @@ run_cfg = CrawlerRunConfig(
### D) **Page Interaction**
| **Parameter** | **Type / Default** | **What It Does** |
|----------------------------|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| **`js_code`** | `str or list[str]` (None) | JavaScript to run after load. E.g. `"document.querySelector('button')?.click();"`. |
| **`js_only`** | `bool` (False) | If `True`, indicates were reusing an existing session and only applying JS. No full reload. |
| **`js_code`** | `str or list[str]` (None) | JavaScript to run **after** `wait_for` and `delay_before_return_html`, on the fully-loaded page. E.g. `"document.querySelector('button')?.click();"`. |
| **`js_code_before_wait`** | `str or list[str]` (None) | JavaScript to run **before** `wait_for`. Use for triggering loading that `wait_for` then checks. |
| **`js_only`** | `bool` (False) | If `True`, indicates we're reusing an existing session and only applying JS. No full reload. |
| **`ignore_body_visibility`** | `bool` (True) | Skip checking if `<body>` is visible. Usually best to keep `True`. |
| **`scan_full_page`** | `bool` (False) | If `True`, auto-scroll the page to load dynamic content (infinite scroll). |
| **`scroll_delay`** | `float` (0.2) | Delay between scroll steps if `scan_full_page=True`. |
| **`process_iframes`** | `bool` (False) | Inlines iframe content for single-page extraction. |
| **`flatten_shadow_dom`** | `bool` (False) | Flattens Shadow DOM content into the light DOM before HTML capture. Resolves slots, strips shadow-scoped styles, and force-opens closed shadow roots. Essential for sites built with Web Components. |
| **`remove_overlay_elements`** | `bool` (False) | Removes potential modals/popups blocking the main content. |
| **`remove_consent_popups`** | `bool` (False) | Removes GDPR/cookie consent popups from known CMP providers (OneTrust, Cookiebot, TrustArc, Quantcast, Didomi, Sourcepoint, FundingChoices, etc.). Tries clicking "Accept All" first, then falls back to DOM removal. |
| **`simulate_user`** | `bool` (False) | Simulate user interactions (mouse movements) to avoid bot detection. |
@@ -2813,6 +2815,46 @@ async def main():
if __name__ == "__main__":
asyncio.run(main())
```
## 3.1 Flattening Shadow DOM
Sites built with **Web Components** (Stencil, Lit, Shoelace, Angular Elements, etc.) render content inside Shadow DOM — an encapsulated sub-tree invisible to `page.content()`. Set `flatten_shadow_dom=True` to extract it:
```python
config = CrawlerRunConfig(
flatten_shadow_dom=True,
wait_until="load",
delay_before_return_html=3.0, # give components time to hydrate
)
```
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
config = CrawlerRunConfig(
flatten_shadow_dom=True,
wait_until="load",
delay_before_return_html=3.0,
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://store.boschrexroth.com/en/us/p/hydraulic-cylinder-r900999011",
config=config,
)
# Without flatten_shadow_dom: ~1 KB markdown (breadcrumbs only)
# With flatten_shadow_dom: ~33 KB (product description, specs, downloads)
print(len(result.markdown.raw_markdown))
if __name__ == "__main__":
asyncio.run(main())
```
When enabled, Crawl4AI also injects an init script that force-opens closed shadow roots. The flattener resolves `<slot>` projections and strips shadow-scoped `<style>` tags, producing clean HTML for the downstream scraping/markdown pipeline.
**Execution order**: `flatten_shadow_dom` runs right before HTML capture, after all waits and JS execution:
```
js_code_before_wait → wait_for → delay → js_code → flatten_shadow_dom → page capture
```
For a full runnable example, see [`shadow_dom_crawling.py`](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/shadow_dom_crawling.py).
## 4. Structured Extraction Examples
### 4.1 Pattern-Based with `JsonCssExtractionStrategy`
```python

View File

@@ -255,16 +255,22 @@ class CrawlerRunConfig:
- Controls caching behavior (`ENABLED`, `BYPASS`, `DISABLED`, etc.).
- Defaults to `CacheMode.BYPASS`.
6.**`js_code`** & **`c4a_script`**:
- `js_code`: A string or list of JavaScript strings to execute.
6.**`js_code`**, **`js_code_before_wait`**, & **`c4a_script`**:
- `js_code`: JavaScript to run **after** `wait_for` completes — on the fully-loaded page.
- `js_code_before_wait`: JavaScript to run **before** `wait_for` — for triggering loading that `wait_for` then checks.
- `c4a_script`: C4A script that compiles to JavaScript.
- Great for "Load More" buttons or user interactions.
- Great for "Load More" buttons or user interactions.
7.**`wait_for`**:
- A CSS or JS expression to wait for before extracting content.
- Common usage: `wait_for="css:.main-loaded"` or `wait_for="js:() => window.loaded === true"`.
8.**`screenshot`**, **`pdf`**, & **`capture_mhtml`**:
8.**`flatten_shadow_dom`**:
- If `True`, flattens Shadow DOM content into the light DOM before HTML capture.
- Essential for sites built with Web Components (Stencil, Lit, Shoelace, etc.).
- Also force-opens closed shadow roots. See [Flattening Shadow DOM](content-selection.md#31-flattening-shadow-dom).
9.**`screenshot`**, **`pdf`**, & **`capture_mhtml`**:
- If `True`, captures a screenshot, PDF, or MHTML snapshot after the page is fully loaded.
- The results go to `result.screenshot` (base64), `result.pdf` (bytes), or `result.mhtml` (string).
- Use `force_viewport_screenshot=True` to capture only the visible viewport instead of the full page. This is faster and produces smaller images when you don't need a full-page screenshot.

View File

@@ -183,6 +183,55 @@ if __name__ == "__main__":
---
## 3.1 Flattening Shadow DOM
Sites built with **Web Components** (Stencil, Lit, Shoelace, Angular Elements, etc.) render content inside [Shadow DOM](https://developer.mozilla.org/en-US/docs/Web/API/Web_components/Using_shadow_DOM) — an encapsulated sub-tree that is invisible to normal page serialization. The browser renders it on screen, but `page.content()` never includes it.
Set `flatten_shadow_dom=True` to walk all shadow trees, resolve `<slot>` projections, and produce a single flat HTML document:
```python
config = CrawlerRunConfig(
# Flatten shadow DOM into the main document
flatten_shadow_dom=True,
# Give web components time to hydrate
wait_until="load",
delay_before_return_html=3.0,
)
```
**Full example** — crawling a product page where specs live inside shadow roots:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
config = CrawlerRunConfig(
flatten_shadow_dom=True,
wait_until="load",
delay_before_return_html=3.0,
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://store.boschrexroth.com/en/us/p/hydraulic-cylinder-r900999011",
config=config,
)
# Without flatten_shadow_dom: ~1 KB of markdown (breadcrumbs only)
# With flatten_shadow_dom: ~33 KB (full product specs, downloads, etc.)
print(len(result.markdown.raw_markdown))
if __name__ == "__main__":
asyncio.run(main())
```
When `flatten_shadow_dom=True` is set, Crawl4AI also injects an init script that force-opens **closed** shadow roots (by patching `Element.prototype.attachShadow`), so even components that use `mode: 'closed'` become accessible.
> **Tip**: Web components need JavaScript to run before they render content (a process called *hydration*). Use `wait_until="load"` and a `delay_before_return_html` of 25 seconds to ensure components are fully hydrated before flattening.
For a complete runnable example, see [`shadow_dom_crawling.py`](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/shadow_dom_crawling.py).
---
## 4. Structured Extraction Examples
You can combine content selection with a more advanced extraction strategy. For instance, a **CSS-based** or **LLM-based** extraction strategy can run on the filtered HTML.

View File

@@ -15,8 +15,9 @@ Below is a quick overview of how to do it.
### Basic Execution
**`js_code`** in **`CrawlerRunConfig`** accepts either a single JS string or a list of JS snippets.
**Example**: Well scroll to the bottom of the page, then optionally click a “Load More” button.
**`js_code`** in **`CrawlerRunConfig`** accepts either a single JS string or a list of JS snippets. It runs **after** `wait_for` and `delay_before_return_html` — so the page is fully loaded when your code executes.
**Example**: We'll scroll to the bottom of the page, then optionally click a "Load More" button.
```python
import asyncio
@@ -55,10 +56,36 @@ if __name__ == "__main__":
```
**Relevant `CrawlerRunConfig` params**:
- **`js_code`**: A string or list of strings with JavaScript to run after the page loads.
- **`js_only`**: If set to `True` on subsequent calls, indicates were continuing an existing session without a new full navigation.
- **`js_code`**: JavaScript to run **after** `wait_for` and `delay_before_return_html` complete. Runs on the fully-loaded page.
- **`js_code_before_wait`**: JavaScript to run **before** `wait_for`. Use when you need to trigger loading that `wait_for` then checks.
- **`js_only`**: If set to `True` on subsequent calls, indicates we're continuing an existing session without a new full navigation.
- **`session_id`**: If you want to keep the same page across multiple calls, specify an ID.
### Execution Order
Understanding when your JavaScript runs relative to other pipeline steps:
```
1. Page navigation (page.goto)
2. js_code_before_wait ← triggers loading / clicks tabs
3. wait_for ← waits for content to appear
4. delay_before_return_html ← extra safety margin
5. js_code ← runs on the fully-loaded page
6. flatten_shadow_dom ← if enabled
7. page.content() ← HTML capture
```
If you need JS to trigger something and then wait for the result, use `js_code_before_wait` + `wait_for`:
```python
config = CrawlerRunConfig(
# Click a tab first
js_code_before_wait="document.querySelector('#specs-tab')?.click();",
# Then wait for the tab content to appear
wait_for="css:#specs-panel .content",
)
```
---
## 2. Wait Conditions
@@ -317,35 +344,55 @@ When done, check `result.extracted_content` for the JSON.
---
## 7. Relevant `CrawlerRunConfig` Parameters
## 7. Shadow DOM Flattening
Sites built with **Web Components** (Stencil, Lit, Shoelace, etc.) render content inside Shadow DOM — an encapsulated sub-tree that is invisible to normal page serialization. Set `flatten_shadow_dom=True` to extract it:
```python
config = CrawlerRunConfig(
flatten_shadow_dom=True,
wait_until="load",
delay_before_return_html=3.0, # give components time to hydrate
)
```
This walks all shadow trees, resolves `<slot>` projections, and produces flat HTML. It also force-opens closed shadow roots via an init script. For details and a full example, see [Flattening Shadow DOM](content-selection.md#31-flattening-shadow-dom) and [`shadow_dom_crawling.py`](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/shadow_dom_crawling.py).
---
## 8. Relevant `CrawlerRunConfig` Parameters
Below are the key interaction-related parameters in `CrawlerRunConfig`. For a full list, see [Configuration Parameters](../api/parameters.md).
- **`js_code`**: JavaScript to run after initial load.
- **`js_only`**: If `True`, no new page navigation—only JS in the existing session.
- **`wait_for`**: CSS (`"css:..."`) or JS (`"js:..."`) expression to wait for.
- **`session_id`**: Reuse the same page across calls.
- **`cache_mode`**: Whether to read/write from the cache or bypass.
- **`js_code`**: JavaScript to run after `wait_for` + `delay_before_return_html`, on the fully-loaded page.
- **`js_code_before_wait`**: JavaScript to run before `wait_for`. For triggering loading that `wait_for` then checks.
- **`js_only`**: If `True`, no new page navigation—only JS in the existing session.
- **`wait_for`**: CSS (`"css:..."`) or JS (`"js:..."`) expression to wait for.
- **`session_id`**: Reuse the same page across calls.
- **`cache_mode`**: Whether to read/write from the cache or bypass.
- **`flatten_shadow_dom`**: Flatten Shadow DOM content into the light DOM before capture.
- **`process_iframes`**: Inline iframe content into the main document.
- **`remove_overlay_elements`**: Remove certain popups automatically.
- **`remove_consent_popups`**: Remove GDPR/cookie consent popups from known CMP providers (OneTrust, Cookiebot, Didomi, etc.).
- **`simulate_user`, `override_navigator`, `magic`**: Anti-bot or "human-like" interactions.
---
## 8. Conclusion
## 9. Conclusion
Crawl4AIs **page interaction** features let you:
Crawl4AI's **page interaction** features let you:
1. **Execute JavaScript** for scrolling, clicks, or form filling.
2. **Wait** for CSS or custom JS conditions before capturing data.
3. **Handle** multi-step flows (like “Load More”) with partial reloads or persistent sessions.
4. Combine with **structured extraction** for dynamic sites.
4. **Flatten Shadow DOM** on Web Component sites to extract hidden content.
5. Combine with **structured extraction** for dynamic sites.
With these tools, you can scrape modern, interactive webpages confidently. For advanced hooking, user simulation, or in-depth config, check the [API reference](../api/parameters.md) or related advanced docs. Happy scripting!
---
## 9. Virtual Scrolling
## 10. Virtual Scrolling
For sites that use **virtual scrolling** (where content is replaced rather than appended as you scroll, like Twitter or Instagram), Crawl4AI provides a dedicated `VirtualScrollConfig`: