Add Shadow DOM flattening and reorder js_code execution pipeline

- Add `flatten_shadow_dom` option to CrawlerRunConfig that serializes
  shadow DOM content into the light DOM before HTML capture. Uses a
  recursive serializer that resolves <slot> projections and strips
  only shadow-scoped <style> tags. Also injects an init script to
  force-open closed shadow roots via attachShadow patching.

- Move `js_code` execution to after `wait_for` + `delay_before_return_html`
  so user scripts run on the fully-hydrated page. Add `js_code_before_wait`
  for the less common case of triggering loading before waiting.

- Add JS snippet (flatten_shadow_dom.js), integration test, example,
  and documentation across all relevant doc files.
This commit is contained in:
unclecode
2026-02-18 06:43:00 +00:00
parent 4fb02f8b50
commit 8576331d4e
11 changed files with 522 additions and 66 deletions

View File

@@ -152,7 +152,8 @@ Use these for controlling whether you read or write from a local content cache.
| **Parameter** | **Type / Default** | **What It Does** |
|----------------------------|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| **`js_code`** | `str or list[str]` (None) | JavaScript to run after load. E.g. `"document.querySelector('button')?.click();"`. |
| **`js_code`** | `str or list[str]` (None) | JavaScript to run **after** `wait_for` and `delay_before_return_html`, on the fully-loaded page. E.g. `"document.querySelector('button')?.click();"`. |
| **`js_code_before_wait`** | `str or list[str]` (None) | JavaScript to run **before** `wait_for`. Use for triggering loading that `wait_for` then checks (e.g. clicking a tab, then waiting for its content). |
| **`c4a_script`** | `str or list[str]` (None) | C4A script that compiles to JavaScript. Alternative to writing raw JS. |
| **`js_only`** | `bool` (False) | If `True`, indicates we're reusing an existing session and only applying JS. No full reload. |
| **`ignore_body_visibility`** | `bool` (True) | Skip checking if `<body>` is visible. Usually best to keep `True`. |
@@ -160,6 +161,7 @@ Use these for controlling whether you read or write from a local content cache.
| **`scroll_delay`** | `float` (0.2) | Delay between scroll steps if `scan_full_page=True`. |
| **`max_scroll_steps`** | `int or None` (None) | Maximum number of scroll steps during full page scan. If None, scrolls until entire page is loaded. |
| **`process_iframes`** | `bool` (False) | Inlines iframe content for single-page extraction. |
| **`flatten_shadow_dom`** | `bool` (False) | Flattens Shadow DOM content into the light DOM before HTML capture. Resolves slots, strips shadow-scoped styles, and force-opens closed shadow roots. Essential for sites built with Web Components (Stencil, Lit, Shoelace, etc.). |
| **`remove_overlay_elements`** | `bool` (False) | Removes potential modals/popups blocking the main content. |
| **`remove_consent_popups`** | `bool` (False) | Removes GDPR/cookie consent popups from known CMP providers (OneTrust, Cookiebot, TrustArc, Quantcast, Didomi, Sourcepoint, FundingChoices, etc.). Tries clicking "Accept All" first, then falls back to DOM removal. |
| **`simulate_user`** | `bool` (False) | Simulate user interactions (mouse movements) to avoid bot detection. |

View File

@@ -1781,12 +1781,14 @@ run_cfg = CrawlerRunConfig(
### D) **Page Interaction**
| **Parameter** | **Type / Default** | **What It Does** |
|----------------------------|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| **`js_code`** | `str or list[str]` (None) | JavaScript to run after load. E.g. `"document.querySelector('button')?.click();"`. |
| **`js_only`** | `bool` (False) | If `True`, indicates were reusing an existing session and only applying JS. No full reload. |
| **`js_code`** | `str or list[str]` (None) | JavaScript to run **after** `wait_for` and `delay_before_return_html`, on the fully-loaded page. E.g. `"document.querySelector('button')?.click();"`. |
| **`js_code_before_wait`** | `str or list[str]` (None) | JavaScript to run **before** `wait_for`. Use for triggering loading that `wait_for` then checks. |
| **`js_only`** | `bool` (False) | If `True`, indicates we're reusing an existing session and only applying JS. No full reload. |
| **`ignore_body_visibility`** | `bool` (True) | Skip checking if `<body>` is visible. Usually best to keep `True`. |
| **`scan_full_page`** | `bool` (False) | If `True`, auto-scroll the page to load dynamic content (infinite scroll). |
| **`scroll_delay`** | `float` (0.2) | Delay between scroll steps if `scan_full_page=True`. |
| **`process_iframes`** | `bool` (False) | Inlines iframe content for single-page extraction. |
| **`flatten_shadow_dom`** | `bool` (False) | Flattens Shadow DOM content into the light DOM before HTML capture. Resolves slots, strips shadow-scoped styles, and force-opens closed shadow roots. Essential for sites built with Web Components. |
| **`remove_overlay_elements`** | `bool` (False) | Removes potential modals/popups blocking the main content. |
| **`remove_consent_popups`** | `bool` (False) | Removes GDPR/cookie consent popups from known CMP providers (OneTrust, Cookiebot, TrustArc, Quantcast, Didomi, Sourcepoint, FundingChoices, etc.). Tries clicking "Accept All" first, then falls back to DOM removal. |
| **`simulate_user`** | `bool` (False) | Simulate user interactions (mouse movements) to avoid bot detection. |
@@ -2813,6 +2815,46 @@ async def main():
if __name__ == "__main__":
asyncio.run(main())
```
## 3.1 Flattening Shadow DOM
Sites built with **Web Components** (Stencil, Lit, Shoelace, Angular Elements, etc.) render content inside Shadow DOM — an encapsulated sub-tree invisible to `page.content()`. Set `flatten_shadow_dom=True` to extract it:
```python
config = CrawlerRunConfig(
flatten_shadow_dom=True,
wait_until="load",
delay_before_return_html=3.0, # give components time to hydrate
)
```
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
config = CrawlerRunConfig(
flatten_shadow_dom=True,
wait_until="load",
delay_before_return_html=3.0,
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://store.boschrexroth.com/en/us/p/hydraulic-cylinder-r900999011",
config=config,
)
# Without flatten_shadow_dom: ~1 KB markdown (breadcrumbs only)
# With flatten_shadow_dom: ~33 KB (product description, specs, downloads)
print(len(result.markdown.raw_markdown))
if __name__ == "__main__":
asyncio.run(main())
```
When enabled, Crawl4AI also injects an init script that force-opens closed shadow roots. The flattener resolves `<slot>` projections and strips shadow-scoped `<style>` tags, producing clean HTML for the downstream scraping/markdown pipeline.
**Execution order**: `flatten_shadow_dom` runs right before HTML capture, after all waits and JS execution:
```
js_code_before_wait → wait_for → delay → js_code → flatten_shadow_dom → page capture
```
For a full runnable example, see [`shadow_dom_crawling.py`](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/shadow_dom_crawling.py).
## 4. Structured Extraction Examples
### 4.1 Pattern-Based with `JsonCssExtractionStrategy`
```python

View File

@@ -255,16 +255,22 @@ class CrawlerRunConfig:
- Controls caching behavior (`ENABLED`, `BYPASS`, `DISABLED`, etc.).
- Defaults to `CacheMode.BYPASS`.
6.**`js_code`** & **`c4a_script`**:
- `js_code`: A string or list of JavaScript strings to execute.
6.**`js_code`**, **`js_code_before_wait`**, & **`c4a_script`**:
- `js_code`: JavaScript to run **after** `wait_for` completes — on the fully-loaded page.
- `js_code_before_wait`: JavaScript to run **before** `wait_for` — for triggering loading that `wait_for` then checks.
- `c4a_script`: C4A script that compiles to JavaScript.
- Great for "Load More" buttons or user interactions.
- Great for "Load More" buttons or user interactions.
7.**`wait_for`**:
- A CSS or JS expression to wait for before extracting content.
- Common usage: `wait_for="css:.main-loaded"` or `wait_for="js:() => window.loaded === true"`.
8.**`screenshot`**, **`pdf`**, & **`capture_mhtml`**:
8.**`flatten_shadow_dom`**:
- If `True`, flattens Shadow DOM content into the light DOM before HTML capture.
- Essential for sites built with Web Components (Stencil, Lit, Shoelace, etc.).
- Also force-opens closed shadow roots. See [Flattening Shadow DOM](content-selection.md#31-flattening-shadow-dom).
9.**`screenshot`**, **`pdf`**, & **`capture_mhtml`**:
- If `True`, captures a screenshot, PDF, or MHTML snapshot after the page is fully loaded.
- The results go to `result.screenshot` (base64), `result.pdf` (bytes), or `result.mhtml` (string).
- Use `force_viewport_screenshot=True` to capture only the visible viewport instead of the full page. This is faster and produces smaller images when you don't need a full-page screenshot.

View File

@@ -183,6 +183,55 @@ if __name__ == "__main__":
---
## 3.1 Flattening Shadow DOM
Sites built with **Web Components** (Stencil, Lit, Shoelace, Angular Elements, etc.) render content inside [Shadow DOM](https://developer.mozilla.org/en-US/docs/Web/API/Web_components/Using_shadow_DOM) — an encapsulated sub-tree that is invisible to normal page serialization. The browser renders it on screen, but `page.content()` never includes it.
Set `flatten_shadow_dom=True` to walk all shadow trees, resolve `<slot>` projections, and produce a single flat HTML document:
```python
config = CrawlerRunConfig(
# Flatten shadow DOM into the main document
flatten_shadow_dom=True,
# Give web components time to hydrate
wait_until="load",
delay_before_return_html=3.0,
)
```
**Full example** — crawling a product page where specs live inside shadow roots:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
config = CrawlerRunConfig(
flatten_shadow_dom=True,
wait_until="load",
delay_before_return_html=3.0,
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://store.boschrexroth.com/en/us/p/hydraulic-cylinder-r900999011",
config=config,
)
# Without flatten_shadow_dom: ~1 KB of markdown (breadcrumbs only)
# With flatten_shadow_dom: ~33 KB (full product specs, downloads, etc.)
print(len(result.markdown.raw_markdown))
if __name__ == "__main__":
asyncio.run(main())
```
When `flatten_shadow_dom=True` is set, Crawl4AI also injects an init script that force-opens **closed** shadow roots (by patching `Element.prototype.attachShadow`), so even components that use `mode: 'closed'` become accessible.
> **Tip**: Web components need JavaScript to run before they render content (a process called *hydration*). Use `wait_until="load"` and a `delay_before_return_html` of 25 seconds to ensure components are fully hydrated before flattening.
For a complete runnable example, see [`shadow_dom_crawling.py`](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/shadow_dom_crawling.py).
---
## 4. Structured Extraction Examples
You can combine content selection with a more advanced extraction strategy. For instance, a **CSS-based** or **LLM-based** extraction strategy can run on the filtered HTML.

View File

@@ -15,8 +15,9 @@ Below is a quick overview of how to do it.
### Basic Execution
**`js_code`** in **`CrawlerRunConfig`** accepts either a single JS string or a list of JS snippets.
**Example**: Well scroll to the bottom of the page, then optionally click a “Load More” button.
**`js_code`** in **`CrawlerRunConfig`** accepts either a single JS string or a list of JS snippets. It runs **after** `wait_for` and `delay_before_return_html` — so the page is fully loaded when your code executes.
**Example**: We'll scroll to the bottom of the page, then optionally click a "Load More" button.
```python
import asyncio
@@ -55,10 +56,36 @@ if __name__ == "__main__":
```
**Relevant `CrawlerRunConfig` params**:
- **`js_code`**: A string or list of strings with JavaScript to run after the page loads.
- **`js_only`**: If set to `True` on subsequent calls, indicates were continuing an existing session without a new full navigation.
- **`js_code`**: JavaScript to run **after** `wait_for` and `delay_before_return_html` complete. Runs on the fully-loaded page.
- **`js_code_before_wait`**: JavaScript to run **before** `wait_for`. Use when you need to trigger loading that `wait_for` then checks.
- **`js_only`**: If set to `True` on subsequent calls, indicates we're continuing an existing session without a new full navigation.
- **`session_id`**: If you want to keep the same page across multiple calls, specify an ID.
### Execution Order
Understanding when your JavaScript runs relative to other pipeline steps:
```
1. Page navigation (page.goto)
2. js_code_before_wait ← triggers loading / clicks tabs
3. wait_for ← waits for content to appear
4. delay_before_return_html ← extra safety margin
5. js_code ← runs on the fully-loaded page
6. flatten_shadow_dom ← if enabled
7. page.content() ← HTML capture
```
If you need JS to trigger something and then wait for the result, use `js_code_before_wait` + `wait_for`:
```python
config = CrawlerRunConfig(
# Click a tab first
js_code_before_wait="document.querySelector('#specs-tab')?.click();",
# Then wait for the tab content to appear
wait_for="css:#specs-panel .content",
)
```
---
## 2. Wait Conditions
@@ -317,35 +344,55 @@ When done, check `result.extracted_content` for the JSON.
---
## 7. Relevant `CrawlerRunConfig` Parameters
## 7. Shadow DOM Flattening
Sites built with **Web Components** (Stencil, Lit, Shoelace, etc.) render content inside Shadow DOM — an encapsulated sub-tree that is invisible to normal page serialization. Set `flatten_shadow_dom=True` to extract it:
```python
config = CrawlerRunConfig(
flatten_shadow_dom=True,
wait_until="load",
delay_before_return_html=3.0, # give components time to hydrate
)
```
This walks all shadow trees, resolves `<slot>` projections, and produces flat HTML. It also force-opens closed shadow roots via an init script. For details and a full example, see [Flattening Shadow DOM](content-selection.md#31-flattening-shadow-dom) and [`shadow_dom_crawling.py`](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/shadow_dom_crawling.py).
---
## 8. Relevant `CrawlerRunConfig` Parameters
Below are the key interaction-related parameters in `CrawlerRunConfig`. For a full list, see [Configuration Parameters](../api/parameters.md).
- **`js_code`**: JavaScript to run after initial load.
- **`js_only`**: If `True`, no new page navigation—only JS in the existing session.
- **`wait_for`**: CSS (`"css:..."`) or JS (`"js:..."`) expression to wait for.
- **`session_id`**: Reuse the same page across calls.
- **`cache_mode`**: Whether to read/write from the cache or bypass.
- **`js_code`**: JavaScript to run after `wait_for` + `delay_before_return_html`, on the fully-loaded page.
- **`js_code_before_wait`**: JavaScript to run before `wait_for`. For triggering loading that `wait_for` then checks.
- **`js_only`**: If `True`, no new page navigation—only JS in the existing session.
- **`wait_for`**: CSS (`"css:..."`) or JS (`"js:..."`) expression to wait for.
- **`session_id`**: Reuse the same page across calls.
- **`cache_mode`**: Whether to read/write from the cache or bypass.
- **`flatten_shadow_dom`**: Flatten Shadow DOM content into the light DOM before capture.
- **`process_iframes`**: Inline iframe content into the main document.
- **`remove_overlay_elements`**: Remove certain popups automatically.
- **`remove_consent_popups`**: Remove GDPR/cookie consent popups from known CMP providers (OneTrust, Cookiebot, Didomi, etc.).
- **`simulate_user`, `override_navigator`, `magic`**: Anti-bot or "human-like" interactions.
---
## 8. Conclusion
## 9. Conclusion
Crawl4AIs **page interaction** features let you:
Crawl4AI's **page interaction** features let you:
1. **Execute JavaScript** for scrolling, clicks, or form filling.
2. **Wait** for CSS or custom JS conditions before capturing data.
3. **Handle** multi-step flows (like “Load More”) with partial reloads or persistent sessions.
4. Combine with **structured extraction** for dynamic sites.
4. **Flatten Shadow DOM** on Web Component sites to extract hidden content.
5. Combine with **structured extraction** for dynamic sites.
With these tools, you can scrape modern, interactive webpages confidently. For advanced hooking, user simulation, or in-depth config, check the [API reference](../api/parameters.md) or related advanced docs. Happy scripting!
---
## 9. Virtual Scrolling
## 10. Virtual Scrolling
For sites that use **virtual scrolling** (where content is replaced rather than appended as you scroll, like Twitter or Instagram), Crawl4AI provides a dedicated `VirtualScrollConfig`: