Add Shadow DOM flattening and reorder js_code execution pipeline

- Add `flatten_shadow_dom` option to CrawlerRunConfig that serializes
  shadow DOM content into the light DOM before HTML capture. Uses a
  recursive serializer that resolves <slot> projections and strips
  only shadow-scoped <style> tags. Also injects an init script to
  force-open closed shadow roots via attachShadow patching.

- Move `js_code` execution to after `wait_for` + `delay_before_return_html`
  so user scripts run on the fully-hydrated page. Add `js_code_before_wait`
  for the less common case of triggering loading before waiting.

- Add JS snippet (flatten_shadow_dom.js), integration test, example,
  and documentation across all relevant doc files.
This commit is contained in:
unclecode
2026-02-18 06:43:00 +00:00
parent 4fb02f8b50
commit 8576331d4e
11 changed files with 522 additions and 66 deletions

View File

@@ -255,16 +255,22 @@ class CrawlerRunConfig:
- Controls caching behavior (`ENABLED`, `BYPASS`, `DISABLED`, etc.).
- Defaults to `CacheMode.BYPASS`.
6.**`js_code`** & **`c4a_script`**:
- `js_code`: A string or list of JavaScript strings to execute.
6.**`js_code`**, **`js_code_before_wait`**, & **`c4a_script`**:
- `js_code`: JavaScript to run **after** `wait_for` completes — on the fully-loaded page.
- `js_code_before_wait`: JavaScript to run **before** `wait_for` — for triggering loading that `wait_for` then checks.
- `c4a_script`: C4A script that compiles to JavaScript.
- Great for "Load More" buttons or user interactions.
- Great for "Load More" buttons or user interactions.
7.**`wait_for`**:
- A CSS or JS expression to wait for before extracting content.
- Common usage: `wait_for="css:.main-loaded"` or `wait_for="js:() => window.loaded === true"`.
8.**`screenshot`**, **`pdf`**, & **`capture_mhtml`**:
8.**`flatten_shadow_dom`**:
- If `True`, flattens Shadow DOM content into the light DOM before HTML capture.
- Essential for sites built with Web Components (Stencil, Lit, Shoelace, etc.).
- Also force-opens closed shadow roots. See [Flattening Shadow DOM](content-selection.md#31-flattening-shadow-dom).
9.**`screenshot`**, **`pdf`**, & **`capture_mhtml`**:
- If `True`, captures a screenshot, PDF, or MHTML snapshot after the page is fully loaded.
- The results go to `result.screenshot` (base64), `result.pdf` (bytes), or `result.mhtml` (string).
- Use `force_viewport_screenshot=True` to capture only the visible viewport instead of the full page. This is faster and produces smaller images when you don't need a full-page screenshot.