Add remove_consent_popups flag and fix from_kwargs dict deserialization

Add CrawlerRunConfig.remove_consent_popups (bool, default False) that
targets GDPR/cookie consent popups from 70+ known CMP providers including
OneTrust, Cookiebot, TrustArc, Quantcast, Didomi, Usercentrics,
Sourcepoint, Google FundingChoices, and many more.

The JS strategy uses a 5-phase approach:
1. Click "Accept All" buttons (cleanest dismissal, sets cookies)
2. Try CMP JavaScript APIs (__tcfapi, Didomi, Cookiebot, Osano, Klaro)
3. Remove known CMP containers by selector (~120 selectors)
4. Handle iframe-based and shadow DOM CMPs
5. Restore body scroll and remove CMP body classes

Also fix from_kwargs() in CrawlerRunConfig and BrowserConfig to
auto-deserialize dict values using the existing from_serializable_dict()
infrastructure. Previously, strategy objects like markdown_generator
arriving as {"type": "DefaultMarkdownGenerator", "params": {...}} from
JSON APIs were passed through as raw dicts, causing crashes when the
crawler later called methods on them.
This commit is contained in:
unclecode
2026-02-11 12:46:47 +00:00
parent 44b8afb6dc
commit 3fc7730aaf
7 changed files with 792 additions and 6 deletions

View File

@@ -153,8 +153,10 @@ Some sites embed content in `<iframe>` tags. If you want that inline:
```python
config = CrawlerRunConfig(
# Merge iframe content into the final output
process_iframes=True,
remove_overlay_elements=True
process_iframes=True,
remove_overlay_elements=True,
# Remove GDPR/cookie consent popups (OneTrust, Cookiebot, etc.)
remove_consent_popups=True
)
```

View File

@@ -326,8 +326,9 @@ Below are the key interaction-related parameters in `CrawlerRunConfig`. For a fu
- **`wait_for`**: CSS (`"css:..."`) or JS (`"js:..."`) expression to wait for.
- **`session_id`**: Reuse the same page across calls.
- **`cache_mode`**: Whether to read/write from the cache or bypass.
- **`remove_overlay_elements`**: Remove certain popups automatically.
- **`simulate_user`, `override_navigator`, `magic`**: Anti-bot or “human-like” interactions.
- **`remove_overlay_elements`**: Remove certain popups automatically.
- **`remove_consent_popups`**: Remove GDPR/cookie consent popups from known CMP providers (OneTrust, Cookiebot, Didomi, etc.).
- **`simulate_user`, `override_navigator`, `magic`**: Anti-bot or "human-like" interactions.
---