Add remove_consent_popups flag and fix from_kwargs dict deserialization

Add CrawlerRunConfig.remove_consent_popups (bool, default False) that
targets GDPR/cookie consent popups from 70+ known CMP providers including
OneTrust, Cookiebot, TrustArc, Quantcast, Didomi, Usercentrics,
Sourcepoint, Google FundingChoices, and many more.

The JS strategy uses a 5-phase approach:
1. Click "Accept All" buttons (cleanest dismissal, sets cookies)
2. Try CMP JavaScript APIs (__tcfapi, Didomi, Cookiebot, Osano, Klaro)
3. Remove known CMP containers by selector (~120 selectors)
4. Handle iframe-based and shadow DOM CMPs
5. Restore body scroll and remove CMP body classes

Also fix from_kwargs() in CrawlerRunConfig and BrowserConfig to
auto-deserialize dict values using the existing from_serializable_dict()
infrastructure. Previously, strategy objects like markdown_generator
arriving as {"type": "DefaultMarkdownGenerator", "params": {...}} from
JSON APIs were passed through as raw dicts, causing crashes when the
crawler later called methods on them.
This commit is contained in:
unclecode
2026-02-11 12:46:47 +00:00
parent 44b8afb6dc
commit 3fc7730aaf
7 changed files with 792 additions and 6 deletions

View File

@@ -159,6 +159,7 @@ Use these for controlling whether you read or write from a local content cache.
| **`max_scroll_steps`** | `int or None` (None) | Maximum number of scroll steps during full page scan. If None, scrolls until entire page is loaded. |
| **`process_iframes`** | `bool` (False) | Inlines iframe content for single-page extraction. |
| **`remove_overlay_elements`** | `bool` (False) | Removes potential modals/popups blocking the main content. |
| **`remove_consent_popups`** | `bool` (False) | Removes GDPR/cookie consent popups from known CMP providers (OneTrust, Cookiebot, TrustArc, Quantcast, Didomi, Sourcepoint, FundingChoices, etc.). Tries clicking "Accept All" first, then falls back to DOM removal. |
| **`simulate_user`** | `bool` (False) | Simulate user interactions (mouse movements) to avoid bot detection. |
| **`override_navigator`** | `bool` (False) | Override `navigator` properties in JS for stealth. |
| **`magic`** | `bool` (False) | Automatic handling of popups/consent banners. Experimental. |

View File

@@ -1787,6 +1787,7 @@ run_cfg = CrawlerRunConfig(
| **`scroll_delay`** | `float` (0.2) | Delay between scroll steps if `scan_full_page=True`. |
| **`process_iframes`** | `bool` (False) | Inlines iframe content for single-page extraction. |
| **`remove_overlay_elements`** | `bool` (False) | Removes potential modals/popups blocking the main content. |
| **`remove_consent_popups`** | `bool` (False) | Removes GDPR/cookie consent popups from known CMP providers (OneTrust, Cookiebot, TrustArc, Quantcast, Didomi, Sourcepoint, FundingChoices, etc.). Tries clicking "Accept All" first, then falls back to DOM removal. |
| **`simulate_user`** | `bool` (False) | Simulate user interactions (mouse movements) to avoid bot detection. |
| **`override_navigator`** | `bool` (False) | Override `navigator` properties in JS for stealth. |
| **`magic`** | `bool` (False) | Automatic handling of popups/consent banners. Experimental. |
@@ -3344,8 +3345,9 @@ Below are the key interaction-related parameters in `CrawlerRunConfig`. For a fu
- **`wait_for`**: CSS (`"css:..."`) or JS (`"js:..."`) expression to wait for.
- **`session_id`**: Reuse the same page across calls.
- **`cache_mode`**: Whether to read/write from the cache or bypass.
- **`remove_overlay_elements`**: Remove certain popups automatically.
- **`simulate_user`, `override_navigator`, `magic`**: Anti-bot or “human-like” interactions.
- **`remove_overlay_elements`**: Remove certain popups automatically.
- **`remove_consent_popups`**: Remove GDPR/cookie consent popups from known CMP providers (OneTrust, Cookiebot, Didomi, etc.).
- **`simulate_user`, `override_navigator`, `magic`**: Anti-bot or "human-like" interactions.
## 8. Conclusion
1. **Execute JavaScript** for scrolling, clicks, or form filling.
2. **Wait** for CSS or custom JS conditions before capturing data.

View File

@@ -153,8 +153,10 @@ Some sites embed content in `<iframe>` tags. If you want that inline:
```python
config = CrawlerRunConfig(
# Merge iframe content into the final output
process_iframes=True,
remove_overlay_elements=True
process_iframes=True,
remove_overlay_elements=True,
# Remove GDPR/cookie consent popups (OneTrust, Cookiebot, etc.)
remove_consent_popups=True
)
```

View File

@@ -326,8 +326,9 @@ Below are the key interaction-related parameters in `CrawlerRunConfig`. For a fu
- **`wait_for`**: CSS (`"css:..."`) or JS (`"js:..."`) expression to wait for.
- **`session_id`**: Reuse the same page across calls.
- **`cache_mode`**: Whether to read/write from the cache or bypass.
- **`remove_overlay_elements`**: Remove certain popups automatically.
- **`simulate_user`, `override_navigator`, `magic`**: Anti-bot or “human-like” interactions.
- **`remove_overlay_elements`**: Remove certain popups automatically.
- **`remove_consent_popups`**: Remove GDPR/cookie consent popups from known CMP providers (OneTrust, Cookiebot, Didomi, etc.).
- **`simulate_user`, `override_navigator`, `magic`**: Anti-bot or "human-like" interactions.
---