Add remove_consent_popups flag and fix from_kwargs dict deserialization
Add CrawlerRunConfig.remove_consent_popups (bool, default False) that
targets GDPR/cookie consent popups from 70+ known CMP providers including
OneTrust, Cookiebot, TrustArc, Quantcast, Didomi, Usercentrics,
Sourcepoint, Google FundingChoices, and many more.
The JS strategy uses a 5-phase approach:
1. Click "Accept All" buttons (cleanest dismissal, sets cookies)
2. Try CMP JavaScript APIs (__tcfapi, Didomi, Cookiebot, Osano, Klaro)
3. Remove known CMP containers by selector (~120 selectors)
4. Handle iframe-based and shadow DOM CMPs
5. Restore body scroll and remove CMP body classes
Also fix from_kwargs() in CrawlerRunConfig and BrowserConfig to
auto-deserialize dict values using the existing from_serializable_dict()
infrastructure. Previously, strategy objects like markdown_generator
arriving as {"type": "DefaultMarkdownGenerator", "params": {...}} from
JSON APIs were passed through as raw dicts, causing crashes when the
crawler later called methods on them.
This commit is contained in:
@@ -159,6 +159,7 @@ Use these for controlling whether you read or write from a local content cache.
|
||||
| **`max_scroll_steps`** | `int or None` (None) | Maximum number of scroll steps during full page scan. If None, scrolls until entire page is loaded. |
|
||||
| **`process_iframes`** | `bool` (False) | Inlines iframe content for single-page extraction. |
|
||||
| **`remove_overlay_elements`** | `bool` (False) | Removes potential modals/popups blocking the main content. |
|
||||
| **`remove_consent_popups`** | `bool` (False) | Removes GDPR/cookie consent popups from known CMP providers (OneTrust, Cookiebot, TrustArc, Quantcast, Didomi, Sourcepoint, FundingChoices, etc.). Tries clicking "Accept All" first, then falls back to DOM removal. |
|
||||
| **`simulate_user`** | `bool` (False) | Simulate user interactions (mouse movements) to avoid bot detection. |
|
||||
| **`override_navigator`** | `bool` (False) | Override `navigator` properties in JS for stealth. |
|
||||
| **`magic`** | `bool` (False) | Automatic handling of popups/consent banners. Experimental. |
|
||||
|
||||
@@ -1787,6 +1787,7 @@ run_cfg = CrawlerRunConfig(
|
||||
| **`scroll_delay`** | `float` (0.2) | Delay between scroll steps if `scan_full_page=True`. |
|
||||
| **`process_iframes`** | `bool` (False) | Inlines iframe content for single-page extraction. |
|
||||
| **`remove_overlay_elements`** | `bool` (False) | Removes potential modals/popups blocking the main content. |
|
||||
| **`remove_consent_popups`** | `bool` (False) | Removes GDPR/cookie consent popups from known CMP providers (OneTrust, Cookiebot, TrustArc, Quantcast, Didomi, Sourcepoint, FundingChoices, etc.). Tries clicking "Accept All" first, then falls back to DOM removal. |
|
||||
| **`simulate_user`** | `bool` (False) | Simulate user interactions (mouse movements) to avoid bot detection. |
|
||||
| **`override_navigator`** | `bool` (False) | Override `navigator` properties in JS for stealth. |
|
||||
| **`magic`** | `bool` (False) | Automatic handling of popups/consent banners. Experimental. |
|
||||
@@ -3344,8 +3345,9 @@ Below are the key interaction-related parameters in `CrawlerRunConfig`. For a fu
|
||||
- **`wait_for`**: CSS (`"css:..."`) or JS (`"js:..."`) expression to wait for.
|
||||
- **`session_id`**: Reuse the same page across calls.
|
||||
- **`cache_mode`**: Whether to read/write from the cache or bypass.
|
||||
- **`remove_overlay_elements`**: Remove certain popups automatically.
|
||||
- **`simulate_user`, `override_navigator`, `magic`**: Anti-bot or “human-like” interactions.
|
||||
- **`remove_overlay_elements`**: Remove certain popups automatically.
|
||||
- **`remove_consent_popups`**: Remove GDPR/cookie consent popups from known CMP providers (OneTrust, Cookiebot, Didomi, etc.).
|
||||
- **`simulate_user`, `override_navigator`, `magic`**: Anti-bot or "human-like" interactions.
|
||||
## 8. Conclusion
|
||||
1. **Execute JavaScript** for scrolling, clicks, or form filling.
|
||||
2. **Wait** for CSS or custom JS conditions before capturing data.
|
||||
|
||||
@@ -153,8 +153,10 @@ Some sites embed content in `<iframe>` tags. If you want that inline:
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
# Merge iframe content into the final output
|
||||
process_iframes=True,
|
||||
remove_overlay_elements=True
|
||||
process_iframes=True,
|
||||
remove_overlay_elements=True,
|
||||
# Remove GDPR/cookie consent popups (OneTrust, Cookiebot, etc.)
|
||||
remove_consent_popups=True
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
@@ -326,8 +326,9 @@ Below are the key interaction-related parameters in `CrawlerRunConfig`. For a fu
|
||||
- **`wait_for`**: CSS (`"css:..."`) or JS (`"js:..."`) expression to wait for.
|
||||
- **`session_id`**: Reuse the same page across calls.
|
||||
- **`cache_mode`**: Whether to read/write from the cache or bypass.
|
||||
- **`remove_overlay_elements`**: Remove certain popups automatically.
|
||||
- **`simulate_user`, `override_navigator`, `magic`**: Anti-bot or “human-like” interactions.
|
||||
- **`remove_overlay_elements`**: Remove certain popups automatically.
|
||||
- **`remove_consent_popups`**: Remove GDPR/cookie consent popups from known CMP providers (OneTrust, Cookiebot, Didomi, etc.).
|
||||
- **`simulate_user`, `override_navigator`, `magic`**: Anti-bot or "human-like" interactions.
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user