Add remove_consent_popups flag and fix from_kwargs dict deserialization
Add CrawlerRunConfig.remove_consent_popups (bool, default False) that
targets GDPR/cookie consent popups from 70+ known CMP providers including
OneTrust, Cookiebot, TrustArc, Quantcast, Didomi, Usercentrics,
Sourcepoint, Google FundingChoices, and many more.
The JS strategy uses a 5-phase approach:
1. Click "Accept All" buttons (cleanest dismissal, sets cookies)
2. Try CMP JavaScript APIs (__tcfapi, Didomi, Cookiebot, Osano, Klaro)
3. Remove known CMP containers by selector (~120 selectors)
4. Handle iframe-based and shadow DOM CMPs
5. Restore body scroll and remove CMP body classes
Also fix from_kwargs() in CrawlerRunConfig and BrowserConfig to
auto-deserialize dict values using the existing from_serializable_dict()
infrastructure. Previously, strategy objects like markdown_generator
arriving as {"type": "DefaultMarkdownGenerator", "params": {...}} from
JSON APIs were passed through as raw dicts, causing crashes when the
crawler later called methods on them.
This commit is contained in:
@@ -153,8 +153,10 @@ Some sites embed content in `<iframe>` tags. If you want that inline:
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
# Merge iframe content into the final output
|
||||
process_iframes=True,
|
||||
remove_overlay_elements=True
|
||||
process_iframes=True,
|
||||
remove_overlay_elements=True,
|
||||
# Remove GDPR/cookie consent popups (OneTrust, Cookiebot, etc.)
|
||||
remove_consent_popups=True
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
Reference in New Issue
Block a user