Replace hardcoded parameter listings in BrowserConfig.from_kwargs() and
CrawlerRunConfig.from_kwargs() with a generic approach that filters
input kwargs to valid __init__ params and passes them through. This:
- Makes set_defaults() work with from_kwargs() (previously ignored)
- Fixes default mismatches (word_count_threshold was 200 vs __init__=1,
markdown_generator was None vs __init__=DefaultMarkdownGenerator())
- Eliminates ~160 lines of duplicated default values
- Auto-supports new params without updating from_kwargs
Add CrawlerRunConfig.remove_consent_popups (bool, default False) that
targets GDPR/cookie consent popups from 70+ known CMP providers including
OneTrust, Cookiebot, TrustArc, Quantcast, Didomi, Usercentrics,
Sourcepoint, Google FundingChoices, and many more.
The JS strategy uses a 5-phase approach:
1. Click "Accept All" buttons (cleanest dismissal, sets cookies)
2. Try CMP JavaScript APIs (__tcfapi, Didomi, Cookiebot, Osano, Klaro)
3. Remove known CMP containers by selector (~120 selectors)
4. Handle iframe-based and shadow DOM CMPs
5. Restore body scroll and remove CMP body classes
Also fix from_kwargs() in CrawlerRunConfig and BrowserConfig to
auto-deserialize dict values using the existing from_serializable_dict()
infrastructure. Previously, strategy objects like markdown_generator
arriving as {"type": "DefaultMarkdownGenerator", "params": {...}} from
JSON APIs were passed through as raw dicts, causing crashes when the
crawler later called methods on them.
- Add tests for device_scale_factor (config + integration)
- Add tests for redirected_status_code (model + redirect + raw HTML)
- Document device_scale_factor in browser config docs and API reference
- Document redirected_status_code in crawler result docs and API reference
- Add TristanDonze and charlaie to CONTRIBUTORS.md
- Update PR-TODOLIST with session results
Applied manually due to conflicts (PR based on older code).
Also fixed missing variable initialization for non-goto paths
(file://, raw:, js_only) that would have caused NameError.
Closes#1434
- Added PRs #475, #462, #416, #335, #332, #312
- Flagged #475 as duplicate of merged #1296
- Corrected author for #1450 (rbushri)
- Updated total count to ~63 open PRs
- Updated date to 2026-02-06
The previous recycle logic waited for all refcounts to hit 0 before
recycling, which never happened under sustained concurrent load (20+
crawls always had at least one active).
New approach:
- Add _browser_version to config signature — bump it to force new contexts
- When threshold is hit: bump version, move old sigs to _pending_cleanup
- New requests get new contexts automatically (different signature)
- Old contexts drain naturally and get cleaned up when refcount hits 0
- Safety cap: max 3 pending browsers draining at once
This means recycling now works under any load pattern — no blocking,
no waiting for quiet moments. Old and new browsers coexist briefly
during transitions.
Includes 12 new tests covering version bumps, concurrent recycling,
safety cap, and edge cases.
The first definition (with tags/questions fields) was immediately
overwritten by the second simpler definition — pure dead code.
Removes 61 lines of unused prompt text.
Inspired by PR #931 (stevenaldinger).
contexts_by_config accumulated browser contexts unboundedly in long-running
crawlers (Docker API). Two root causes fixed:
1. _make_config_signature() hashed ~60 CrawlerRunConfig fields but only 7
affect the browser context (proxy_config, locale, timezone_id, geolocation,
override_navigator, simulate_user, magic). Switched from blacklist to
whitelist — non-context fields like word_count_threshold, css_selector,
screenshot, verbose no longer cause unnecessary context creation.
2. No eviction mechanism existed between close() calls. Added refcount
tracking (_context_refcounts, incremented under _contexts_lock in
get_page, decremented in release_page_with_context) and LRU eviction
(_evict_lru_context_locked) that caps contexts at _max_contexts=20,
evicting only idle contexts (refcount==0) oldest-first.
Also fixed: storage_state path leaked a temporary context every request
(now explicitly closed after clone_runtime_state).
Closes#943. Credit to @Martichou for the investigation in #1640.
- PR #1714: Replace tf-playwright-stealth with playwright-stealth
- PR #1721: Respect <base> tag in html2text for relative links
- PR #1719: Include GoogleSearchCrawler script.js in package data
- PR #1717: Allow local embeddings by removing OpenAI fallback
- Fix: Extract <base href> from raw HTML before head gets stripped
- Close duplicates: #1703, #1698, #1697, #1710, #1720
- Update CONTRIBUTORS.md and PR-TODOLIST.md
Pass the normalized absolute URL instead of the raw href to
can_process_url() in BFS, BFF, and DFS deep crawl strategies.
This ensures URL validation and filter chain evaluation operate
on consistent, fully-qualified URLs.
Fixes#1743
- Add TheRedRad to CONTRIBUTORS.md for PR #1694
- Document force_viewport_screenshot in API parameters reference
- Add viewport screenshot note in browser-crawler-config guide
- Add viewport-only screenshot example in screenshot docs
- Replace raw eval() in _compute_field() with AST-validated
_safe_eval_expression() that blocks __import__, dunder attribute
access, and import statements while preserving safe transforms
- Add ALLOWED_DESERIALIZE_TYPES allowlist to from_serializable_dict()
preventing arbitrary class instantiation from API input
- Update security contact email and add v0.8.1 security fixes to
SECURITY.md with researcher acknowledgment
- Add 17 security tests covering both fixes
Strip markdown code fences (```json ... ```) from LLM responses before
json.loads() in agenerate_schema(). Anthropic models wrap JSON output
in markdown fences when litellm silently drops the unsupported
response_format parameter, causing json.loads("") parse failures.
- Add _strip_markdown_fences() helper to extraction_strategy.py
- Apply fence stripping + empty response check in agenerate_schema()
- Separate JSONDecodeError for clearer error messages
- Add 34 tests: unit, real API integration (Anthropic/OpenAI/Groq
against quotes.toscrape.com), and regression parametrized
- Use class-level tracking keyed by normalized CDP URL
- All BrowserManager instances connecting to same browser share tracking
- For CDP connections, always create new pages (cross-connection page
sharing isn't reliable in Playwright)
- For managed browsers, page reuse works within same process
- Normalize CDP URLs to handle different formats (http, ws, query params)
When using create_isolated_context=False with concurrent crawls, multiple
tasks would reuse the same page (pages[0]) causing navigation race
conditions and "Page.content: Unable to retrieve content because the
page is navigating" errors.
Changes:
- Add _pages_in_use set to track pages currently being used by crawls
- Rewrite get_page() to only reuse pages that are not in use
- Create new pages when all existing pages are busy
- Add release_page() method to release pages after crawl completes
- Update cleanup paths to release pages before closing
This maintains context sharing (cookies, localStorage) while ensuring
each concurrent crawl gets its own isolated page for navigation.
Includes integration tests verifying:
- Single and sequential crawls still work
- Concurrent crawls don't cause race conditions
- High concurrency (10 simultaneous crawls) works
- Page tracking state remains consistent
- Add storage_state.json to all KEEP_PATTERNS levels
- This file contains unencrypted cookies in Playwright format
- Critical for cross-machine profile portability (local -> cloud)
- Add --password-store=basic and --use-mock-keychain flags when creating
profiles to prevent OS keychain encryption of cookies
- Without this, cookies are encrypted with machine-specific keys and
profiles can't be transferred between machines (local -> cloud)
Also adds direct CLI commands for profile management:
- crwl profiles create <name>
- crwl profiles list
- crwl profiles delete <name>
The interactive menu (crwl profiles) still works as before.