- Added PRs #475, #462, #416, #335, #332, #312
- Flagged #475 as duplicate of merged #1296
- Corrected author for #1450 (rbushri)
- Updated total count to ~63 open PRs
- Updated date to 2026-02-06
The previous recycle logic waited for all refcounts to hit 0 before
recycling, which never happened under sustained concurrent load (20+
crawls always had at least one active).
New approach:
- Add _browser_version to config signature — bump it to force new contexts
- When threshold is hit: bump version, move old sigs to _pending_cleanup
- New requests get new contexts automatically (different signature)
- Old contexts drain naturally and get cleaned up when refcount hits 0
- Safety cap: max 3 pending browsers draining at once
This means recycling now works under any load pattern — no blocking,
no waiting for quiet moments. Old and new browsers coexist briefly
during transitions.
Includes 12 new tests covering version bumps, concurrent recycling,
safety cap, and edge cases.
The first definition (with tags/questions fields) was immediately
overwritten by the second simpler definition — pure dead code.
Removes 61 lines of unused prompt text.
Inspired by PR #931 (stevenaldinger).
contexts_by_config accumulated browser contexts unboundedly in long-running
crawlers (Docker API). Two root causes fixed:
1. _make_config_signature() hashed ~60 CrawlerRunConfig fields but only 7
affect the browser context (proxy_config, locale, timezone_id, geolocation,
override_navigator, simulate_user, magic). Switched from blacklist to
whitelist — non-context fields like word_count_threshold, css_selector,
screenshot, verbose no longer cause unnecessary context creation.
2. No eviction mechanism existed between close() calls. Added refcount
tracking (_context_refcounts, incremented under _contexts_lock in
get_page, decremented in release_page_with_context) and LRU eviction
(_evict_lru_context_locked) that caps contexts at _max_contexts=20,
evicting only idle contexts (refcount==0) oldest-first.
Also fixed: storage_state path leaked a temporary context every request
(now explicitly closed after clone_runtime_state).
Closes#943. Credit to @Martichou for the investigation in #1640.
- PR #1714: Replace tf-playwright-stealth with playwright-stealth
- PR #1721: Respect <base> tag in html2text for relative links
- PR #1719: Include GoogleSearchCrawler script.js in package data
- PR #1717: Allow local embeddings by removing OpenAI fallback
- Fix: Extract <base href> from raw HTML before head gets stripped
- Close duplicates: #1703, #1698, #1697, #1710, #1720
- Update CONTRIBUTORS.md and PR-TODOLIST.md
Pass the normalized absolute URL instead of the raw href to
can_process_url() in BFS, BFF, and DFS deep crawl strategies.
This ensures URL validation and filter chain evaluation operate
on consistent, fully-qualified URLs.
Fixes#1743
- Add TheRedRad to CONTRIBUTORS.md for PR #1694
- Document force_viewport_screenshot in API parameters reference
- Add viewport screenshot note in browser-crawler-config guide
- Add viewport-only screenshot example in screenshot docs
- Replace raw eval() in _compute_field() with AST-validated
_safe_eval_expression() that blocks __import__, dunder attribute
access, and import statements while preserving safe transforms
- Add ALLOWED_DESERIALIZE_TYPES allowlist to from_serializable_dict()
preventing arbitrary class instantiation from API input
- Update security contact email and add v0.8.1 security fixes to
SECURITY.md with researcher acknowledgment
- Add 17 security tests covering both fixes
Strip markdown code fences (```json ... ```) from LLM responses before
json.loads() in agenerate_schema(). Anthropic models wrap JSON output
in markdown fences when litellm silently drops the unsupported
response_format parameter, causing json.loads("") parse failures.
- Add _strip_markdown_fences() helper to extraction_strategy.py
- Apply fence stripping + empty response check in agenerate_schema()
- Separate JSONDecodeError for clearer error messages
- Add 34 tests: unit, real API integration (Anthropic/OpenAI/Groq
against quotes.toscrape.com), and regression parametrized
- Use class-level tracking keyed by normalized CDP URL
- All BrowserManager instances connecting to same browser share tracking
- For CDP connections, always create new pages (cross-connection page
sharing isn't reliable in Playwright)
- For managed browsers, page reuse works within same process
- Normalize CDP URLs to handle different formats (http, ws, query params)
When using create_isolated_context=False with concurrent crawls, multiple
tasks would reuse the same page (pages[0]) causing navigation race
conditions and "Page.content: Unable to retrieve content because the
page is navigating" errors.
Changes:
- Add _pages_in_use set to track pages currently being used by crawls
- Rewrite get_page() to only reuse pages that are not in use
- Create new pages when all existing pages are busy
- Add release_page() method to release pages after crawl completes
- Update cleanup paths to release pages before closing
This maintains context sharing (cookies, localStorage) while ensuring
each concurrent crawl gets its own isolated page for navigation.
Includes integration tests verifying:
- Single and sequential crawls still work
- Concurrent crawls don't cause race conditions
- High concurrency (10 simultaneous crawls) works
- Page tracking state remains consistent
- Add storage_state.json to all KEEP_PATTERNS levels
- This file contains unencrypted cookies in Playwright format
- Critical for cross-machine profile portability (local -> cloud)
- Add --password-store=basic and --use-mock-keychain flags when creating
profiles to prevent OS keychain encryption of cookies
- Without this, cookies are encrypted with machine-specific keys and
profiles can't be transferred between machines (local -> cloud)
Also adds direct CLI commands for profile management:
- crwl profiles create <name>
- crwl profiles list
- crwl profiles delete <name>
The interactive menu (crwl profiles) still works as before.
- Add Section 11 "Cancellation Support for Deep Crawls" to deep-crawling.md
- Document should_cancel callback, cancel() method, and cancelled property
- Include complete example for cloud platform job cancellation
- Add docs/examples/deep_crawl_cancellation.py with 6 comprehensive examples
- Update summary section to mention cancellation feature
- Add should_cancel callback parameter to BFS, DFS, and BestFirst strategies
- Add cancel() method for immediate cancellation (thread-safe)
- Add cancelled property to check cancellation status
- Add _check_cancellation() internal method supporting both sync/async callbacks
- Reset cancel event on strategy reuse for multiple crawls
- Include cancelled flag in state notifications via on_state_change
- Handle callback exceptions gracefully (fail-open, log warning)
- Add comprehensive test suite with 26 tests covering all edge cases
This enables external callers (e.g., cloud platforms) to stop a running
deep crawl mid-execution and retrieve partial results.