The first definition (with tags/questions fields) was immediately
overwritten by the second simpler definition — pure dead code.
Removes 61 lines of unused prompt text.
Inspired by PR #931 (stevenaldinger).
contexts_by_config accumulated browser contexts unboundedly in long-running
crawlers (Docker API). Two root causes fixed:
1. _make_config_signature() hashed ~60 CrawlerRunConfig fields but only 7
affect the browser context (proxy_config, locale, timezone_id, geolocation,
override_navigator, simulate_user, magic). Switched from blacklist to
whitelist — non-context fields like word_count_threshold, css_selector,
screenshot, verbose no longer cause unnecessary context creation.
2. No eviction mechanism existed between close() calls. Added refcount
tracking (_context_refcounts, incremented under _contexts_lock in
get_page, decremented in release_page_with_context) and LRU eviction
(_evict_lru_context_locked) that caps contexts at _max_contexts=20,
evicting only idle contexts (refcount==0) oldest-first.
Also fixed: storage_state path leaked a temporary context every request
(now explicitly closed after clone_runtime_state).
Closes#943. Credit to @Martichou for the investigation in #1640.
- PR #1714: Replace tf-playwright-stealth with playwright-stealth
- PR #1721: Respect <base> tag in html2text for relative links
- PR #1719: Include GoogleSearchCrawler script.js in package data
- PR #1717: Allow local embeddings by removing OpenAI fallback
- Fix: Extract <base href> from raw HTML before head gets stripped
- Close duplicates: #1703, #1698, #1697, #1710, #1720
- Update CONTRIBUTORS.md and PR-TODOLIST.md
Pass the normalized absolute URL instead of the raw href to
can_process_url() in BFS, BFF, and DFS deep crawl strategies.
This ensures URL validation and filter chain evaluation operate
on consistent, fully-qualified URLs.
Fixes#1743
- Add TheRedRad to CONTRIBUTORS.md for PR #1694
- Document force_viewport_screenshot in API parameters reference
- Add viewport screenshot note in browser-crawler-config guide
- Add viewport-only screenshot example in screenshot docs
- Replace raw eval() in _compute_field() with AST-validated
_safe_eval_expression() that blocks __import__, dunder attribute
access, and import statements while preserving safe transforms
- Add ALLOWED_DESERIALIZE_TYPES allowlist to from_serializable_dict()
preventing arbitrary class instantiation from API input
- Update security contact email and add v0.8.1 security fixes to
SECURITY.md with researcher acknowledgment
- Add 17 security tests covering both fixes
Strip markdown code fences (```json ... ```) from LLM responses before
json.loads() in agenerate_schema(). Anthropic models wrap JSON output
in markdown fences when litellm silently drops the unsupported
response_format parameter, causing json.loads("") parse failures.
- Add _strip_markdown_fences() helper to extraction_strategy.py
- Apply fence stripping + empty response check in agenerate_schema()
- Separate JSONDecodeError for clearer error messages
- Add 34 tests: unit, real API integration (Anthropic/OpenAI/Groq
against quotes.toscrape.com), and regression parametrized
- Use class-level tracking keyed by normalized CDP URL
- All BrowserManager instances connecting to same browser share tracking
- For CDP connections, always create new pages (cross-connection page
sharing isn't reliable in Playwright)
- For managed browsers, page reuse works within same process
- Normalize CDP URLs to handle different formats (http, ws, query params)
When using create_isolated_context=False with concurrent crawls, multiple
tasks would reuse the same page (pages[0]) causing navigation race
conditions and "Page.content: Unable to retrieve content because the
page is navigating" errors.
Changes:
- Add _pages_in_use set to track pages currently being used by crawls
- Rewrite get_page() to only reuse pages that are not in use
- Create new pages when all existing pages are busy
- Add release_page() method to release pages after crawl completes
- Update cleanup paths to release pages before closing
This maintains context sharing (cookies, localStorage) while ensuring
each concurrent crawl gets its own isolated page for navigation.
Includes integration tests verifying:
- Single and sequential crawls still work
- Concurrent crawls don't cause race conditions
- High concurrency (10 simultaneous crawls) works
- Page tracking state remains consistent
- Add storage_state.json to all KEEP_PATTERNS levels
- This file contains unencrypted cookies in Playwright format
- Critical for cross-machine profile portability (local -> cloud)
- Add --password-store=basic and --use-mock-keychain flags when creating
profiles to prevent OS keychain encryption of cookies
- Without this, cookies are encrypted with machine-specific keys and
profiles can't be transferred between machines (local -> cloud)
Also adds direct CLI commands for profile management:
- crwl profiles create <name>
- crwl profiles list
- crwl profiles delete <name>
The interactive menu (crwl profiles) still works as before.
- Add Section 11 "Cancellation Support for Deep Crawls" to deep-crawling.md
- Document should_cancel callback, cancel() method, and cancelled property
- Include complete example for cloud platform job cancellation
- Add docs/examples/deep_crawl_cancellation.py with 6 comprehensive examples
- Update summary section to mention cancellation feature
- Add should_cancel callback parameter to BFS, DFS, and BestFirst strategies
- Add cancel() method for immediate cancellation (thread-safe)
- Add cancelled property to check cancellation status
- Add _check_cancellation() internal method supporting both sync/async callbacks
- Reset cancel event on strategy reuse for multiple crawls
- Include cancelled flag in state notifications via on_state_change
- Handle callback exceptions gracefully (fail-open, log warning)
- Add comprehensive test suite with 26 tests covering all edge cases
This enables external callers (e.g., cloud platforms) to stop a running
deep crawl mid-execution and retrieve partial results.
Replace hardcoded version string with import from crawl4ai.__version__
to ensure health endpoint reports correct version.
Fixes#1686
Reviewers: @chansearrington
Replace hardcoded version string with import from crawl4ai.__version__
to ensure health endpoint reports correct version.
Fixes#1686
Reviewers: @chansearrington