crawl4ai

Author	SHA1	Message	Date
unclecode	8576331d4e	Add Shadow DOM flattening and reorder js_code execution pipeline - Add `flatten_shadow_dom` option to CrawlerRunConfig that serializes shadow DOM content into the light DOM before HTML capture. Uses a recursive serializer that resolves <slot> projections and strips only shadow-scoped <style> tags. Also injects an init script to force-open closed shadow roots via attachShadow patching. - Move `js_code` execution to after `wait_for` + `delay_before_return_html` so user scripts run on the fully-hydrated page. Add `js_code_before_wait` for the less common case of triggering loading before waiting. - Add JS snippet (flatten_shadow_dom.js), integration test, example, and documentation across all relevant doc files.	2026-02-18 06:43:00 +00:00
unclecode	4fb02f8b50	Warn LLM against hashed/generated CSS class names in schema prompts Replace vague "handle dynamic class names appropriately" with explicit rule: never use auto-generated class names (.styles_card__xK9r2, etc.) as they break on every site rebuild. Prefer data-* attributes, semantic tags, ARIA attributes, and stable meaningful class names instead.	2026-02-17 12:02:58 +00:00
unclecode	d267c650cb	Add source (sibling selector) support to JSON extraction strategies Many sites (e.g. Hacker News) split a single item's data across sibling elements. Field selectors only search descendants, making sibling data unreachable. The new "source" field key navigates to a sibling element before running the selector: {"source": "+ tr"} finds the next sibling <tr>, then extracts from there. - Add _resolve_source abstract method to JsonElementExtractionStrategy - Implement in all 4 subclasses (CSS/BS4, XPath/lxml, two lxml/CSS) - Modify _extract_field to resolve source before type dispatch - Update CSS and XPath LLM prompts with source docs and HN example - Default generate_schema validate=True so schemas are checked on creation - Add schema validation with feedback loop for auto-refinement - Add messages param to completion helpers for multi-turn refinement - Document source field and schema validation in docs - Add 14 unit tests covering CSS, XPath, backward compat, edge cases	2026-02-17 09:04:40 +00:00
unclecode	ccd24aa824	Fix fallback fetch: run when all proxies crash, skip re-check, never return None Three related fixes to the anti-bot proxy retry + fallback pipeline: 1. Allow fallback_fetch_function to run when crawl_result is None (all proxies threw exceptions like browser crashes). Previously fallback only ran when crawl_result existed but was blocked — exception-only failures bypassed it. 2. Skip is_blocked() re-check after successful fallback. Real unblocked pages may contain anti-bot script markers (e.g. PerimeterX JS on Walmart) that trigger false positives, overriding success=True back to False. 3. Always return a CrawlResult with crawl_stats, never None. When all proxies and fallback fail, create a minimal failed result so callers get stats about what was attempted instead of AttributeError on None. Also: if aprocess_html fails during fallback (dead browser can't run Page.evaluate for consent popup removal), fall back to raw HTML result instead of silently discarding the successfully-fetched fallback content.	2026-02-15 10:55:00 +00:00
unclecode	45d8e1450f	Fix proxy escalation: don't re-raise on first proxy exception when chain has alternatives When proxy_config is a list (escalation chain) and the first proxy throws an exception (timeout, connection error, browser crash), the retry loop now continues to the next proxy instead of immediately re-raising. Previously, exceptions on _p_idx==0 and _attempt==0 were always re-raised, which broke the entire escalation chain — ISP/Residential/fallback proxies were never tried. This made the proxy list effectively useless for sites where the first-tier proxy fails with an exception rather than a blocked response. The raise is preserved when there's only a single proxy and single attempt (len(proxy_list) <= 1 and max_attempts <= 1) so that simple non-chain crawls still get immediate error propagation.	2026-02-15 09:55:55 +00:00
unclecode	d028a889d0	Make proxy_config a property so direct assignment also normalizes Setting config.proxy_config = [ProxyConfig.DIRECT, ...] after construction now goes through the same normalization as __init__, converting "direct" sentinels to None. Fixes crash when proxy_config is assigned directly instead of passed to the constructor.	2026-02-14 13:16:36 +00:00
unclecode	879553955c	Add ProxyConfig.DIRECT sentinel for direct-then-proxy escalation Allow "direct" or None in proxy_config list to explicitly try without a proxy before escalating to proxy servers. The retry loop already handled None as direct — this exposes it as a clean user-facing API via ProxyConfig.DIRECT.	2026-02-14 10:25:07 +00:00
unclecode	875207287e	Unify proxy_config to accept list, add crawl_stats tracking - proxy_config on CrawlerRunConfig now accepts a single ProxyConfig or a list of ProxyConfig tried in order (first-come-first-served) - Remove is_fallback from ProxyConfig and fallback_proxy_configs from CrawlerRunConfig — proxy escalation handled entirely by list order - Add _get_proxy_list() normalizer for the retry loop - Add CrawlResult.crawl_stats with attempts, retries, proxies_used, fallback_fetch_used, and resolved_by for billing and observability - Set success=False with error_message when all attempts are blocked - Simplify retry loop — no more is_fallback stashing logic - Update docs and tests to reflect new API	2026-02-14 07:53:46 +00:00
unclecode	72b546c48d	Add anti-bot detection, retry, and fallback system Automatically detect when crawls are blocked by anti-bot systems (Akamai, Cloudflare, PerimeterX, DataDome, Imperva, etc.) and escalate through configurable retry and fallback strategies. New features on CrawlerRunConfig: - max_retries: retry rounds when blocking is detected - fallback_proxy_configs: list of fallback proxies tried each round - fallback_fetch_function: async last-resort function returning raw HTML New field on ProxyConfig: - is_fallback: skip proxy on first attempt, activate only when blocked Escalation chain per round: main proxy → fallback proxies in order. After all rounds: fallback_fetch_function as last resort. Detection uses tiered heuristics — structural HTML markers (high confidence) trigger on any page, generic patterns only on short error pages to avoid false positives.	2026-02-14 05:24:07 +00:00
unclecode	fdd989785f	Sync sec-ch-ua with User-Agent and keep WebGL alive in stealth mode Fix a bug where magic mode and per-request UA overrides would change the User-Agent header without updating the sec-ch-ua (browser hint) header to match. Anti-bot systems like Akamai detect this mismatch as a bot signal. Changes: - Regenerate browser_hint via UAGen.generate_client_hints() whenever the UA is changed at crawl time (magic mode or explicit override) - Re-apply updated headers to the page via set_extra_http_headers() - Skip per-crawl UA override for persistent contexts where the UA is locked at launch time by Playwright's protocol layer - Move --disable-gpu flags behind enable_stealth check so WebGL works via SwiftShader when stealth mode is active (missing WebGL is a detectable headless signal) - Clean up old test scripts, add clean anti-bot test	2026-02-13 04:10:47 +00:00
unclecode	112f44a97d	Fix proxy auth for persistent browser contexts Chromium's --proxy-server CLI flag silently ignores inline credentials (user:pass@server). For persistent contexts, crawl4ai was embedding credentials in this flag via ManagedBrowser.build_browser_flags(), causing proxy auth to fail and the browser to fall back to direct connection. Fix: Use Playwright's launch_persistent_context(proxy=...) API instead of subprocess + CDP when use_persistent_context=True. This handles proxy authentication properly via the HTTP CONNECT handshake. The non-persistent and CDP paths remain unchanged. Changes: - Strip credentials from --proxy-server flag in build_browser_flags() - Add launch_persistent_context() path in BrowserManager.start() - Add cleanup path in BrowserManager.close() - Guard create_browser_context() when self.browser is None - Add regression tests covering all 4 proxy/persistence combinations	2026-02-12 11:19:29 +00:00
unclecode	1a24ac785e	Refactor from_kwargs to respect set_defaults and use __init__ defaults Replace hardcoded parameter listings in BrowserConfig.from_kwargs() and CrawlerRunConfig.from_kwargs() with a generic approach that filters input kwargs to valid __init__ params and passes them through. This: - Makes set_defaults() work with from_kwargs() (previously ignored) - Fixes default mismatches (word_count_threshold was 200 vs __init__=1, markdown_generator was None vs __init__=DefaultMarkdownGenerator()) - Eliminates ~160 lines of duplicated default values - Auto-supports new params without updating from_kwargs	2026-02-11 13:35:36 +00:00
unclecode	3fc7730aaf	Add remove_consent_popups flag and fix from_kwargs dict deserialization Add CrawlerRunConfig.remove_consent_popups (bool, default False) that targets GDPR/cookie consent popups from 70+ known CMP providers including OneTrust, Cookiebot, TrustArc, Quantcast, Didomi, Usercentrics, Sourcepoint, Google FundingChoices, and many more. The JS strategy uses a 5-phase approach: 1. Click "Accept All" buttons (cleanest dismissal, sets cookies) 2. Try CMP JavaScript APIs (__tcfapi, Didomi, Cookiebot, Osano, Klaro) 3. Remove known CMP containers by selector (~120 selectors) 4. Handle iframe-based and shadow DOM CMPs 5. Restore body scroll and remove CMP body classes Also fix from_kwargs() in CrawlerRunConfig and BrowserConfig to auto-deserialize dict values using the existing from_serializable_dict() infrastructure. Previously, strategy objects like markdown_generator arriving as {"type": "DefaultMarkdownGenerator", "params": {...}} from JSON APIs were passed through as raw dicts, causing crashes when the crawler later called methods on them.	2026-02-11 12:46:47 +00:00
unclecode	44b8afb6dc	Improve schema generation prompt for sibling-based layouts	2026-02-10 08:34:22 +00:00
unclecode	fbc52813a4	Add tests, docs, and contributors for PRs #1463 and #1435 - Add tests for device_scale_factor (config + integration) - Add tests for redirected_status_code (model + redirect + raw HTML) - Document device_scale_factor in browser config docs and API reference - Document redirected_status_code in crawler result docs and API reference - Add TristanDonze and charlaie to CONTRIBUTORS.md - Update PR-TODOLIST with session results	2026-02-06 09:30:19 +00:00
unclecode	37a49c5315	Merge PR #1435 : Add redirected_status_code to CrawlResult Applied manually due to conflicts (PR based on older code). Also fixed missing variable initialization for non-goto paths (file://, raw:, js_only) that would have caused NameError. Closes #1434	2026-02-06 09:23:54 +00:00
unclecode	0aacafed0a	Merge PR #1463 : Add configurable device_scale_factor for screenshot quality	2026-02-06 09:19:42 +00:00
unclecode	719e83e105	Update PR todolist — refresh open PRs, add 6 new, classify - Added PRs #475, #462, #416, #335, #332, #312 - Flagged #475 as duplicate of merged #1296 - Corrected author for #1450 (rbushri) - Updated total count to ~63 open PRs - Updated date to 2026-02-06	2026-02-06 09:06:13 +00:00
unclecode	3401dd1620	Fix browser recycling under high concurrency — version-based approach The previous recycle logic waited for all refcounts to hit 0 before recycling, which never happened under sustained concurrent load (20+ crawls always had at least one active). New approach: - Add _browser_version to config signature — bump it to force new contexts - When threshold is hit: bump version, move old sigs to _pending_cleanup - New requests get new contexts automatically (different signature) - Old contexts drain naturally and get cleaned up when refcount hits 0 - Safety cap: max 3 pending browsers draining at once This means recycling now works under any load pattern — no blocking, no waiting for quiet moments. Old and new browsers coexist briefly during transitions. Includes 12 new tests covering version bumps, concurrent recycling, safety cap, and edge cases.	2026-02-05 07:48:12 +00:00
unclecode	c046918bb4	Add memory-saving mode, browser recycling, and CDP leak fixes - Add memory_saving_mode config: aggressive cache discard + V8 heap cap flags for high-volume crawling (1000+ pages) - Add max_pages_before_recycle config: automatic browser process recycling after N pages to reclaim leaked memory (recommended 500-1000) - Add default Chrome flags to disable unused features (OptimizationHints, MediaRouter, component updates, domain reliability) - Fix CDP session leak: detach CDP session after viewport adjustment - Fix session kill: only close context when refcount reaches 0, preventing use-after-close for shared contexts - Add browser lifecycle and memory tests	2026-02-04 02:00:53 +00:00
ntohidi	4e56f3e00d	Add contributing guide and update mkdocs navigation for community resources	2026-02-03 09:46:54 +01:00
unclecode	0bfcf080dd	Add contributors from PRs #1133 , #729 Credit chrizzly2309 and complete-dope for identifying bugs that were resolved on develop.	2026-02-02 07:56:37 +00:00
unclecode	b962699c0d	Add contributors from PRs #973 , #1073 , #931 Credit danyQe, saipavanmeruga7797, and stevenaldinger for identifying bugs that were resolved on develop.	2026-02-02 07:14:12 +00:00
unclecode	ffd3face6b	Remove duplicate PROMPT_EXTRACT_BLOCKS definition in prompts.py The first definition (with tags/questions fields) was immediately overwritten by the second simpler definition — pure dead code. Removes 61 lines of unused prompt text. Inspired by PR #931 (stevenaldinger).	2026-02-02 07:04:35 +00:00
unclecode	c790231aba	Fix browser context memory leak — signature shrink + LRU eviction (#943 ) contexts_by_config accumulated browser contexts unboundedly in long-running crawlers (Docker API). Two root causes fixed: 1. _make_config_signature() hashed ~60 CrawlerRunConfig fields but only 7 affect the browser context (proxy_config, locale, timezone_id, geolocation, override_navigator, simulate_user, magic). Switched from blacklist to whitelist — non-context fields like word_count_threshold, css_selector, screenshot, verbose no longer cause unnecessary context creation. 2. No eviction mechanism existed between close() calls. Added refcount tracking (_context_refcounts, incremented under _contexts_lock in get_page, decremented in release_page_with_context) and LRU eviction (_evict_lru_context_locked) that caps contexts at _max_contexts=20, evicting only idle contexts (refcount==0) oldest-first. Also fixed: storage_state path leaked a temporary context every request (now explicitly closed after clone_runtime_state). Closes #943. Credit to @Martichou for the investigation in #1640.	2026-02-01 14:23:04 +00:00
unclecode	bb523b6c6c	Merge PRs #1077 , #1281 — bs4 deprecation and proxy auth fix - PR #1077: Fix bs4 deprecation warning (text -> string) - PR #1281: Fix proxy auth ERR_INVALID_AUTH_CREDENTIALS - Comment on PR #1081 guiding author on needed DFS/BFF fixes - Update CONTRIBUTORS.md and PR-TODOLIST.md	2026-02-01 07:06:39 +00:00
unclecode	980dc73156	Merge PR #1281 : Fix proxy auth ERR_INVALID_AUTH_CREDENTIALS	2026-02-01 07:05:00 +00:00
unclecode	98aea2fb46	Merge PR #1077 : Fix bs4 deprecation warning (text -> string)	2026-02-01 07:04:31 +00:00
unclecode	a56dd07559	Merge PRs #1667 , #1296 , #1364 — CLI deep-crawl, env var, script tags - PR #1667: Fix deep-crawl CLI outputting only the first page - PR #1296: Fix VersionManager ignoring CRAWL4_AI_BASE_DIRECTORY - PR #1364: Fix script tag removal losing adjacent text - Fix: restore .crawl4ai subfolder in VersionManager path - Close #1150 (already fixed on develop) - Update CONTRIBUTORS.md and PR-TODOLIST.md	2026-02-01 06:53:53 +00:00
unclecode	312cef8633	Fix PR #1296 : restore .crawl4ai subfolder in VersionManager path	2026-02-01 06:22:16 +00:00
unclecode	a244e4d781	Merge PR #1364 : Fix script tag removal losing adjacent text in cleaned_html	2026-02-01 06:22:10 +00:00
unclecode	0f83b05a2d	Merge PR #1296 : Fix VersionManager ignoring CRAWL4_AI_BASE_DIRECTORY env var	2026-02-01 06:21:40 +00:00
unclecode	37995d4d3f	Merge PR #1667 : Fix deep-crawl CLI outputting only the first page	2026-02-01 06:21:25 +00:00
unclecode	dc4ae73221	Merge PRs #1714 , #1721 , #1719 , #1717 and fix base tag pipeline - PR #1714: Replace tf-playwright-stealth with playwright-stealth - PR #1721: Respect <base> tag in html2text for relative links - PR #1719: Include GoogleSearchCrawler script.js in package data - PR #1717: Allow local embeddings by removing OpenAI fallback - Fix: Extract <base href> from raw HTML before head gets stripped - Close duplicates: #1703, #1698, #1697, #1710, #1720 - Update CONTRIBUTORS.md and PR-TODOLIST.md	2026-02-01 05:41:33 +00:00
unclecode	5cd0648d71	Merge PR #1717 : Allow local embeddings by removing OpenAI fallback	2026-02-01 05:02:18 +00:00
unclecode	9172581416	Merge PR #1719 : Include GoogleSearchCrawler script.js in package distribution	2026-02-01 05:02:05 +00:00
unclecode	c39e796a18	Merge PR #1721 : Fix <base> tag ignored in html2text relative link resolution	2026-02-01 05:01:52 +00:00
unclecode	ccab926f1f	Merge PR #1714 : Replace tf-playwright-stealth with playwright-stealth	2026-02-01 05:01:31 +00:00
unclecode	43738c9ed2	Fix can_process_url() to receive normalized URL in deep crawl strategies Pass the normalized absolute URL instead of the raw href to can_process_url() in BFS, BFF, and DFS deep crawl strategies. This ensures URL validation and filter chain evaluation operate on consistent, fully-qualified URLs. Fixes #1743	2026-02-01 03:45:52 +00:00
unclecode	ee717dc019	Add contributor for PR #1746 and fix test pytest marker - Add ChiragBellara to CONTRIBUTORS.md for sitemap seeding fix - Add missing @pytest.mark.asyncio decorator to seeder test	2026-02-01 03:10:32 +00:00
unclecode	7c5933e2e7	Merge PR #1746 : Fix sitemap-only URL seeding avoiding Common Crawl calls	2026-02-01 02:57:06 +00:00
unclecode	5be0d2d75e	Add contributor and docs for force_viewport_screenshot feature - Add TheRedRad to CONTRIBUTORS.md for PR #1694 - Document force_viewport_screenshot in API parameters reference - Add viewport screenshot note in browser-crawler-config guide - Add viewport-only screenshot example in screenshot docs	2026-02-01 01:10:20 +00:00
unclecode	e19492a82e	Merge PR #1694 : feat: add force viewport screenshot	2026-02-01 01:05:52 +00:00
unclecode	55a2cc8181	Document set_defaults/get_defaults/reset_defaults in config guides	2026-01-31 11:46:53 +00:00
unclecode	13a414802b	Add set_defaults/get_defaults/reset_defaults to config classes	2026-01-31 11:44:07 +00:00
unclecode	19b9140c68	Improve CDP connection handling	2026-01-31 11:07:26 +00:00
ChiragBellara	694ba44a04	Added fix for URL Seeder forcing Common Crawl index in case of a "sitemap"	2026-01-30 09:33:30 -08:00
unclecode	0104db6de2	Fix critical RCE via deserialization and eval() in /crawl endpoint - Replace raw eval() in _compute_field() with AST-validated _safe_eval_expression() that blocks __import__, dunder attribute access, and import statements while preserving safe transforms - Add ALLOWED_DESERIALIZE_TYPES allowlist to from_serializable_dict() preventing arbitrary class instantiation from API input - Update security contact email and add v0.8.1 security fixes to SECURITY.md with researcher acknowledgment - Add 17 security tests covering both fixes	2026-01-30 08:46:32 +00:00
Nasrin	ad5ebf166a	Merge pull request #1718 from YuriNachos/fix/issue-1704-default-logger fix: Initialize default logger in AsyncPlaywrightCrawlerStrategy (#1704)	2026-01-29 13:03:11 +01:00
Nasrin	034bddf557	Merge pull request #1733 from jose-blockchain/fix/1686-docker-health-version Fix #1686: Docker health endpoint reports outdated version	2026-01-29 12:55:24 +01:00

1 2 3 4 5 ...

1356 Commits