Commit Graph

1362 Commits

Author SHA1 Message Date
unclecode
c854e2b899 Fix simulate_user destroying page content via ArrowDown keypress
The simulate_user/magic block fired keyboard.press("ArrowDown") and
mouse.click at fixed coords (100,100) on every page. This destroyed
content on any site with keyboard event handlers (MkDocs, React Router,
Vue, Angular, SPAs) and clicked interactive elements at that position.

Replace with mouse movement (two-point trajectory) and mouse.wheel
scroll — these generate the same anti-bot signals (mousemove, scroll
events) without triggering JS framework navigation or clicking buttons.

Also remove temporary RAW_DEBUG logging that was left from investigation.
2026-02-19 15:03:28 +00:00
unclecode
8df3541ac4 Skip anti-bot checks and fallback for raw: URLs
raw: URLs contain caller-provided HTML (e.g. from cache), not content
fetched from a web server. Anti-bot detection, proxy retries, and
fallback fetching are meaningless for this content.

- Skip is_blocked() in retry loop and final re-check for raw: URLs
- Skip fallback_fetch_function invocation for raw: URLs
- Add RAW_DEBUG logging in browser strategy for set_content/page.content
2026-02-19 14:05:56 +00:00
unclecode
94a77eea30 Move test_repro_1640.py to tests/browser/ 2026-02-19 06:33:46 +00:00
unclecode
2060c7e965 Fix browser recycling deadlock under sustained concurrent load (#1640)
Three bugs in the version-based browser recycling caused requests to
hang after ~80-130 pages under concurrent load:

1. Race condition: _maybe_bump_browser_version() added ALL context
   signatures to _pending_cleanup, including those with refcount 0.
   Since no future release would trigger cleanup for idle sigs, they
   stayed in _pending_cleanup permanently. Fix: split sigs into
   active (refcount > 0, go to pending) and idle (refcount == 0,
   cleaned up immediately).

2. Finally block fragility: the first line of _crawl_web's finally
   block accessed page.context.browser.contexts, which throws if the
   browser crashed. This prevented release_page_with_context() from
   ever being called, permanently leaking the refcount. Fix: call
   release_page_with_context() first in its own try/except, then do
   best-effort cleanup of listeners and page.

3. Safety cap deadlock: when _pending_cleanup accumulated >= 3 stuck
   entries, _maybe_bump_browser_version() blocked get_page() forever
   with no timeout. Fix: 30-second timeout on the wait, after which
   stuck entries (refcount 0) are force-cleaned.

Includes regression test covering all three bugs plus multi-config
concurrent crawl scenarios.
2026-02-19 06:27:25 +00:00
unclecode
13048a106b Add Tier 3 structural integrity check to anti-bot detector
Catches silent blocks, anti-bot redirects, and incomplete renders that
pass pattern-based detection (Tiers 1/2) but are structurally broken:

- No <body> tag on pages under 50KB
- Minimal visible text after stripping scripts/styles/tags
- No semantic content elements (p, h1-6, article, section, li, td, a)
- Script-heavy shells with scripts but no real content

Uses signal scoring: 2+ signals = blocked, 1 signal on small page
(<5KB) = blocked. Skips large pages and JSON/XML data responses.
2026-02-18 06:59:22 +00:00
unclecode
c9cb0160cf Add token usage tracking to generate_schema / agenerate_schema
generate_schema can make up to 5 internal LLM calls (field inference,
schema generation, validation retries) with no way to track token
consumption. Add an optional `usage: TokenUsage = None` parameter that
accumulates prompt/completion/total tokens across all calls in-place.

- _infer_target_json: accept and populate usage accumulator
- agenerate_schema: track usage after every aperform_completion call
  in the retry loop, forward usage to _infer_target_json
- generate_schema (sync): forward usage to agenerate_schema

Fully backward-compatible — omitting usage changes nothing.
2026-02-18 06:44:17 +00:00
unclecode
8576331d4e Add Shadow DOM flattening and reorder js_code execution pipeline
- Add `flatten_shadow_dom` option to CrawlerRunConfig that serializes
  shadow DOM content into the light DOM before HTML capture. Uses a
  recursive serializer that resolves <slot> projections and strips
  only shadow-scoped <style> tags. Also injects an init script to
  force-open closed shadow roots via attachShadow patching.

- Move `js_code` execution to after `wait_for` + `delay_before_return_html`
  so user scripts run on the fully-hydrated page. Add `js_code_before_wait`
  for the less common case of triggering loading before waiting.

- Add JS snippet (flatten_shadow_dom.js), integration test, example,
  and documentation across all relevant doc files.
2026-02-18 06:43:00 +00:00
unclecode
4fb02f8b50 Warn LLM against hashed/generated CSS class names in schema prompts
Replace vague "handle dynamic class names appropriately" with explicit
rule: never use auto-generated class names (.styles_card__xK9r2, etc.)
as they break on every site rebuild. Prefer data-* attributes, semantic
tags, ARIA attributes, and stable meaningful class names instead.
2026-02-17 12:02:58 +00:00
unclecode
d267c650cb Add source (sibling selector) support to JSON extraction strategies
Many sites (e.g. Hacker News) split a single item's data across sibling
elements. Field selectors only search descendants, making sibling data
unreachable. The new "source" field key navigates to a sibling element
before running the selector: {"source": "+ tr"} finds the next sibling
<tr>, then extracts from there.

- Add _resolve_source abstract method to JsonElementExtractionStrategy
- Implement in all 4 subclasses (CSS/BS4, XPath/lxml, two lxml/CSS)
- Modify _extract_field to resolve source before type dispatch
- Update CSS and XPath LLM prompts with source docs and HN example
- Default generate_schema validate=True so schemas are checked on creation
- Add schema validation with feedback loop for auto-refinement
- Add messages param to completion helpers for multi-turn refinement
- Document source field and schema validation in docs
- Add 14 unit tests covering CSS, XPath, backward compat, edge cases
2026-02-17 09:04:40 +00:00
unclecode
ccd24aa824 Fix fallback fetch: run when all proxies crash, skip re-check, never return None
Three related fixes to the anti-bot proxy retry + fallback pipeline:

1. Allow fallback_fetch_function to run when crawl_result is None (all proxies
   threw exceptions like browser crashes). Previously fallback only ran when
   crawl_result existed but was blocked — exception-only failures bypassed it.

2. Skip is_blocked() re-check after successful fallback. Real unblocked pages
   may contain anti-bot script markers (e.g. PerimeterX JS on Walmart) that
   trigger false positives, overriding success=True back to False.

3. Always return a CrawlResult with crawl_stats, never None. When all proxies
   and fallback fail, create a minimal failed result so callers get stats
   about what was attempted instead of AttributeError on None.

Also: if aprocess_html fails during fallback (dead browser can't run
Page.evaluate for consent popup removal), fall back to raw HTML result
instead of silently discarding the successfully-fetched fallback content.
2026-02-15 10:55:00 +00:00
unclecode
45d8e1450f Fix proxy escalation: don't re-raise on first proxy exception when chain has alternatives
When proxy_config is a list (escalation chain) and the first proxy throws
an exception (timeout, connection error, browser crash), the retry loop
now continues to the next proxy instead of immediately re-raising.

Previously, exceptions on _p_idx==0 and _attempt==0 were always re-raised,
which broke the entire escalation chain — ISP/Residential/fallback proxies
were never tried. This made the proxy list effectively useless for sites
where the first-tier proxy fails with an exception rather than a blocked
response.

The raise is preserved when there's only a single proxy and single attempt
(len(proxy_list) <= 1 and max_attempts <= 1) so that simple non-chain
crawls still get immediate error propagation.
2026-02-15 09:55:55 +00:00
unclecode
d028a889d0 Make proxy_config a property so direct assignment also normalizes
Setting config.proxy_config = [ProxyConfig.DIRECT, ...] after
construction now goes through the same normalization as __init__,
converting "direct" sentinels to None. Fixes crash when proxy_config
is assigned directly instead of passed to the constructor.
2026-02-14 13:16:36 +00:00
unclecode
879553955c Add ProxyConfig.DIRECT sentinel for direct-then-proxy escalation
Allow "direct" or None in proxy_config list to explicitly try
without a proxy before escalating to proxy servers. The retry
loop already handled None as direct — this exposes it as a
clean user-facing API via ProxyConfig.DIRECT.
2026-02-14 10:25:07 +00:00
unclecode
875207287e Unify proxy_config to accept list, add crawl_stats tracking
- proxy_config on CrawlerRunConfig now accepts a single ProxyConfig or
  a list of ProxyConfig tried in order (first-come-first-served)
- Remove is_fallback from ProxyConfig and fallback_proxy_configs from
  CrawlerRunConfig — proxy escalation handled entirely by list order
- Add _get_proxy_list() normalizer for the retry loop
- Add CrawlResult.crawl_stats with attempts, retries, proxies_used,
  fallback_fetch_used, and resolved_by for billing and observability
- Set success=False with error_message when all attempts are blocked
- Simplify retry loop — no more is_fallback stashing logic
- Update docs and tests to reflect new API
2026-02-14 07:53:46 +00:00
unclecode
72b546c48d Add anti-bot detection, retry, and fallback system
Automatically detect when crawls are blocked by anti-bot systems
(Akamai, Cloudflare, PerimeterX, DataDome, Imperva, etc.) and
escalate through configurable retry and fallback strategies.

New features on CrawlerRunConfig:
- max_retries: retry rounds when blocking is detected
- fallback_proxy_configs: list of fallback proxies tried each round
- fallback_fetch_function: async last-resort function returning raw HTML

New field on ProxyConfig:
- is_fallback: skip proxy on first attempt, activate only when blocked

Escalation chain per round: main proxy → fallback proxies in order.
After all rounds: fallback_fetch_function as last resort.

Detection uses tiered heuristics — structural HTML markers (high
confidence) trigger on any page, generic patterns only on short
error pages to avoid false positives.
2026-02-14 05:24:07 +00:00
unclecode
fdd989785f Sync sec-ch-ua with User-Agent and keep WebGL alive in stealth mode
Fix a bug where magic mode and per-request UA overrides would change
the User-Agent header without updating the sec-ch-ua (browser hint)
header to match. Anti-bot systems like Akamai detect this mismatch
as a bot signal.

Changes:
- Regenerate browser_hint via UAGen.generate_client_hints() whenever
  the UA is changed at crawl time (magic mode or explicit override)
- Re-apply updated headers to the page via set_extra_http_headers()
- Skip per-crawl UA override for persistent contexts where the UA is
  locked at launch time by Playwright's protocol layer
- Move --disable-gpu flags behind enable_stealth check so WebGL works
  via SwiftShader when stealth mode is active (missing WebGL is a
  detectable headless signal)
- Clean up old test scripts, add clean anti-bot test
2026-02-13 04:10:47 +00:00
unclecode
112f44a97d Fix proxy auth for persistent browser contexts
Chromium's --proxy-server CLI flag silently ignores inline credentials
(user:pass@server). For persistent contexts, crawl4ai was embedding
credentials in this flag via ManagedBrowser.build_browser_flags(),
causing proxy auth to fail and the browser to fall back to direct
connection.

Fix: Use Playwright's launch_persistent_context(proxy=...) API instead
of subprocess + CDP when use_persistent_context=True. This handles
proxy authentication properly via the HTTP CONNECT handshake. The
non-persistent and CDP paths remain unchanged.

Changes:
- Strip credentials from --proxy-server flag in build_browser_flags()
- Add launch_persistent_context() path in BrowserManager.start()
- Add cleanup path in BrowserManager.close()
- Guard create_browser_context() when self.browser is None
- Add regression tests covering all 4 proxy/persistence combinations
2026-02-12 11:19:29 +00:00
unclecode
1a24ac785e Refactor from_kwargs to respect set_defaults and use __init__ defaults
Replace hardcoded parameter listings in BrowserConfig.from_kwargs() and
CrawlerRunConfig.from_kwargs() with a generic approach that filters
input kwargs to valid __init__ params and passes them through. This:

- Makes set_defaults() work with from_kwargs() (previously ignored)
- Fixes default mismatches (word_count_threshold was 200 vs __init__=1,
  markdown_generator was None vs __init__=DefaultMarkdownGenerator())
- Eliminates ~160 lines of duplicated default values
- Auto-supports new params without updating from_kwargs
2026-02-11 13:35:36 +00:00
unclecode
3fc7730aaf Add remove_consent_popups flag and fix from_kwargs dict deserialization
Add CrawlerRunConfig.remove_consent_popups (bool, default False) that
targets GDPR/cookie consent popups from 70+ known CMP providers including
OneTrust, Cookiebot, TrustArc, Quantcast, Didomi, Usercentrics,
Sourcepoint, Google FundingChoices, and many more.

The JS strategy uses a 5-phase approach:
1. Click "Accept All" buttons (cleanest dismissal, sets cookies)
2. Try CMP JavaScript APIs (__tcfapi, Didomi, Cookiebot, Osano, Klaro)
3. Remove known CMP containers by selector (~120 selectors)
4. Handle iframe-based and shadow DOM CMPs
5. Restore body scroll and remove CMP body classes

Also fix from_kwargs() in CrawlerRunConfig and BrowserConfig to
auto-deserialize dict values using the existing from_serializable_dict()
infrastructure. Previously, strategy objects like markdown_generator
arriving as {"type": "DefaultMarkdownGenerator", "params": {...}} from
JSON APIs were passed through as raw dicts, causing crashes when the
crawler later called methods on them.
2026-02-11 12:46:47 +00:00
unclecode
44b8afb6dc Improve schema generation prompt for sibling-based layouts 2026-02-10 08:34:22 +00:00
unclecode
fbc52813a4 Add tests, docs, and contributors for PRs #1463 and #1435
- Add tests for device_scale_factor (config + integration)
- Add tests for redirected_status_code (model + redirect + raw HTML)
- Document device_scale_factor in browser config docs and API reference
- Document redirected_status_code in crawler result docs and API reference
- Add TristanDonze and charlaie to CONTRIBUTORS.md
- Update PR-TODOLIST with session results
2026-02-06 09:30:19 +00:00
unclecode
37a49c5315 Merge PR #1435: Add redirected_status_code to CrawlResult
Applied manually due to conflicts (PR based on older code).
Also fixed missing variable initialization for non-goto paths
(file://, raw:, js_only) that would have caused NameError.

Closes #1434
2026-02-06 09:23:54 +00:00
unclecode
0aacafed0a Merge PR #1463: Add configurable device_scale_factor for screenshot quality 2026-02-06 09:19:42 +00:00
unclecode
719e83e105 Update PR todolist — refresh open PRs, add 6 new, classify
- Added PRs #475, #462, #416, #335, #332, #312
- Flagged #475 as duplicate of merged #1296
- Corrected author for #1450 (rbushri)
- Updated total count to ~63 open PRs
- Updated date to 2026-02-06
2026-02-06 09:06:13 +00:00
unclecode
3401dd1620 Fix browser recycling under high concurrency — version-based approach
The previous recycle logic waited for all refcounts to hit 0 before
recycling, which never happened under sustained concurrent load (20+
crawls always had at least one active).

New approach:
- Add _browser_version to config signature — bump it to force new contexts
- When threshold is hit: bump version, move old sigs to _pending_cleanup
- New requests get new contexts automatically (different signature)
- Old contexts drain naturally and get cleaned up when refcount hits 0
- Safety cap: max 3 pending browsers draining at once

This means recycling now works under any load pattern — no blocking,
no waiting for quiet moments. Old and new browsers coexist briefly
during transitions.

Includes 12 new tests covering version bumps, concurrent recycling,
safety cap, and edge cases.
2026-02-05 07:48:12 +00:00
unclecode
c046918bb4 Add memory-saving mode, browser recycling, and CDP leak fixes
- Add memory_saving_mode config: aggressive cache discard + V8 heap cap
  flags for high-volume crawling (1000+ pages)
- Add max_pages_before_recycle config: automatic browser process recycling
  after N pages to reclaim leaked memory (recommended 500-1000)
- Add default Chrome flags to disable unused features (OptimizationHints,
  MediaRouter, component updates, domain reliability)
- Fix CDP session leak: detach CDP session after viewport adjustment
- Fix session kill: only close context when refcount reaches 0, preventing
  use-after-close for shared contexts
- Add browser lifecycle and memory tests
2026-02-04 02:00:53 +00:00
ntohidi
4e56f3e00d Add contributing guide and update mkdocs navigation for community resources 2026-02-03 09:46:54 +01:00
unclecode
0bfcf080dd Add contributors from PRs #1133, #729
Credit chrizzly2309 and complete-dope for identifying bugs
that were resolved on develop.
2026-02-02 07:56:37 +00:00
unclecode
b962699c0d Add contributors from PRs #973, #1073, #931
Credit danyQe, saipavanmeruga7797, and stevenaldinger for
identifying bugs that were resolved on develop.
2026-02-02 07:14:12 +00:00
unclecode
ffd3face6b Remove duplicate PROMPT_EXTRACT_BLOCKS definition in prompts.py
The first definition (with tags/questions fields) was immediately
overwritten by the second simpler definition — pure dead code.
Removes 61 lines of unused prompt text.

Inspired by PR #931 (stevenaldinger).
2026-02-02 07:04:35 +00:00
unclecode
c790231aba Fix browser context memory leak — signature shrink + LRU eviction (#943)
contexts_by_config accumulated browser contexts unboundedly in long-running
crawlers (Docker API). Two root causes fixed:

1. _make_config_signature() hashed ~60 CrawlerRunConfig fields but only 7
   affect the browser context (proxy_config, locale, timezone_id, geolocation,
   override_navigator, simulate_user, magic). Switched from blacklist to
   whitelist — non-context fields like word_count_threshold, css_selector,
   screenshot, verbose no longer cause unnecessary context creation.

2. No eviction mechanism existed between close() calls. Added refcount
   tracking (_context_refcounts, incremented under _contexts_lock in
   get_page, decremented in release_page_with_context) and LRU eviction
   (_evict_lru_context_locked) that caps contexts at _max_contexts=20,
   evicting only idle contexts (refcount==0) oldest-first.

Also fixed: storage_state path leaked a temporary context every request
(now explicitly closed after clone_runtime_state).

Closes #943. Credit to @Martichou for the investigation in #1640.
2026-02-01 14:23:04 +00:00
unclecode
bb523b6c6c Merge PRs #1077, #1281 — bs4 deprecation and proxy auth fix
- PR #1077: Fix bs4 deprecation warning (text -> string)
- PR #1281: Fix proxy auth ERR_INVALID_AUTH_CREDENTIALS
- Comment on PR #1081 guiding author on needed DFS/BFF fixes
- Update CONTRIBUTORS.md and PR-TODOLIST.md
2026-02-01 07:06:39 +00:00
unclecode
980dc73156 Merge PR #1281: Fix proxy auth ERR_INVALID_AUTH_CREDENTIALS 2026-02-01 07:05:00 +00:00
unclecode
98aea2fb46 Merge PR #1077: Fix bs4 deprecation warning (text -> string) 2026-02-01 07:04:31 +00:00
unclecode
a56dd07559 Merge PRs #1667, #1296, #1364 — CLI deep-crawl, env var, script tags
- PR #1667: Fix deep-crawl CLI outputting only the first page
- PR #1296: Fix VersionManager ignoring CRAWL4_AI_BASE_DIRECTORY
- PR #1364: Fix script tag removal losing adjacent text
- Fix: restore .crawl4ai subfolder in VersionManager path
- Close #1150 (already fixed on develop)
- Update CONTRIBUTORS.md and PR-TODOLIST.md
2026-02-01 06:53:53 +00:00
unclecode
312cef8633 Fix PR #1296: restore .crawl4ai subfolder in VersionManager path 2026-02-01 06:22:16 +00:00
unclecode
a244e4d781 Merge PR #1364: Fix script tag removal losing adjacent text in cleaned_html 2026-02-01 06:22:10 +00:00
unclecode
0f83b05a2d Merge PR #1296: Fix VersionManager ignoring CRAWL4_AI_BASE_DIRECTORY env var 2026-02-01 06:21:40 +00:00
unclecode
37995d4d3f Merge PR #1667: Fix deep-crawl CLI outputting only the first page 2026-02-01 06:21:25 +00:00
unclecode
dc4ae73221 Merge PRs #1714, #1721, #1719, #1717 and fix base tag pipeline
- PR #1714: Replace tf-playwright-stealth with playwright-stealth
- PR #1721: Respect <base> tag in html2text for relative links
- PR #1719: Include GoogleSearchCrawler script.js in package data
- PR #1717: Allow local embeddings by removing OpenAI fallback
- Fix: Extract <base href> from raw HTML before head gets stripped
- Close duplicates: #1703, #1698, #1697, #1710, #1720
- Update CONTRIBUTORS.md and PR-TODOLIST.md
2026-02-01 05:41:33 +00:00
unclecode
5cd0648d71 Merge PR #1717: Allow local embeddings by removing OpenAI fallback 2026-02-01 05:02:18 +00:00
unclecode
9172581416 Merge PR #1719: Include GoogleSearchCrawler script.js in package distribution 2026-02-01 05:02:05 +00:00
unclecode
c39e796a18 Merge PR #1721: Fix <base> tag ignored in html2text relative link resolution 2026-02-01 05:01:52 +00:00
unclecode
ccab926f1f Merge PR #1714: Replace tf-playwright-stealth with playwright-stealth 2026-02-01 05:01:31 +00:00
unclecode
43738c9ed2 Fix can_process_url() to receive normalized URL in deep crawl strategies
Pass the normalized absolute URL instead of the raw href to
can_process_url() in BFS, BFF, and DFS deep crawl strategies.
This ensures URL validation and filter chain evaluation operate
on consistent, fully-qualified URLs.

Fixes #1743
2026-02-01 03:45:52 +00:00
unclecode
ee717dc019 Add contributor for PR #1746 and fix test pytest marker
- Add ChiragBellara to CONTRIBUTORS.md for sitemap seeding fix
- Add missing @pytest.mark.asyncio decorator to seeder test
2026-02-01 03:10:32 +00:00
unclecode
7c5933e2e7 Merge PR #1746: Fix sitemap-only URL seeding avoiding Common Crawl calls 2026-02-01 02:57:06 +00:00
unclecode
5be0d2d75e Add contributor and docs for force_viewport_screenshot feature
- Add TheRedRad to CONTRIBUTORS.md for PR #1694
- Document force_viewport_screenshot in API parameters reference
- Add viewport screenshot note in browser-crawler-config guide
- Add viewport-only screenshot example in screenshot docs
2026-02-01 01:10:20 +00:00
unclecode
e19492a82e Merge PR #1694: feat: add force viewport screenshot 2026-02-01 01:05:52 +00:00
unclecode
55a2cc8181 Document set_defaults/get_defaults/reset_defaults in config guides 2026-01-31 11:46:53 +00:00