Commit Graph

1356 Commits

Author SHA1 Message Date
unclecode
8576331d4e Add Shadow DOM flattening and reorder js_code execution pipeline
- Add `flatten_shadow_dom` option to CrawlerRunConfig that serializes
  shadow DOM content into the light DOM before HTML capture. Uses a
  recursive serializer that resolves <slot> projections and strips
  only shadow-scoped <style> tags. Also injects an init script to
  force-open closed shadow roots via attachShadow patching.

- Move `js_code` execution to after `wait_for` + `delay_before_return_html`
  so user scripts run on the fully-hydrated page. Add `js_code_before_wait`
  for the less common case of triggering loading before waiting.

- Add JS snippet (flatten_shadow_dom.js), integration test, example,
  and documentation across all relevant doc files.
2026-02-18 06:43:00 +00:00
unclecode
4fb02f8b50 Warn LLM against hashed/generated CSS class names in schema prompts
Replace vague "handle dynamic class names appropriately" with explicit
rule: never use auto-generated class names (.styles_card__xK9r2, etc.)
as they break on every site rebuild. Prefer data-* attributes, semantic
tags, ARIA attributes, and stable meaningful class names instead.
2026-02-17 12:02:58 +00:00
unclecode
d267c650cb Add source (sibling selector) support to JSON extraction strategies
Many sites (e.g. Hacker News) split a single item's data across sibling
elements. Field selectors only search descendants, making sibling data
unreachable. The new "source" field key navigates to a sibling element
before running the selector: {"source": "+ tr"} finds the next sibling
<tr>, then extracts from there.

- Add _resolve_source abstract method to JsonElementExtractionStrategy
- Implement in all 4 subclasses (CSS/BS4, XPath/lxml, two lxml/CSS)
- Modify _extract_field to resolve source before type dispatch
- Update CSS and XPath LLM prompts with source docs and HN example
- Default generate_schema validate=True so schemas are checked on creation
- Add schema validation with feedback loop for auto-refinement
- Add messages param to completion helpers for multi-turn refinement
- Document source field and schema validation in docs
- Add 14 unit tests covering CSS, XPath, backward compat, edge cases
2026-02-17 09:04:40 +00:00
unclecode
ccd24aa824 Fix fallback fetch: run when all proxies crash, skip re-check, never return None
Three related fixes to the anti-bot proxy retry + fallback pipeline:

1. Allow fallback_fetch_function to run when crawl_result is None (all proxies
   threw exceptions like browser crashes). Previously fallback only ran when
   crawl_result existed but was blocked — exception-only failures bypassed it.

2. Skip is_blocked() re-check after successful fallback. Real unblocked pages
   may contain anti-bot script markers (e.g. PerimeterX JS on Walmart) that
   trigger false positives, overriding success=True back to False.

3. Always return a CrawlResult with crawl_stats, never None. When all proxies
   and fallback fail, create a minimal failed result so callers get stats
   about what was attempted instead of AttributeError on None.

Also: if aprocess_html fails during fallback (dead browser can't run
Page.evaluate for consent popup removal), fall back to raw HTML result
instead of silently discarding the successfully-fetched fallback content.
2026-02-15 10:55:00 +00:00
unclecode
45d8e1450f Fix proxy escalation: don't re-raise on first proxy exception when chain has alternatives
When proxy_config is a list (escalation chain) and the first proxy throws
an exception (timeout, connection error, browser crash), the retry loop
now continues to the next proxy instead of immediately re-raising.

Previously, exceptions on _p_idx==0 and _attempt==0 were always re-raised,
which broke the entire escalation chain — ISP/Residential/fallback proxies
were never tried. This made the proxy list effectively useless for sites
where the first-tier proxy fails with an exception rather than a blocked
response.

The raise is preserved when there's only a single proxy and single attempt
(len(proxy_list) <= 1 and max_attempts <= 1) so that simple non-chain
crawls still get immediate error propagation.
2026-02-15 09:55:55 +00:00
unclecode
d028a889d0 Make proxy_config a property so direct assignment also normalizes
Setting config.proxy_config = [ProxyConfig.DIRECT, ...] after
construction now goes through the same normalization as __init__,
converting "direct" sentinels to None. Fixes crash when proxy_config
is assigned directly instead of passed to the constructor.
2026-02-14 13:16:36 +00:00
unclecode
879553955c Add ProxyConfig.DIRECT sentinel for direct-then-proxy escalation
Allow "direct" or None in proxy_config list to explicitly try
without a proxy before escalating to proxy servers. The retry
loop already handled None as direct — this exposes it as a
clean user-facing API via ProxyConfig.DIRECT.
2026-02-14 10:25:07 +00:00
unclecode
875207287e Unify proxy_config to accept list, add crawl_stats tracking
- proxy_config on CrawlerRunConfig now accepts a single ProxyConfig or
  a list of ProxyConfig tried in order (first-come-first-served)
- Remove is_fallback from ProxyConfig and fallback_proxy_configs from
  CrawlerRunConfig — proxy escalation handled entirely by list order
- Add _get_proxy_list() normalizer for the retry loop
- Add CrawlResult.crawl_stats with attempts, retries, proxies_used,
  fallback_fetch_used, and resolved_by for billing and observability
- Set success=False with error_message when all attempts are blocked
- Simplify retry loop — no more is_fallback stashing logic
- Update docs and tests to reflect new API
2026-02-14 07:53:46 +00:00
unclecode
72b546c48d Add anti-bot detection, retry, and fallback system
Automatically detect when crawls are blocked by anti-bot systems
(Akamai, Cloudflare, PerimeterX, DataDome, Imperva, etc.) and
escalate through configurable retry and fallback strategies.

New features on CrawlerRunConfig:
- max_retries: retry rounds when blocking is detected
- fallback_proxy_configs: list of fallback proxies tried each round
- fallback_fetch_function: async last-resort function returning raw HTML

New field on ProxyConfig:
- is_fallback: skip proxy on first attempt, activate only when blocked

Escalation chain per round: main proxy → fallback proxies in order.
After all rounds: fallback_fetch_function as last resort.

Detection uses tiered heuristics — structural HTML markers (high
confidence) trigger on any page, generic patterns only on short
error pages to avoid false positives.
2026-02-14 05:24:07 +00:00
unclecode
fdd989785f Sync sec-ch-ua with User-Agent and keep WebGL alive in stealth mode
Fix a bug where magic mode and per-request UA overrides would change
the User-Agent header without updating the sec-ch-ua (browser hint)
header to match. Anti-bot systems like Akamai detect this mismatch
as a bot signal.

Changes:
- Regenerate browser_hint via UAGen.generate_client_hints() whenever
  the UA is changed at crawl time (magic mode or explicit override)
- Re-apply updated headers to the page via set_extra_http_headers()
- Skip per-crawl UA override for persistent contexts where the UA is
  locked at launch time by Playwright's protocol layer
- Move --disable-gpu flags behind enable_stealth check so WebGL works
  via SwiftShader when stealth mode is active (missing WebGL is a
  detectable headless signal)
- Clean up old test scripts, add clean anti-bot test
2026-02-13 04:10:47 +00:00
unclecode
112f44a97d Fix proxy auth for persistent browser contexts
Chromium's --proxy-server CLI flag silently ignores inline credentials
(user:pass@server). For persistent contexts, crawl4ai was embedding
credentials in this flag via ManagedBrowser.build_browser_flags(),
causing proxy auth to fail and the browser to fall back to direct
connection.

Fix: Use Playwright's launch_persistent_context(proxy=...) API instead
of subprocess + CDP when use_persistent_context=True. This handles
proxy authentication properly via the HTTP CONNECT handshake. The
non-persistent and CDP paths remain unchanged.

Changes:
- Strip credentials from --proxy-server flag in build_browser_flags()
- Add launch_persistent_context() path in BrowserManager.start()
- Add cleanup path in BrowserManager.close()
- Guard create_browser_context() when self.browser is None
- Add regression tests covering all 4 proxy/persistence combinations
2026-02-12 11:19:29 +00:00
unclecode
1a24ac785e Refactor from_kwargs to respect set_defaults and use __init__ defaults
Replace hardcoded parameter listings in BrowserConfig.from_kwargs() and
CrawlerRunConfig.from_kwargs() with a generic approach that filters
input kwargs to valid __init__ params and passes them through. This:

- Makes set_defaults() work with from_kwargs() (previously ignored)
- Fixes default mismatches (word_count_threshold was 200 vs __init__=1,
  markdown_generator was None vs __init__=DefaultMarkdownGenerator())
- Eliminates ~160 lines of duplicated default values
- Auto-supports new params without updating from_kwargs
2026-02-11 13:35:36 +00:00
unclecode
3fc7730aaf Add remove_consent_popups flag and fix from_kwargs dict deserialization
Add CrawlerRunConfig.remove_consent_popups (bool, default False) that
targets GDPR/cookie consent popups from 70+ known CMP providers including
OneTrust, Cookiebot, TrustArc, Quantcast, Didomi, Usercentrics,
Sourcepoint, Google FundingChoices, and many more.

The JS strategy uses a 5-phase approach:
1. Click "Accept All" buttons (cleanest dismissal, sets cookies)
2. Try CMP JavaScript APIs (__tcfapi, Didomi, Cookiebot, Osano, Klaro)
3. Remove known CMP containers by selector (~120 selectors)
4. Handle iframe-based and shadow DOM CMPs
5. Restore body scroll and remove CMP body classes

Also fix from_kwargs() in CrawlerRunConfig and BrowserConfig to
auto-deserialize dict values using the existing from_serializable_dict()
infrastructure. Previously, strategy objects like markdown_generator
arriving as {"type": "DefaultMarkdownGenerator", "params": {...}} from
JSON APIs were passed through as raw dicts, causing crashes when the
crawler later called methods on them.
2026-02-11 12:46:47 +00:00
unclecode
44b8afb6dc Improve schema generation prompt for sibling-based layouts 2026-02-10 08:34:22 +00:00
unclecode
fbc52813a4 Add tests, docs, and contributors for PRs #1463 and #1435
- Add tests for device_scale_factor (config + integration)
- Add tests for redirected_status_code (model + redirect + raw HTML)
- Document device_scale_factor in browser config docs and API reference
- Document redirected_status_code in crawler result docs and API reference
- Add TristanDonze and charlaie to CONTRIBUTORS.md
- Update PR-TODOLIST with session results
2026-02-06 09:30:19 +00:00
unclecode
37a49c5315 Merge PR #1435: Add redirected_status_code to CrawlResult
Applied manually due to conflicts (PR based on older code).
Also fixed missing variable initialization for non-goto paths
(file://, raw:, js_only) that would have caused NameError.

Closes #1434
2026-02-06 09:23:54 +00:00
unclecode
0aacafed0a Merge PR #1463: Add configurable device_scale_factor for screenshot quality 2026-02-06 09:19:42 +00:00
unclecode
719e83e105 Update PR todolist — refresh open PRs, add 6 new, classify
- Added PRs #475, #462, #416, #335, #332, #312
- Flagged #475 as duplicate of merged #1296
- Corrected author for #1450 (rbushri)
- Updated total count to ~63 open PRs
- Updated date to 2026-02-06
2026-02-06 09:06:13 +00:00
unclecode
3401dd1620 Fix browser recycling under high concurrency — version-based approach
The previous recycle logic waited for all refcounts to hit 0 before
recycling, which never happened under sustained concurrent load (20+
crawls always had at least one active).

New approach:
- Add _browser_version to config signature — bump it to force new contexts
- When threshold is hit: bump version, move old sigs to _pending_cleanup
- New requests get new contexts automatically (different signature)
- Old contexts drain naturally and get cleaned up when refcount hits 0
- Safety cap: max 3 pending browsers draining at once

This means recycling now works under any load pattern — no blocking,
no waiting for quiet moments. Old and new browsers coexist briefly
during transitions.

Includes 12 new tests covering version bumps, concurrent recycling,
safety cap, and edge cases.
2026-02-05 07:48:12 +00:00
unclecode
c046918bb4 Add memory-saving mode, browser recycling, and CDP leak fixes
- Add memory_saving_mode config: aggressive cache discard + V8 heap cap
  flags for high-volume crawling (1000+ pages)
- Add max_pages_before_recycle config: automatic browser process recycling
  after N pages to reclaim leaked memory (recommended 500-1000)
- Add default Chrome flags to disable unused features (OptimizationHints,
  MediaRouter, component updates, domain reliability)
- Fix CDP session leak: detach CDP session after viewport adjustment
- Fix session kill: only close context when refcount reaches 0, preventing
  use-after-close for shared contexts
- Add browser lifecycle and memory tests
2026-02-04 02:00:53 +00:00
ntohidi
4e56f3e00d Add contributing guide and update mkdocs navigation for community resources 2026-02-03 09:46:54 +01:00
unclecode
0bfcf080dd Add contributors from PRs #1133, #729
Credit chrizzly2309 and complete-dope for identifying bugs
that were resolved on develop.
2026-02-02 07:56:37 +00:00
unclecode
b962699c0d Add contributors from PRs #973, #1073, #931
Credit danyQe, saipavanmeruga7797, and stevenaldinger for
identifying bugs that were resolved on develop.
2026-02-02 07:14:12 +00:00
unclecode
ffd3face6b Remove duplicate PROMPT_EXTRACT_BLOCKS definition in prompts.py
The first definition (with tags/questions fields) was immediately
overwritten by the second simpler definition — pure dead code.
Removes 61 lines of unused prompt text.

Inspired by PR #931 (stevenaldinger).
2026-02-02 07:04:35 +00:00
unclecode
c790231aba Fix browser context memory leak — signature shrink + LRU eviction (#943)
contexts_by_config accumulated browser contexts unboundedly in long-running
crawlers (Docker API). Two root causes fixed:

1. _make_config_signature() hashed ~60 CrawlerRunConfig fields but only 7
   affect the browser context (proxy_config, locale, timezone_id, geolocation,
   override_navigator, simulate_user, magic). Switched from blacklist to
   whitelist — non-context fields like word_count_threshold, css_selector,
   screenshot, verbose no longer cause unnecessary context creation.

2. No eviction mechanism existed between close() calls. Added refcount
   tracking (_context_refcounts, incremented under _contexts_lock in
   get_page, decremented in release_page_with_context) and LRU eviction
   (_evict_lru_context_locked) that caps contexts at _max_contexts=20,
   evicting only idle contexts (refcount==0) oldest-first.

Also fixed: storage_state path leaked a temporary context every request
(now explicitly closed after clone_runtime_state).

Closes #943. Credit to @Martichou for the investigation in #1640.
2026-02-01 14:23:04 +00:00
unclecode
bb523b6c6c Merge PRs #1077, #1281 — bs4 deprecation and proxy auth fix
- PR #1077: Fix bs4 deprecation warning (text -> string)
- PR #1281: Fix proxy auth ERR_INVALID_AUTH_CREDENTIALS
- Comment on PR #1081 guiding author on needed DFS/BFF fixes
- Update CONTRIBUTORS.md and PR-TODOLIST.md
2026-02-01 07:06:39 +00:00
unclecode
980dc73156 Merge PR #1281: Fix proxy auth ERR_INVALID_AUTH_CREDENTIALS 2026-02-01 07:05:00 +00:00
unclecode
98aea2fb46 Merge PR #1077: Fix bs4 deprecation warning (text -> string) 2026-02-01 07:04:31 +00:00
unclecode
a56dd07559 Merge PRs #1667, #1296, #1364 — CLI deep-crawl, env var, script tags
- PR #1667: Fix deep-crawl CLI outputting only the first page
- PR #1296: Fix VersionManager ignoring CRAWL4_AI_BASE_DIRECTORY
- PR #1364: Fix script tag removal losing adjacent text
- Fix: restore .crawl4ai subfolder in VersionManager path
- Close #1150 (already fixed on develop)
- Update CONTRIBUTORS.md and PR-TODOLIST.md
2026-02-01 06:53:53 +00:00
unclecode
312cef8633 Fix PR #1296: restore .crawl4ai subfolder in VersionManager path 2026-02-01 06:22:16 +00:00
unclecode
a244e4d781 Merge PR #1364: Fix script tag removal losing adjacent text in cleaned_html 2026-02-01 06:22:10 +00:00
unclecode
0f83b05a2d Merge PR #1296: Fix VersionManager ignoring CRAWL4_AI_BASE_DIRECTORY env var 2026-02-01 06:21:40 +00:00
unclecode
37995d4d3f Merge PR #1667: Fix deep-crawl CLI outputting only the first page 2026-02-01 06:21:25 +00:00
unclecode
dc4ae73221 Merge PRs #1714, #1721, #1719, #1717 and fix base tag pipeline
- PR #1714: Replace tf-playwright-stealth with playwright-stealth
- PR #1721: Respect <base> tag in html2text for relative links
- PR #1719: Include GoogleSearchCrawler script.js in package data
- PR #1717: Allow local embeddings by removing OpenAI fallback
- Fix: Extract <base href> from raw HTML before head gets stripped
- Close duplicates: #1703, #1698, #1697, #1710, #1720
- Update CONTRIBUTORS.md and PR-TODOLIST.md
2026-02-01 05:41:33 +00:00
unclecode
5cd0648d71 Merge PR #1717: Allow local embeddings by removing OpenAI fallback 2026-02-01 05:02:18 +00:00
unclecode
9172581416 Merge PR #1719: Include GoogleSearchCrawler script.js in package distribution 2026-02-01 05:02:05 +00:00
unclecode
c39e796a18 Merge PR #1721: Fix <base> tag ignored in html2text relative link resolution 2026-02-01 05:01:52 +00:00
unclecode
ccab926f1f Merge PR #1714: Replace tf-playwright-stealth with playwright-stealth 2026-02-01 05:01:31 +00:00
unclecode
43738c9ed2 Fix can_process_url() to receive normalized URL in deep crawl strategies
Pass the normalized absolute URL instead of the raw href to
can_process_url() in BFS, BFF, and DFS deep crawl strategies.
This ensures URL validation and filter chain evaluation operate
on consistent, fully-qualified URLs.

Fixes #1743
2026-02-01 03:45:52 +00:00
unclecode
ee717dc019 Add contributor for PR #1746 and fix test pytest marker
- Add ChiragBellara to CONTRIBUTORS.md for sitemap seeding fix
- Add missing @pytest.mark.asyncio decorator to seeder test
2026-02-01 03:10:32 +00:00
unclecode
7c5933e2e7 Merge PR #1746: Fix sitemap-only URL seeding avoiding Common Crawl calls 2026-02-01 02:57:06 +00:00
unclecode
5be0d2d75e Add contributor and docs for force_viewport_screenshot feature
- Add TheRedRad to CONTRIBUTORS.md for PR #1694
- Document force_viewport_screenshot in API parameters reference
- Add viewport screenshot note in browser-crawler-config guide
- Add viewport-only screenshot example in screenshot docs
2026-02-01 01:10:20 +00:00
unclecode
e19492a82e Merge PR #1694: feat: add force viewport screenshot 2026-02-01 01:05:52 +00:00
unclecode
55a2cc8181 Document set_defaults/get_defaults/reset_defaults in config guides 2026-01-31 11:46:53 +00:00
unclecode
13a414802b Add set_defaults/get_defaults/reset_defaults to config classes 2026-01-31 11:44:07 +00:00
unclecode
19b9140c68 Improve CDP connection handling 2026-01-31 11:07:26 +00:00
ChiragBellara
694ba44a04 Added fix for URL Seeder forcing Common Crawl index in case of a "sitemap" 2026-01-30 09:33:30 -08:00
unclecode
0104db6de2 Fix critical RCE via deserialization and eval() in /crawl endpoint
- Replace raw eval() in _compute_field() with AST-validated
  _safe_eval_expression() that blocks __import__, dunder attribute
  access, and import statements while preserving safe transforms
- Add ALLOWED_DESERIALIZE_TYPES allowlist to from_serializable_dict()
  preventing arbitrary class instantiation from API input
- Update security contact email and add v0.8.1 security fixes to
  SECURITY.md with researcher acknowledgment
- Add 17 security tests covering both fixes
2026-01-30 08:46:32 +00:00
Nasrin
ad5ebf166a Merge pull request #1718 from YuriNachos/fix/issue-1704-default-logger
fix: Initialize default logger in AsyncPlaywrightCrawlerStrategy (#1704)
2026-01-29 13:03:11 +01:00
Nasrin
034bddf557 Merge pull request #1733 from jose-blockchain/fix/1686-docker-health-version
Fix #1686: Docker health endpoint reports outdated version
2026-01-29 12:55:24 +01:00