Commit Graph

1351 Commits

Author SHA1 Message Date
unclecode
d028a889d0 Make proxy_config a property so direct assignment also normalizes
Setting config.proxy_config = [ProxyConfig.DIRECT, ...] after
construction now goes through the same normalization as __init__,
converting "direct" sentinels to None. Fixes crash when proxy_config
is assigned directly instead of passed to the constructor.
2026-02-14 13:16:36 +00:00
unclecode
879553955c Add ProxyConfig.DIRECT sentinel for direct-then-proxy escalation
Allow "direct" or None in proxy_config list to explicitly try
without a proxy before escalating to proxy servers. The retry
loop already handled None as direct — this exposes it as a
clean user-facing API via ProxyConfig.DIRECT.
2026-02-14 10:25:07 +00:00
unclecode
875207287e Unify proxy_config to accept list, add crawl_stats tracking
- proxy_config on CrawlerRunConfig now accepts a single ProxyConfig or
  a list of ProxyConfig tried in order (first-come-first-served)
- Remove is_fallback from ProxyConfig and fallback_proxy_configs from
  CrawlerRunConfig — proxy escalation handled entirely by list order
- Add _get_proxy_list() normalizer for the retry loop
- Add CrawlResult.crawl_stats with attempts, retries, proxies_used,
  fallback_fetch_used, and resolved_by for billing and observability
- Set success=False with error_message when all attempts are blocked
- Simplify retry loop — no more is_fallback stashing logic
- Update docs and tests to reflect new API
2026-02-14 07:53:46 +00:00
unclecode
72b546c48d Add anti-bot detection, retry, and fallback system
Automatically detect when crawls are blocked by anti-bot systems
(Akamai, Cloudflare, PerimeterX, DataDome, Imperva, etc.) and
escalate through configurable retry and fallback strategies.

New features on CrawlerRunConfig:
- max_retries: retry rounds when blocking is detected
- fallback_proxy_configs: list of fallback proxies tried each round
- fallback_fetch_function: async last-resort function returning raw HTML

New field on ProxyConfig:
- is_fallback: skip proxy on first attempt, activate only when blocked

Escalation chain per round: main proxy → fallback proxies in order.
After all rounds: fallback_fetch_function as last resort.

Detection uses tiered heuristics — structural HTML markers (high
confidence) trigger on any page, generic patterns only on short
error pages to avoid false positives.
2026-02-14 05:24:07 +00:00
unclecode
fdd989785f Sync sec-ch-ua with User-Agent and keep WebGL alive in stealth mode
Fix a bug where magic mode and per-request UA overrides would change
the User-Agent header without updating the sec-ch-ua (browser hint)
header to match. Anti-bot systems like Akamai detect this mismatch
as a bot signal.

Changes:
- Regenerate browser_hint via UAGen.generate_client_hints() whenever
  the UA is changed at crawl time (magic mode or explicit override)
- Re-apply updated headers to the page via set_extra_http_headers()
- Skip per-crawl UA override for persistent contexts where the UA is
  locked at launch time by Playwright's protocol layer
- Move --disable-gpu flags behind enable_stealth check so WebGL works
  via SwiftShader when stealth mode is active (missing WebGL is a
  detectable headless signal)
- Clean up old test scripts, add clean anti-bot test
2026-02-13 04:10:47 +00:00
unclecode
112f44a97d Fix proxy auth for persistent browser contexts
Chromium's --proxy-server CLI flag silently ignores inline credentials
(user:pass@server). For persistent contexts, crawl4ai was embedding
credentials in this flag via ManagedBrowser.build_browser_flags(),
causing proxy auth to fail and the browser to fall back to direct
connection.

Fix: Use Playwright's launch_persistent_context(proxy=...) API instead
of subprocess + CDP when use_persistent_context=True. This handles
proxy authentication properly via the HTTP CONNECT handshake. The
non-persistent and CDP paths remain unchanged.

Changes:
- Strip credentials from --proxy-server flag in build_browser_flags()
- Add launch_persistent_context() path in BrowserManager.start()
- Add cleanup path in BrowserManager.close()
- Guard create_browser_context() when self.browser is None
- Add regression tests covering all 4 proxy/persistence combinations
2026-02-12 11:19:29 +00:00
unclecode
1a24ac785e Refactor from_kwargs to respect set_defaults and use __init__ defaults
Replace hardcoded parameter listings in BrowserConfig.from_kwargs() and
CrawlerRunConfig.from_kwargs() with a generic approach that filters
input kwargs to valid __init__ params and passes them through. This:

- Makes set_defaults() work with from_kwargs() (previously ignored)
- Fixes default mismatches (word_count_threshold was 200 vs __init__=1,
  markdown_generator was None vs __init__=DefaultMarkdownGenerator())
- Eliminates ~160 lines of duplicated default values
- Auto-supports new params without updating from_kwargs
2026-02-11 13:35:36 +00:00
unclecode
3fc7730aaf Add remove_consent_popups flag and fix from_kwargs dict deserialization
Add CrawlerRunConfig.remove_consent_popups (bool, default False) that
targets GDPR/cookie consent popups from 70+ known CMP providers including
OneTrust, Cookiebot, TrustArc, Quantcast, Didomi, Usercentrics,
Sourcepoint, Google FundingChoices, and many more.

The JS strategy uses a 5-phase approach:
1. Click "Accept All" buttons (cleanest dismissal, sets cookies)
2. Try CMP JavaScript APIs (__tcfapi, Didomi, Cookiebot, Osano, Klaro)
3. Remove known CMP containers by selector (~120 selectors)
4. Handle iframe-based and shadow DOM CMPs
5. Restore body scroll and remove CMP body classes

Also fix from_kwargs() in CrawlerRunConfig and BrowserConfig to
auto-deserialize dict values using the existing from_serializable_dict()
infrastructure. Previously, strategy objects like markdown_generator
arriving as {"type": "DefaultMarkdownGenerator", "params": {...}} from
JSON APIs were passed through as raw dicts, causing crashes when the
crawler later called methods on them.
2026-02-11 12:46:47 +00:00
unclecode
44b8afb6dc Improve schema generation prompt for sibling-based layouts 2026-02-10 08:34:22 +00:00
unclecode
fbc52813a4 Add tests, docs, and contributors for PRs #1463 and #1435
- Add tests for device_scale_factor (config + integration)
- Add tests for redirected_status_code (model + redirect + raw HTML)
- Document device_scale_factor in browser config docs and API reference
- Document redirected_status_code in crawler result docs and API reference
- Add TristanDonze and charlaie to CONTRIBUTORS.md
- Update PR-TODOLIST with session results
2026-02-06 09:30:19 +00:00
unclecode
37a49c5315 Merge PR #1435: Add redirected_status_code to CrawlResult
Applied manually due to conflicts (PR based on older code).
Also fixed missing variable initialization for non-goto paths
(file://, raw:, js_only) that would have caused NameError.

Closes #1434
2026-02-06 09:23:54 +00:00
unclecode
0aacafed0a Merge PR #1463: Add configurable device_scale_factor for screenshot quality 2026-02-06 09:19:42 +00:00
unclecode
719e83e105 Update PR todolist — refresh open PRs, add 6 new, classify
- Added PRs #475, #462, #416, #335, #332, #312
- Flagged #475 as duplicate of merged #1296
- Corrected author for #1450 (rbushri)
- Updated total count to ~63 open PRs
- Updated date to 2026-02-06
2026-02-06 09:06:13 +00:00
unclecode
3401dd1620 Fix browser recycling under high concurrency — version-based approach
The previous recycle logic waited for all refcounts to hit 0 before
recycling, which never happened under sustained concurrent load (20+
crawls always had at least one active).

New approach:
- Add _browser_version to config signature — bump it to force new contexts
- When threshold is hit: bump version, move old sigs to _pending_cleanup
- New requests get new contexts automatically (different signature)
- Old contexts drain naturally and get cleaned up when refcount hits 0
- Safety cap: max 3 pending browsers draining at once

This means recycling now works under any load pattern — no blocking,
no waiting for quiet moments. Old and new browsers coexist briefly
during transitions.

Includes 12 new tests covering version bumps, concurrent recycling,
safety cap, and edge cases.
2026-02-05 07:48:12 +00:00
unclecode
c046918bb4 Add memory-saving mode, browser recycling, and CDP leak fixes
- Add memory_saving_mode config: aggressive cache discard + V8 heap cap
  flags for high-volume crawling (1000+ pages)
- Add max_pages_before_recycle config: automatic browser process recycling
  after N pages to reclaim leaked memory (recommended 500-1000)
- Add default Chrome flags to disable unused features (OptimizationHints,
  MediaRouter, component updates, domain reliability)
- Fix CDP session leak: detach CDP session after viewport adjustment
- Fix session kill: only close context when refcount reaches 0, preventing
  use-after-close for shared contexts
- Add browser lifecycle and memory tests
2026-02-04 02:00:53 +00:00
ntohidi
4e56f3e00d Add contributing guide and update mkdocs navigation for community resources 2026-02-03 09:46:54 +01:00
unclecode
0bfcf080dd Add contributors from PRs #1133, #729
Credit chrizzly2309 and complete-dope for identifying bugs
that were resolved on develop.
2026-02-02 07:56:37 +00:00
unclecode
b962699c0d Add contributors from PRs #973, #1073, #931
Credit danyQe, saipavanmeruga7797, and stevenaldinger for
identifying bugs that were resolved on develop.
2026-02-02 07:14:12 +00:00
unclecode
ffd3face6b Remove duplicate PROMPT_EXTRACT_BLOCKS definition in prompts.py
The first definition (with tags/questions fields) was immediately
overwritten by the second simpler definition — pure dead code.
Removes 61 lines of unused prompt text.

Inspired by PR #931 (stevenaldinger).
2026-02-02 07:04:35 +00:00
unclecode
c790231aba Fix browser context memory leak — signature shrink + LRU eviction (#943)
contexts_by_config accumulated browser contexts unboundedly in long-running
crawlers (Docker API). Two root causes fixed:

1. _make_config_signature() hashed ~60 CrawlerRunConfig fields but only 7
   affect the browser context (proxy_config, locale, timezone_id, geolocation,
   override_navigator, simulate_user, magic). Switched from blacklist to
   whitelist — non-context fields like word_count_threshold, css_selector,
   screenshot, verbose no longer cause unnecessary context creation.

2. No eviction mechanism existed between close() calls. Added refcount
   tracking (_context_refcounts, incremented under _contexts_lock in
   get_page, decremented in release_page_with_context) and LRU eviction
   (_evict_lru_context_locked) that caps contexts at _max_contexts=20,
   evicting only idle contexts (refcount==0) oldest-first.

Also fixed: storage_state path leaked a temporary context every request
(now explicitly closed after clone_runtime_state).

Closes #943. Credit to @Martichou for the investigation in #1640.
2026-02-01 14:23:04 +00:00
unclecode
bb523b6c6c Merge PRs #1077, #1281 — bs4 deprecation and proxy auth fix
- PR #1077: Fix bs4 deprecation warning (text -> string)
- PR #1281: Fix proxy auth ERR_INVALID_AUTH_CREDENTIALS
- Comment on PR #1081 guiding author on needed DFS/BFF fixes
- Update CONTRIBUTORS.md and PR-TODOLIST.md
2026-02-01 07:06:39 +00:00
unclecode
980dc73156 Merge PR #1281: Fix proxy auth ERR_INVALID_AUTH_CREDENTIALS 2026-02-01 07:05:00 +00:00
unclecode
98aea2fb46 Merge PR #1077: Fix bs4 deprecation warning (text -> string) 2026-02-01 07:04:31 +00:00
unclecode
a56dd07559 Merge PRs #1667, #1296, #1364 — CLI deep-crawl, env var, script tags
- PR #1667: Fix deep-crawl CLI outputting only the first page
- PR #1296: Fix VersionManager ignoring CRAWL4_AI_BASE_DIRECTORY
- PR #1364: Fix script tag removal losing adjacent text
- Fix: restore .crawl4ai subfolder in VersionManager path
- Close #1150 (already fixed on develop)
- Update CONTRIBUTORS.md and PR-TODOLIST.md
2026-02-01 06:53:53 +00:00
unclecode
312cef8633 Fix PR #1296: restore .crawl4ai subfolder in VersionManager path 2026-02-01 06:22:16 +00:00
unclecode
a244e4d781 Merge PR #1364: Fix script tag removal losing adjacent text in cleaned_html 2026-02-01 06:22:10 +00:00
unclecode
0f83b05a2d Merge PR #1296: Fix VersionManager ignoring CRAWL4_AI_BASE_DIRECTORY env var 2026-02-01 06:21:40 +00:00
unclecode
37995d4d3f Merge PR #1667: Fix deep-crawl CLI outputting only the first page 2026-02-01 06:21:25 +00:00
unclecode
dc4ae73221 Merge PRs #1714, #1721, #1719, #1717 and fix base tag pipeline
- PR #1714: Replace tf-playwright-stealth with playwright-stealth
- PR #1721: Respect <base> tag in html2text for relative links
- PR #1719: Include GoogleSearchCrawler script.js in package data
- PR #1717: Allow local embeddings by removing OpenAI fallback
- Fix: Extract <base href> from raw HTML before head gets stripped
- Close duplicates: #1703, #1698, #1697, #1710, #1720
- Update CONTRIBUTORS.md and PR-TODOLIST.md
2026-02-01 05:41:33 +00:00
unclecode
5cd0648d71 Merge PR #1717: Allow local embeddings by removing OpenAI fallback 2026-02-01 05:02:18 +00:00
unclecode
9172581416 Merge PR #1719: Include GoogleSearchCrawler script.js in package distribution 2026-02-01 05:02:05 +00:00
unclecode
c39e796a18 Merge PR #1721: Fix <base> tag ignored in html2text relative link resolution 2026-02-01 05:01:52 +00:00
unclecode
ccab926f1f Merge PR #1714: Replace tf-playwright-stealth with playwright-stealth 2026-02-01 05:01:31 +00:00
unclecode
43738c9ed2 Fix can_process_url() to receive normalized URL in deep crawl strategies
Pass the normalized absolute URL instead of the raw href to
can_process_url() in BFS, BFF, and DFS deep crawl strategies.
This ensures URL validation and filter chain evaluation operate
on consistent, fully-qualified URLs.

Fixes #1743
2026-02-01 03:45:52 +00:00
unclecode
ee717dc019 Add contributor for PR #1746 and fix test pytest marker
- Add ChiragBellara to CONTRIBUTORS.md for sitemap seeding fix
- Add missing @pytest.mark.asyncio decorator to seeder test
2026-02-01 03:10:32 +00:00
unclecode
7c5933e2e7 Merge PR #1746: Fix sitemap-only URL seeding avoiding Common Crawl calls 2026-02-01 02:57:06 +00:00
unclecode
5be0d2d75e Add contributor and docs for force_viewport_screenshot feature
- Add TheRedRad to CONTRIBUTORS.md for PR #1694
- Document force_viewport_screenshot in API parameters reference
- Add viewport screenshot note in browser-crawler-config guide
- Add viewport-only screenshot example in screenshot docs
2026-02-01 01:10:20 +00:00
unclecode
e19492a82e Merge PR #1694: feat: add force viewport screenshot 2026-02-01 01:05:52 +00:00
unclecode
55a2cc8181 Document set_defaults/get_defaults/reset_defaults in config guides 2026-01-31 11:46:53 +00:00
unclecode
13a414802b Add set_defaults/get_defaults/reset_defaults to config classes 2026-01-31 11:44:07 +00:00
unclecode
19b9140c68 Improve CDP connection handling 2026-01-31 11:07:26 +00:00
ChiragBellara
694ba44a04 Added fix for URL Seeder forcing Common Crawl index in case of a "sitemap" 2026-01-30 09:33:30 -08:00
unclecode
0104db6de2 Fix critical RCE via deserialization and eval() in /crawl endpoint
- Replace raw eval() in _compute_field() with AST-validated
  _safe_eval_expression() that blocks __import__, dunder attribute
  access, and import statements while preserving safe transforms
- Add ALLOWED_DESERIALIZE_TYPES allowlist to from_serializable_dict()
  preventing arbitrary class instantiation from API input
- Update security contact email and add v0.8.1 security fixes to
  SECURITY.md with researcher acknowledgment
- Add 17 security tests covering both fixes
2026-01-30 08:46:32 +00:00
Nasrin
ad5ebf166a Merge pull request #1718 from YuriNachos/fix/issue-1704-default-logger
fix: Initialize default logger in AsyncPlaywrightCrawlerStrategy (#1704)
2026-01-29 13:03:11 +01:00
Nasrin
034bddf557 Merge pull request #1733 from jose-blockchain/fix/1686-docker-health-version
Fix #1686: Docker health endpoint reports outdated version
2026-01-29 12:55:24 +01:00
unclecode
911bbce8b1 Fix agenerate_schema() JSON parsing for Anthropic models
Strip markdown code fences (```json ... ```) from LLM responses before
json.loads() in agenerate_schema(). Anthropic models wrap JSON output
in markdown fences when litellm silently drops the unsupported
response_format parameter, causing json.loads("") parse failures.

- Add _strip_markdown_fences() helper to extraction_strategy.py
- Apply fence stripping + empty response check in agenerate_schema()
- Separate JSONDecodeError for clearer error messages
- Add 34 tests: unit, real API integration (Anthropic/OpenAI/Groq
  against quotes.toscrape.com), and regression parametrized
2026-01-29 11:38:53 +00:00
unclecode
0a17fe8f19 Improve page tracking with global CDP endpoint-based tracking
- Use class-level tracking keyed by normalized CDP URL
- All BrowserManager instances connecting to same browser share tracking
- For CDP connections, always create new pages (cross-connection page
  sharing isn't reliable in Playwright)
- For managed browsers, page reuse works within same process
- Normalize CDP URLs to handle different formats (http, ws, query params)
2026-01-28 09:30:20 +00:00
unclecode
9b52c1490b Fix page reuse race condition when create_isolated_context=False
When using create_isolated_context=False with concurrent crawls, multiple
tasks would reuse the same page (pages[0]) causing navigation race
conditions and "Page.content: Unable to retrieve content because the
page is navigating" errors.

Changes:
- Add _pages_in_use set to track pages currently being used by crawls
- Rewrite get_page() to only reuse pages that are not in use
- Create new pages when all existing pages are busy
- Add release_page() method to release pages after crawl completes
- Update cleanup paths to release pages before closing

This maintains context sharing (cookies, localStorage) while ensuring
each concurrent crawl gets its own isolated page for navigation.

Includes integration tests verifying:
- Single and sequential crawls still work
- Concurrent crawls don't cause race conditions
- High concurrency (10 simultaneous crawls) works
- Page tracking state remains consistent
2026-01-28 01:43:21 +00:00
unclecode
656b938ef8 Merge branch 'main' into develop 2026-01-27 01:58:45 +00:00
unclecode
55de32d925 Add CycloneDX SBOM and generation script
- Add sbom/sbom.cdx.json generated via Syft
- Add scripts/gen-sbom.sh for regenerating SBOM
- Add sbom/README.md with disclaimer
- Update .gitignore to track gen-sbom.sh
2026-01-27 01:45:42 +00:00