Commit Graph

1338 Commits

Author SHA1 Message Date
unclecode
719e83e105 Update PR todolist — refresh open PRs, add 6 new, classify
- Added PRs #475, #462, #416, #335, #332, #312
- Flagged #475 as duplicate of merged #1296
- Corrected author for #1450 (rbushri)
- Updated total count to ~63 open PRs
- Updated date to 2026-02-06
2026-02-06 09:06:13 +00:00
unclecode
3401dd1620 Fix browser recycling under high concurrency — version-based approach
The previous recycle logic waited for all refcounts to hit 0 before
recycling, which never happened under sustained concurrent load (20+
crawls always had at least one active).

New approach:
- Add _browser_version to config signature — bump it to force new contexts
- When threshold is hit: bump version, move old sigs to _pending_cleanup
- New requests get new contexts automatically (different signature)
- Old contexts drain naturally and get cleaned up when refcount hits 0
- Safety cap: max 3 pending browsers draining at once

This means recycling now works under any load pattern — no blocking,
no waiting for quiet moments. Old and new browsers coexist briefly
during transitions.

Includes 12 new tests covering version bumps, concurrent recycling,
safety cap, and edge cases.
2026-02-05 07:48:12 +00:00
unclecode
c046918bb4 Add memory-saving mode, browser recycling, and CDP leak fixes
- Add memory_saving_mode config: aggressive cache discard + V8 heap cap
  flags for high-volume crawling (1000+ pages)
- Add max_pages_before_recycle config: automatic browser process recycling
  after N pages to reclaim leaked memory (recommended 500-1000)
- Add default Chrome flags to disable unused features (OptimizationHints,
  MediaRouter, component updates, domain reliability)
- Fix CDP session leak: detach CDP session after viewport adjustment
- Fix session kill: only close context when refcount reaches 0, preventing
  use-after-close for shared contexts
- Add browser lifecycle and memory tests
2026-02-04 02:00:53 +00:00
ntohidi
4e56f3e00d Add contributing guide and update mkdocs navigation for community resources 2026-02-03 09:46:54 +01:00
unclecode
0bfcf080dd Add contributors from PRs #1133, #729
Credit chrizzly2309 and complete-dope for identifying bugs
that were resolved on develop.
2026-02-02 07:56:37 +00:00
unclecode
b962699c0d Add contributors from PRs #973, #1073, #931
Credit danyQe, saipavanmeruga7797, and stevenaldinger for
identifying bugs that were resolved on develop.
2026-02-02 07:14:12 +00:00
unclecode
ffd3face6b Remove duplicate PROMPT_EXTRACT_BLOCKS definition in prompts.py
The first definition (with tags/questions fields) was immediately
overwritten by the second simpler definition — pure dead code.
Removes 61 lines of unused prompt text.

Inspired by PR #931 (stevenaldinger).
2026-02-02 07:04:35 +00:00
unclecode
c790231aba Fix browser context memory leak — signature shrink + LRU eviction (#943)
contexts_by_config accumulated browser contexts unboundedly in long-running
crawlers (Docker API). Two root causes fixed:

1. _make_config_signature() hashed ~60 CrawlerRunConfig fields but only 7
   affect the browser context (proxy_config, locale, timezone_id, geolocation,
   override_navigator, simulate_user, magic). Switched from blacklist to
   whitelist — non-context fields like word_count_threshold, css_selector,
   screenshot, verbose no longer cause unnecessary context creation.

2. No eviction mechanism existed between close() calls. Added refcount
   tracking (_context_refcounts, incremented under _contexts_lock in
   get_page, decremented in release_page_with_context) and LRU eviction
   (_evict_lru_context_locked) that caps contexts at _max_contexts=20,
   evicting only idle contexts (refcount==0) oldest-first.

Also fixed: storage_state path leaked a temporary context every request
(now explicitly closed after clone_runtime_state).

Closes #943. Credit to @Martichou for the investigation in #1640.
2026-02-01 14:23:04 +00:00
unclecode
bb523b6c6c Merge PRs #1077, #1281 — bs4 deprecation and proxy auth fix
- PR #1077: Fix bs4 deprecation warning (text -> string)
- PR #1281: Fix proxy auth ERR_INVALID_AUTH_CREDENTIALS
- Comment on PR #1081 guiding author on needed DFS/BFF fixes
- Update CONTRIBUTORS.md and PR-TODOLIST.md
2026-02-01 07:06:39 +00:00
unclecode
980dc73156 Merge PR #1281: Fix proxy auth ERR_INVALID_AUTH_CREDENTIALS 2026-02-01 07:05:00 +00:00
unclecode
98aea2fb46 Merge PR #1077: Fix bs4 deprecation warning (text -> string) 2026-02-01 07:04:31 +00:00
unclecode
a56dd07559 Merge PRs #1667, #1296, #1364 — CLI deep-crawl, env var, script tags
- PR #1667: Fix deep-crawl CLI outputting only the first page
- PR #1296: Fix VersionManager ignoring CRAWL4_AI_BASE_DIRECTORY
- PR #1364: Fix script tag removal losing adjacent text
- Fix: restore .crawl4ai subfolder in VersionManager path
- Close #1150 (already fixed on develop)
- Update CONTRIBUTORS.md and PR-TODOLIST.md
2026-02-01 06:53:53 +00:00
unclecode
312cef8633 Fix PR #1296: restore .crawl4ai subfolder in VersionManager path 2026-02-01 06:22:16 +00:00
unclecode
a244e4d781 Merge PR #1364: Fix script tag removal losing adjacent text in cleaned_html 2026-02-01 06:22:10 +00:00
unclecode
0f83b05a2d Merge PR #1296: Fix VersionManager ignoring CRAWL4_AI_BASE_DIRECTORY env var 2026-02-01 06:21:40 +00:00
unclecode
37995d4d3f Merge PR #1667: Fix deep-crawl CLI outputting only the first page 2026-02-01 06:21:25 +00:00
unclecode
dc4ae73221 Merge PRs #1714, #1721, #1719, #1717 and fix base tag pipeline
- PR #1714: Replace tf-playwright-stealth with playwright-stealth
- PR #1721: Respect <base> tag in html2text for relative links
- PR #1719: Include GoogleSearchCrawler script.js in package data
- PR #1717: Allow local embeddings by removing OpenAI fallback
- Fix: Extract <base href> from raw HTML before head gets stripped
- Close duplicates: #1703, #1698, #1697, #1710, #1720
- Update CONTRIBUTORS.md and PR-TODOLIST.md
2026-02-01 05:41:33 +00:00
unclecode
5cd0648d71 Merge PR #1717: Allow local embeddings by removing OpenAI fallback 2026-02-01 05:02:18 +00:00
unclecode
9172581416 Merge PR #1719: Include GoogleSearchCrawler script.js in package distribution 2026-02-01 05:02:05 +00:00
unclecode
c39e796a18 Merge PR #1721: Fix <base> tag ignored in html2text relative link resolution 2026-02-01 05:01:52 +00:00
unclecode
ccab926f1f Merge PR #1714: Replace tf-playwright-stealth with playwright-stealth 2026-02-01 05:01:31 +00:00
unclecode
43738c9ed2 Fix can_process_url() to receive normalized URL in deep crawl strategies
Pass the normalized absolute URL instead of the raw href to
can_process_url() in BFS, BFF, and DFS deep crawl strategies.
This ensures URL validation and filter chain evaluation operate
on consistent, fully-qualified URLs.

Fixes #1743
2026-02-01 03:45:52 +00:00
unclecode
ee717dc019 Add contributor for PR #1746 and fix test pytest marker
- Add ChiragBellara to CONTRIBUTORS.md for sitemap seeding fix
- Add missing @pytest.mark.asyncio decorator to seeder test
2026-02-01 03:10:32 +00:00
unclecode
7c5933e2e7 Merge PR #1746: Fix sitemap-only URL seeding avoiding Common Crawl calls 2026-02-01 02:57:06 +00:00
unclecode
5be0d2d75e Add contributor and docs for force_viewport_screenshot feature
- Add TheRedRad to CONTRIBUTORS.md for PR #1694
- Document force_viewport_screenshot in API parameters reference
- Add viewport screenshot note in browser-crawler-config guide
- Add viewport-only screenshot example in screenshot docs
2026-02-01 01:10:20 +00:00
unclecode
e19492a82e Merge PR #1694: feat: add force viewport screenshot 2026-02-01 01:05:52 +00:00
unclecode
55a2cc8181 Document set_defaults/get_defaults/reset_defaults in config guides 2026-01-31 11:46:53 +00:00
unclecode
13a414802b Add set_defaults/get_defaults/reset_defaults to config classes 2026-01-31 11:44:07 +00:00
unclecode
19b9140c68 Improve CDP connection handling 2026-01-31 11:07:26 +00:00
ChiragBellara
694ba44a04 Added fix for URL Seeder forcing Common Crawl index in case of a "sitemap" 2026-01-30 09:33:30 -08:00
unclecode
0104db6de2 Fix critical RCE via deserialization and eval() in /crawl endpoint
- Replace raw eval() in _compute_field() with AST-validated
  _safe_eval_expression() that blocks __import__, dunder attribute
  access, and import statements while preserving safe transforms
- Add ALLOWED_DESERIALIZE_TYPES allowlist to from_serializable_dict()
  preventing arbitrary class instantiation from API input
- Update security contact email and add v0.8.1 security fixes to
  SECURITY.md with researcher acknowledgment
- Add 17 security tests covering both fixes
2026-01-30 08:46:32 +00:00
Nasrin
ad5ebf166a Merge pull request #1718 from YuriNachos/fix/issue-1704-default-logger
fix: Initialize default logger in AsyncPlaywrightCrawlerStrategy (#1704)
2026-01-29 13:03:11 +01:00
Nasrin
034bddf557 Merge pull request #1733 from jose-blockchain/fix/1686-docker-health-version
Fix #1686: Docker health endpoint reports outdated version
2026-01-29 12:55:24 +01:00
unclecode
911bbce8b1 Fix agenerate_schema() JSON parsing for Anthropic models
Strip markdown code fences (```json ... ```) from LLM responses before
json.loads() in agenerate_schema(). Anthropic models wrap JSON output
in markdown fences when litellm silently drops the unsupported
response_format parameter, causing json.loads("") parse failures.

- Add _strip_markdown_fences() helper to extraction_strategy.py
- Apply fence stripping + empty response check in agenerate_schema()
- Separate JSONDecodeError for clearer error messages
- Add 34 tests: unit, real API integration (Anthropic/OpenAI/Groq
  against quotes.toscrape.com), and regression parametrized
2026-01-29 11:38:53 +00:00
unclecode
0a17fe8f19 Improve page tracking with global CDP endpoint-based tracking
- Use class-level tracking keyed by normalized CDP URL
- All BrowserManager instances connecting to same browser share tracking
- For CDP connections, always create new pages (cross-connection page
  sharing isn't reliable in Playwright)
- For managed browsers, page reuse works within same process
- Normalize CDP URLs to handle different formats (http, ws, query params)
2026-01-28 09:30:20 +00:00
unclecode
9b52c1490b Fix page reuse race condition when create_isolated_context=False
When using create_isolated_context=False with concurrent crawls, multiple
tasks would reuse the same page (pages[0]) causing navigation race
conditions and "Page.content: Unable to retrieve content because the
page is navigating" errors.

Changes:
- Add _pages_in_use set to track pages currently being used by crawls
- Rewrite get_page() to only reuse pages that are not in use
- Create new pages when all existing pages are busy
- Add release_page() method to release pages after crawl completes
- Update cleanup paths to release pages before closing

This maintains context sharing (cookies, localStorage) while ensuring
each concurrent crawl gets its own isolated page for navigation.

Includes integration tests verifying:
- Single and sequential crawls still work
- Concurrent crawls don't cause race conditions
- High concurrency (10 simultaneous crawls) works
- Page tracking state remains consistent
2026-01-28 01:43:21 +00:00
unclecode
656b938ef8 Merge branch 'main' into develop 2026-01-27 01:58:45 +00:00
unclecode
55de32d925 Add CycloneDX SBOM and generation script
- Add sbom/sbom.cdx.json generated via Syft
- Add scripts/gen-sbom.sh for regenerating SBOM
- Add sbom/README.md with disclaimer
- Update .gitignore to track gen-sbom.sh
2026-01-27 01:45:42 +00:00
unclecode
21e6c418be Fix: Keep storage_state.json in profile shrink
- Add storage_state.json to all KEEP_PATTERNS levels
- This file contains unencrypted cookies in Playwright format
- Critical for cross-machine profile portability (local -> cloud)
2026-01-26 13:06:31 +00:00
unclecode
18d2ef4a24 Fix: Disable cookie encryption for portable profiles
- Add --password-store=basic and --use-mock-keychain flags when creating
  profiles to prevent OS keychain encryption of cookies
- Without this, cookies are encrypted with machine-specific keys and
  profiles can't be transferred between machines (local -> cloud)

Also adds direct CLI commands for profile management:
- crwl profiles create <name>
- crwl profiles list
- crwl profiles delete <name>

The interactive menu (crwl profiles) still works as before.
2026-01-26 12:57:17 +00:00
unclecode
ef226f5787 Add: Cloud CLI module for profile management
New cloud module (crawl4ai/cloud/):
- crwl cloud auth - Authenticate with API key
- crwl cloud profiles upload - Upload local profile to cloud
- crwl cloud profiles list - List cloud profiles
- crwl cloud profiles delete - Delete cloud profile

Features:
- Stores credentials in ~/.crawl4ai/global.yml
- Auto-shrinks profiles before upload (configurable)
- Validates API key on auth
- Rich formatted output with tables and panels
2026-01-25 09:35:48 +00:00
unclecode
94e19a4c72 Enhance browser profile management capabilities 2026-01-24 08:02:52 +00:00
unclecode
79ebfce913 Refactor HTML block delimiter to use config constant 2026-01-24 04:19:50 +00:00
unclecode
2d5e5306c5 Add support for parallel URL processing in extraction utilities 2026-01-24 04:13:39 +00:00
unclecode
b0b3ca1222 Refactor extraction strategy internals and improve error handling 2026-01-24 03:10:26 +00:00
ntohidi
777d0878f2 Update security contact emails in SECURITY.md 2026-01-22 09:53:24 +01:00
unclecode
fbfbc6995c Fix deep crawl cancellation example to use DFS for precise control 2026-01-22 06:25:34 +00:00
unclecode
1e2b7fe7e6 Add documentation and example for deep crawl cancellation
- Add Section 11 "Cancellation Support for Deep Crawls" to deep-crawling.md
- Document should_cancel callback, cancel() method, and cancelled property
- Include complete example for cloud platform job cancellation
- Add docs/examples/deep_crawl_cancellation.py with 6 comprehensive examples
- Update summary section to mention cancellation feature
2026-01-22 06:10:54 +00:00
unclecode
f6897d1429 Add cancellation support for deep crawl strategies
- Add should_cancel callback parameter to BFS, DFS, and BestFirst strategies
- Add cancel() method for immediate cancellation (thread-safe)
- Add cancelled property to check cancellation status
- Add _check_cancellation() internal method supporting both sync/async callbacks
- Reset cancel event on strategy reuse for multiple crawls
- Include cancelled flag in state notifications via on_state_change
- Handle callback exceptions gracefully (fail-open, log warning)
- Add comprehensive test suite with 26 tests covering all edge cases

This enables external callers (e.g., cloud platforms) to stop a running
deep crawl mid-execution and retrieve partial results.
2026-01-22 06:08:25 +00:00
José
c9a271a3ff Merge branch 'fix/1686-docker-health-version' of https://github.com/jose-blockchain/crawl4ai into fix/1686-docker-health-version 2026-01-20 23:45:13 +01:00