Commit Graph

1264 Commits

Author SHA1 Message Date
unclecode
530cde351f Add release notes for v0.8.0, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates
Documentation for v0.8.0 release:

- SECURITY.md: Security policy and vulnerability reporting guidelines
- RELEASE_NOTES_v0.8.0.md: Comprehensive release notes
- migration/v0.8.0-upgrade-guide.md: Step-by-step migration guide
- security/GHSA-DRAFT-RCE-LFI.md: GitHub security advisory drafts
- CHANGELOG.md: Updated with v0.8.0 changes

Breaking changes documented:
- Docker API hooks disabled by default (CRAWL4AI_HOOKS_ENABLED)
- file:// URLs blocked on Docker API endpoints

Security fixes credited to Neo by ProjectDiscovery
2026-01-12 13:45:42 +00:00
ntohidi
122b4fe3f0 Add release notes for v0.7.9, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates 2026-01-12 13:46:39 +01:00
ntohidi
acfab80dd4 Enhance authentication flow by implementing JWT token retrieval and adding authorization headers to API requests 2026-01-12 13:46:32 +01:00
unclecode
f24396c23e Fix critical RCE and LFI vulnerabilities in Docker API deployment
Security fixes for vulnerabilities reported by ProjectDiscovery:

1. Remote Code Execution via Hooks (CVE pending)
   - Remove __import__ from allowed_builtins in hook_manager.py
   - Prevents arbitrary module imports (os, subprocess, etc.)
   - Hooks now disabled by default via CRAWL4AI_HOOKS_ENABLED env var

2. Local File Inclusion via file:// URLs (CVE pending)
   - Add URL scheme validation to /execute_js, /screenshot, /pdf, /html
   - Block file://, javascript:, data: and other dangerous schemes
   - Only allow http://, https://, and raw: (where appropriate)

3. Security hardening
   - Add CRAWL4AI_HOOKS_ENABLED=false as default (opt-in for hooks)
   - Add security warning comments in config.yml
   - Add validate_url_scheme() helper for consistent validation

Testing:
   - Add unit tests (test_security_fixes.py) - 16 tests
   - Add integration tests (run_security_tests.py) for live server

Affected endpoints:
   - POST /crawl (hooks disabled by default)
   - POST /crawl/stream (hooks disabled by default)
   - POST /execute_js (URL validation added)
   - POST /screenshot (URL validation added)
   - POST /pdf (URL validation added)
   - POST /html (URL validation added)

Breaking changes:
   - Hooks require CRAWL4AI_HOOKS_ENABLED=true to function
   - file:// URLs no longer work on API endpoints (use library directly)
2026-01-12 04:14:37 +00:00
unclecode
6b2dca76c3 Docs: Add multi-sample schema generation section
Add documentation explaining how to pass multiple HTML samples
to generate_schema() for stable selectors that work across pages
with varying DOM structures.

Includes:
- Problem explanation (fragile nth-child selectors)
- Solution with code example
- Key points for multi-sample queries
- Comparison table of fragile vs stable selectors
2026-01-04 12:50:08 +00:00
unclecode
0d3f9e65b0 Add MEMORY.md to gitignore 2025-12-30 03:04:30 +00:00
unclecode
db61ab8559 Update URL seeder docs with smart TTL cache parameters
- Add cache_ttl_hours and validate_sitemap_lastmod to parameter table
- Document smart TTL cache validation with examples
- Add cache-related troubleshooting entries
- Update key features summary
2025-12-30 03:03:41 +00:00
unclecode
3d78001c30 Add smart TTL cache for sitemap URL seeder
- Add cache_ttl_hours and validate_sitemap_lastmod params to SeedingConfig
- New JSON cache format with metadata (version, created_at, lastmod, url_count)
- Cache validation by TTL expiry and sitemap lastmod comparison
- Auto-migration from old .jsonl to new .json format
- Fixes bug where incomplete cache was used indefinitely
2025-12-30 01:59:09 +00:00
unclecode
2550f3d2d5 Add browser pipeline support for raw:/file:// URLs
- Add process_in_browser parameter to CrawlerRunConfig
- Route raw:/file:// URLs through _crawl_web() when browser operations needed
- Use page.set_content() instead of goto() for local content
- Fix cookie handling for non-HTTP URLs in browser_manager
- Auto-detect browser requirements: js_code, wait_for, screenshot, etc.
- Maintain fast path for raw:/file:// without browser params

Fixes #310
2025-12-27 12:32:42 +00:00
unclecode
a43256b27a Add proxy support to HTTP crawler strategy 2025-12-26 13:17:28 +00:00
unclecode
9e7f5aa44b Updates on proxy rotation and proxy configuration 2025-12-26 12:45:57 +00:00
unclecode
fde4e9f0c6 Add prefetch mode for two-phase deep crawling
- Add `prefetch` parameter to CrawlerRunConfig
- Add `quick_extract_links()` function for fast link extraction
- Add short-circuit in aprocess_html() for prefetch mode
- Add 42 tests (unit, integration, regression)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-25 01:55:08 +00:00
unclecode
3937efcf0b Add base_url parameter to CrawlerRunConfig for raw HTML processing
When processing raw: HTML (e.g., from cache), the URL parameter is meaningless
for markdown link resolution. This adds a base_url parameter that can be set
explicitly to provide proper URL resolution context.

Changes:
- Add base_url parameter to CrawlerRunConfig.__init__
- Add base_url to CrawlerRunConfig.from_kwargs
- Update aprocess_html to use base_url for markdown generation

Usage:
  config = CrawlerRunConfig(base_url='https://example.com')
  result = await crawler.arun(url='raw:{html}', config=config)
2025-12-24 06:05:55 +00:00
unclecode
624e34164d Fix: HTTP strategy raw: URL parsing truncates at # character
The AsyncHTTPCrawlerStrategy.crawl() method used urlparse() to extract
content from raw: URLs. This caused HTML with CSS color codes like #eee
to be truncated because # is treated as a URL fragment delimiter.

Before: raw:body{background:#eee} -> parsed.path = 'body{background:'
After:  raw:body{background:#eee} -> raw_content = 'body{background:#eee'

Fix: Strip the raw: or raw:// prefix directly instead of using urlparse,
matching how the browser strategy handles it.
2025-12-24 04:31:57 +00:00
unclecode
31ebf37252 Add crash recovery for deep crawl strategies
Add optional resume_state and on_state_change parameters to all deep
crawl strategies (BFS, DFS, Best-First) for cloud deployment crash
recovery.

Features:
- resume_state: Pass saved state to resume from checkpoint
- on_state_change: Async callback fired after each URL for real-time
  state persistence to external storage (Redis, DB, etc.)
- export_state(): Get last captured state manually
- Zero overhead when features are disabled (None defaults)

State includes visited URLs, pending queue/stack, depths, and
pages_crawled count. All state is JSON-serializable.
2025-12-22 14:51:10 +00:00
unclecode
67e03d64b8 Add PDF and MHTML support for raw: and file:// URLs
- Replace _generate_screenshot_from_html with _generate_media_from_html
- New method handles screenshot, PDF, and MHTML in one browser session
- Update raw: and file:// URL handlers to use new method
- Enables cached HTML to generate all media types
2025-12-22 01:24:51 +00:00
unclecode
444cb14f82 Add _generate_screenshot_from_html for raw: and file:// URLs
Implements the missing method that was being called but never defined.
Now raw: and file:// URLs can generate screenshots by:
1. Loading HTML into a browser page via page.set_content()
2. Taking screenshot using existing take_screenshot() method
3. Cleaning up the page afterward

This enables cached HTML to be rendered with screenshots in crawl4ai-cloud.
2025-12-22 01:10:20 +00:00
unclecode
48426f73f0 Some debugging for caching 2025-12-21 04:48:03 +00:00
unclecode
f6b29a8f9f Update gitignore 2025-12-21 04:48:03 +00:00
unclecode
02acad1dc6 Fix CDP connection handling: support WS URLs and proper cleanup
Changes to browser_manager.py:

1. _verify_cdp_ready(): Support multiple URL formats
   - WebSocket URLs (ws://, wss://): Skip HTTP verification, Playwright handles directly
   - HTTP URLs with query params: Properly parse with urlparse to preserve query string
   - Fixes issue where naive f"{cdp_url}/json/version" broke WS URLs and query params

2. close(): Proper cleanup when cdp_cleanup_on_close=True
   - Close all sessions (pages)
   - Close all contexts
   - Call browser.close() to disconnect (doesn't terminate browser, just releases connection)
   - Wait 1 second for CDP connection to fully release
   - Stop Playwright instance to prevent memory leaks

This enables:
- Connecting to specific browsers via WS URL
- Reusing the same browser with multiple sequential connections
- No user wait needed between connections (internal 1s delay handles it)

Added tests/browser/test_cdp_cleanup_reuse.py with comprehensive tests.
2025-12-18 22:04:52 +08:00
unclecode
d10ca38599 Add init_scripts support to BrowserConfig for pre-page-load JS injection
This adds the ability to inject JavaScript that runs before any page loads,
useful for stealth evasions (canvas/audio fingerprinting, userAgentData).

- Add init_scripts parameter to BrowserConfig (list of JS strings)
- Apply init_scripts in setup_context() via context.add_init_script()
- Update from_kwargs() and to_dict() for serialization
2025-12-14 01:58:11 +00:00
unclecode
ecedb6113e Add context caching to create_isolated_context branch
Uses contexts_by_config cache (same as non-CDP mode) to reuse contexts
for multiple URLs with same config. Still creates new page per crawl
for navigation isolation. Benefits batch/deep crawls.
2025-12-13 08:58:21 +00:00
unclecode
55eb968a8d Add create_isolated_context flag for concurrent CDP crawls
When True, forces creation of a new browser context instead of reusing
the default context. Essential for concurrent crawls on the same browser
to prevent navigation conflicts.
2025-12-13 08:29:05 +00:00
unclecode
6185d3cb32 Revert context matching attempts - Playwright cannot see CDP-created contexts 2025-12-13 07:57:29 +00:00
unclecode
8014805c17 Fix: use CDP to find context by browserContextId for concurrent sessions 2025-12-13 07:02:23 +00:00
unclecode
c1e485e0b0 Fix: use target_id to find correct page in get_page 2025-12-13 06:51:54 +00:00
unclecode
b2e4a1f2e3 Fix: find context by target_id for concurrent CDP connections 2025-12-13 06:41:13 +00:00
unclecode
d22825eea4 Fix: add cdp_cleanup_on_close to from_kwargs 2025-12-13 06:33:26 +00:00
unclecode
66941a59e8 Add cdp_cleanup_on_close flag to prevent memory leaks in cloud/server scenarios 2025-12-13 06:25:25 +00:00
unclecode
8ae908bede Add browser_context_id and target_id parameters to BrowserConfig
Enable Crawl4AI to connect to pre-created CDP browser contexts, which is
essential for cloud browser services that pre-create isolated contexts.

Changes:
- Add browser_context_id and target_id parameters to BrowserConfig
- Update from_kwargs() and to_dict() methods
- Modify BrowserManager.start() to use existing context when provided
- Add _get_page_by_target_id() helper method
- Update get_page() to handle pre-existing targets
- Add test for browser_context_id functionality

This enables cloud services to:
1. Create isolated CDP contexts before Crawl4AI connects
2. Pass context/target IDs to BrowserConfig
3. Have Crawl4AI reuse existing contexts instead of creating new ones
2025-12-13 02:42:48 +00:00
ntohidi
306ddcbf3d Merge branch 'main' into develop 2025-12-11 11:18:30 +01:00
Nasrin
a87e8c1c9e Release/v0.7.8 (#1662)
* Fix: Use correct URL variable for raw HTML extraction (#1116)

- Prevents full HTML content from being passed as URL to extraction strategies
- Added unit tests to verify raw HTML and regular URL processing

Fix: Wrong URL variable used for extraction of raw html

* Fix #1181: Preserve whitespace in code blocks during HTML scraping

  The remove_empty_elements_fast() method was removing whitespace-only
  span elements inside <pre> and <code> tags, causing import statements
  like "import torch" to become "importtorch". Now skips elements inside
  code blocks where whitespace is significant.

* Refactor Pydantic model configuration to use ConfigDict for arbitrary types

* Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621

* Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638

* fix: ensure BrowserConfig.to_dict serializes proxy_config

* feat: make LLM backoff configurable end-to-end

- extend LLMConfig with backoff delay/attempt/factor fields and thread them
  through LLMExtractionStrategy, LLMContentFilter, table extraction, and
  Docker API handlers
- expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff
  and document them in the md_v2 guides

* reproduced AttributeError from #1642

* pass timeout parameter to docker client request

* added missing deep crawling objects to init

* generalized query in ContentRelevanceFilter to be a str or list

* import modules from enhanceable deserialization

* parameterized tests

* Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268

* refactor: replace PyPDF2 with pypdf across the codebase. ref #1412

* announcement: add application form for cloud API closed beta

* Release v0.7.8: Stability & Bug Fix Release

- Updated version to 0.7.8
- Introduced focused stability release addressing 11 community-reported bugs.
- Key fixes include Docker API improvements, LLM extraction enhancements, URL handling corrections, and dependency updates.
- Added detailed release notes for v0.7.8 in the blog and created a dedicated verification script to ensure all fixes are functioning as intended.
- Updated documentation to reflect recent changes and improvements.

* docs: add section for Crawl4AI Cloud API closed beta with application link

* fix: add disk cleanup step to Docker workflow

---------

Co-authored-by: rbushria <rbushri@gmail.com>
Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com>
Co-authored-by: Soham Kukreti <kukretisoham@gmail.com>
Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com>
Co-authored-by: Aravind Karnam <aravind.karanam@gmail.com>
2025-12-11 11:04:52 +01:00
UncleCode
835e3c56fe Add disk cleanup step in Docker release workflow
Added a step to free up disk space before the build process.
2025-12-11 09:49:27 +01:00
Nasrin
5a8fb57795 Merge pull request #1648 from christopher-w-murphy/fix/content-relevance-filter
[Fix]: Docker server does not decode ContentRelevanceFilter
2025-12-03 18:36:07 +08:00
ntohidi
df4d87ed78 refactor: replace PyPDF2 with pypdf across the codebase. ref #1412 2025-12-03 10:59:18 +01:00
Nasrin
f32cfc6db0 Merge pull request #1645 from unclecode/fix/configurable-backoff
Make LLM backoff configurable end-to-end
2025-12-02 21:07:49 +08:00
Nasrin
d06c39e8ab Merge pull request #1641 from unclecode/fix/serialize-proxy-config
Fix BrowserConfig proxy_config serialization
2025-12-02 21:06:02 +08:00
ntohidi
afc31e144a Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop 2025-12-02 13:01:11 +01:00
ntohidi
07ccf13be6 Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268 2025-12-02 13:00:54 +01:00
Aravind
3a07c5962c Sponsors/new (#1643) 2025-12-02 00:49:39 +01:00
Chris Murphy
6893094f58 parameterized tests 2025-12-01 16:19:19 -05:00
Chris Murphy
3a8f8298d3 import modules from enhanceable deserialization 2025-12-01 16:18:59 -05:00
Chris Murphy
e95e8e1a97 generalized query in ContentRelevanceFilter to be a str or list 2025-12-01 16:16:31 -05:00
Chris Murphy
eb76df2c0d added missing deep crawling objects to init 2025-12-01 16:15:58 -05:00
Chris Murphy
6ec6bc4d8a pass timeout parameter to docker client request 2025-12-01 16:15:27 -05:00
Chris Murphy
33a3cc3933 reproduced AttributeError from #1642 2025-12-01 11:31:07 -05:00
Soham Kukreti
7a133e22cc feat: make LLM backoff configurable end-to-end
- extend LLMConfig with backoff delay/attempt/factor fields and thread them
  through LLMExtractionStrategy, LLMContentFilter, table extraction, and
  Docker API handlers
- expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff
  and document them in the md_v2 guides
2025-11-28 18:50:04 +05:30
Nasrin
dcb77c94bf Merge pull request #1623 from unclecode/fix/deprecated_pydantic
Refactor Pydantic model configuration to use ConfigDict for arbitrary…
2025-11-27 20:05:42 +08:00
Soham Kukreti
a0c5f0f79a fix: ensure BrowserConfig.to_dict serializes proxy_config 2025-11-26 17:44:06 +05:30
ntohidi
b36c6daa5c Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638 2025-11-25 11:51:59 +01:00