From 530cde351f27a0e09180b07bfd72e9df38bc2a70 Mon Sep 17 00:00:00 2001 From: unclecode Date: Mon, 12 Jan 2026 13:45:42 +0000 Subject: [PATCH] Add release notes for v0.8.0, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates Documentation for v0.8.0 release: - SECURITY.md: Security policy and vulnerability reporting guidelines - RELEASE_NOTES_v0.8.0.md: Comprehensive release notes - migration/v0.8.0-upgrade-guide.md: Step-by-step migration guide - security/GHSA-DRAFT-RCE-LFI.md: GitHub security advisory drafts - CHANGELOG.md: Updated with v0.8.0 changes Breaking changes documented: - Docker API hooks disabled by default (CRAWL4AI_HOOKS_ENABLED) - file:// URLs blocked on Docker API endpoints Security fixes credited to Neo by ProjectDiscovery --- CHANGELOG.md | 40 +++ SECURITY.md | 122 ++++++++ docs/CHANGES_NOTES_FOR_V.0.7.9_RELEASE.md | 335 ---------------------- docs/RELEASE_NOTES_v0.8.0.md | 243 ++++++++++++++++ docs/migration/v0.8.0-upgrade-guide.md | 301 +++++++++++++++++++ docs/security/GHSA-DRAFT-RCE-LFI.md | 171 +++++++++++ 6 files changed, 877 insertions(+), 335 deletions(-) create mode 100644 SECURITY.md delete mode 100644 docs/CHANGES_NOTES_FOR_V.0.7.9_RELEASE.md create mode 100644 docs/RELEASE_NOTES_v0.8.0.md create mode 100644 docs/migration/v0.8.0-upgrade-guide.md create mode 100644 docs/security/GHSA-DRAFT-RCE-LFI.md diff --git a/CHANGELOG.md b/CHANGELOG.md index ce63516f..433160eb 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,6 +5,46 @@ All notable changes to Crawl4AI will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [0.8.0] - 2026-01-12 + +### Security +- **🔒 CRITICAL: Remote Code Execution Fix**: Removed `__import__` from hook allowed builtins + - Prevents arbitrary module imports in user-provided hook code + - Hooks now disabled by default via `CRAWL4AI_HOOKS_ENABLED` environment variable + - Credit: Neo by ProjectDiscovery +- **🔒 HIGH: Local File Inclusion Fix**: Added URL scheme validation to Docker API endpoints + - Blocks `file://`, `javascript:`, `data:` URLs on `/execute_js`, `/screenshot`, `/pdf`, `/html` + - Only allows `http://`, `https://`, and `raw:` URLs + - Credit: Neo by ProjectDiscovery + +### Breaking Changes +- **Docker API: Hooks disabled by default**: Set `CRAWL4AI_HOOKS_ENABLED=true` to enable +- **Docker API: file:// URLs blocked**: Use Python library directly for local file processing + +### Added +- **🚀 init_scripts for BrowserConfig**: Pre-page-load JavaScript injection for stealth evasions +- **🔄 CDP Connection Improvements**: WebSocket URL support, proper cleanup, browser reuse +- **💾 Crash Recovery for Deep Crawl**: `resume_state` and `on_state_change` for BFS/DFS/Best-First strategies +- **📄 PDF/MHTML for raw:/file:// URLs**: Generate PDFs and MHTML from cached HTML content +- **📸 Screenshots for raw:/file:// URLs**: Render cached HTML and capture screenshots +- **🔗 base_url Parameter**: Proper URL resolution for raw: HTML processing +- **⚡ Prefetch Mode**: Two-phase deep crawling with fast link extraction +- **🔀 Enhanced Proxy Support**: Improved proxy rotation and sticky sessions +- **🌐 HTTP Strategy Proxy Support**: Non-browser crawler now supports proxies +- **🖥️ Browser Pipeline for raw:/file://**: New `process_in_browser` parameter +- **📋 Smart TTL Cache for Sitemap Seeder**: `cache_ttl_hours` and `validate_sitemap_lastmod` parameters +- **📚 Security Documentation**: Added SECURITY.md with vulnerability reporting guidelines + +### Fixed +- **raw: URL Parsing**: Fixed truncation at `#` character (CSS color codes like `#eee`) +- **Caching System**: Various improvements to cache validation and persistence + +### Documentation +- Multi-sample schema generation section +- URL seeder smart TTL cache parameters +- v0.8.0 migration guide +- Security policy and disclosure process + ## [Unreleased] ### Added diff --git a/SECURITY.md b/SECURITY.md new file mode 100644 index 00000000..745e0780 --- /dev/null +++ b/SECURITY.md @@ -0,0 +1,122 @@ +# Security Policy + +## Supported Versions + +| Version | Supported | +| ------- | ------------------ | +| 0.8.x | :white_check_mark: | +| 0.7.x | :x: (upgrade recommended) | +| < 0.7 | :x: | + +## Reporting a Vulnerability + +We take security vulnerabilities seriously. If you discover a security issue, please report it responsibly. + +### How to Report + +**DO NOT** open a public GitHub issue for security vulnerabilities. + +Instead, please report via one of these methods: + +1. **GitHub Security Advisories (Preferred)** + - Go to [Security Advisories](https://github.com/unclecode/crawl4ai/security/advisories) + - Click "New draft security advisory" + - Fill in the details + +2. **Email** + - Send details to: security@crawl4ai.com + - Use subject: `[SECURITY] Brief description` + - Include: + - Description of the vulnerability + - Steps to reproduce + - Potential impact + - Any suggested fixes + +### What to Expect + +- **Acknowledgment**: Within 48 hours +- **Initial Assessment**: Within 7 days +- **Resolution Timeline**: Depends on severity + - Critical: 24-72 hours + - High: 7 days + - Medium: 30 days + - Low: 90 days + +### Disclosure Policy + +- We follow responsible disclosure practices +- We will coordinate with you on disclosure timing +- Credit will be given to reporters (unless anonymity is requested) +- We may request CVE assignment for significant vulnerabilities + +## Security Best Practices for Users + +### Docker API Deployment + +If you're running the Crawl4AI Docker API in production: + +1. **Enable Authentication** + ```yaml + # config.yml + security: + enabled: true + jwt_enabled: true + ``` + ```bash + # Set a strong secret key + export SECRET_KEY="your-secure-random-key-here" + ``` + +2. **Hooks are Disabled by Default** (v0.8.0+) + - Only enable if you trust all API users + - Set `CRAWL4AI_HOOKS_ENABLED=true` only when necessary + +3. **Network Security** + - Run behind a reverse proxy (nginx, traefik) + - Use HTTPS in production + - Restrict access to trusted IPs if possible + +4. **Container Security** + - Run as non-root user (default in our container) + - Use read-only filesystem where possible + - Limit container resources + +### Library Usage + +When using Crawl4AI as a Python library: + +1. **Validate URLs** before crawling untrusted input +2. **Sanitize extracted content** before using in other systems +3. **Be cautious with hooks** - they execute arbitrary code + +## Known Security Issues + +### Fixed in v0.8.0 + +| ID | Severity | Description | Fix | +|----|----------|-------------|-----| +| CVE-pending-1 | CRITICAL | RCE via hooks `__import__` | Removed from allowed builtins | +| CVE-pending-2 | HIGH | LFI via `file://` URLs | URL scheme validation added | + +See [Security Advisory](https://github.com/unclecode/crawl4ai/security/advisories) for details. + +## Security Features + +### v0.8.0+ + +- **URL Scheme Validation**: Blocks `file://`, `javascript:`, `data:` URLs on API +- **Hooks Disabled by Default**: Opt-in via `CRAWL4AI_HOOKS_ENABLED=true` +- **Restricted Hook Builtins**: No `__import__`, `eval`, `exec`, `open` +- **JWT Authentication**: Optional but recommended for production +- **Rate Limiting**: Configurable request limits +- **Security Headers**: X-Frame-Options, CSP, HSTS when enabled + +## Acknowledgments + +We thank the following security researchers for responsibly disclosing vulnerabilities: + +- **Neo by ProjectDiscovery** - RCE and LFI vulnerabilities (December 2025) + +--- + +*Last updated: January 2026* diff --git a/docs/CHANGES_NOTES_FOR_V.0.7.9_RELEASE.md b/docs/CHANGES_NOTES_FOR_V.0.7.9_RELEASE.md deleted file mode 100644 index 2a6110f4..00000000 --- a/docs/CHANGES_NOTES_FOR_V.0.7.9_RELEASE.md +++ /dev/null @@ -1,335 +0,0 @@ -# Crawl4AI Release Notes v0.7.9 - -**Period**: December 13, 2025 - January 12, 2026 -**Total Commits**: 18 -**Status**: DRAFT - Pending review - ---- - -## Breaking Changes ⚠️ - -### 1. Docker API Hooks Disabled by Default -- **Commit**: `f24396c` (Jan 12, 2026) -- **Reason**: Security fix for RCE vulnerability -- **Impact**: Hooks no longer work unless `CRAWL4AI_HOOKS_ENABLED=true` is set -- **Migration**: Set environment variable `CRAWL4AI_HOOKS_ENABLED=true` in Docker container - -### 2. file:// URLs Blocked on Docker API -- **Commit**: `f24396c` (Jan 12, 2026) -- **Reason**: Security fix for LFI vulnerability -- **Impact**: Endpoints `/execute_js`, `/screenshot`, `/pdf`, `/html` reject `file://` URLs -- **Migration**: Use the library directly for local file processing - ---- - -## Security Fixes 🔒 - -### Fix Critical RCE and LFI Vulnerabilities in Docker API -- **Commit**: `f24396c` -- **Date**: January 12, 2026 -- **Severity**: CRITICAL -- **Files Changed**: - - `deploy/docker/config.yml` - - `deploy/docker/hook_manager.py` - - `deploy/docker/server.py` - - `deploy/docker/tests/run_security_tests.py` - - `deploy/docker/tests/test_security_fixes.py` - -**Details**: -1. **Remote Code Execution via Hooks (CVE pending)** - - Removed `__import__` from allowed_builtins in hook_manager.py - - Prevents arbitrary module imports (os, subprocess, etc.) - - Hooks now disabled by default via `CRAWL4AI_HOOKS_ENABLED` env var - -2. **Local File Inclusion via file:// URLs (CVE pending)** - - Added URL scheme validation to `/execute_js`, `/screenshot`, `/pdf`, `/html` - - Blocks `file://`, `javascript:`, `data:` and other dangerous schemes - - Only allows `http://`, `https://`, and `raw:` (where appropriate) - -3. **Security hardening** - - Added `CRAWL4AI_HOOKS_ENABLED=false` as default (opt-in for hooks) - - Added security warning comments in config.yml - - Added `validate_url_scheme()` helper for consistent validation - ---- - -## New Features ✨ - -### 1. init_scripts Support for BrowserConfig -- **Commit**: `d10ca38` -- **Date**: December 14, 2025 -- **Files Changed**: - - `crawl4ai/async_configs.py` - - `crawl4ai/browser_manager.py` - -**Description**: Pre-page-load JavaScript injection capability. Useful for stealth evasions (canvas/audio fingerprinting, userAgentData). - -**Usage**: -```python -config = BrowserConfig( - init_scripts=["Object.defineProperty(navigator, 'webdriver', {get: () => false})"] -) -``` - ---- - -### 2. CDP Connection Improvements -- **Commit**: `02acad1` -- **Date**: December 18, 2025 -- **Files Changed**: - - `crawl4ai/browser_manager.py` - - `tests/browser/test_cdp_cleanup_reuse.py` - -**Description**: -- Support WebSocket URLs (ws://, wss://) for CDP connections -- Proper cleanup when `cdp_cleanup_on_close=True` -- Enables reusing the same browser with multiple sequential connections - ---- - -### 3. Crash Recovery for Deep Crawl Strategies -- **Commit**: `31ebf37` -- **Date**: December 22, 2025 -- **Files Changed**: - - `crawl4ai/deep_crawling/bff_strategy.py` - - `crawl4ai/deep_crawling/bfs_strategy.py` - - `crawl4ai/deep_crawling/dfs_strategy.py` - - `tests/deep_crawling/test_deep_crawl_resume.py` - - `tests/deep_crawling/test_deep_crawl_resume_integration.py` - -**Description**: Optional `resume_state` and `on_state_change` parameters for all deep crawl strategies (BFS, DFS, Best-First) enabling cloud deployment crash recovery. - -**Features**: -- `resume_state`: Pass saved state to resume from checkpoint -- `on_state_change`: Async callback fired after each URL for real-time state persistence -- `export_state()`: Get last captured state manually -- Zero overhead when features are disabled (None defaults) -- State is JSON-serializable - ---- - -### 4. PDF and MHTML Support for raw:/file:// URLs -- **Commit**: `67e03d6` -- **Date**: December 22, 2025 -- **Files Changed**: - - `crawl4ai/async_crawler_strategy.py` - -**Description**: Generate PDFs and MHTML from cached HTML content. Replaced `_generate_screenshot_from_html` with unified `_generate_media_from_html` method. - ---- - -### 5. Screenshots for raw:/file:// URLs -- **Commit**: `444cb14` -- **Date**: December 22, 2025 -- **Files Changed**: - - `crawl4ai/async_crawler_strategy.py` - -**Description**: Enables cached HTML to be rendered with screenshots. Loads HTML into browser via `page.set_content()` and takes screenshot. - ---- - -### 6. base_url Parameter for CrawlerRunConfig -- **Commit**: `3937efc` -- **Date**: December 24, 2025 -- **Files Changed**: - - `crawl4ai/async_configs.py` - - `crawl4ai/async_webcrawler.py` - -**Description**: When processing raw: HTML (e.g., from cache), provides proper URL resolution context for markdown link generation. - -**Usage**: -```python -config = CrawlerRunConfig(base_url='https://example.com') -result = await crawler.arun(url='raw:{html}', config=config) -``` - ---- - -### 7. Prefetch Mode for Two-Phase Deep Crawling -- **Commit**: `fde4e9f` -- **Date**: December 25, 2025 -- **Files Changed**: - - `crawl4ai/async_configs.py` - - `crawl4ai/async_webcrawler.py` - - `crawl4ai/utils.py` - - `tests/test_prefetch_integration.py` - - `tests/test_prefetch_mode.py` - - `tests/test_prefetch_regression.py` - -**Description**: -- Added `prefetch` parameter to CrawlerRunConfig -- Added `quick_extract_links()` function for fast link extraction -- Short-circuit in `aprocess_html()` for prefetch mode -- 42 tests added (unit, integration, regression) - ---- - -### 8. Proxy Rotation and Configuration Updates -- **Commit**: `9e7f5aa` -- **Date**: December 26, 2025 -- **Files Changed**: - - `crawl4ai/async_configs.py` - - `crawl4ai/async_webcrawler.py` - - `crawl4ai/proxy_strategy.py` - - `tests/proxy/test_sticky_sessions.py` - -**Description**: Enhanced proxy rotation and proxy configuration options. - ---- - -### 9. Proxy Support for HTTP Crawler Strategy -- **Commit**: `a43256b` -- **Date**: December 26, 2025 -- **Files Changed**: - - `crawl4ai/async_crawler_strategy.py` - -**Description**: Added proxy support to the HTTP (non-browser) crawler strategy. - ---- - -### 10. Browser Pipeline Support for raw:/file:// URLs -- **Commit**: `2550f3d` -- **Date**: December 27, 2025 -- **Files Changed**: - - `crawl4ai/async_configs.py` - - `crawl4ai/async_crawler_strategy.py` - - `crawl4ai/browser_manager.py` - - `tests/test_raw_html_browser.py` - - `tests/test_raw_html_edge_cases.py` - -**Description**: -- Added `process_in_browser` parameter to CrawlerRunConfig -- Routes raw:/file:// URLs through browser when browser operations needed -- Uses `page.set_content()` instead of `goto()` for local content -- Auto-detects browser requirements: js_code, wait_for, screenshot, etc. -- Maintains fast path for raw:/file:// without browser params - -**Fixes**: #310 - ---- - -### 11. Smart TTL Cache for Sitemap URL Seeder -- **Commit**: `3d78001` -- **Date**: December 30, 2025 -- **Files Changed**: - - `crawl4ai/async_configs.py` - - `crawl4ai/async_url_seeder.py` - -**Description**: -- Added `cache_ttl_hours` and `validate_sitemap_lastmod` params to SeedingConfig -- New JSON cache format with metadata (version, created_at, lastmod, url_count) -- Cache validation by TTL expiry and sitemap lastmod comparison -- Auto-migration from old .jsonl to new .json format -- Fixes bug where incomplete cache was used indefinitely - ---- - -## Bug Fixes 🐛 - -### 1. HTTP Strategy raw: URL Parsing Truncates at # Character -- **Commit**: `624e341` -- **Date**: December 24, 2025 -- **Files Changed**: - - `crawl4ai/async_crawler_strategy.py` - -**Problem**: `urlparse()` treated `#` as URL fragment delimiter, breaking CSS color codes like `#eee`. - -**Before**: `raw:body{background:#eee}` → parsed to `body{background:` -**After**: `raw:body{background:#eee}` → parsed to `body{background:#eee}` - -**Fix**: Strip `raw:` prefix directly instead of using urlparse. - ---- - -### 2. Caching Debugging and Fixes -- **Commit**: `48426f7` -- **Date**: December 21, 2025 -- **Files Changed**: - - `crawl4ai/async_configs.py` - - `crawl4ai/async_database.py` - - `crawl4ai/async_webcrawler.py` - - `crawl4ai/cache_validator.py` - - `crawl4ai/models.py` - - `crawl4ai/utils.py` - - `tests/cache_validation/` (multiple test files) - -**Description**: Various debugging and improvements to the caching system. - ---- - -## Documentation Updates 📚 - -### 1. Multi-Sample Schema Generation Section -- **Commit**: `6b2dca7` -- **Date**: January 4, 2026 -- **Files Changed**: - - `docs/md_v2/extraction/no-llm-strategies.md` - -**Description**: Added documentation explaining how to pass multiple HTML samples to `generate_schema()` for stable selectors that work across pages with varying DOM structures. - ---- - -### 2. URL Seeder Docs - Smart TTL Cache Parameters -- **Commit**: `db61ab8` -- **Date**: December 30, 2025 -- **Files Changed**: - - `docs/md_v2/core/url-seeding.md` - -**Description**: -- Added `cache_ttl_hours` and `validate_sitemap_lastmod` to parameter table -- Documented smart TTL cache validation with examples -- Added cache-related troubleshooting entries - ---- - -## Other Changes 🔧 - -| Date | Commit | Description | -|------|--------|-------------| -| Dec 30, 2025 | `0d3f9e6` | Add MEMORY.md to gitignore | -| Dec 21, 2025 | `f6b29a8` | Update gitignore | - ---- - -## Files Changed Summary - -### Core Library -- `crawl4ai/async_configs.py` - Multiple changes (init_scripts, base_url, prefetch, proxy, process_in_browser, cache TTL) -- `crawl4ai/async_webcrawler.py` - Prefetch mode, base_url, proxy -- `crawl4ai/async_crawler_strategy.py` - Media generation, proxy, raw URL fixes -- `crawl4ai/browser_manager.py` - init_scripts, CDP cleanup, cookie handling -- `crawl4ai/async_url_seeder.py` - Smart TTL cache -- `crawl4ai/proxy_strategy.py` - Proxy rotation -- `crawl4ai/deep_crawling/*.py` - Crash recovery for all strategies -- `crawl4ai/utils.py` - quick_extract_links, cache fixes - -### Docker Deployment -- `deploy/docker/server.py` - Security fixes -- `deploy/docker/hook_manager.py` - RCE fix -- `deploy/docker/config.yml` - Security warnings - -### Documentation -- `docs/md_v2/extraction/no-llm-strategies.md` - Schema generation -- `docs/md_v2/core/url-seeding.md` - Cache parameters - -### Tests Added -- `tests/test_prefetch_*.py` - 42 prefetch tests -- `tests/test_raw_html_*.py` - Raw HTML browser tests -- `tests/deep_crawling/test_deep_crawl_resume*.py` - Resume tests -- `tests/browser/test_cdp_cleanup_reuse.py` - CDP tests -- `tests/proxy/test_sticky_sessions.py` - Proxy tests -- `tests/cache_validation/*.py` - Cache tests -- `deploy/docker/tests/test_security_fixes.py` - Security tests -- `deploy/docker/tests/run_security_tests.py` - Security integration tests - ---- - -## Questions for Main Developer - -1. [ ] Are there any other breaking changes not captured here? -2. [ ] Should the security fixes get their own patch release? -3. [ ] Any features that need additional documentation? - ---- - -*Generated: January 12, 2026* diff --git a/docs/RELEASE_NOTES_v0.8.0.md b/docs/RELEASE_NOTES_v0.8.0.md new file mode 100644 index 00000000..bdae30e3 --- /dev/null +++ b/docs/RELEASE_NOTES_v0.8.0.md @@ -0,0 +1,243 @@ +# Crawl4AI v0.8.0 Release Notes + +**Release Date**: January 2026 +**Previous Version**: v0.7.6 +**Status**: Release Candidate + +--- + +## Highlights + +- **Critical Security Fixes** for Docker API deployment +- **11 New Features** including crash recovery, prefetch mode, and proxy improvements +- **Breaking Changes** - see migration guide below + +--- + +## Breaking Changes + +### 1. Docker API: Hooks Disabled by Default + +**What changed**: Hooks are now disabled by default on the Docker API. + +**Why**: Security fix for Remote Code Execution (RCE) vulnerability. + +**Who is affected**: Users of the Docker API who use the `hooks` parameter in `/crawl` requests. + +**Migration**: +```bash +# To re-enable hooks (only if you trust all API users): +export CRAWL4AI_HOOKS_ENABLED=true +``` + +### 2. Docker API: file:// URLs Blocked + +**What changed**: The endpoints `/execute_js`, `/screenshot`, `/pdf`, and `/html` now reject `file://` URLs. + +**Why**: Security fix for Local File Inclusion (LFI) vulnerability. + +**Who is affected**: Users who were reading local files via the Docker API. + +**Migration**: Use the Python library directly for local file processing: +```python +# Instead of API call with file:// URL, use library: +from crawl4ai import AsyncWebCrawler +async with AsyncWebCrawler() as crawler: + result = await crawler.arun(url="file:///path/to/file.html") +``` + +--- + +## Security Fixes + +### Critical: Remote Code Execution via Hooks (CVE Pending) + +**Severity**: CRITICAL (CVSS 10.0) +**Affected**: Docker API deployment (all versions before v0.8.0) +**Vector**: `POST /crawl` with malicious `hooks` parameter + +**Details**: The `__import__` builtin was available in hook code, allowing attackers to import `os`, `subprocess`, etc. and execute arbitrary commands. + +**Fix**: +1. Removed `__import__` from allowed builtins +2. Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`) + +### High: Local File Inclusion via file:// URLs (CVE Pending) + +**Severity**: HIGH (CVSS 8.6) +**Affected**: Docker API deployment (all versions before v0.8.0) +**Vector**: `POST /execute_js` (and other endpoints) with `file:///etc/passwd` + +**Details**: API endpoints accepted `file://` URLs, allowing attackers to read arbitrary files from the server. + +**Fix**: URL scheme validation now only allows `http://`, `https://`, and `raw:` URLs. + +### Credits + +Discovered by **Neo by ProjectDiscovery** ([projectdiscovery.io](https://projectdiscovery.io)) - December 2025 + +--- + +## New Features + +### 1. init_scripts Support for BrowserConfig + +Pre-page-load JavaScript injection for stealth evasions. + +```python +config = BrowserConfig( + init_scripts=[ + "Object.defineProperty(navigator, 'webdriver', {get: () => false})" + ] +) +``` + +### 2. CDP Connection Improvements + +- WebSocket URL support (`ws://`, `wss://`) +- Proper cleanup with `cdp_cleanup_on_close=True` +- Browser reuse across multiple connections + +### 3. Crash Recovery for Deep Crawl Strategies + +All deep crawl strategies (BFS, DFS, Best-First) now support crash recovery: + +```python +from crawl4ai.deep_crawling import BFSDeepCrawlStrategy + +strategy = BFSDeepCrawlStrategy( + max_depth=3, + resume_state=saved_state, # Resume from checkpoint + on_state_change=save_callback # Persist state in real-time +) +``` + +### 4. PDF and MHTML for raw:/file:// URLs + +Generate PDFs and MHTML from cached HTML content. + +### 5. Screenshots for raw:/file:// URLs + +Render cached HTML and capture screenshots. + +### 6. base_url Parameter for CrawlerRunConfig + +Proper URL resolution for raw: HTML processing: + +```python +config = CrawlerRunConfig(base_url='https://example.com') +result = await crawler.arun(url='raw:{html}', config=config) +``` + +### 7. Prefetch Mode for Two-Phase Deep Crawling + +Fast link extraction without full page processing: + +```python +config = CrawlerRunConfig(prefetch=True) +``` + +### 8. Proxy Rotation and Configuration + +Enhanced proxy rotation with sticky sessions support. + +### 9. Proxy Support for HTTP Strategy + +Non-browser crawler now supports proxies. + +### 10. Browser Pipeline for raw:/file:// URLs + +New `process_in_browser` parameter for browser operations on local content: + +```python +config = CrawlerRunConfig( + process_in_browser=True, # Force browser processing + screenshot=True +) +result = await crawler.arun(url='raw:...', config=config) +``` + +### 11. Smart TTL Cache for Sitemap URL Seeder + +Intelligent cache invalidation for sitemaps: + +```python +config = SeedingConfig( + cache_ttl_hours=24, + validate_sitemap_lastmod=True +) +``` + +--- + +## Bug Fixes + +### raw: URL Parsing Truncates at # Character + +**Problem**: CSS color codes like `#eee` were being truncated. + +**Before**: `raw:body{background:#eee}` → `body{background:` +**After**: `raw:body{background:#eee}` → `body{background:#eee}` + +### Caching System Improvements + +Various fixes to cache validation and persistence. + +--- + +## Documentation Updates + +- Multi-sample schema generation documentation +- URL seeder smart TTL cache parameters +- Security documentation (SECURITY.md) + +--- + +## Upgrade Guide + +### From v0.7.x to v0.8.0 + +1. **Update the package**: + ```bash + pip install --upgrade crawl4ai + ``` + +2. **Docker API users**: + - Hooks are now disabled by default + - If you need hooks: `export CRAWL4AI_HOOKS_ENABLED=true` + - `file://` URLs no longer work on API (use library directly) + +3. **Review security settings**: + ```yaml + # config.yml - recommended for production + security: + enabled: true + jwt_enabled: true + ``` + +4. **Test your integration** before deploying to production + +### Breaking Change Checklist + +- [ ] Check if you use `hooks` parameter in API calls +- [ ] Check if you use `file://` URLs via the API +- [ ] Update environment variables if needed +- [ ] Review security configuration + +--- + +## Full Changelog + +See [CHANGELOG.md](../CHANGELOG.md) for complete version history. + +--- + +## Contributors + +Thanks to all contributors who made this release possible. + +Special thanks to **Neo by ProjectDiscovery** for responsible security disclosure. + +--- + +*For questions or issues, please open a [GitHub Issue](https://github.com/unclecode/crawl4ai/issues).* diff --git a/docs/migration/v0.8.0-upgrade-guide.md b/docs/migration/v0.8.0-upgrade-guide.md new file mode 100644 index 00000000..a3104752 --- /dev/null +++ b/docs/migration/v0.8.0-upgrade-guide.md @@ -0,0 +1,301 @@ +# Migration Guide: Upgrading to Crawl4AI v0.8.0 + +This guide helps you upgrade from v0.7.x to v0.8.0, with special attention to breaking changes and security updates. + +## Quick Summary + +| Change | Impact | Action Required | +|--------|--------|-----------------| +| Hooks disabled by default | Docker API users with hooks | Set `CRAWL4AI_HOOKS_ENABLED=true` | +| file:// URLs blocked | Docker API users reading local files | Use Python library directly | +| Security fixes | All Docker API users | Update immediately | + +--- + +## Step 1: Update the Package + +### PyPI Installation + +```bash +pip install --upgrade crawl4ai +``` + +### Docker Installation + +```bash +docker pull unclecode/crawl4ai:latest +# or +docker pull unclecode/crawl4ai:0.8.0 +``` + +### From Source + +```bash +git pull origin main +pip install -e . +``` + +--- + +## Step 2: Check for Breaking Changes + +### Are You Affected? + +**You ARE affected if you:** +- Use the Docker API deployment +- Use the `hooks` parameter in `/crawl` requests +- Use `file://` URLs via API endpoints + +**You are NOT affected if you:** +- Only use Crawl4AI as a Python library +- Don't use hooks in your API calls +- Don't use `file://` URLs via the API + +--- + +## Step 3: Migrate Hooks Usage + +### Before v0.8.0 + +Hooks worked by default: + +```bash +# This worked without any configuration +curl -X POST http://localhost:11235/crawl \ + -H "Content-Type: application/json" \ + -d '{ + "urls": ["https://example.com"], + "hooks": { + "code": { + "on_page_context_created": "async def hook(page, context, **kwargs):\n await context.add_cookies([...])\n return page" + } + } + }' +``` + +### After v0.8.0 + +You must explicitly enable hooks: + +**Option A: Environment Variable (Recommended)** +```bash +# In your Docker run command or docker-compose.yml +export CRAWL4AI_HOOKS_ENABLED=true +``` + +```yaml +# docker-compose.yml +services: + crawl4ai: + image: unclecode/crawl4ai:0.8.0 + environment: + - CRAWL4AI_HOOKS_ENABLED=true +``` + +**Option B: For Kubernetes** +```yaml +env: + - name: CRAWL4AI_HOOKS_ENABLED + value: "true" +``` + +### Security Warning + +Only enable hooks if: +- You trust all users who can access the API +- The API is not exposed to the public internet +- You have other authentication/authorization in place + +--- + +## Step 4: Migrate file:// URL Usage + +### Before v0.8.0 + +```bash +# This worked via API +curl -X POST http://localhost:11235/execute_js \ + -d '{"url": "file:///var/data/page.html", "scripts": ["document.title"]}' +``` + +### After v0.8.0 + +**Option A: Use the Python Library Directly** + +```python +from crawl4ai import AsyncWebCrawler, CrawlerRunConfig + +async def process_local_file(): + async with AsyncWebCrawler() as crawler: + result = await crawler.arun( + url="file:///var/data/page.html", + config=CrawlerRunConfig(js_code=["document.title"]) + ) + return result +``` + +**Option B: Use raw: Protocol for HTML Content** + +If you have the HTML content, you can still use the API: + +```bash +# Read file content and send as raw: +HTML_CONTENT=$(cat /var/data/page.html) +curl -X POST http://localhost:11235/html \ + -H "Content-Type: application/json" \ + -d "{\"url\": \"raw:$HTML_CONTENT\"}" +``` + +**Option C: Create a Preprocessing Service** + +```python +# preprocessing_service.py +from fastapi import FastAPI +from crawl4ai import AsyncWebCrawler + +app = FastAPI() + +@app.post("/process-local") +async def process_local(file_path: str): + async with AsyncWebCrawler() as crawler: + result = await crawler.arun(url=f"file://{file_path}") + return result.model_dump() +``` + +--- + +## Step 5: Review Security Configuration + +### Recommended Production Settings + +```yaml +# config.yml +security: + enabled: true + jwt_enabled: true + https_redirect: true # If behind HTTPS proxy + trusted_hosts: + - "your-domain.com" + - "api.your-domain.com" +``` + +### Environment Variables + +```bash +# Required for JWT authentication +export SECRET_KEY="your-secure-random-key-minimum-32-characters" + +# Only if you need hooks +export CRAWL4AI_HOOKS_ENABLED=true +``` + +### Generate a Secure Secret Key + +```python +import secrets +print(secrets.token_urlsafe(32)) +``` + +--- + +## Step 6: Test Your Integration + +### Quick Validation Script + +```python +import asyncio +import aiohttp + +async def test_upgrade(): + base_url = "http://localhost:11235" + + # Test 1: Basic crawl should work + async with aiohttp.ClientSession() as session: + async with session.post( + f"{base_url}/crawl", + json={"urls": ["https://example.com"]} + ) as resp: + assert resp.status == 200, "Basic crawl failed" + print("✓ Basic crawl works") + + # Test 2: Hooks should be blocked (unless enabled) + async with aiohttp.ClientSession() as session: + async with session.post( + f"{base_url}/crawl", + json={ + "urls": ["https://example.com"], + "hooks": {"code": {"on_page_context_created": "async def hook(page, context, **kwargs): return page"}} + } + ) as resp: + if resp.status == 403: + print("✓ Hooks correctly blocked (default)") + elif resp.status == 200: + print("! Hooks enabled - ensure this is intentional") + + # Test 3: file:// should be blocked + async with aiohttp.ClientSession() as session: + async with session.post( + f"{base_url}/execute_js", + json={"url": "file:///etc/passwd", "scripts": ["1"]} + ) as resp: + assert resp.status == 400, "file:// should be blocked" + print("✓ file:// URLs correctly blocked") + +asyncio.run(test_upgrade()) +``` + +--- + +## Troubleshooting + +### "Hooks are disabled" Error + +**Symptom**: API returns 403 with "Hooks are disabled" + +**Solution**: Set `CRAWL4AI_HOOKS_ENABLED=true` if you need hooks + +### "URL must start with http://, https://" Error + +**Symptom**: API returns 400 when using `file://` URLs + +**Solution**: Use Python library directly or `raw:` protocol + +### Authentication Errors After Enabling JWT + +**Symptom**: API returns 401 Unauthorized + +**Solution**: +1. Get a token: `POST /token` with your email +2. Include token in requests: `Authorization: Bearer ` + +--- + +## Rollback Plan + +If you need to rollback: + +```bash +# PyPI +pip install crawl4ai==0.7.6 + +# Docker +docker pull unclecode/crawl4ai:0.7.6 +``` + +**Warning**: Rolling back re-exposes the security vulnerabilities. Only do this temporarily while fixing integration issues. + +--- + +## Getting Help + +- **GitHub Issues**: [github.com/unclecode/crawl4ai/issues](https://github.com/unclecode/crawl4ai/issues) +- **Security Issues**: See [SECURITY.md](../../SECURITY.md) +- **Documentation**: [docs.crawl4ai.com](https://docs.crawl4ai.com) + +--- + +## Changelog Reference + +For complete list of changes, see: +- [Release Notes v0.8.0](../RELEASE_NOTES_v0.8.0.md) +- [CHANGELOG.md](../../CHANGELOG.md) diff --git a/docs/security/GHSA-DRAFT-RCE-LFI.md b/docs/security/GHSA-DRAFT-RCE-LFI.md new file mode 100644 index 00000000..e2cc409f --- /dev/null +++ b/docs/security/GHSA-DRAFT-RCE-LFI.md @@ -0,0 +1,171 @@ +# GitHub Security Advisory Draft + +> **Instructions**: Copy this content to create security advisories at: +> https://github.com/unclecode/crawl4ai/security/advisories/new + +--- + +## Advisory 1: Remote Code Execution via Hooks Parameter + +### Title +Remote Code Execution in Docker API via Hooks Parameter + +### Severity +Critical + +### CVSS Score +10.0 (CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:H/A:H) + +### CWE +CWE-94 (Improper Control of Generation of Code) + +### Package +crawl4ai (Docker API deployment) + +### Affected Versions +< 0.8.0 + +### Patched Versions +0.8.0 + +### Description + +A critical remote code execution vulnerability exists in the Crawl4AI Docker API deployment. The `/crawl` endpoint accepts a `hooks` parameter containing Python code that is executed using `exec()`. The `__import__` builtin was included in the allowed builtins, allowing attackers to import arbitrary modules and execute system commands. + +**Attack Vector:** +```json +POST /crawl +{ + "urls": ["https://example.com"], + "hooks": { + "code": { + "on_page_context_created": "async def hook(page, context, **kwargs):\n __import__('os').system('malicious_command')\n return page" + } + } +} +``` + +### Impact + +An unauthenticated attacker can: +- Execute arbitrary system commands +- Read/write files on the server +- Exfiltrate sensitive data (environment variables, API keys) +- Pivot to internal network services +- Completely compromise the server + +### Mitigation + +1. **Upgrade to v0.8.0** (recommended) +2. If unable to upgrade immediately: + - Disable the Docker API + - Block `/crawl` endpoint at network level + - Add authentication to the API + +### Fix Details + +1. Removed `__import__` from `allowed_builtins` in `hook_manager.py` +2. Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`) +3. Users must explicitly opt-in to enable hooks + +### Credits + +Discovered by Neo by ProjectDiscovery (https://projectdiscovery.io) + +### References + +- [Release Notes v0.8.0](https://github.com/unclecode/crawl4ai/blob/main/docs/RELEASE_NOTES_v0.8.0.md) +- [Migration Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/migration/v0.8.0-upgrade-guide.md) + +--- + +## Advisory 2: Local File Inclusion via file:// URLs + +### Title +Local File Inclusion in Docker API via file:// URLs + +### Severity +High + +### CVSS Score +8.6 (CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:N/A:N) + +### CWE +CWE-22 (Improper Limitation of a Pathname to a Restricted Directory) + +### Package +crawl4ai (Docker API deployment) + +### Affected Versions +< 0.8.0 + +### Patched Versions +0.8.0 + +### Description + +A local file inclusion vulnerability exists in the Crawl4AI Docker API. The `/execute_js`, `/screenshot`, `/pdf`, and `/html` endpoints accept `file://` URLs, allowing attackers to read arbitrary files from the server filesystem. + +**Attack Vector:** +```json +POST /execute_js +{ + "url": "file:///etc/passwd", + "scripts": ["document.body.innerText"] +} +``` + +### Impact + +An unauthenticated attacker can: +- Read sensitive files (`/etc/passwd`, `/etc/shadow`, application configs) +- Access environment variables via `/proc/self/environ` +- Discover internal application structure +- Potentially read credentials and API keys + +### Mitigation + +1. **Upgrade to v0.8.0** (recommended) +2. If unable to upgrade immediately: + - Disable the Docker API + - Add authentication to the API + - Use network-level filtering + +### Fix Details + +Added URL scheme validation to block: +- `file://` URLs +- `javascript:` URLs +- `data:` URLs +- Other non-HTTP schemes + +Only `http://`, `https://`, and `raw:` URLs are now allowed. + +### Credits + +Discovered by Neo by ProjectDiscovery (https://projectdiscovery.io) + +### References + +- [Release Notes v0.8.0](https://github.com/unclecode/crawl4ai/blob/main/docs/RELEASE_NOTES_v0.8.0.md) +- [Migration Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/migration/v0.8.0-upgrade-guide.md) + +--- + +## Creating the Advisories on GitHub + +1. Go to: https://github.com/unclecode/crawl4ai/security/advisories/new + +2. Fill in the form for each advisory: + - **Ecosystem**: PyPI + - **Package name**: crawl4ai + - **Affected versions**: < 0.8.0 + - **Patched versions**: 0.8.0 + - **Severity**: Critical (for RCE), High (for LFI) + +3. After creating, GitHub will: + - Assign a GHSA ID + - Optionally request a CVE + - Notify users who have security alerts enabled + +4. Coordinate disclosure timing with the fix release