Add release notes for v0.8.0, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates
Documentation for v0.8.0 release: - SECURITY.md: Security policy and vulnerability reporting guidelines - RELEASE_NOTES_v0.8.0.md: Comprehensive release notes - migration/v0.8.0-upgrade-guide.md: Step-by-step migration guide - security/GHSA-DRAFT-RCE-LFI.md: GitHub security advisory drafts - CHANGELOG.md: Updated with v0.8.0 changes Breaking changes documented: - Docker API hooks disabled by default (CRAWL4AI_HOOKS_ENABLED) - file:// URLs blocked on Docker API endpoints Security fixes credited to Neo by ProjectDiscovery
This commit is contained in:
40
CHANGELOG.md
40
CHANGELOG.md
@@ -5,6 +5,46 @@ All notable changes to Crawl4AI will be documented in this file.
|
|||||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||||
|
|
||||||
|
## [0.8.0] - 2026-01-12
|
||||||
|
|
||||||
|
### Security
|
||||||
|
- **🔒 CRITICAL: Remote Code Execution Fix**: Removed `__import__` from hook allowed builtins
|
||||||
|
- Prevents arbitrary module imports in user-provided hook code
|
||||||
|
- Hooks now disabled by default via `CRAWL4AI_HOOKS_ENABLED` environment variable
|
||||||
|
- Credit: Neo by ProjectDiscovery
|
||||||
|
- **🔒 HIGH: Local File Inclusion Fix**: Added URL scheme validation to Docker API endpoints
|
||||||
|
- Blocks `file://`, `javascript:`, `data:` URLs on `/execute_js`, `/screenshot`, `/pdf`, `/html`
|
||||||
|
- Only allows `http://`, `https://`, and `raw:` URLs
|
||||||
|
- Credit: Neo by ProjectDiscovery
|
||||||
|
|
||||||
|
### Breaking Changes
|
||||||
|
- **Docker API: Hooks disabled by default**: Set `CRAWL4AI_HOOKS_ENABLED=true` to enable
|
||||||
|
- **Docker API: file:// URLs blocked**: Use Python library directly for local file processing
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- **🚀 init_scripts for BrowserConfig**: Pre-page-load JavaScript injection for stealth evasions
|
||||||
|
- **🔄 CDP Connection Improvements**: WebSocket URL support, proper cleanup, browser reuse
|
||||||
|
- **💾 Crash Recovery for Deep Crawl**: `resume_state` and `on_state_change` for BFS/DFS/Best-First strategies
|
||||||
|
- **📄 PDF/MHTML for raw:/file:// URLs**: Generate PDFs and MHTML from cached HTML content
|
||||||
|
- **📸 Screenshots for raw:/file:// URLs**: Render cached HTML and capture screenshots
|
||||||
|
- **🔗 base_url Parameter**: Proper URL resolution for raw: HTML processing
|
||||||
|
- **⚡ Prefetch Mode**: Two-phase deep crawling with fast link extraction
|
||||||
|
- **🔀 Enhanced Proxy Support**: Improved proxy rotation and sticky sessions
|
||||||
|
- **🌐 HTTP Strategy Proxy Support**: Non-browser crawler now supports proxies
|
||||||
|
- **🖥️ Browser Pipeline for raw:/file://**: New `process_in_browser` parameter
|
||||||
|
- **📋 Smart TTL Cache for Sitemap Seeder**: `cache_ttl_hours` and `validate_sitemap_lastmod` parameters
|
||||||
|
- **📚 Security Documentation**: Added SECURITY.md with vulnerability reporting guidelines
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- **raw: URL Parsing**: Fixed truncation at `#` character (CSS color codes like `#eee`)
|
||||||
|
- **Caching System**: Various improvements to cache validation and persistence
|
||||||
|
|
||||||
|
### Documentation
|
||||||
|
- Multi-sample schema generation section
|
||||||
|
- URL seeder smart TTL cache parameters
|
||||||
|
- v0.8.0 migration guide
|
||||||
|
- Security policy and disclosure process
|
||||||
|
|
||||||
## [Unreleased]
|
## [Unreleased]
|
||||||
|
|
||||||
### Added
|
### Added
|
||||||
|
|||||||
122
SECURITY.md
Normal file
122
SECURITY.md
Normal file
@@ -0,0 +1,122 @@
|
|||||||
|
# Security Policy
|
||||||
|
|
||||||
|
## Supported Versions
|
||||||
|
|
||||||
|
| Version | Supported |
|
||||||
|
| ------- | ------------------ |
|
||||||
|
| 0.8.x | :white_check_mark: |
|
||||||
|
| 0.7.x | :x: (upgrade recommended) |
|
||||||
|
| < 0.7 | :x: |
|
||||||
|
|
||||||
|
## Reporting a Vulnerability
|
||||||
|
|
||||||
|
We take security vulnerabilities seriously. If you discover a security issue, please report it responsibly.
|
||||||
|
|
||||||
|
### How to Report
|
||||||
|
|
||||||
|
**DO NOT** open a public GitHub issue for security vulnerabilities.
|
||||||
|
|
||||||
|
Instead, please report via one of these methods:
|
||||||
|
|
||||||
|
1. **GitHub Security Advisories (Preferred)**
|
||||||
|
- Go to [Security Advisories](https://github.com/unclecode/crawl4ai/security/advisories)
|
||||||
|
- Click "New draft security advisory"
|
||||||
|
- Fill in the details
|
||||||
|
|
||||||
|
2. **Email**
|
||||||
|
- Send details to: security@crawl4ai.com
|
||||||
|
- Use subject: `[SECURITY] Brief description`
|
||||||
|
- Include:
|
||||||
|
- Description of the vulnerability
|
||||||
|
- Steps to reproduce
|
||||||
|
- Potential impact
|
||||||
|
- Any suggested fixes
|
||||||
|
|
||||||
|
### What to Expect
|
||||||
|
|
||||||
|
- **Acknowledgment**: Within 48 hours
|
||||||
|
- **Initial Assessment**: Within 7 days
|
||||||
|
- **Resolution Timeline**: Depends on severity
|
||||||
|
- Critical: 24-72 hours
|
||||||
|
- High: 7 days
|
||||||
|
- Medium: 30 days
|
||||||
|
- Low: 90 days
|
||||||
|
|
||||||
|
### Disclosure Policy
|
||||||
|
|
||||||
|
- We follow responsible disclosure practices
|
||||||
|
- We will coordinate with you on disclosure timing
|
||||||
|
- Credit will be given to reporters (unless anonymity is requested)
|
||||||
|
- We may request CVE assignment for significant vulnerabilities
|
||||||
|
|
||||||
|
## Security Best Practices for Users
|
||||||
|
|
||||||
|
### Docker API Deployment
|
||||||
|
|
||||||
|
If you're running the Crawl4AI Docker API in production:
|
||||||
|
|
||||||
|
1. **Enable Authentication**
|
||||||
|
```yaml
|
||||||
|
# config.yml
|
||||||
|
security:
|
||||||
|
enabled: true
|
||||||
|
jwt_enabled: true
|
||||||
|
```
|
||||||
|
```bash
|
||||||
|
# Set a strong secret key
|
||||||
|
export SECRET_KEY="your-secure-random-key-here"
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Hooks are Disabled by Default** (v0.8.0+)
|
||||||
|
- Only enable if you trust all API users
|
||||||
|
- Set `CRAWL4AI_HOOKS_ENABLED=true` only when necessary
|
||||||
|
|
||||||
|
3. **Network Security**
|
||||||
|
- Run behind a reverse proxy (nginx, traefik)
|
||||||
|
- Use HTTPS in production
|
||||||
|
- Restrict access to trusted IPs if possible
|
||||||
|
|
||||||
|
4. **Container Security**
|
||||||
|
- Run as non-root user (default in our container)
|
||||||
|
- Use read-only filesystem where possible
|
||||||
|
- Limit container resources
|
||||||
|
|
||||||
|
### Library Usage
|
||||||
|
|
||||||
|
When using Crawl4AI as a Python library:
|
||||||
|
|
||||||
|
1. **Validate URLs** before crawling untrusted input
|
||||||
|
2. **Sanitize extracted content** before using in other systems
|
||||||
|
3. **Be cautious with hooks** - they execute arbitrary code
|
||||||
|
|
||||||
|
## Known Security Issues
|
||||||
|
|
||||||
|
### Fixed in v0.8.0
|
||||||
|
|
||||||
|
| ID | Severity | Description | Fix |
|
||||||
|
|----|----------|-------------|-----|
|
||||||
|
| CVE-pending-1 | CRITICAL | RCE via hooks `__import__` | Removed from allowed builtins |
|
||||||
|
| CVE-pending-2 | HIGH | LFI via `file://` URLs | URL scheme validation added |
|
||||||
|
|
||||||
|
See [Security Advisory](https://github.com/unclecode/crawl4ai/security/advisories) for details.
|
||||||
|
|
||||||
|
## Security Features
|
||||||
|
|
||||||
|
### v0.8.0+
|
||||||
|
|
||||||
|
- **URL Scheme Validation**: Blocks `file://`, `javascript:`, `data:` URLs on API
|
||||||
|
- **Hooks Disabled by Default**: Opt-in via `CRAWL4AI_HOOKS_ENABLED=true`
|
||||||
|
- **Restricted Hook Builtins**: No `__import__`, `eval`, `exec`, `open`
|
||||||
|
- **JWT Authentication**: Optional but recommended for production
|
||||||
|
- **Rate Limiting**: Configurable request limits
|
||||||
|
- **Security Headers**: X-Frame-Options, CSP, HSTS when enabled
|
||||||
|
|
||||||
|
## Acknowledgments
|
||||||
|
|
||||||
|
We thank the following security researchers for responsibly disclosing vulnerabilities:
|
||||||
|
|
||||||
|
- **Neo by ProjectDiscovery** - RCE and LFI vulnerabilities (December 2025)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Last updated: January 2026*
|
||||||
@@ -1,335 +0,0 @@
|
|||||||
# Crawl4AI Release Notes v0.7.9
|
|
||||||
|
|
||||||
**Period**: December 13, 2025 - January 12, 2026
|
|
||||||
**Total Commits**: 18
|
|
||||||
**Status**: DRAFT - Pending review
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Breaking Changes ⚠️
|
|
||||||
|
|
||||||
### 1. Docker API Hooks Disabled by Default
|
|
||||||
- **Commit**: `f24396c` (Jan 12, 2026)
|
|
||||||
- **Reason**: Security fix for RCE vulnerability
|
|
||||||
- **Impact**: Hooks no longer work unless `CRAWL4AI_HOOKS_ENABLED=true` is set
|
|
||||||
- **Migration**: Set environment variable `CRAWL4AI_HOOKS_ENABLED=true` in Docker container
|
|
||||||
|
|
||||||
### 2. file:// URLs Blocked on Docker API
|
|
||||||
- **Commit**: `f24396c` (Jan 12, 2026)
|
|
||||||
- **Reason**: Security fix for LFI vulnerability
|
|
||||||
- **Impact**: Endpoints `/execute_js`, `/screenshot`, `/pdf`, `/html` reject `file://` URLs
|
|
||||||
- **Migration**: Use the library directly for local file processing
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Security Fixes 🔒
|
|
||||||
|
|
||||||
### Fix Critical RCE and LFI Vulnerabilities in Docker API
|
|
||||||
- **Commit**: `f24396c`
|
|
||||||
- **Date**: January 12, 2026
|
|
||||||
- **Severity**: CRITICAL
|
|
||||||
- **Files Changed**:
|
|
||||||
- `deploy/docker/config.yml`
|
|
||||||
- `deploy/docker/hook_manager.py`
|
|
||||||
- `deploy/docker/server.py`
|
|
||||||
- `deploy/docker/tests/run_security_tests.py`
|
|
||||||
- `deploy/docker/tests/test_security_fixes.py`
|
|
||||||
|
|
||||||
**Details**:
|
|
||||||
1. **Remote Code Execution via Hooks (CVE pending)**
|
|
||||||
- Removed `__import__` from allowed_builtins in hook_manager.py
|
|
||||||
- Prevents arbitrary module imports (os, subprocess, etc.)
|
|
||||||
- Hooks now disabled by default via `CRAWL4AI_HOOKS_ENABLED` env var
|
|
||||||
|
|
||||||
2. **Local File Inclusion via file:// URLs (CVE pending)**
|
|
||||||
- Added URL scheme validation to `/execute_js`, `/screenshot`, `/pdf`, `/html`
|
|
||||||
- Blocks `file://`, `javascript:`, `data:` and other dangerous schemes
|
|
||||||
- Only allows `http://`, `https://`, and `raw:` (where appropriate)
|
|
||||||
|
|
||||||
3. **Security hardening**
|
|
||||||
- Added `CRAWL4AI_HOOKS_ENABLED=false` as default (opt-in for hooks)
|
|
||||||
- Added security warning comments in config.yml
|
|
||||||
- Added `validate_url_scheme()` helper for consistent validation
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## New Features ✨
|
|
||||||
|
|
||||||
### 1. init_scripts Support for BrowserConfig
|
|
||||||
- **Commit**: `d10ca38`
|
|
||||||
- **Date**: December 14, 2025
|
|
||||||
- **Files Changed**:
|
|
||||||
- `crawl4ai/async_configs.py`
|
|
||||||
- `crawl4ai/browser_manager.py`
|
|
||||||
|
|
||||||
**Description**: Pre-page-load JavaScript injection capability. Useful for stealth evasions (canvas/audio fingerprinting, userAgentData).
|
|
||||||
|
|
||||||
**Usage**:
|
|
||||||
```python
|
|
||||||
config = BrowserConfig(
|
|
||||||
init_scripts=["Object.defineProperty(navigator, 'webdriver', {get: () => false})"]
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 2. CDP Connection Improvements
|
|
||||||
- **Commit**: `02acad1`
|
|
||||||
- **Date**: December 18, 2025
|
|
||||||
- **Files Changed**:
|
|
||||||
- `crawl4ai/browser_manager.py`
|
|
||||||
- `tests/browser/test_cdp_cleanup_reuse.py`
|
|
||||||
|
|
||||||
**Description**:
|
|
||||||
- Support WebSocket URLs (ws://, wss://) for CDP connections
|
|
||||||
- Proper cleanup when `cdp_cleanup_on_close=True`
|
|
||||||
- Enables reusing the same browser with multiple sequential connections
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 3. Crash Recovery for Deep Crawl Strategies
|
|
||||||
- **Commit**: `31ebf37`
|
|
||||||
- **Date**: December 22, 2025
|
|
||||||
- **Files Changed**:
|
|
||||||
- `crawl4ai/deep_crawling/bff_strategy.py`
|
|
||||||
- `crawl4ai/deep_crawling/bfs_strategy.py`
|
|
||||||
- `crawl4ai/deep_crawling/dfs_strategy.py`
|
|
||||||
- `tests/deep_crawling/test_deep_crawl_resume.py`
|
|
||||||
- `tests/deep_crawling/test_deep_crawl_resume_integration.py`
|
|
||||||
|
|
||||||
**Description**: Optional `resume_state` and `on_state_change` parameters for all deep crawl strategies (BFS, DFS, Best-First) enabling cloud deployment crash recovery.
|
|
||||||
|
|
||||||
**Features**:
|
|
||||||
- `resume_state`: Pass saved state to resume from checkpoint
|
|
||||||
- `on_state_change`: Async callback fired after each URL for real-time state persistence
|
|
||||||
- `export_state()`: Get last captured state manually
|
|
||||||
- Zero overhead when features are disabled (None defaults)
|
|
||||||
- State is JSON-serializable
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 4. PDF and MHTML Support for raw:/file:// URLs
|
|
||||||
- **Commit**: `67e03d6`
|
|
||||||
- **Date**: December 22, 2025
|
|
||||||
- **Files Changed**:
|
|
||||||
- `crawl4ai/async_crawler_strategy.py`
|
|
||||||
|
|
||||||
**Description**: Generate PDFs and MHTML from cached HTML content. Replaced `_generate_screenshot_from_html` with unified `_generate_media_from_html` method.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 5. Screenshots for raw:/file:// URLs
|
|
||||||
- **Commit**: `444cb14`
|
|
||||||
- **Date**: December 22, 2025
|
|
||||||
- **Files Changed**:
|
|
||||||
- `crawl4ai/async_crawler_strategy.py`
|
|
||||||
|
|
||||||
**Description**: Enables cached HTML to be rendered with screenshots. Loads HTML into browser via `page.set_content()` and takes screenshot.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 6. base_url Parameter for CrawlerRunConfig
|
|
||||||
- **Commit**: `3937efc`
|
|
||||||
- **Date**: December 24, 2025
|
|
||||||
- **Files Changed**:
|
|
||||||
- `crawl4ai/async_configs.py`
|
|
||||||
- `crawl4ai/async_webcrawler.py`
|
|
||||||
|
|
||||||
**Description**: When processing raw: HTML (e.g., from cache), provides proper URL resolution context for markdown link generation.
|
|
||||||
|
|
||||||
**Usage**:
|
|
||||||
```python
|
|
||||||
config = CrawlerRunConfig(base_url='https://example.com')
|
|
||||||
result = await crawler.arun(url='raw:{html}', config=config)
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 7. Prefetch Mode for Two-Phase Deep Crawling
|
|
||||||
- **Commit**: `fde4e9f`
|
|
||||||
- **Date**: December 25, 2025
|
|
||||||
- **Files Changed**:
|
|
||||||
- `crawl4ai/async_configs.py`
|
|
||||||
- `crawl4ai/async_webcrawler.py`
|
|
||||||
- `crawl4ai/utils.py`
|
|
||||||
- `tests/test_prefetch_integration.py`
|
|
||||||
- `tests/test_prefetch_mode.py`
|
|
||||||
- `tests/test_prefetch_regression.py`
|
|
||||||
|
|
||||||
**Description**:
|
|
||||||
- Added `prefetch` parameter to CrawlerRunConfig
|
|
||||||
- Added `quick_extract_links()` function for fast link extraction
|
|
||||||
- Short-circuit in `aprocess_html()` for prefetch mode
|
|
||||||
- 42 tests added (unit, integration, regression)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 8. Proxy Rotation and Configuration Updates
|
|
||||||
- **Commit**: `9e7f5aa`
|
|
||||||
- **Date**: December 26, 2025
|
|
||||||
- **Files Changed**:
|
|
||||||
- `crawl4ai/async_configs.py`
|
|
||||||
- `crawl4ai/async_webcrawler.py`
|
|
||||||
- `crawl4ai/proxy_strategy.py`
|
|
||||||
- `tests/proxy/test_sticky_sessions.py`
|
|
||||||
|
|
||||||
**Description**: Enhanced proxy rotation and proxy configuration options.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 9. Proxy Support for HTTP Crawler Strategy
|
|
||||||
- **Commit**: `a43256b`
|
|
||||||
- **Date**: December 26, 2025
|
|
||||||
- **Files Changed**:
|
|
||||||
- `crawl4ai/async_crawler_strategy.py`
|
|
||||||
|
|
||||||
**Description**: Added proxy support to the HTTP (non-browser) crawler strategy.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 10. Browser Pipeline Support for raw:/file:// URLs
|
|
||||||
- **Commit**: `2550f3d`
|
|
||||||
- **Date**: December 27, 2025
|
|
||||||
- **Files Changed**:
|
|
||||||
- `crawl4ai/async_configs.py`
|
|
||||||
- `crawl4ai/async_crawler_strategy.py`
|
|
||||||
- `crawl4ai/browser_manager.py`
|
|
||||||
- `tests/test_raw_html_browser.py`
|
|
||||||
- `tests/test_raw_html_edge_cases.py`
|
|
||||||
|
|
||||||
**Description**:
|
|
||||||
- Added `process_in_browser` parameter to CrawlerRunConfig
|
|
||||||
- Routes raw:/file:// URLs through browser when browser operations needed
|
|
||||||
- Uses `page.set_content()` instead of `goto()` for local content
|
|
||||||
- Auto-detects browser requirements: js_code, wait_for, screenshot, etc.
|
|
||||||
- Maintains fast path for raw:/file:// without browser params
|
|
||||||
|
|
||||||
**Fixes**: #310
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 11. Smart TTL Cache for Sitemap URL Seeder
|
|
||||||
- **Commit**: `3d78001`
|
|
||||||
- **Date**: December 30, 2025
|
|
||||||
- **Files Changed**:
|
|
||||||
- `crawl4ai/async_configs.py`
|
|
||||||
- `crawl4ai/async_url_seeder.py`
|
|
||||||
|
|
||||||
**Description**:
|
|
||||||
- Added `cache_ttl_hours` and `validate_sitemap_lastmod` params to SeedingConfig
|
|
||||||
- New JSON cache format with metadata (version, created_at, lastmod, url_count)
|
|
||||||
- Cache validation by TTL expiry and sitemap lastmod comparison
|
|
||||||
- Auto-migration from old .jsonl to new .json format
|
|
||||||
- Fixes bug where incomplete cache was used indefinitely
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Bug Fixes 🐛
|
|
||||||
|
|
||||||
### 1. HTTP Strategy raw: URL Parsing Truncates at # Character
|
|
||||||
- **Commit**: `624e341`
|
|
||||||
- **Date**: December 24, 2025
|
|
||||||
- **Files Changed**:
|
|
||||||
- `crawl4ai/async_crawler_strategy.py`
|
|
||||||
|
|
||||||
**Problem**: `urlparse()` treated `#` as URL fragment delimiter, breaking CSS color codes like `#eee`.
|
|
||||||
|
|
||||||
**Before**: `raw:body{background:#eee}` → parsed to `body{background:`
|
|
||||||
**After**: `raw:body{background:#eee}` → parsed to `body{background:#eee}`
|
|
||||||
|
|
||||||
**Fix**: Strip `raw:` prefix directly instead of using urlparse.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 2. Caching Debugging and Fixes
|
|
||||||
- **Commit**: `48426f7`
|
|
||||||
- **Date**: December 21, 2025
|
|
||||||
- **Files Changed**:
|
|
||||||
- `crawl4ai/async_configs.py`
|
|
||||||
- `crawl4ai/async_database.py`
|
|
||||||
- `crawl4ai/async_webcrawler.py`
|
|
||||||
- `crawl4ai/cache_validator.py`
|
|
||||||
- `crawl4ai/models.py`
|
|
||||||
- `crawl4ai/utils.py`
|
|
||||||
- `tests/cache_validation/` (multiple test files)
|
|
||||||
|
|
||||||
**Description**: Various debugging and improvements to the caching system.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Documentation Updates 📚
|
|
||||||
|
|
||||||
### 1. Multi-Sample Schema Generation Section
|
|
||||||
- **Commit**: `6b2dca7`
|
|
||||||
- **Date**: January 4, 2026
|
|
||||||
- **Files Changed**:
|
|
||||||
- `docs/md_v2/extraction/no-llm-strategies.md`
|
|
||||||
|
|
||||||
**Description**: Added documentation explaining how to pass multiple HTML samples to `generate_schema()` for stable selectors that work across pages with varying DOM structures.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 2. URL Seeder Docs - Smart TTL Cache Parameters
|
|
||||||
- **Commit**: `db61ab8`
|
|
||||||
- **Date**: December 30, 2025
|
|
||||||
- **Files Changed**:
|
|
||||||
- `docs/md_v2/core/url-seeding.md`
|
|
||||||
|
|
||||||
**Description**:
|
|
||||||
- Added `cache_ttl_hours` and `validate_sitemap_lastmod` to parameter table
|
|
||||||
- Documented smart TTL cache validation with examples
|
|
||||||
- Added cache-related troubleshooting entries
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Other Changes 🔧
|
|
||||||
|
|
||||||
| Date | Commit | Description |
|
|
||||||
|------|--------|-------------|
|
|
||||||
| Dec 30, 2025 | `0d3f9e6` | Add MEMORY.md to gitignore |
|
|
||||||
| Dec 21, 2025 | `f6b29a8` | Update gitignore |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Files Changed Summary
|
|
||||||
|
|
||||||
### Core Library
|
|
||||||
- `crawl4ai/async_configs.py` - Multiple changes (init_scripts, base_url, prefetch, proxy, process_in_browser, cache TTL)
|
|
||||||
- `crawl4ai/async_webcrawler.py` - Prefetch mode, base_url, proxy
|
|
||||||
- `crawl4ai/async_crawler_strategy.py` - Media generation, proxy, raw URL fixes
|
|
||||||
- `crawl4ai/browser_manager.py` - init_scripts, CDP cleanup, cookie handling
|
|
||||||
- `crawl4ai/async_url_seeder.py` - Smart TTL cache
|
|
||||||
- `crawl4ai/proxy_strategy.py` - Proxy rotation
|
|
||||||
- `crawl4ai/deep_crawling/*.py` - Crash recovery for all strategies
|
|
||||||
- `crawl4ai/utils.py` - quick_extract_links, cache fixes
|
|
||||||
|
|
||||||
### Docker Deployment
|
|
||||||
- `deploy/docker/server.py` - Security fixes
|
|
||||||
- `deploy/docker/hook_manager.py` - RCE fix
|
|
||||||
- `deploy/docker/config.yml` - Security warnings
|
|
||||||
|
|
||||||
### Documentation
|
|
||||||
- `docs/md_v2/extraction/no-llm-strategies.md` - Schema generation
|
|
||||||
- `docs/md_v2/core/url-seeding.md` - Cache parameters
|
|
||||||
|
|
||||||
### Tests Added
|
|
||||||
- `tests/test_prefetch_*.py` - 42 prefetch tests
|
|
||||||
- `tests/test_raw_html_*.py` - Raw HTML browser tests
|
|
||||||
- `tests/deep_crawling/test_deep_crawl_resume*.py` - Resume tests
|
|
||||||
- `tests/browser/test_cdp_cleanup_reuse.py` - CDP tests
|
|
||||||
- `tests/proxy/test_sticky_sessions.py` - Proxy tests
|
|
||||||
- `tests/cache_validation/*.py` - Cache tests
|
|
||||||
- `deploy/docker/tests/test_security_fixes.py` - Security tests
|
|
||||||
- `deploy/docker/tests/run_security_tests.py` - Security integration tests
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Questions for Main Developer
|
|
||||||
|
|
||||||
1. [ ] Are there any other breaking changes not captured here?
|
|
||||||
2. [ ] Should the security fixes get their own patch release?
|
|
||||||
3. [ ] Any features that need additional documentation?
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
*Generated: January 12, 2026*
|
|
||||||
243
docs/RELEASE_NOTES_v0.8.0.md
Normal file
243
docs/RELEASE_NOTES_v0.8.0.md
Normal file
@@ -0,0 +1,243 @@
|
|||||||
|
# Crawl4AI v0.8.0 Release Notes
|
||||||
|
|
||||||
|
**Release Date**: January 2026
|
||||||
|
**Previous Version**: v0.7.6
|
||||||
|
**Status**: Release Candidate
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Highlights
|
||||||
|
|
||||||
|
- **Critical Security Fixes** for Docker API deployment
|
||||||
|
- **11 New Features** including crash recovery, prefetch mode, and proxy improvements
|
||||||
|
- **Breaking Changes** - see migration guide below
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Breaking Changes
|
||||||
|
|
||||||
|
### 1. Docker API: Hooks Disabled by Default
|
||||||
|
|
||||||
|
**What changed**: Hooks are now disabled by default on the Docker API.
|
||||||
|
|
||||||
|
**Why**: Security fix for Remote Code Execution (RCE) vulnerability.
|
||||||
|
|
||||||
|
**Who is affected**: Users of the Docker API who use the `hooks` parameter in `/crawl` requests.
|
||||||
|
|
||||||
|
**Migration**:
|
||||||
|
```bash
|
||||||
|
# To re-enable hooks (only if you trust all API users):
|
||||||
|
export CRAWL4AI_HOOKS_ENABLED=true
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Docker API: file:// URLs Blocked
|
||||||
|
|
||||||
|
**What changed**: The endpoints `/execute_js`, `/screenshot`, `/pdf`, and `/html` now reject `file://` URLs.
|
||||||
|
|
||||||
|
**Why**: Security fix for Local File Inclusion (LFI) vulnerability.
|
||||||
|
|
||||||
|
**Who is affected**: Users who were reading local files via the Docker API.
|
||||||
|
|
||||||
|
**Migration**: Use the Python library directly for local file processing:
|
||||||
|
```python
|
||||||
|
# Instead of API call with file:// URL, use library:
|
||||||
|
from crawl4ai import AsyncWebCrawler
|
||||||
|
async with AsyncWebCrawler() as crawler:
|
||||||
|
result = await crawler.arun(url="file:///path/to/file.html")
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Security Fixes
|
||||||
|
|
||||||
|
### Critical: Remote Code Execution via Hooks (CVE Pending)
|
||||||
|
|
||||||
|
**Severity**: CRITICAL (CVSS 10.0)
|
||||||
|
**Affected**: Docker API deployment (all versions before v0.8.0)
|
||||||
|
**Vector**: `POST /crawl` with malicious `hooks` parameter
|
||||||
|
|
||||||
|
**Details**: The `__import__` builtin was available in hook code, allowing attackers to import `os`, `subprocess`, etc. and execute arbitrary commands.
|
||||||
|
|
||||||
|
**Fix**:
|
||||||
|
1. Removed `__import__` from allowed builtins
|
||||||
|
2. Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)
|
||||||
|
|
||||||
|
### High: Local File Inclusion via file:// URLs (CVE Pending)
|
||||||
|
|
||||||
|
**Severity**: HIGH (CVSS 8.6)
|
||||||
|
**Affected**: Docker API deployment (all versions before v0.8.0)
|
||||||
|
**Vector**: `POST /execute_js` (and other endpoints) with `file:///etc/passwd`
|
||||||
|
|
||||||
|
**Details**: API endpoints accepted `file://` URLs, allowing attackers to read arbitrary files from the server.
|
||||||
|
|
||||||
|
**Fix**: URL scheme validation now only allows `http://`, `https://`, and `raw:` URLs.
|
||||||
|
|
||||||
|
### Credits
|
||||||
|
|
||||||
|
Discovered by **Neo by ProjectDiscovery** ([projectdiscovery.io](https://projectdiscovery.io)) - December 2025
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## New Features
|
||||||
|
|
||||||
|
### 1. init_scripts Support for BrowserConfig
|
||||||
|
|
||||||
|
Pre-page-load JavaScript injection for stealth evasions.
|
||||||
|
|
||||||
|
```python
|
||||||
|
config = BrowserConfig(
|
||||||
|
init_scripts=[
|
||||||
|
"Object.defineProperty(navigator, 'webdriver', {get: () => false})"
|
||||||
|
]
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. CDP Connection Improvements
|
||||||
|
|
||||||
|
- WebSocket URL support (`ws://`, `wss://`)
|
||||||
|
- Proper cleanup with `cdp_cleanup_on_close=True`
|
||||||
|
- Browser reuse across multiple connections
|
||||||
|
|
||||||
|
### 3. Crash Recovery for Deep Crawl Strategies
|
||||||
|
|
||||||
|
All deep crawl strategies (BFS, DFS, Best-First) now support crash recovery:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
|
||||||
|
|
||||||
|
strategy = BFSDeepCrawlStrategy(
|
||||||
|
max_depth=3,
|
||||||
|
resume_state=saved_state, # Resume from checkpoint
|
||||||
|
on_state_change=save_callback # Persist state in real-time
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. PDF and MHTML for raw:/file:// URLs
|
||||||
|
|
||||||
|
Generate PDFs and MHTML from cached HTML content.
|
||||||
|
|
||||||
|
### 5. Screenshots for raw:/file:// URLs
|
||||||
|
|
||||||
|
Render cached HTML and capture screenshots.
|
||||||
|
|
||||||
|
### 6. base_url Parameter for CrawlerRunConfig
|
||||||
|
|
||||||
|
Proper URL resolution for raw: HTML processing:
|
||||||
|
|
||||||
|
```python
|
||||||
|
config = CrawlerRunConfig(base_url='https://example.com')
|
||||||
|
result = await crawler.arun(url='raw:{html}', config=config)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 7. Prefetch Mode for Two-Phase Deep Crawling
|
||||||
|
|
||||||
|
Fast link extraction without full page processing:
|
||||||
|
|
||||||
|
```python
|
||||||
|
config = CrawlerRunConfig(prefetch=True)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 8. Proxy Rotation and Configuration
|
||||||
|
|
||||||
|
Enhanced proxy rotation with sticky sessions support.
|
||||||
|
|
||||||
|
### 9. Proxy Support for HTTP Strategy
|
||||||
|
|
||||||
|
Non-browser crawler now supports proxies.
|
||||||
|
|
||||||
|
### 10. Browser Pipeline for raw:/file:// URLs
|
||||||
|
|
||||||
|
New `process_in_browser` parameter for browser operations on local content:
|
||||||
|
|
||||||
|
```python
|
||||||
|
config = CrawlerRunConfig(
|
||||||
|
process_in_browser=True, # Force browser processing
|
||||||
|
screenshot=True
|
||||||
|
)
|
||||||
|
result = await crawler.arun(url='raw:<html>...</html>', config=config)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 11. Smart TTL Cache for Sitemap URL Seeder
|
||||||
|
|
||||||
|
Intelligent cache invalidation for sitemaps:
|
||||||
|
|
||||||
|
```python
|
||||||
|
config = SeedingConfig(
|
||||||
|
cache_ttl_hours=24,
|
||||||
|
validate_sitemap_lastmod=True
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Bug Fixes
|
||||||
|
|
||||||
|
### raw: URL Parsing Truncates at # Character
|
||||||
|
|
||||||
|
**Problem**: CSS color codes like `#eee` were being truncated.
|
||||||
|
|
||||||
|
**Before**: `raw:body{background:#eee}` → `body{background:`
|
||||||
|
**After**: `raw:body{background:#eee}` → `body{background:#eee}`
|
||||||
|
|
||||||
|
### Caching System Improvements
|
||||||
|
|
||||||
|
Various fixes to cache validation and persistence.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Documentation Updates
|
||||||
|
|
||||||
|
- Multi-sample schema generation documentation
|
||||||
|
- URL seeder smart TTL cache parameters
|
||||||
|
- Security documentation (SECURITY.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Upgrade Guide
|
||||||
|
|
||||||
|
### From v0.7.x to v0.8.0
|
||||||
|
|
||||||
|
1. **Update the package**:
|
||||||
|
```bash
|
||||||
|
pip install --upgrade crawl4ai
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Docker API users**:
|
||||||
|
- Hooks are now disabled by default
|
||||||
|
- If you need hooks: `export CRAWL4AI_HOOKS_ENABLED=true`
|
||||||
|
- `file://` URLs no longer work on API (use library directly)
|
||||||
|
|
||||||
|
3. **Review security settings**:
|
||||||
|
```yaml
|
||||||
|
# config.yml - recommended for production
|
||||||
|
security:
|
||||||
|
enabled: true
|
||||||
|
jwt_enabled: true
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Test your integration** before deploying to production
|
||||||
|
|
||||||
|
### Breaking Change Checklist
|
||||||
|
|
||||||
|
- [ ] Check if you use `hooks` parameter in API calls
|
||||||
|
- [ ] Check if you use `file://` URLs via the API
|
||||||
|
- [ ] Update environment variables if needed
|
||||||
|
- [ ] Review security configuration
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Full Changelog
|
||||||
|
|
||||||
|
See [CHANGELOG.md](../CHANGELOG.md) for complete version history.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Contributors
|
||||||
|
|
||||||
|
Thanks to all contributors who made this release possible.
|
||||||
|
|
||||||
|
Special thanks to **Neo by ProjectDiscovery** for responsible security disclosure.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*For questions or issues, please open a [GitHub Issue](https://github.com/unclecode/crawl4ai/issues).*
|
||||||
301
docs/migration/v0.8.0-upgrade-guide.md
Normal file
301
docs/migration/v0.8.0-upgrade-guide.md
Normal file
@@ -0,0 +1,301 @@
|
|||||||
|
# Migration Guide: Upgrading to Crawl4AI v0.8.0
|
||||||
|
|
||||||
|
This guide helps you upgrade from v0.7.x to v0.8.0, with special attention to breaking changes and security updates.
|
||||||
|
|
||||||
|
## Quick Summary
|
||||||
|
|
||||||
|
| Change | Impact | Action Required |
|
||||||
|
|--------|--------|-----------------|
|
||||||
|
| Hooks disabled by default | Docker API users with hooks | Set `CRAWL4AI_HOOKS_ENABLED=true` |
|
||||||
|
| file:// URLs blocked | Docker API users reading local files | Use Python library directly |
|
||||||
|
| Security fixes | All Docker API users | Update immediately |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Step 1: Update the Package
|
||||||
|
|
||||||
|
### PyPI Installation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install --upgrade crawl4ai
|
||||||
|
```
|
||||||
|
|
||||||
|
### Docker Installation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker pull unclecode/crawl4ai:latest
|
||||||
|
# or
|
||||||
|
docker pull unclecode/crawl4ai:0.8.0
|
||||||
|
```
|
||||||
|
|
||||||
|
### From Source
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git pull origin main
|
||||||
|
pip install -e .
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Step 2: Check for Breaking Changes
|
||||||
|
|
||||||
|
### Are You Affected?
|
||||||
|
|
||||||
|
**You ARE affected if you:**
|
||||||
|
- Use the Docker API deployment
|
||||||
|
- Use the `hooks` parameter in `/crawl` requests
|
||||||
|
- Use `file://` URLs via API endpoints
|
||||||
|
|
||||||
|
**You are NOT affected if you:**
|
||||||
|
- Only use Crawl4AI as a Python library
|
||||||
|
- Don't use hooks in your API calls
|
||||||
|
- Don't use `file://` URLs via the API
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Step 3: Migrate Hooks Usage
|
||||||
|
|
||||||
|
### Before v0.8.0
|
||||||
|
|
||||||
|
Hooks worked by default:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# This worked without any configuration
|
||||||
|
curl -X POST http://localhost:11235/crawl \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"urls": ["https://example.com"],
|
||||||
|
"hooks": {
|
||||||
|
"code": {
|
||||||
|
"on_page_context_created": "async def hook(page, context, **kwargs):\n await context.add_cookies([...])\n return page"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
### After v0.8.0
|
||||||
|
|
||||||
|
You must explicitly enable hooks:
|
||||||
|
|
||||||
|
**Option A: Environment Variable (Recommended)**
|
||||||
|
```bash
|
||||||
|
# In your Docker run command or docker-compose.yml
|
||||||
|
export CRAWL4AI_HOOKS_ENABLED=true
|
||||||
|
```
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# docker-compose.yml
|
||||||
|
services:
|
||||||
|
crawl4ai:
|
||||||
|
image: unclecode/crawl4ai:0.8.0
|
||||||
|
environment:
|
||||||
|
- CRAWL4AI_HOOKS_ENABLED=true
|
||||||
|
```
|
||||||
|
|
||||||
|
**Option B: For Kubernetes**
|
||||||
|
```yaml
|
||||||
|
env:
|
||||||
|
- name: CRAWL4AI_HOOKS_ENABLED
|
||||||
|
value: "true"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Security Warning
|
||||||
|
|
||||||
|
Only enable hooks if:
|
||||||
|
- You trust all users who can access the API
|
||||||
|
- The API is not exposed to the public internet
|
||||||
|
- You have other authentication/authorization in place
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Step 4: Migrate file:// URL Usage
|
||||||
|
|
||||||
|
### Before v0.8.0
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# This worked via API
|
||||||
|
curl -X POST http://localhost:11235/execute_js \
|
||||||
|
-d '{"url": "file:///var/data/page.html", "scripts": ["document.title"]}'
|
||||||
|
```
|
||||||
|
|
||||||
|
### After v0.8.0
|
||||||
|
|
||||||
|
**Option A: Use the Python Library Directly**
|
||||||
|
|
||||||
|
```python
|
||||||
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||||
|
|
||||||
|
async def process_local_file():
|
||||||
|
async with AsyncWebCrawler() as crawler:
|
||||||
|
result = await crawler.arun(
|
||||||
|
url="file:///var/data/page.html",
|
||||||
|
config=CrawlerRunConfig(js_code=["document.title"])
|
||||||
|
)
|
||||||
|
return result
|
||||||
|
```
|
||||||
|
|
||||||
|
**Option B: Use raw: Protocol for HTML Content**
|
||||||
|
|
||||||
|
If you have the HTML content, you can still use the API:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Read file content and send as raw:
|
||||||
|
HTML_CONTENT=$(cat /var/data/page.html)
|
||||||
|
curl -X POST http://localhost:11235/html \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d "{\"url\": \"raw:$HTML_CONTENT\"}"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Option C: Create a Preprocessing Service**
|
||||||
|
|
||||||
|
```python
|
||||||
|
# preprocessing_service.py
|
||||||
|
from fastapi import FastAPI
|
||||||
|
from crawl4ai import AsyncWebCrawler
|
||||||
|
|
||||||
|
app = FastAPI()
|
||||||
|
|
||||||
|
@app.post("/process-local")
|
||||||
|
async def process_local(file_path: str):
|
||||||
|
async with AsyncWebCrawler() as crawler:
|
||||||
|
result = await crawler.arun(url=f"file://{file_path}")
|
||||||
|
return result.model_dump()
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Step 5: Review Security Configuration
|
||||||
|
|
||||||
|
### Recommended Production Settings
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# config.yml
|
||||||
|
security:
|
||||||
|
enabled: true
|
||||||
|
jwt_enabled: true
|
||||||
|
https_redirect: true # If behind HTTPS proxy
|
||||||
|
trusted_hosts:
|
||||||
|
- "your-domain.com"
|
||||||
|
- "api.your-domain.com"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Environment Variables
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Required for JWT authentication
|
||||||
|
export SECRET_KEY="your-secure-random-key-minimum-32-characters"
|
||||||
|
|
||||||
|
# Only if you need hooks
|
||||||
|
export CRAWL4AI_HOOKS_ENABLED=true
|
||||||
|
```
|
||||||
|
|
||||||
|
### Generate a Secure Secret Key
|
||||||
|
|
||||||
|
```python
|
||||||
|
import secrets
|
||||||
|
print(secrets.token_urlsafe(32))
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Step 6: Test Your Integration
|
||||||
|
|
||||||
|
### Quick Validation Script
|
||||||
|
|
||||||
|
```python
|
||||||
|
import asyncio
|
||||||
|
import aiohttp
|
||||||
|
|
||||||
|
async def test_upgrade():
|
||||||
|
base_url = "http://localhost:11235"
|
||||||
|
|
||||||
|
# Test 1: Basic crawl should work
|
||||||
|
async with aiohttp.ClientSession() as session:
|
||||||
|
async with session.post(
|
||||||
|
f"{base_url}/crawl",
|
||||||
|
json={"urls": ["https://example.com"]}
|
||||||
|
) as resp:
|
||||||
|
assert resp.status == 200, "Basic crawl failed"
|
||||||
|
print("✓ Basic crawl works")
|
||||||
|
|
||||||
|
# Test 2: Hooks should be blocked (unless enabled)
|
||||||
|
async with aiohttp.ClientSession() as session:
|
||||||
|
async with session.post(
|
||||||
|
f"{base_url}/crawl",
|
||||||
|
json={
|
||||||
|
"urls": ["https://example.com"],
|
||||||
|
"hooks": {"code": {"on_page_context_created": "async def hook(page, context, **kwargs): return page"}}
|
||||||
|
}
|
||||||
|
) as resp:
|
||||||
|
if resp.status == 403:
|
||||||
|
print("✓ Hooks correctly blocked (default)")
|
||||||
|
elif resp.status == 200:
|
||||||
|
print("! Hooks enabled - ensure this is intentional")
|
||||||
|
|
||||||
|
# Test 3: file:// should be blocked
|
||||||
|
async with aiohttp.ClientSession() as session:
|
||||||
|
async with session.post(
|
||||||
|
f"{base_url}/execute_js",
|
||||||
|
json={"url": "file:///etc/passwd", "scripts": ["1"]}
|
||||||
|
) as resp:
|
||||||
|
assert resp.status == 400, "file:// should be blocked"
|
||||||
|
print("✓ file:// URLs correctly blocked")
|
||||||
|
|
||||||
|
asyncio.run(test_upgrade())
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### "Hooks are disabled" Error
|
||||||
|
|
||||||
|
**Symptom**: API returns 403 with "Hooks are disabled"
|
||||||
|
|
||||||
|
**Solution**: Set `CRAWL4AI_HOOKS_ENABLED=true` if you need hooks
|
||||||
|
|
||||||
|
### "URL must start with http://, https://" Error
|
||||||
|
|
||||||
|
**Symptom**: API returns 400 when using `file://` URLs
|
||||||
|
|
||||||
|
**Solution**: Use Python library directly or `raw:` protocol
|
||||||
|
|
||||||
|
### Authentication Errors After Enabling JWT
|
||||||
|
|
||||||
|
**Symptom**: API returns 401 Unauthorized
|
||||||
|
|
||||||
|
**Solution**:
|
||||||
|
1. Get a token: `POST /token` with your email
|
||||||
|
2. Include token in requests: `Authorization: Bearer <token>`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Rollback Plan
|
||||||
|
|
||||||
|
If you need to rollback:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# PyPI
|
||||||
|
pip install crawl4ai==0.7.6
|
||||||
|
|
||||||
|
# Docker
|
||||||
|
docker pull unclecode/crawl4ai:0.7.6
|
||||||
|
```
|
||||||
|
|
||||||
|
**Warning**: Rolling back re-exposes the security vulnerabilities. Only do this temporarily while fixing integration issues.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Getting Help
|
||||||
|
|
||||||
|
- **GitHub Issues**: [github.com/unclecode/crawl4ai/issues](https://github.com/unclecode/crawl4ai/issues)
|
||||||
|
- **Security Issues**: See [SECURITY.md](../../SECURITY.md)
|
||||||
|
- **Documentation**: [docs.crawl4ai.com](https://docs.crawl4ai.com)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Changelog Reference
|
||||||
|
|
||||||
|
For complete list of changes, see:
|
||||||
|
- [Release Notes v0.8.0](../RELEASE_NOTES_v0.8.0.md)
|
||||||
|
- [CHANGELOG.md](../../CHANGELOG.md)
|
||||||
171
docs/security/GHSA-DRAFT-RCE-LFI.md
Normal file
171
docs/security/GHSA-DRAFT-RCE-LFI.md
Normal file
@@ -0,0 +1,171 @@
|
|||||||
|
# GitHub Security Advisory Draft
|
||||||
|
|
||||||
|
> **Instructions**: Copy this content to create security advisories at:
|
||||||
|
> https://github.com/unclecode/crawl4ai/security/advisories/new
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Advisory 1: Remote Code Execution via Hooks Parameter
|
||||||
|
|
||||||
|
### Title
|
||||||
|
Remote Code Execution in Docker API via Hooks Parameter
|
||||||
|
|
||||||
|
### Severity
|
||||||
|
Critical
|
||||||
|
|
||||||
|
### CVSS Score
|
||||||
|
10.0 (CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:H/A:H)
|
||||||
|
|
||||||
|
### CWE
|
||||||
|
CWE-94 (Improper Control of Generation of Code)
|
||||||
|
|
||||||
|
### Package
|
||||||
|
crawl4ai (Docker API deployment)
|
||||||
|
|
||||||
|
### Affected Versions
|
||||||
|
< 0.8.0
|
||||||
|
|
||||||
|
### Patched Versions
|
||||||
|
0.8.0
|
||||||
|
|
||||||
|
### Description
|
||||||
|
|
||||||
|
A critical remote code execution vulnerability exists in the Crawl4AI Docker API deployment. The `/crawl` endpoint accepts a `hooks` parameter containing Python code that is executed using `exec()`. The `__import__` builtin was included in the allowed builtins, allowing attackers to import arbitrary modules and execute system commands.
|
||||||
|
|
||||||
|
**Attack Vector:**
|
||||||
|
```json
|
||||||
|
POST /crawl
|
||||||
|
{
|
||||||
|
"urls": ["https://example.com"],
|
||||||
|
"hooks": {
|
||||||
|
"code": {
|
||||||
|
"on_page_context_created": "async def hook(page, context, **kwargs):\n __import__('os').system('malicious_command')\n return page"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Impact
|
||||||
|
|
||||||
|
An unauthenticated attacker can:
|
||||||
|
- Execute arbitrary system commands
|
||||||
|
- Read/write files on the server
|
||||||
|
- Exfiltrate sensitive data (environment variables, API keys)
|
||||||
|
- Pivot to internal network services
|
||||||
|
- Completely compromise the server
|
||||||
|
|
||||||
|
### Mitigation
|
||||||
|
|
||||||
|
1. **Upgrade to v0.8.0** (recommended)
|
||||||
|
2. If unable to upgrade immediately:
|
||||||
|
- Disable the Docker API
|
||||||
|
- Block `/crawl` endpoint at network level
|
||||||
|
- Add authentication to the API
|
||||||
|
|
||||||
|
### Fix Details
|
||||||
|
|
||||||
|
1. Removed `__import__` from `allowed_builtins` in `hook_manager.py`
|
||||||
|
2. Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)
|
||||||
|
3. Users must explicitly opt-in to enable hooks
|
||||||
|
|
||||||
|
### Credits
|
||||||
|
|
||||||
|
Discovered by Neo by ProjectDiscovery (https://projectdiscovery.io)
|
||||||
|
|
||||||
|
### References
|
||||||
|
|
||||||
|
- [Release Notes v0.8.0](https://github.com/unclecode/crawl4ai/blob/main/docs/RELEASE_NOTES_v0.8.0.md)
|
||||||
|
- [Migration Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/migration/v0.8.0-upgrade-guide.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Advisory 2: Local File Inclusion via file:// URLs
|
||||||
|
|
||||||
|
### Title
|
||||||
|
Local File Inclusion in Docker API via file:// URLs
|
||||||
|
|
||||||
|
### Severity
|
||||||
|
High
|
||||||
|
|
||||||
|
### CVSS Score
|
||||||
|
8.6 (CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:N/A:N)
|
||||||
|
|
||||||
|
### CWE
|
||||||
|
CWE-22 (Improper Limitation of a Pathname to a Restricted Directory)
|
||||||
|
|
||||||
|
### Package
|
||||||
|
crawl4ai (Docker API deployment)
|
||||||
|
|
||||||
|
### Affected Versions
|
||||||
|
< 0.8.0
|
||||||
|
|
||||||
|
### Patched Versions
|
||||||
|
0.8.0
|
||||||
|
|
||||||
|
### Description
|
||||||
|
|
||||||
|
A local file inclusion vulnerability exists in the Crawl4AI Docker API. The `/execute_js`, `/screenshot`, `/pdf`, and `/html` endpoints accept `file://` URLs, allowing attackers to read arbitrary files from the server filesystem.
|
||||||
|
|
||||||
|
**Attack Vector:**
|
||||||
|
```json
|
||||||
|
POST /execute_js
|
||||||
|
{
|
||||||
|
"url": "file:///etc/passwd",
|
||||||
|
"scripts": ["document.body.innerText"]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Impact
|
||||||
|
|
||||||
|
An unauthenticated attacker can:
|
||||||
|
- Read sensitive files (`/etc/passwd`, `/etc/shadow`, application configs)
|
||||||
|
- Access environment variables via `/proc/self/environ`
|
||||||
|
- Discover internal application structure
|
||||||
|
- Potentially read credentials and API keys
|
||||||
|
|
||||||
|
### Mitigation
|
||||||
|
|
||||||
|
1. **Upgrade to v0.8.0** (recommended)
|
||||||
|
2. If unable to upgrade immediately:
|
||||||
|
- Disable the Docker API
|
||||||
|
- Add authentication to the API
|
||||||
|
- Use network-level filtering
|
||||||
|
|
||||||
|
### Fix Details
|
||||||
|
|
||||||
|
Added URL scheme validation to block:
|
||||||
|
- `file://` URLs
|
||||||
|
- `javascript:` URLs
|
||||||
|
- `data:` URLs
|
||||||
|
- Other non-HTTP schemes
|
||||||
|
|
||||||
|
Only `http://`, `https://`, and `raw:` URLs are now allowed.
|
||||||
|
|
||||||
|
### Credits
|
||||||
|
|
||||||
|
Discovered by Neo by ProjectDiscovery (https://projectdiscovery.io)
|
||||||
|
|
||||||
|
### References
|
||||||
|
|
||||||
|
- [Release Notes v0.8.0](https://github.com/unclecode/crawl4ai/blob/main/docs/RELEASE_NOTES_v0.8.0.md)
|
||||||
|
- [Migration Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/migration/v0.8.0-upgrade-guide.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Creating the Advisories on GitHub
|
||||||
|
|
||||||
|
1. Go to: https://github.com/unclecode/crawl4ai/security/advisories/new
|
||||||
|
|
||||||
|
2. Fill in the form for each advisory:
|
||||||
|
- **Ecosystem**: PyPI
|
||||||
|
- **Package name**: crawl4ai
|
||||||
|
- **Affected versions**: < 0.8.0
|
||||||
|
- **Patched versions**: 0.8.0
|
||||||
|
- **Severity**: Critical (for RCE), High (for LFI)
|
||||||
|
|
||||||
|
3. After creating, GitHub will:
|
||||||
|
- Assign a GHSA ID
|
||||||
|
- Optionally request a CVE
|
||||||
|
- Notify users who have security alerts enabled
|
||||||
|
|
||||||
|
4. Coordinate disclosure timing with the fix release
|
||||||
Reference in New Issue
Block a user