Documentation for v0.8.0 release: - SECURITY.md: Security policy and vulnerability reporting guidelines - RELEASE_NOTES_v0.8.0.md: Comprehensive release notes - migration/v0.8.0-upgrade-guide.md: Step-by-step migration guide - security/GHSA-DRAFT-RCE-LFI.md: GitHub security advisory drafts - CHANGELOG.md: Updated with v0.8.0 changes Breaking changes documented: - Docker API hooks disabled by default (CRAWL4AI_HOOKS_ENABLED) - file:// URLs blocked on Docker API endpoints Security fixes credited to Neo by ProjectDiscovery
244 lines
6.0 KiB
Markdown
244 lines
6.0 KiB
Markdown
# Crawl4AI v0.8.0 Release Notes
|
|
|
|
**Release Date**: January 2026
|
|
**Previous Version**: v0.7.6
|
|
**Status**: Release Candidate
|
|
|
|
---
|
|
|
|
## Highlights
|
|
|
|
- **Critical Security Fixes** for Docker API deployment
|
|
- **11 New Features** including crash recovery, prefetch mode, and proxy improvements
|
|
- **Breaking Changes** - see migration guide below
|
|
|
|
---
|
|
|
|
## Breaking Changes
|
|
|
|
### 1. Docker API: Hooks Disabled by Default
|
|
|
|
**What changed**: Hooks are now disabled by default on the Docker API.
|
|
|
|
**Why**: Security fix for Remote Code Execution (RCE) vulnerability.
|
|
|
|
**Who is affected**: Users of the Docker API who use the `hooks` parameter in `/crawl` requests.
|
|
|
|
**Migration**:
|
|
```bash
|
|
# To re-enable hooks (only if you trust all API users):
|
|
export CRAWL4AI_HOOKS_ENABLED=true
|
|
```
|
|
|
|
### 2. Docker API: file:// URLs Blocked
|
|
|
|
**What changed**: The endpoints `/execute_js`, `/screenshot`, `/pdf`, and `/html` now reject `file://` URLs.
|
|
|
|
**Why**: Security fix for Local File Inclusion (LFI) vulnerability.
|
|
|
|
**Who is affected**: Users who were reading local files via the Docker API.
|
|
|
|
**Migration**: Use the Python library directly for local file processing:
|
|
```python
|
|
# Instead of API call with file:// URL, use library:
|
|
from crawl4ai import AsyncWebCrawler
|
|
async with AsyncWebCrawler() as crawler:
|
|
result = await crawler.arun(url="file:///path/to/file.html")
|
|
```
|
|
|
|
---
|
|
|
|
## Security Fixes
|
|
|
|
### Critical: Remote Code Execution via Hooks (CVE Pending)
|
|
|
|
**Severity**: CRITICAL (CVSS 10.0)
|
|
**Affected**: Docker API deployment (all versions before v0.8.0)
|
|
**Vector**: `POST /crawl` with malicious `hooks` parameter
|
|
|
|
**Details**: The `__import__` builtin was available in hook code, allowing attackers to import `os`, `subprocess`, etc. and execute arbitrary commands.
|
|
|
|
**Fix**:
|
|
1. Removed `__import__` from allowed builtins
|
|
2. Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)
|
|
|
|
### High: Local File Inclusion via file:// URLs (CVE Pending)
|
|
|
|
**Severity**: HIGH (CVSS 8.6)
|
|
**Affected**: Docker API deployment (all versions before v0.8.0)
|
|
**Vector**: `POST /execute_js` (and other endpoints) with `file:///etc/passwd`
|
|
|
|
**Details**: API endpoints accepted `file://` URLs, allowing attackers to read arbitrary files from the server.
|
|
|
|
**Fix**: URL scheme validation now only allows `http://`, `https://`, and `raw:` URLs.
|
|
|
|
### Credits
|
|
|
|
Discovered by **Neo by ProjectDiscovery** ([projectdiscovery.io](https://projectdiscovery.io)) - December 2025
|
|
|
|
---
|
|
|
|
## New Features
|
|
|
|
### 1. init_scripts Support for BrowserConfig
|
|
|
|
Pre-page-load JavaScript injection for stealth evasions.
|
|
|
|
```python
|
|
config = BrowserConfig(
|
|
init_scripts=[
|
|
"Object.defineProperty(navigator, 'webdriver', {get: () => false})"
|
|
]
|
|
)
|
|
```
|
|
|
|
### 2. CDP Connection Improvements
|
|
|
|
- WebSocket URL support (`ws://`, `wss://`)
|
|
- Proper cleanup with `cdp_cleanup_on_close=True`
|
|
- Browser reuse across multiple connections
|
|
|
|
### 3. Crash Recovery for Deep Crawl Strategies
|
|
|
|
All deep crawl strategies (BFS, DFS, Best-First) now support crash recovery:
|
|
|
|
```python
|
|
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
|
|
|
|
strategy = BFSDeepCrawlStrategy(
|
|
max_depth=3,
|
|
resume_state=saved_state, # Resume from checkpoint
|
|
on_state_change=save_callback # Persist state in real-time
|
|
)
|
|
```
|
|
|
|
### 4. PDF and MHTML for raw:/file:// URLs
|
|
|
|
Generate PDFs and MHTML from cached HTML content.
|
|
|
|
### 5. Screenshots for raw:/file:// URLs
|
|
|
|
Render cached HTML and capture screenshots.
|
|
|
|
### 6. base_url Parameter for CrawlerRunConfig
|
|
|
|
Proper URL resolution for raw: HTML processing:
|
|
|
|
```python
|
|
config = CrawlerRunConfig(base_url='https://example.com')
|
|
result = await crawler.arun(url='raw:{html}', config=config)
|
|
```
|
|
|
|
### 7. Prefetch Mode for Two-Phase Deep Crawling
|
|
|
|
Fast link extraction without full page processing:
|
|
|
|
```python
|
|
config = CrawlerRunConfig(prefetch=True)
|
|
```
|
|
|
|
### 8. Proxy Rotation and Configuration
|
|
|
|
Enhanced proxy rotation with sticky sessions support.
|
|
|
|
### 9. Proxy Support for HTTP Strategy
|
|
|
|
Non-browser crawler now supports proxies.
|
|
|
|
### 10. Browser Pipeline for raw:/file:// URLs
|
|
|
|
New `process_in_browser` parameter for browser operations on local content:
|
|
|
|
```python
|
|
config = CrawlerRunConfig(
|
|
process_in_browser=True, # Force browser processing
|
|
screenshot=True
|
|
)
|
|
result = await crawler.arun(url='raw:<html>...</html>', config=config)
|
|
```
|
|
|
|
### 11. Smart TTL Cache for Sitemap URL Seeder
|
|
|
|
Intelligent cache invalidation for sitemaps:
|
|
|
|
```python
|
|
config = SeedingConfig(
|
|
cache_ttl_hours=24,
|
|
validate_sitemap_lastmod=True
|
|
)
|
|
```
|
|
|
|
---
|
|
|
|
## Bug Fixes
|
|
|
|
### raw: URL Parsing Truncates at # Character
|
|
|
|
**Problem**: CSS color codes like `#eee` were being truncated.
|
|
|
|
**Before**: `raw:body{background:#eee}` → `body{background:`
|
|
**After**: `raw:body{background:#eee}` → `body{background:#eee}`
|
|
|
|
### Caching System Improvements
|
|
|
|
Various fixes to cache validation and persistence.
|
|
|
|
---
|
|
|
|
## Documentation Updates
|
|
|
|
- Multi-sample schema generation documentation
|
|
- URL seeder smart TTL cache parameters
|
|
- Security documentation (SECURITY.md)
|
|
|
|
---
|
|
|
|
## Upgrade Guide
|
|
|
|
### From v0.7.x to v0.8.0
|
|
|
|
1. **Update the package**:
|
|
```bash
|
|
pip install --upgrade crawl4ai
|
|
```
|
|
|
|
2. **Docker API users**:
|
|
- Hooks are now disabled by default
|
|
- If you need hooks: `export CRAWL4AI_HOOKS_ENABLED=true`
|
|
- `file://` URLs no longer work on API (use library directly)
|
|
|
|
3. **Review security settings**:
|
|
```yaml
|
|
# config.yml - recommended for production
|
|
security:
|
|
enabled: true
|
|
jwt_enabled: true
|
|
```
|
|
|
|
4. **Test your integration** before deploying to production
|
|
|
|
### Breaking Change Checklist
|
|
|
|
- [ ] Check if you use `hooks` parameter in API calls
|
|
- [ ] Check if you use `file://` URLs via the API
|
|
- [ ] Update environment variables if needed
|
|
- [ ] Review security configuration
|
|
|
|
---
|
|
|
|
## Full Changelog
|
|
|
|
See [CHANGELOG.md](../CHANGELOG.md) for complete version history.
|
|
|
|
---
|
|
|
|
## Contributors
|
|
|
|
Thanks to all contributors who made this release possible.
|
|
|
|
Special thanks to **Neo by ProjectDiscovery** for responsible security disclosure.
|
|
|
|
---
|
|
|
|
*For questions or issues, please open a [GitHub Issue](https://github.com/unclecode/crawl4ai/issues).*
|