Add release notes for v0.8.0, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates

Documentation for v0.8.0 release: - SECURITY.md: Security policy and vulnerability reporting guidelines - RELEASE_NOTES_v0.8.0.md: Comprehensive release notes - migration/v0.8.0-upgrade-guide.md: Step-by-step migration guide - security/GHSA-DRAFT-RCE-LFI.md: GitHub security advisory drafts - CHANGELOG.md: Updated with v0.8.0 changes Breaking changes documented: - Docker API hooks disabled by default (CRAWL4AI_HOOKS_ENABLED) - file:// URLs blocked on Docker API endpoints Security fixes credited to Neo by ProjectDiscovery
2026-01-12 13:45:42 +00:00
parent 122b4fe3f0
commit 530cde351f
6 changed files with 877 additions and 335 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -5,6 +5,46 @@ All notable changes to Crawl4AI will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 ## [0.8.0] - 2026-01-12
 ### Security
 - **🔒 CRITICAL: Remote Code Execution Fix**: Removed `__import__` from hook allowed builtins
  - Prevents arbitrary module imports in user-provided hook code
  - Hooks now disabled by default via `CRAWL4AI_HOOKS_ENABLED` environment variable
  - Credit: Neo by ProjectDiscovery
 - **🔒 HIGH: Local File Inclusion Fix**: Added URL scheme validation to Docker API endpoints
  - Blocks `file://`, `javascript:`, `data:` URLs on `/execute_js`, `/screenshot`, `/pdf`, `/html`
  - Only allows `http://`, `https://`, and `raw:` URLs
  - Credit: Neo by ProjectDiscovery
 ### Breaking Changes
 - **Docker API: Hooks disabled by default**: Set `CRAWL4AI_HOOKS_ENABLED=true` to enable
 - **Docker API: file:// URLs blocked**: Use Python library directly for local file processing
 ### Added
 - **🚀 init_scripts for BrowserConfig**: Pre-page-load JavaScript injection for stealth evasions
 - **🔄 CDP Connection Improvements**: WebSocket URL support, proper cleanup, browser reuse
 - **💾 Crash Recovery for Deep Crawl**: `resume_state` and `on_state_change` for BFS/DFS/Best-First strategies
 - **📄 PDF/MHTML for raw:/file:// URLs**: Generate PDFs and MHTML from cached HTML content
 - **📸 Screenshots for raw:/file:// URLs**: Render cached HTML and capture screenshots
 - **🔗 base_url Parameter**: Proper URL resolution for raw: HTML processing
 - **⚡ Prefetch Mode**: Two-phase deep crawling with fast link extraction
 - **🔀 Enhanced Proxy Support**: Improved proxy rotation and sticky sessions
 - **🌐 HTTP Strategy Proxy Support**: Non-browser crawler now supports proxies
 - **🖥️ Browser Pipeline for raw:/file://**: New `process_in_browser` parameter
 - **📋 Smart TTL Cache for Sitemap Seeder**: `cache_ttl_hours` and `validate_sitemap_lastmod` parameters
 - **📚 Security Documentation**: Added SECURITY.md with vulnerability reporting guidelines
 ### Fixed
 - **raw: URL Parsing**: Fixed truncation at `#` character (CSS color codes like `#eee`)
 - **Caching System**: Various improvements to cache validation and persistence
 ### Documentation
 - Multi-sample schema generation section
 - URL seeder smart TTL cache parameters
 - v0.8.0 migration guide
 - Security policy and disclosure process
 ## [Unreleased]
 ### Added
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -0,0 +1,122 @@
 # Security Policy
 ## Supported Versions
 | Version | Supported          |
 | ------- | ------------------ |
 | 0.8.x   | :white_check_mark: |
 | 0.7.x   | :x: (upgrade recommended) |
 | < 0.7   | :x:                |
 ## Reporting a Vulnerability
 We take security vulnerabilities seriously. If you discover a security issue, please report it responsibly.
 ### How to Report
 **DO NOT** open a public GitHub issue for security vulnerabilities.
 Instead, please report via one of these methods:
 1. **GitHub Security Advisories (Preferred)**
   - Go to [Security Advisories](https://github.com/unclecode/crawl4ai/security/advisories)
   - Click "New draft security advisory"
   - Fill in the details
 2. **Email**
   - Send details to: security@crawl4ai.com
   - Use subject: `[SECURITY] Brief description`
   - Include:
     - Description of the vulnerability
     - Steps to reproduce
     - Potential impact
     - Any suggested fixes
 ### What to Expect
 - **Acknowledgment**: Within 48 hours
 - **Initial Assessment**: Within 7 days
 - **Resolution Timeline**: Depends on severity
  - Critical: 24-72 hours
  - High: 7 days
  - Medium: 30 days
  - Low: 90 days
 ### Disclosure Policy
 - We follow responsible disclosure practices
 - We will coordinate with you on disclosure timing
 - Credit will be given to reporters (unless anonymity is requested)
 - We may request CVE assignment for significant vulnerabilities
 ## Security Best Practices for Users
 ### Docker API Deployment
 If you're running the Crawl4AI Docker API in production:
 1. **Enable Authentication**
   ```yaml
   # config.yml
   security:
     enabled: true
     jwt_enabled: true
   ```
   ```bash
   # Set a strong secret key
   export SECRET_KEY="your-secure-random-key-here"
   ```
 2. **Hooks are Disabled by Default** (v0.8.0+)
   - Only enable if you trust all API users
   - Set `CRAWL4AI_HOOKS_ENABLED=true` only when necessary
 3. **Network Security**
   - Run behind a reverse proxy (nginx, traefik)
   - Use HTTPS in production
   - Restrict access to trusted IPs if possible
 4. **Container Security**
   - Run as non-root user (default in our container)
   - Use read-only filesystem where possible
   - Limit container resources
 ### Library Usage
 When using Crawl4AI as a Python library:
 1. **Validate URLs** before crawling untrusted input
 2. **Sanitize extracted content** before using in other systems
 3. **Be cautious with hooks** - they execute arbitrary code
 ## Known Security Issues
 ### Fixed in v0.8.0
 | ID | Severity | Description | Fix |
 |----|----------|-------------|-----|
 | CVE-pending-1 | CRITICAL | RCE via hooks `__import__` | Removed from allowed builtins |
 | CVE-pending-2 | HIGH | LFI via `file://` URLs | URL scheme validation added |
 See [Security Advisory](https://github.com/unclecode/crawl4ai/security/advisories) for details.
 ## Security Features
 ### v0.8.0+
 - **URL Scheme Validation**: Blocks `file://`, `javascript:`, `data:` URLs on API
 - **Hooks Disabled by Default**: Opt-in via `CRAWL4AI_HOOKS_ENABLED=true`
 - **Restricted Hook Builtins**: No `__import__`, `eval`, `exec`, `open`
 - **JWT Authentication**: Optional but recommended for production
 - **Rate Limiting**: Configurable request limits
 - **Security Headers**: X-Frame-Options, CSP, HSTS when enabled
 ## Acknowledgments
 We thank the following security researchers for responsibly disclosing vulnerabilities:
 - **Neo by ProjectDiscovery** - RCE and LFI vulnerabilities (December 2025)
 ---
 *Last updated: January 2026*
--- a/docs/CHANGES_NOTES_FOR_V.0.7.9_RELEASE.md
+++ b/docs/CHANGES_NOTES_FOR_V.0.7.9_RELEASE.md
@@ -1,335 +0,0 @@
 # Crawl4AI Release Notes v0.7.9
 **Period**: December 13, 2025 - January 12, 2026
 **Total Commits**: 18
 **Status**: DRAFT - Pending review
 ---
 ## Breaking Changes ⚠️
 ### 1. Docker API Hooks Disabled by Default
 - **Commit**: `f24396c` (Jan 12, 2026)
 - **Reason**: Security fix for RCE vulnerability
 - **Impact**: Hooks no longer work unless `CRAWL4AI_HOOKS_ENABLED=true` is set
 - **Migration**: Set environment variable `CRAWL4AI_HOOKS_ENABLED=true` in Docker container
 ### 2. file:// URLs Blocked on Docker API
 - **Commit**: `f24396c` (Jan 12, 2026)
 - **Reason**: Security fix for LFI vulnerability
 - **Impact**: Endpoints `/execute_js`, `/screenshot`, `/pdf`, `/html` reject `file://` URLs
 - **Migration**: Use the library directly for local file processing
 ---
 ## Security Fixes 🔒
 ### Fix Critical RCE and LFI Vulnerabilities in Docker API
 - **Commit**: `f24396c`
 - **Date**: January 12, 2026
 - **Severity**: CRITICAL
 - **Files Changed**:
  - `deploy/docker/config.yml`
  - `deploy/docker/hook_manager.py`
  - `deploy/docker/server.py`
  - `deploy/docker/tests/run_security_tests.py`
  - `deploy/docker/tests/test_security_fixes.py`
 **Details**:
 1. **Remote Code Execution via Hooks (CVE pending)**
   - Removed `__import__` from allowed_builtins in hook_manager.py
   - Prevents arbitrary module imports (os, subprocess, etc.)
   - Hooks now disabled by default via `CRAWL4AI_HOOKS_ENABLED` env var
 2. **Local File Inclusion via file:// URLs (CVE pending)**
   - Added URL scheme validation to `/execute_js`, `/screenshot`, `/pdf`, `/html`
   - Blocks `file://`, `javascript:`, `data:` and other dangerous schemes
   - Only allows `http://`, `https://`, and `raw:` (where appropriate)
 3. **Security hardening**
   - Added `CRAWL4AI_HOOKS_ENABLED=false` as default (opt-in for hooks)
   - Added security warning comments in config.yml
   - Added `validate_url_scheme()` helper for consistent validation
 ---
 ## New Features ✨
 ### 1. init_scripts Support for BrowserConfig
 - **Commit**: `d10ca38`
 - **Date**: December 14, 2025
 - **Files Changed**:
  - `crawl4ai/async_configs.py`
  - `crawl4ai/browser_manager.py`
 **Description**: Pre-page-load JavaScript injection capability. Useful for stealth evasions (canvas/audio fingerprinting, userAgentData).
 **Usage**:
 ```python
 config = BrowserConfig(
    init_scripts=["Object.defineProperty(navigator, 'webdriver', {get: () => false})"]
 )
 ```
 ---
 ### 2. CDP Connection Improvements
 - **Commit**: `02acad1`
 - **Date**: December 18, 2025
 - **Files Changed**:
  - `crawl4ai/browser_manager.py`
  - `tests/browser/test_cdp_cleanup_reuse.py`
 **Description**:
 - Support WebSocket URLs (ws://, wss://) for CDP connections
 - Proper cleanup when `cdp_cleanup_on_close=True`
 - Enables reusing the same browser with multiple sequential connections
 ---
 ### 3. Crash Recovery for Deep Crawl Strategies
 - **Commit**: `31ebf37`
 - **Date**: December 22, 2025
 - **Files Changed**:
  - `crawl4ai/deep_crawling/bff_strategy.py`
  - `crawl4ai/deep_crawling/bfs_strategy.py`
  - `crawl4ai/deep_crawling/dfs_strategy.py`
  - `tests/deep_crawling/test_deep_crawl_resume.py`
  - `tests/deep_crawling/test_deep_crawl_resume_integration.py`
 **Description**: Optional `resume_state` and `on_state_change` parameters for all deep crawl strategies (BFS, DFS, Best-First) enabling cloud deployment crash recovery.
 **Features**:
 - `resume_state`: Pass saved state to resume from checkpoint
 - `on_state_change`: Async callback fired after each URL for real-time state persistence
 - `export_state()`: Get last captured state manually
 - Zero overhead when features are disabled (None defaults)
 - State is JSON-serializable
 ---
 ### 4. PDF and MHTML Support for raw:/file:// URLs
 - **Commit**: `67e03d6`
 - **Date**: December 22, 2025
 - **Files Changed**:
  - `crawl4ai/async_crawler_strategy.py`
 **Description**: Generate PDFs and MHTML from cached HTML content. Replaced `_generate_screenshot_from_html` with unified `_generate_media_from_html` method.
 ---
 ### 5. Screenshots for raw:/file:// URLs
 - **Commit**: `444cb14`
 - **Date**: December 22, 2025
 - **Files Changed**:
  - `crawl4ai/async_crawler_strategy.py`
 **Description**: Enables cached HTML to be rendered with screenshots. Loads HTML into browser via `page.set_content()` and takes screenshot.
 ---
 ### 6. base_url Parameter for CrawlerRunConfig
 - **Commit**: `3937efc`
 - **Date**: December 24, 2025
 - **Files Changed**:
  - `crawl4ai/async_configs.py`
  - `crawl4ai/async_webcrawler.py`
 **Description**: When processing raw: HTML (e.g., from cache), provides proper URL resolution context for markdown link generation.
 **Usage**:
 ```python
 config = CrawlerRunConfig(base_url='https://example.com')
 result = await crawler.arun(url='raw:{html}', config=config)
 ```
 ---
 ### 7. Prefetch Mode for Two-Phase Deep Crawling
 - **Commit**: `fde4e9f`
 - **Date**: December 25, 2025
 - **Files Changed**:
  - `crawl4ai/async_configs.py`
  - `crawl4ai/async_webcrawler.py`
  - `crawl4ai/utils.py`
  - `tests/test_prefetch_integration.py`
  - `tests/test_prefetch_mode.py`
  - `tests/test_prefetch_regression.py`
 **Description**:
 - Added `prefetch` parameter to CrawlerRunConfig
 - Added `quick_extract_links()` function for fast link extraction
 - Short-circuit in `aprocess_html()` for prefetch mode
 - 42 tests added (unit, integration, regression)
 ---
 ### 8. Proxy Rotation and Configuration Updates
 - **Commit**: `9e7f5aa`
 - **Date**: December 26, 2025
 - **Files Changed**:
  - `crawl4ai/async_configs.py`
  - `crawl4ai/async_webcrawler.py`
  - `crawl4ai/proxy_strategy.py`
  - `tests/proxy/test_sticky_sessions.py`
 **Description**: Enhanced proxy rotation and proxy configuration options.
 ---
 ### 9. Proxy Support for HTTP Crawler Strategy
 - **Commit**: `a43256b`
 - **Date**: December 26, 2025
 - **Files Changed**:
  - `crawl4ai/async_crawler_strategy.py`
 **Description**: Added proxy support to the HTTP (non-browser) crawler strategy.
 ---
 ### 10. Browser Pipeline Support for raw:/file:// URLs
 - **Commit**: `2550f3d`
 - **Date**: December 27, 2025
 - **Files Changed**:
  - `crawl4ai/async_configs.py`
  - `crawl4ai/async_crawler_strategy.py`
  - `crawl4ai/browser_manager.py`
  - `tests/test_raw_html_browser.py`
  - `tests/test_raw_html_edge_cases.py`
 **Description**:
 - Added `process_in_browser` parameter to CrawlerRunConfig
 - Routes raw:/file:// URLs through browser when browser operations needed
 - Uses `page.set_content()` instead of `goto()` for local content
 - Auto-detects browser requirements: js_code, wait_for, screenshot, etc.
 - Maintains fast path for raw:/file:// without browser params
 **Fixes**: #310
 ---
 ### 11. Smart TTL Cache for Sitemap URL Seeder
 - **Commit**: `3d78001`
 - **Date**: December 30, 2025
 - **Files Changed**:
  - `crawl4ai/async_configs.py`
  - `crawl4ai/async_url_seeder.py`
 **Description**:
 - Added `cache_ttl_hours` and `validate_sitemap_lastmod` params to SeedingConfig
 - New JSON cache format with metadata (version, created_at, lastmod, url_count)
 - Cache validation by TTL expiry and sitemap lastmod comparison
 - Auto-migration from old .jsonl to new .json format
 - Fixes bug where incomplete cache was used indefinitely
 ---
 ## Bug Fixes 🐛
 ### 1. HTTP Strategy raw: URL Parsing Truncates at # Character
 - **Commit**: `624e341`
 - **Date**: December 24, 2025
 - **Files Changed**:
  - `crawl4ai/async_crawler_strategy.py`
 **Problem**: `urlparse()` treated `#` as URL fragment delimiter, breaking CSS color codes like `#eee`.
 **Before**: `raw:body{background:#eee}` → parsed to `body{background:`
 **After**: `raw:body{background:#eee}` → parsed to `body{background:#eee}`
 **Fix**: Strip `raw:` prefix directly instead of using urlparse.
 ---
 ### 2. Caching Debugging and Fixes
 - **Commit**: `48426f7`
 - **Date**: December 21, 2025
 - **Files Changed**:
  - `crawl4ai/async_configs.py`
  - `crawl4ai/async_database.py`
  - `crawl4ai/async_webcrawler.py`
  - `crawl4ai/cache_validator.py`
  - `crawl4ai/models.py`
  - `crawl4ai/utils.py`
  - `tests/cache_validation/` (multiple test files)
 **Description**: Various debugging and improvements to the caching system.
 ---
 ## Documentation Updates 📚
 ### 1. Multi-Sample Schema Generation Section
 - **Commit**: `6b2dca7`
 - **Date**: January 4, 2026
 - **Files Changed**:
  - `docs/md_v2/extraction/no-llm-strategies.md`
 **Description**: Added documentation explaining how to pass multiple HTML samples to `generate_schema()` for stable selectors that work across pages with varying DOM structures.
 ---
 ### 2. URL Seeder Docs - Smart TTL Cache Parameters
 - **Commit**: `db61ab8`
 - **Date**: December 30, 2025
 - **Files Changed**:
  - `docs/md_v2/core/url-seeding.md`
 **Description**:
 - Added `cache_ttl_hours` and `validate_sitemap_lastmod` to parameter table
 - Documented smart TTL cache validation with examples
 - Added cache-related troubleshooting entries
 ---
 ## Other Changes 🔧
 | Date | Commit | Description |
 |------|--------|-------------|
 | Dec 30, 2025 | `0d3f9e6` | Add MEMORY.md to gitignore |
 | Dec 21, 2025 | `f6b29a8` | Update gitignore |
 ---
 ## Files Changed Summary
 ### Core Library
 - `crawl4ai/async_configs.py` - Multiple changes (init_scripts, base_url, prefetch, proxy, process_in_browser, cache TTL)
 - `crawl4ai/async_webcrawler.py` - Prefetch mode, base_url, proxy
 - `crawl4ai/async_crawler_strategy.py` - Media generation, proxy, raw URL fixes
 - `crawl4ai/browser_manager.py` - init_scripts, CDP cleanup, cookie handling
 - `crawl4ai/async_url_seeder.py` - Smart TTL cache
 - `crawl4ai/proxy_strategy.py` - Proxy rotation
 - `crawl4ai/deep_crawling/*.py` - Crash recovery for all strategies
 - `crawl4ai/utils.py` - quick_extract_links, cache fixes
 ### Docker Deployment
 - `deploy/docker/server.py` - Security fixes
 - `deploy/docker/hook_manager.py` - RCE fix
 - `deploy/docker/config.yml` - Security warnings
 ### Documentation
 - `docs/md_v2/extraction/no-llm-strategies.md` - Schema generation
 - `docs/md_v2/core/url-seeding.md` - Cache parameters
 ### Tests Added
 - `tests/test_prefetch_*.py` - 42 prefetch tests
 - `tests/test_raw_html_*.py` - Raw HTML browser tests
 - `tests/deep_crawling/test_deep_crawl_resume*.py` - Resume tests
 - `tests/browser/test_cdp_cleanup_reuse.py` - CDP tests
 - `tests/proxy/test_sticky_sessions.py` - Proxy tests
 - `tests/cache_validation/*.py` - Cache tests
 - `deploy/docker/tests/test_security_fixes.py` - Security tests
 - `deploy/docker/tests/run_security_tests.py` - Security integration tests
 ---
 ## Questions for Main Developer
 1. [ ] Are there any other breaking changes not captured here?
 2. [ ] Should the security fixes get their own patch release?
 3. [ ] Any features that need additional documentation?
 ---
 *Generated: January 12, 2026*
--- a/docs/RELEASE_NOTES_v0.8.0.md
+++ b/docs/RELEASE_NOTES_v0.8.0.md
@@ -0,0 +1,243 @@
 # Crawl4AI v0.8.0 Release Notes
 **Release Date**: January 2026
 **Previous Version**: v0.7.6
 **Status**: Release Candidate
 ---
 ## Highlights
 - **Critical Security Fixes** for Docker API deployment
 - **11 New Features** including crash recovery, prefetch mode, and proxy improvements
 - **Breaking Changes** - see migration guide below
 ---
 ## Breaking Changes
 ### 1. Docker API: Hooks Disabled by Default
 **What changed**: Hooks are now disabled by default on the Docker API.
 **Why**: Security fix for Remote Code Execution (RCE) vulnerability.
 **Who is affected**: Users of the Docker API who use the `hooks` parameter in `/crawl` requests.
 **Migration**:
 ```bash
 # To re-enable hooks (only if you trust all API users):
 export CRAWL4AI_HOOKS_ENABLED=true
 ```
 ### 2. Docker API: file:// URLs Blocked
 **What changed**: The endpoints `/execute_js`, `/screenshot`, `/pdf`, and `/html` now reject `file://` URLs.
 **Why**: Security fix for Local File Inclusion (LFI) vulnerability.
 **Who is affected**: Users who were reading local files via the Docker API.
 **Migration**: Use the Python library directly for local file processing:
 ```python
 # Instead of API call with file:// URL, use library:
 from crawl4ai import AsyncWebCrawler
 async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="file:///path/to/file.html")
 ```
 ---
 ## Security Fixes
 ### Critical: Remote Code Execution via Hooks (CVE Pending)
 **Severity**: CRITICAL (CVSS 10.0)
 **Affected**: Docker API deployment (all versions before v0.8.0)
 **Vector**: `POST /crawl` with malicious `hooks` parameter
 **Details**: The `__import__` builtin was available in hook code, allowing attackers to import `os`, `subprocess`, etc. and execute arbitrary commands.
 **Fix**:
 1. Removed `__import__` from allowed builtins
 2. Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)
 ### High: Local File Inclusion via file:// URLs (CVE Pending)
 **Severity**: HIGH (CVSS 8.6)
 **Affected**: Docker API deployment (all versions before v0.8.0)
 **Vector**: `POST /execute_js` (and other endpoints) with `file:///etc/passwd`
 **Details**: API endpoints accepted `file://` URLs, allowing attackers to read arbitrary files from the server.
 **Fix**: URL scheme validation now only allows `http://`, `https://`, and `raw:` URLs.
 ### Credits
 Discovered by **Neo by ProjectDiscovery** ([projectdiscovery.io](https://projectdiscovery.io)) - December 2025
 ---
 ## New Features
 ### 1. init_scripts Support for BrowserConfig
 Pre-page-load JavaScript injection for stealth evasions.
 ```python
 config = BrowserConfig(
    init_scripts=[
        "Object.defineProperty(navigator, 'webdriver', {get: () => false})"
    ]
 )
 ```
 ### 2. CDP Connection Improvements
 - WebSocket URL support (`ws://`, `wss://`)
 - Proper cleanup with `cdp_cleanup_on_close=True`
 - Browser reuse across multiple connections
 ### 3. Crash Recovery for Deep Crawl Strategies
 All deep crawl strategies (BFS, DFS, Best-First) now support crash recovery:
 ```python
 from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
 strategy = BFSDeepCrawlStrategy(
    max_depth=3,
    resume_state=saved_state,  # Resume from checkpoint
    on_state_change=save_callback  # Persist state in real-time
 )
 ```
 ### 4. PDF and MHTML for raw:/file:// URLs
 Generate PDFs and MHTML from cached HTML content.
 ### 5. Screenshots for raw:/file:// URLs
 Render cached HTML and capture screenshots.
 ### 6. base_url Parameter for CrawlerRunConfig
 Proper URL resolution for raw: HTML processing:
 ```python
 config = CrawlerRunConfig(base_url='https://example.com')
 result = await crawler.arun(url='raw:{html}', config=config)
 ```
 ### 7. Prefetch Mode for Two-Phase Deep Crawling
 Fast link extraction without full page processing:
 ```python
 config = CrawlerRunConfig(prefetch=True)
 ```
 ### 8. Proxy Rotation and Configuration
 Enhanced proxy rotation with sticky sessions support.
 ### 9. Proxy Support for HTTP Strategy
 Non-browser crawler now supports proxies.
 ### 10. Browser Pipeline for raw:/file:// URLs
 New `process_in_browser` parameter for browser operations on local content:
 ```python
 config = CrawlerRunConfig(
    process_in_browser=True,  # Force browser processing
    screenshot=True
 )
 result = await crawler.arun(url='raw:<html>...</html>', config=config)
 ```
 ### 11. Smart TTL Cache for Sitemap URL Seeder
 Intelligent cache invalidation for sitemaps:
 ```python
 config = SeedingConfig(
    cache_ttl_hours=24,
    validate_sitemap_lastmod=True
 )
 ```
 ---
 ## Bug Fixes
 ### raw: URL Parsing Truncates at # Character
 **Problem**: CSS color codes like `#eee` were being truncated.
 **Before**: `raw:body{background:#eee}` → `body{background:`
 **After**: `raw:body{background:#eee}` → `body{background:#eee}`
 ### Caching System Improvements
 Various fixes to cache validation and persistence.
 ---
 ## Documentation Updates
 - Multi-sample schema generation documentation
 - URL seeder smart TTL cache parameters
 - Security documentation (SECURITY.md)
 ---
 ## Upgrade Guide
 ### From v0.7.x to v0.8.0
 1. **Update the package**:
   ```bash
   pip install --upgrade crawl4ai
   ```
 2. **Docker API users**:
   - Hooks are now disabled by default
   - If you need hooks: `export CRAWL4AI_HOOKS_ENABLED=true`
   - `file://` URLs no longer work on API (use library directly)
 3. **Review security settings**:
   ```yaml
   # config.yml - recommended for production
   security:
     enabled: true
     jwt_enabled: true
   ```
 4. **Test your integration** before deploying to production
 ### Breaking Change Checklist
 - [ ] Check if you use `hooks` parameter in API calls
 - [ ] Check if you use `file://` URLs via the API
 - [ ] Update environment variables if needed
 - [ ] Review security configuration
 ---
 ## Full Changelog
 See [CHANGELOG.md](../CHANGELOG.md) for complete version history.
 ---
 ## Contributors
 Thanks to all contributors who made this release possible.
 Special thanks to **Neo by ProjectDiscovery** for responsible security disclosure.
 ---
 *For questions or issues, please open a [GitHub Issue](https://github.com/unclecode/crawl4ai/issues).*
--- a/docs/migration/v0.8.0-upgrade-guide.md
+++ b/docs/migration/v0.8.0-upgrade-guide.md
@@ -0,0 +1,301 @@
 # Migration Guide: Upgrading to Crawl4AI v0.8.0
 This guide helps you upgrade from v0.7.x to v0.8.0, with special attention to breaking changes and security updates.
 ## Quick Summary
 | Change | Impact | Action Required |
 |--------|--------|-----------------|
 | Hooks disabled by default | Docker API users with hooks | Set `CRAWL4AI_HOOKS_ENABLED=true` |
 | file:// URLs blocked | Docker API users reading local files | Use Python library directly |
 | Security fixes | All Docker API users | Update immediately |
 ---
 ## Step 1: Update the Package
 ### PyPI Installation
 ```bash
 pip install --upgrade crawl4ai
 ```
 ### Docker Installation
 ```bash
 docker pull unclecode/crawl4ai:latest
 # or
 docker pull unclecode/crawl4ai:0.8.0
 ```
 ### From Source
 ```bash
 git pull origin main
 pip install -e .
 ```
 ---
 ## Step 2: Check for Breaking Changes
 ### Are You Affected?
 **You ARE affected if you:**
 - Use the Docker API deployment
 - Use the `hooks` parameter in `/crawl` requests
 - Use `file://` URLs via API endpoints
 **You are NOT affected if you:**
 - Only use Crawl4AI as a Python library
 - Don't use hooks in your API calls
 - Don't use `file://` URLs via the API
 ---
 ## Step 3: Migrate Hooks Usage
 ### Before v0.8.0
 Hooks worked by default:
 ```bash
 # This worked without any configuration
 curl -X POST http://localhost:11235/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com"],
    "hooks": {
      "code": {
        "on_page_context_created": "async def hook(page, context, **kwargs):\n    await context.add_cookies([...])\n    return page"
      }
    }
  }'
 ```
 ### After v0.8.0
 You must explicitly enable hooks:
 **Option A: Environment Variable (Recommended)**
 ```bash
 # In your Docker run command or docker-compose.yml
 export CRAWL4AI_HOOKS_ENABLED=true
 ```
 ```yaml
 # docker-compose.yml
 services:
  crawl4ai:
    image: unclecode/crawl4ai:0.8.0
    environment:
      - CRAWL4AI_HOOKS_ENABLED=true
 ```
 **Option B: For Kubernetes**
 ```yaml
 env:
  - name: CRAWL4AI_HOOKS_ENABLED
    value: "true"
 ```
 ### Security Warning
 Only enable hooks if:
 - You trust all users who can access the API
 - The API is not exposed to the public internet
 - You have other authentication/authorization in place
 ---
 ## Step 4: Migrate file:// URL Usage
 ### Before v0.8.0
 ```bash
 # This worked via API
 curl -X POST http://localhost:11235/execute_js \
  -d '{"url": "file:///var/data/page.html", "scripts": ["document.title"]}'
 ```
 ### After v0.8.0
 **Option A: Use the Python Library Directly**
 ```python
 from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
 async def process_local_file():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="file:///var/data/page.html",
            config=CrawlerRunConfig(js_code=["document.title"])
        )
        return result
 ```
 **Option B: Use raw: Protocol for HTML Content**
 If you have the HTML content, you can still use the API:
 ```bash
 # Read file content and send as raw:
 HTML_CONTENT=$(cat /var/data/page.html)
 curl -X POST http://localhost:11235/html \
  -H "Content-Type: application/json" \
  -d "{\"url\": \"raw:$HTML_CONTENT\"}"
 ```
 **Option C: Create a Preprocessing Service**
 ```python
 # preprocessing_service.py
 from fastapi import FastAPI
 from crawl4ai import AsyncWebCrawler
 app = FastAPI()
@app.post("/process-local")
 async def process_local(file_path: str):
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=f"file://{file_path}")
        return result.model_dump()
 ```
 ---
 ## Step 5: Review Security Configuration
 ### Recommended Production Settings
 ```yaml
 # config.yml
 security:
  enabled: true
  jwt_enabled: true
  https_redirect: true  # If behind HTTPS proxy
  trusted_hosts:
    - "your-domain.com"
    - "api.your-domain.com"
 ```
 ### Environment Variables
 ```bash
 # Required for JWT authentication
 export SECRET_KEY="your-secure-random-key-minimum-32-characters"
 # Only if you need hooks
 export CRAWL4AI_HOOKS_ENABLED=true
 ```
 ### Generate a Secure Secret Key
 ```python
 import secrets
 print(secrets.token_urlsafe(32))
 ```
 ---
 ## Step 6: Test Your Integration
 ### Quick Validation Script
 ```python
 import asyncio
 import aiohttp
 async def test_upgrade():
    base_url = "http://localhost:11235"
    # Test 1: Basic crawl should work
    async with aiohttp.ClientSession() as session:
        async with session.post(
            f"{base_url}/crawl",
            json={"urls": ["https://example.com"]}
        ) as resp:
            assert resp.status == 200, "Basic crawl failed"
            print("✓ Basic crawl works")
    # Test 2: Hooks should be blocked (unless enabled)
    async with aiohttp.ClientSession() as session:
        async with session.post(
            f"{base_url}/crawl",
            json={
                "urls": ["https://example.com"],
                "hooks": {"code": {"on_page_context_created": "async def hook(page, context, **kwargs): return page"}}
            }
        ) as resp:
            if resp.status == 403:
                print("✓ Hooks correctly blocked (default)")
            elif resp.status == 200:
                print("! Hooks enabled - ensure this is intentional")
    # Test 3: file:// should be blocked
    async with aiohttp.ClientSession() as session:
        async with session.post(
            f"{base_url}/execute_js",
            json={"url": "file:///etc/passwd", "scripts": ["1"]}
        ) as resp:
            assert resp.status == 400, "file:// should be blocked"
            print("✓ file:// URLs correctly blocked")
 asyncio.run(test_upgrade())
 ```
 ---
 ## Troubleshooting
 ### "Hooks are disabled" Error
 **Symptom**: API returns 403 with "Hooks are disabled"
 **Solution**: Set `CRAWL4AI_HOOKS_ENABLED=true` if you need hooks
 ### "URL must start with http://, https://" Error
 **Symptom**: API returns 400 when using `file://` URLs
 **Solution**: Use Python library directly or `raw:` protocol
 ### Authentication Errors After Enabling JWT
 **Symptom**: API returns 401 Unauthorized
 **Solution**:
 1. Get a token: `POST /token` with your email
 2. Include token in requests: `Authorization: Bearer <token>`
 ---
 ## Rollback Plan
 If you need to rollback:
 ```bash
 # PyPI
 pip install crawl4ai==0.7.6
 # Docker
 docker pull unclecode/crawl4ai:0.7.6
 ```
 **Warning**: Rolling back re-exposes the security vulnerabilities. Only do this temporarily while fixing integration issues.
 ---
 ## Getting Help
 - **GitHub Issues**: [github.com/unclecode/crawl4ai/issues](https://github.com/unclecode/crawl4ai/issues)
 - **Security Issues**: See [SECURITY.md](../../SECURITY.md)
 - **Documentation**: [docs.crawl4ai.com](https://docs.crawl4ai.com)
 ---
 ## Changelog Reference
 For complete list of changes, see:
 - [Release Notes v0.8.0](../RELEASE_NOTES_v0.8.0.md)
 - [CHANGELOG.md](../../CHANGELOG.md)
--- a/docs/security/GHSA-DRAFT-RCE-LFI.md
+++ b/docs/security/GHSA-DRAFT-RCE-LFI.md
@@ -0,0 +1,171 @@
 # GitHub Security Advisory Draft
 > **Instructions**: Copy this content to create security advisories at:
 > https://github.com/unclecode/crawl4ai/security/advisories/new
 ---
 ## Advisory 1: Remote Code Execution via Hooks Parameter
 ### Title
 Remote Code Execution in Docker API via Hooks Parameter
 ### Severity
 Critical
 ### CVSS Score
 10.0 (CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:H/A:H)
 ### CWE
 CWE-94 (Improper Control of Generation of Code)
 ### Package
 crawl4ai (Docker API deployment)
 ### Affected Versions
 < 0.8.0
 ### Patched Versions
 0.8.0
 ### Description
 A critical remote code execution vulnerability exists in the Crawl4AI Docker API deployment. The `/crawl` endpoint accepts a `hooks` parameter containing Python code that is executed using `exec()`. The `__import__` builtin was included in the allowed builtins, allowing attackers to import arbitrary modules and execute system commands.
 **Attack Vector:**
 ```json
 POST /crawl
 {
  "urls": ["https://example.com"],
  "hooks": {
    "code": {
      "on_page_context_created": "async def hook(page, context, **kwargs):\n    __import__('os').system('malicious_command')\n    return page"
    }
  }
 }
 ```
 ### Impact
 An unauthenticated attacker can:
 - Execute arbitrary system commands
 - Read/write files on the server
 - Exfiltrate sensitive data (environment variables, API keys)
 - Pivot to internal network services
 - Completely compromise the server
 ### Mitigation
 1. **Upgrade to v0.8.0** (recommended)
 2. If unable to upgrade immediately:
   - Disable the Docker API
   - Block `/crawl` endpoint at network level
   - Add authentication to the API
 ### Fix Details
 1. Removed `__import__` from `allowed_builtins` in `hook_manager.py`
 2. Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)
 3. Users must explicitly opt-in to enable hooks
 ### Credits
 Discovered by Neo by ProjectDiscovery (https://projectdiscovery.io)
 ### References
 - [Release Notes v0.8.0](https://github.com/unclecode/crawl4ai/blob/main/docs/RELEASE_NOTES_v0.8.0.md)
 - [Migration Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/migration/v0.8.0-upgrade-guide.md)
 ---
 ## Advisory 2: Local File Inclusion via file:// URLs
 ### Title
 Local File Inclusion in Docker API via file:// URLs
 ### Severity
 High
 ### CVSS Score
 8.6 (CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:N/A:N)
 ### CWE
 CWE-22 (Improper Limitation of a Pathname to a Restricted Directory)
 ### Package
 crawl4ai (Docker API deployment)
 ### Affected Versions
 < 0.8.0
 ### Patched Versions
 0.8.0
 ### Description
 A local file inclusion vulnerability exists in the Crawl4AI Docker API. The `/execute_js`, `/screenshot`, `/pdf`, and `/html` endpoints accept `file://` URLs, allowing attackers to read arbitrary files from the server filesystem.
 **Attack Vector:**
 ```json
 POST /execute_js
 {
  "url": "file:///etc/passwd",
  "scripts": ["document.body.innerText"]
 }
 ```
 ### Impact
 An unauthenticated attacker can:
 - Read sensitive files (`/etc/passwd`, `/etc/shadow`, application configs)
 - Access environment variables via `/proc/self/environ`
 - Discover internal application structure
 - Potentially read credentials and API keys
 ### Mitigation
 1. **Upgrade to v0.8.0** (recommended)
 2. If unable to upgrade immediately:
   - Disable the Docker API
   - Add authentication to the API
   - Use network-level filtering
 ### Fix Details
 Added URL scheme validation to block:
 - `file://` URLs
 - `javascript:` URLs
 - `data:` URLs
 - Other non-HTTP schemes
 Only `http://`, `https://`, and `raw:` URLs are now allowed.
 ### Credits
 Discovered by Neo by ProjectDiscovery (https://projectdiscovery.io)
 ### References
 - [Release Notes v0.8.0](https://github.com/unclecode/crawl4ai/blob/main/docs/RELEASE_NOTES_v0.8.0.md)
 - [Migration Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/migration/v0.8.0-upgrade-guide.md)
 ---
 ## Creating the Advisories on GitHub
 1. Go to: https://github.com/unclecode/crawl4ai/security/advisories/new
 2. Fill in the form for each advisory:
   - **Ecosystem**: PyPI
   - **Package name**: crawl4ai
   - **Affected versions**: < 0.8.0
   - **Patched versions**: 0.8.0
   - **Severity**: Critical (for RCE), High (for LFI)
 3. After creating, GitHub will:
   - Assign a GHSA ID
   - Optionally request a CVE
   - Notify users who have security alerts enabled
 4. Coordinate disclosure timing with the fix release