Release v0.8.0: Crash Recovery, Prefetch Mode & Security Fixes (#1712)

* Fix: Use correct URL variable for raw HTML extraction (#1116) - Prevents full HTML content from being passed as URL to extraction strategies - Added unit tests to verify raw HTML and regular URL processing Fix: Wrong URL variable used for extraction of raw html * Fix #1181: Preserve whitespace in code blocks during HTML scraping The remove_empty_elements_fast() method was removing whitespace-only span elements inside <pre> and <code> tags, causing import statements like "import torch" to become "importtorch". Now skips elements inside code blocks where whitespace is significant. * Refactor Pydantic model configuration to use ConfigDict for arbitrary types * Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621 * Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638 * fix: ensure BrowserConfig.to_dict serializes proxy_config * feat: make LLM backoff configurable end-to-end - extend LLMConfig with backoff delay/attempt/factor fields and thread them through LLMExtractionStrategy, LLMContentFilter, table extraction, and Docker API handlers - expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff and document them in the md_v2 guides * reproduced AttributeError from #1642 * pass timeout parameter to docker client request * added missing deep crawling objects to init * generalized query in ContentRelevanceFilter to be a str or list * import modules from enhanceable deserialization * parameterized tests * Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268 * refactor: replace PyPDF2 with pypdf across the codebase. ref #1412 * Add browser_context_id and target_id parameters to BrowserConfig Enable Crawl4AI to connect to pre-created CDP browser contexts, which is essential for cloud browser services that pre-create isolated contexts. Changes: - Add browser_context_id and target_id parameters to BrowserConfig - Update from_kwargs() and to_dict() methods - Modify BrowserManager.start() to use existing context when provided - Add _get_page_by_target_id() helper method - Update get_page() to handle pre-existing targets - Add test for browser_context_id functionality This enables cloud services to: 1. Create isolated CDP contexts before Crawl4AI connects 2. Pass context/target IDs to BrowserConfig 3. Have Crawl4AI reuse existing contexts instead of creating new ones * Add cdp_cleanup_on_close flag to prevent memory leaks in cloud/server scenarios * Fix: add cdp_cleanup_on_close to from_kwargs * Fix: find context by target_id for concurrent CDP connections * Fix: use target_id to find correct page in get_page * Fix: use CDP to find context by browserContextId for concurrent sessions * Revert context matching attempts - Playwright cannot see CDP-created contexts * Add create_isolated_context flag for concurrent CDP crawls When True, forces creation of a new browser context instead of reusing the default context. Essential for concurrent crawls on the same browser to prevent navigation conflicts. * Add context caching to create_isolated_context branch Uses contexts_by_config cache (same as non-CDP mode) to reuse contexts for multiple URLs with same config. Still creates new page per crawl for navigation isolation. Benefits batch/deep crawls. * Add init_scripts support to BrowserConfig for pre-page-load JS injection This adds the ability to inject JavaScript that runs before any page loads, useful for stealth evasions (canvas/audio fingerprinting, userAgentData). - Add init_scripts parameter to BrowserConfig (list of JS strings) - Apply init_scripts in setup_context() via context.add_init_script() - Update from_kwargs() and to_dict() for serialization * Fix CDP connection handling: support WS URLs and proper cleanup Changes to browser_manager.py: 1. _verify_cdp_ready(): Support multiple URL formats - WebSocket URLs (ws://, wss://): Skip HTTP verification, Playwright handles directly - HTTP URLs with query params: Properly parse with urlparse to preserve query string - Fixes issue where naive f"{cdp_url}/json/version" broke WS URLs and query params 2. close(): Proper cleanup when cdp_cleanup_on_close=True - Close all sessions (pages) - Close all contexts - Call browser.close() to disconnect (doesn't terminate browser, just releases connection) - Wait 1 second for CDP connection to fully release - Stop Playwright instance to prevent memory leaks This enables: - Connecting to specific browsers via WS URL - Reusing the same browser with multiple sequential connections - No user wait needed between connections (internal 1s delay handles it) Added tests/browser/test_cdp_cleanup_reuse.py with comprehensive tests. * Update gitignore * Some debugging for caching * Add _generate_screenshot_from_html for raw: and file:// URLs Implements the missing method that was being called but never defined. Now raw: and file:// URLs can generate screenshots by: 1. Loading HTML into a browser page via page.set_content() 2. Taking screenshot using existing take_screenshot() method 3. Cleaning up the page afterward This enables cached HTML to be rendered with screenshots in crawl4ai-cloud. * Add PDF and MHTML support for raw: and file:// URLs - Replace _generate_screenshot_from_html with _generate_media_from_html - New method handles screenshot, PDF, and MHTML in one browser session - Update raw: and file:// URL handlers to use new method - Enables cached HTML to generate all media types * Add crash recovery for deep crawl strategies Add optional resume_state and on_state_change parameters to all deep crawl strategies (BFS, DFS, Best-First) for cloud deployment crash recovery. Features: - resume_state: Pass saved state to resume from checkpoint - on_state_change: Async callback fired after each URL for real-time state persistence to external storage (Redis, DB, etc.) - export_state(): Get last captured state manually - Zero overhead when features are disabled (None defaults) State includes visited URLs, pending queue/stack, depths, and pages_crawled count. All state is JSON-serializable. * Fix: HTTP strategy raw: URL parsing truncates at # character The AsyncHTTPCrawlerStrategy.crawl() method used urlparse() to extract content from raw: URLs. This caused HTML with CSS color codes like #eee to be truncated because # is treated as a URL fragment delimiter. Before: raw:body{background:#eee} -> parsed.path = 'body{background:' After: raw:body{background:#eee} -> raw_content = 'body{background:#eee' Fix: Strip the raw: or raw:// prefix directly instead of using urlparse, matching how the browser strategy handles it. * Add base_url parameter to CrawlerRunConfig for raw HTML processing When processing raw: HTML (e.g., from cache), the URL parameter is meaningless for markdown link resolution. This adds a base_url parameter that can be set explicitly to provide proper URL resolution context. Changes: - Add base_url parameter to CrawlerRunConfig.__init__ - Add base_url to CrawlerRunConfig.from_kwargs - Update aprocess_html to use base_url for markdown generation Usage: config = CrawlerRunConfig(base_url='https://example.com') result = await crawler.arun(url='raw:{html}', config=config) * Add prefetch mode for two-phase deep crawling - Add `prefetch` parameter to CrawlerRunConfig - Add `quick_extract_links()` function for fast link extraction - Add short-circuit in aprocess_html() for prefetch mode - Add 42 tests (unit, integration, regression) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Updates on proxy rotation and proxy configuration * Add proxy support to HTTP crawler strategy * Add browser pipeline support for raw:/file:// URLs - Add process_in_browser parameter to CrawlerRunConfig - Route raw:/file:// URLs through _crawl_web() when browser operations needed - Use page.set_content() instead of goto() for local content - Fix cookie handling for non-HTTP URLs in browser_manager - Auto-detect browser requirements: js_code, wait_for, screenshot, etc. - Maintain fast path for raw:/file:// without browser params Fixes #310 * Add smart TTL cache for sitemap URL seeder - Add cache_ttl_hours and validate_sitemap_lastmod params to SeedingConfig - New JSON cache format with metadata (version, created_at, lastmod, url_count) - Cache validation by TTL expiry and sitemap lastmod comparison - Auto-migration from old .jsonl to new .json format - Fixes bug where incomplete cache was used indefinitely * Update URL seeder docs with smart TTL cache parameters - Add cache_ttl_hours and validate_sitemap_lastmod to parameter table - Document smart TTL cache validation with examples - Add cache-related troubleshooting entries - Update key features summary * Add MEMORY.md to gitignore * Docs: Add multi-sample schema generation section Add documentation explaining how to pass multiple HTML samples to generate_schema() for stable selectors that work across pages with varying DOM structures. Includes: - Problem explanation (fragile nth-child selectors) - Solution with code example - Key points for multi-sample queries - Comparison table of fragile vs stable selectors * Fix critical RCE and LFI vulnerabilities in Docker API deployment Security fixes for vulnerabilities reported by ProjectDiscovery: 1. Remote Code Execution via Hooks (CVE pending) - Remove __import__ from allowed_builtins in hook_manager.py - Prevents arbitrary module imports (os, subprocess, etc.) - Hooks now disabled by default via CRAWL4AI_HOOKS_ENABLED env var 2. Local File Inclusion via file:// URLs (CVE pending) - Add URL scheme validation to /execute_js, /screenshot, /pdf, /html - Block file://, javascript:, data: and other dangerous schemes - Only allow http://, https://, and raw: (where appropriate) 3. Security hardening - Add CRAWL4AI_HOOKS_ENABLED=false as default (opt-in for hooks) - Add security warning comments in config.yml - Add validate_url_scheme() helper for consistent validation Testing: - Add unit tests (test_security_fixes.py) - 16 tests - Add integration tests (run_security_tests.py) for live server Affected endpoints: - POST /crawl (hooks disabled by default) - POST /crawl/stream (hooks disabled by default) - POST /execute_js (URL validation added) - POST /screenshot (URL validation added) - POST /pdf (URL validation added) - POST /html (URL validation added) Breaking changes: - Hooks require CRAWL4AI_HOOKS_ENABLED=true to function - file:// URLs no longer work on API endpoints (use library directly) * Enhance authentication flow by implementing JWT token retrieval and adding authorization headers to API requests * Add release notes for v0.7.9, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates * Add release notes for v0.8.0, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates Documentation for v0.8.0 release: - SECURITY.md: Security policy and vulnerability reporting guidelines - RELEASE_NOTES_v0.8.0.md: Comprehensive release notes - migration/v0.8.0-upgrade-guide.md: Step-by-step migration guide - security/GHSA-DRAFT-RCE-LFI.md: GitHub security advisory drafts - CHANGELOG.md: Updated with v0.8.0 changes Breaking changes documented: - Docker API hooks disabled by default (CRAWL4AI_HOOKS_ENABLED) - file:// URLs blocked on Docker API endpoints Security fixes credited to Neo by ProjectDiscovery * Add examples for deep crawl crash recovery and prefetch mode in documentation * Release v0.8.0: The v0.8.0 Update - Updated version to 0.8.0 - Added comprehensive demo and release notes - Updated all documentation * Update security researcher acknowledgment with a hyperlink for Neo by ProjectDiscovery * Add async agenerate_schema method for schema generation - Extract prompt building to shared _build_schema_prompt() method - Add agenerate_schema() async version using aperform_completion_with_backoff - Refactor generate_schema() to use shared prompt builder - Fixes Gemini/Vertex AI compatibility in async contexts (FastAPI) * Fix: Enable litellm.drop_params for O-series/GPT-5 model compatibility O-series (o1, o3) and GPT-5 models only support temperature=1. Setting litellm.drop_params=True auto-drops unsupported parameters instead of throwing UnsupportedParamsError. Fixes temperature=0.01 error for these models in LLM extraction. --------- Co-authored-by: rbushria <rbushri@gmail.com> Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com> Co-authored-by: Soham Kukreti <kukretisoham@gmail.com> Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com> Co-authored-by: unclecode <unclecode@kidocode.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-17 14:19:15 +01:00
parent c85f56b085
commit f6f7f1b551
58 changed files with 11942 additions and 2411 deletions
--- a/docs/RELEASE_NOTES_v0.8.0.md
+++ b/docs/RELEASE_NOTES_v0.8.0.md
@@ -0,0 +1,243 @@
+# Crawl4AI v0.8.0 Release Notes
+
+**Release Date**: January 2026
+**Previous Version**: v0.7.6
+**Status**: Release Candidate
+
+---
+
+## Highlights
+
+- **Critical Security Fixes** for Docker API deployment
+- **11 New Features** including crash recovery, prefetch mode, and proxy improvements
+- **Breaking Changes** - see migration guide below
+
+---
+
+## Breaking Changes
+
+### 1. Docker API: Hooks Disabled by Default
+
+**What changed**: Hooks are now disabled by default on the Docker API.
+
+**Why**: Security fix for Remote Code Execution (RCE) vulnerability.
+
+**Who is affected**: Users of the Docker API who use the `hooks` parameter in `/crawl` requests.
+
+**Migration**:
+```bash
+# To re-enable hooks (only if you trust all API users):
+export CRAWL4AI_HOOKS_ENABLED=true
+```
+
+### 2. Docker API: file:// URLs Blocked
+
+**What changed**: The endpoints `/execute_js`, `/screenshot`, `/pdf`, and `/html` now reject `file://` URLs.
+
+**Why**: Security fix for Local File Inclusion (LFI) vulnerability.
+
+**Who is affected**: Users who were reading local files via the Docker API.
+
+**Migration**: Use the Python library directly for local file processing:
+```python
+# Instead of API call with file:// URL, use library:
+from crawl4ai import AsyncWebCrawler
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(url="file:///path/to/file.html")
+```
+
+---
+
+## Security Fixes
+
+### Critical: Remote Code Execution via Hooks (CVE Pending)
+
+**Severity**: CRITICAL (CVSS 10.0)
+**Affected**: Docker API deployment (all versions before v0.8.0)
+**Vector**: `POST /crawl` with malicious `hooks` parameter
+
+**Details**: The `__import__` builtin was available in hook code, allowing attackers to import `os`, `subprocess`, etc. and execute arbitrary commands.
+
+**Fix**:
+1. Removed `__import__` from allowed builtins
+2. Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)
+
+### High: Local File Inclusion via file:// URLs (CVE Pending)
+
+**Severity**: HIGH (CVSS 8.6)
+**Affected**: Docker API deployment (all versions before v0.8.0)
+**Vector**: `POST /execute_js` (and other endpoints) with `file:///etc/passwd`
+
+**Details**: API endpoints accepted `file://` URLs, allowing attackers to read arbitrary files from the server.
+
+**Fix**: URL scheme validation now only allows `http://`, `https://`, and `raw:` URLs.
+
+### Credits
+
+Discovered by **Neo by ProjectDiscovery** ([projectdiscovery.io](https://projectdiscovery.io)) - December 2025
+
+---
+
+## New Features
+
+### 1. init_scripts Support for BrowserConfig
+
+Pre-page-load JavaScript injection for stealth evasions.
+
+```python
+config = BrowserConfig(
+    init_scripts=[
+        "Object.defineProperty(navigator, 'webdriver', {get: () => false})"
+    ]
+)
+```
+
+### 2. CDP Connection Improvements
+
+- WebSocket URL support (`ws://`, `wss://`)
+- Proper cleanup with `cdp_cleanup_on_close=True`
+- Browser reuse across multiple connections
+
+### 3. Crash Recovery for Deep Crawl Strategies
+
+All deep crawl strategies (BFS, DFS, Best-First) now support crash recovery:
+
+```python
+from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+
+strategy = BFSDeepCrawlStrategy(
+    max_depth=3,
+    resume_state=saved_state,  # Resume from checkpoint
+    on_state_change=save_callback  # Persist state in real-time
+)
+```
+
+### 4. PDF and MHTML for raw:/file:// URLs
+
+Generate PDFs and MHTML from cached HTML content.
+
+### 5. Screenshots for raw:/file:// URLs
+
+Render cached HTML and capture screenshots.
+
+### 6. base_url Parameter for CrawlerRunConfig
+
+Proper URL resolution for raw: HTML processing:
+
+```python
+config = CrawlerRunConfig(base_url='https://example.com')
+result = await crawler.arun(url='raw:{html}', config=config)
+```
+
+### 7. Prefetch Mode for Two-Phase Deep Crawling
+
+Fast link extraction without full page processing:
+
+```python
+config = CrawlerRunConfig(prefetch=True)
+```
+
+### 8. Proxy Rotation and Configuration
+
+Enhanced proxy rotation with sticky sessions support.
+
+### 9. Proxy Support for HTTP Strategy
+
+Non-browser crawler now supports proxies.
+
+### 10. Browser Pipeline for raw:/file:// URLs
+
+New `process_in_browser` parameter for browser operations on local content:
+
+```python
+config = CrawlerRunConfig(
+    process_in_browser=True,  # Force browser processing
+    screenshot=True
+)
+result = await crawler.arun(url='raw:<html>...</html>', config=config)
+```
+
+### 11. Smart TTL Cache for Sitemap URL Seeder
+
+Intelligent cache invalidation for sitemaps:
+
+```python
+config = SeedingConfig(
+    cache_ttl_hours=24,
+    validate_sitemap_lastmod=True
+)
+```
+
+---
+
+## Bug Fixes
+
+### raw: URL Parsing Truncates at # Character
+
+**Problem**: CSS color codes like `#eee` were being truncated.
+
+**Before**: `raw:body{background:#eee}` → `body{background:`
+**After**: `raw:body{background:#eee}` → `body{background:#eee}`
+
+### Caching System Improvements
+
+Various fixes to cache validation and persistence.
+
+---
+
+## Documentation Updates
+
+- Multi-sample schema generation documentation
+- URL seeder smart TTL cache parameters
+- Security documentation (SECURITY.md)
+
+---
+
+## Upgrade Guide
+
+### From v0.7.x to v0.8.0
+
+1. **Update the package**:
+   ```bash
+   pip install --upgrade crawl4ai
+   ```
+
+2. **Docker API users**:
+   - Hooks are now disabled by default
+   - If you need hooks: `export CRAWL4AI_HOOKS_ENABLED=true`
+   - `file://` URLs no longer work on API (use library directly)
+
+3. **Review security settings**:
+   ```yaml
+   # config.yml - recommended for production
+   security:
+     enabled: true
+     jwt_enabled: true
+   ```
+
+4. **Test your integration** before deploying to production
+
+### Breaking Change Checklist
+
+- [ ] Check if you use `hooks` parameter in API calls
+- [ ] Check if you use `file://` URLs via the API
+- [ ] Update environment variables if needed
+- [ ] Review security configuration
+
+---
+
+## Full Changelog
+
+See [CHANGELOG.md](../CHANGELOG.md) for complete version history.
+
+---
+
+## Contributors
+
+Thanks to all contributors who made this release possible.
+
+Special thanks to **Neo by ProjectDiscovery** for responsible security disclosure.
+
+---
+
+*For questions or issues, please open a [GitHub Issue](https://github.com/unclecode/crawl4ai/issues).*
--- a/docs/blog/release-v0.8.0.md
+++ b/docs/blog/release-v0.8.0.md
@@ -0,0 +1,243 @@
+# Crawl4AI v0.8.0 Release Notes
+
+**Release Date**: January 2026
+**Previous Version**: v0.7.6
+**Status**: Release Candidate
+
+---
+
+## Highlights
+
+- **Critical Security Fixes** for Docker API deployment
+- **11 New Features** including crash recovery, prefetch mode, and proxy improvements
+- **Breaking Changes** - see migration guide below
+
+---
+
+## Breaking Changes
+
+### 1. Docker API: Hooks Disabled by Default
+
+**What changed**: Hooks are now disabled by default on the Docker API.
+
+**Why**: Security fix for Remote Code Execution (RCE) vulnerability.
+
+**Who is affected**: Users of the Docker API who use the `hooks` parameter in `/crawl` requests.
+
+**Migration**:
+```bash
+# To re-enable hooks (only if you trust all API users):
+export CRAWL4AI_HOOKS_ENABLED=true
+```
+
+### 2. Docker API: file:// URLs Blocked
+
+**What changed**: The endpoints `/execute_js`, `/screenshot`, `/pdf`, and `/html` now reject `file://` URLs.
+
+**Why**: Security fix for Local File Inclusion (LFI) vulnerability.
+
+**Who is affected**: Users who were reading local files via the Docker API.
+
+**Migration**: Use the Python library directly for local file processing:
+```python
+# Instead of API call with file:// URL, use library:
+from crawl4ai import AsyncWebCrawler
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(url="file:///path/to/file.html")
+```
+
+---
+
+## Security Fixes
+
+### Critical: Remote Code Execution via Hooks (CVE Pending)
+
+**Severity**: CRITICAL (CVSS 10.0)
+**Affected**: Docker API deployment (all versions before v0.8.0)
+**Vector**: `POST /crawl` with malicious `hooks` parameter
+
+**Details**: The `__import__` builtin was available in hook code, allowing attackers to import `os`, `subprocess`, etc. and execute arbitrary commands.
+
+**Fix**:
+1. Removed `__import__` from allowed builtins
+2. Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)
+
+### High: Local File Inclusion via file:// URLs (CVE Pending)
+
+**Severity**: HIGH (CVSS 8.6)
+**Affected**: Docker API deployment (all versions before v0.8.0)
+**Vector**: `POST /execute_js` (and other endpoints) with `file:///etc/passwd`
+
+**Details**: API endpoints accepted `file://` URLs, allowing attackers to read arbitrary files from the server.
+
+**Fix**: URL scheme validation now only allows `http://`, `https://`, and `raw:` URLs.
+
+### Credits
+
+Discovered by **Neo by ProjectDiscovery** ([projectdiscovery.io](https://projectdiscovery.io)) - December 2025
+
+---
+
+## New Features
+
+### 1. init_scripts Support for BrowserConfig
+
+Pre-page-load JavaScript injection for stealth evasions.
+
+```python
+config = BrowserConfig(
+    init_scripts=[
+        "Object.defineProperty(navigator, 'webdriver', {get: () => false})"
+    ]
+)
+```
+
+### 2. CDP Connection Improvements
+
+- WebSocket URL support (`ws://`, `wss://`)
+- Proper cleanup with `cdp_cleanup_on_close=True`
+- Browser reuse across multiple connections
+
+### 3. Crash Recovery for Deep Crawl Strategies
+
+All deep crawl strategies (BFS, DFS, Best-First) now support crash recovery:
+
+```python
+from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+
+strategy = BFSDeepCrawlStrategy(
+    max_depth=3,
+    resume_state=saved_state,  # Resume from checkpoint
+    on_state_change=save_callback  # Persist state in real-time
+)
+```
+
+### 4. PDF and MHTML for raw:/file:// URLs
+
+Generate PDFs and MHTML from cached HTML content.
+
+### 5. Screenshots for raw:/file:// URLs
+
+Render cached HTML and capture screenshots.
+
+### 6. base_url Parameter for CrawlerRunConfig
+
+Proper URL resolution for raw: HTML processing:
+
+```python
+config = CrawlerRunConfig(base_url='https://example.com')
+result = await crawler.arun(url='raw:{html}', config=config)
+```
+
+### 7. Prefetch Mode for Two-Phase Deep Crawling
+
+Fast link extraction without full page processing:
+
+```python
+config = CrawlerRunConfig(prefetch=True)
+```
+
+### 8. Proxy Rotation and Configuration
+
+Enhanced proxy rotation with sticky sessions support.
+
+### 9. Proxy Support for HTTP Strategy
+
+Non-browser crawler now supports proxies.
+
+### 10. Browser Pipeline for raw:/file:// URLs
+
+New `process_in_browser` parameter for browser operations on local content:
+
+```python
+config = CrawlerRunConfig(
+    process_in_browser=True,  # Force browser processing
+    screenshot=True
+)
+result = await crawler.arun(url='raw:<html>...</html>', config=config)
+```
+
+### 11. Smart TTL Cache for Sitemap URL Seeder
+
+Intelligent cache invalidation for sitemaps:
+
+```python
+config = SeedingConfig(
+    cache_ttl_hours=24,
+    validate_sitemap_lastmod=True
+)
+```
+
+---
+
+## Bug Fixes
+
+### raw: URL Parsing Truncates at # Character
+
+**Problem**: CSS color codes like `#eee` were being truncated.
+
+**Before**: `raw:body{background:#eee}` → `body{background:`
+**After**: `raw:body{background:#eee}` → `body{background:#eee}`
+
+### Caching System Improvements
+
+Various fixes to cache validation and persistence.
+
+---
+
+## Documentation Updates
+
+- Multi-sample schema generation documentation
+- URL seeder smart TTL cache parameters
+- Security documentation (SECURITY.md)
+
+---
+
+## Upgrade Guide
+
+### From v0.7.x to v0.8.0
+
+1. **Update the package**:
+   ```bash
+   pip install --upgrade crawl4ai
+   ```
+
+2. **Docker API users**:
+   - Hooks are now disabled by default
+   - If you need hooks: `export CRAWL4AI_HOOKS_ENABLED=true`
+   - `file://` URLs no longer work on API (use library directly)
+
+3. **Review security settings**:
+   ```yaml
+   # config.yml - recommended for production
+   security:
+     enabled: true
+     jwt_enabled: true
+   ```
+
+4. **Test your integration** before deploying to production
+
+### Breaking Change Checklist
+
+- [ ] Check if you use `hooks` parameter in API calls
+- [ ] Check if you use `file://` URLs via the API
+- [ ] Update environment variables if needed
+- [ ] Review security configuration
+
+---
+
+## Full Changelog
+
+See [CHANGELOG.md](../CHANGELOG.md) for complete version history.
+
+---
+
+## Contributors
+
+Thanks to all contributors who made this release possible.
+
+Special thanks to **Neo by ProjectDiscovery** for responsible security disclosure.
+
+---
+
+*For questions or issues, please open a [GitHub Issue](https://github.com/unclecode/crawl4ai/issues).*
--- a/docs/examples/deep_crawl_crash_recovery.py
+++ b/docs/examples/deep_crawl_crash_recovery.py
@@ -0,0 +1,297 @@
+#!/usr/bin/env python3
+"""
+Deep Crawl Crash Recovery Example
+
+This example demonstrates how to implement crash recovery for long-running
+deep crawls. The feature is useful for:
+
+- Cloud deployments with spot/preemptible instances
+- Long-running crawls that may be interrupted
+- Distributed crawling with state coordination
+
+Key concepts:
+- `on_state_change`: Callback fired after each URL is processed
+- `resume_state`: Pass saved state to continue from a checkpoint
+- `export_state()`: Get the last captured state manually
+
+Works with all strategies: BFSDeepCrawlStrategy, DFSDeepCrawlStrategy,
+BestFirstCrawlingStrategy
+"""
+
+import asyncio
+import json
+import os
+from pathlib import Path
+from typing import Dict, Any, List
+
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+
+
+# File to store crawl state (in production, use Redis/database)
+STATE_FILE = Path("crawl_state.json")
+
+
+async def save_state_to_file(state: Dict[str, Any]) -> None:
+    """
+    Callback to save state after each URL is processed.
+
+    In production, you might save to:
+    - Redis: await redis.set("crawl_state", json.dumps(state))
+    - Database: await db.execute("UPDATE crawls SET state = ?", json.dumps(state))
+    - S3: await s3.put_object(Bucket="crawls", Key="state.json", Body=json.dumps(state))
+    """
+    with open(STATE_FILE, "w") as f:
+        json.dump(state, f, indent=2)
+    print(f"  [State saved] Pages: {state['pages_crawled']}, Pending: {len(state['pending'])}")
+
+
+def load_state_from_file() -> Dict[str, Any] | None:
+    """Load previously saved state, if it exists."""
+    if STATE_FILE.exists():
+        with open(STATE_FILE, "r") as f:
+            return json.load(f)
+    return None
+
+
+async def example_basic_state_persistence():
+    """
+    Example 1: Basic state persistence with file storage.
+
+    The on_state_change callback is called after each URL is processed,
+    allowing you to save progress in real-time.
+    """
+    print("\n" + "=" * 60)
+    print("Example 1: Basic State Persistence")
+    print("=" * 60)
+
+    # Clean up any previous state
+    if STATE_FILE.exists():
+        STATE_FILE.unlink()
+
+    strategy = BFSDeepCrawlStrategy(
+        max_depth=2,
+        max_pages=5,
+        on_state_change=save_state_to_file,  # Save after each URL
+    )
+
+    config = CrawlerRunConfig(
+        deep_crawl_strategy=strategy,
+        verbose=False,
+    )
+
+    print("\nStarting crawl with state persistence...")
+    async with AsyncWebCrawler(verbose=False) as crawler:
+        results = await crawler.arun("https://books.toscrape.com", config=config)
+
+    # Show final state
+    if STATE_FILE.exists():
+        with open(STATE_FILE, "r") as f:
+            final_state = json.load(f)
+
+        print(f"\nFinal state saved to {STATE_FILE}:")
+        print(f"  - Strategy: {final_state['strategy_type']}")
+        print(f"  - Pages crawled: {final_state['pages_crawled']}")
+        print(f"  - URLs visited: {len(final_state['visited'])}")
+        print(f"  - URLs pending: {len(final_state['pending'])}")
+
+    print(f"\nCrawled {len(results)} pages total")
+
+
+async def example_crash_and_resume():
+    """
+    Example 2: Simulate a crash and resume from checkpoint.
+
+    This demonstrates the full crash recovery workflow:
+    1. Start crawling with state persistence
+    2. "Crash" after N pages
+    3. Resume from saved state
+    4. Verify no duplicate work
+    """
+    print("\n" + "=" * 60)
+    print("Example 2: Crash and Resume")
+    print("=" * 60)
+
+    # Clean up any previous state
+    if STATE_FILE.exists():
+        STATE_FILE.unlink()
+
+    crash_after = 3
+    crawled_urls_phase1: List[str] = []
+
+    async def save_and_maybe_crash(state: Dict[str, Any]) -> None:
+        """Save state, then simulate crash after N pages."""
+        # Always save state first
+        await save_state_to_file(state)
+        crawled_urls_phase1.clear()
+        crawled_urls_phase1.extend(state["visited"])
+
+        # Simulate crash after reaching threshold
+        if state["pages_crawled"] >= crash_after:
+            raise Exception("Simulated crash! (This is intentional)")
+
+    # Phase 1: Start crawl that will "crash"
+    print(f"\n--- Phase 1: Crawl until 'crash' after {crash_after} pages ---")
+
+    strategy1 = BFSDeepCrawlStrategy(
+        max_depth=2,
+        max_pages=10,
+        on_state_change=save_and_maybe_crash,
+    )
+
+    config = CrawlerRunConfig(
+        deep_crawl_strategy=strategy1,
+        verbose=False,
+    )
+
+    try:
+        async with AsyncWebCrawler(verbose=False) as crawler:
+            await crawler.arun("https://books.toscrape.com", config=config)
+    except Exception as e:
+        print(f"\n  Crash occurred: {e}")
+        print(f"  URLs crawled before crash: {len(crawled_urls_phase1)}")
+
+    # Phase 2: Resume from checkpoint
+    print("\n--- Phase 2: Resume from checkpoint ---")
+
+    saved_state = load_state_from_file()
+    if not saved_state:
+        print("  ERROR: No saved state found!")
+        return
+
+    print(f"  Loaded state: {saved_state['pages_crawled']} pages, {len(saved_state['pending'])} pending")
+
+    crawled_urls_phase2: List[str] = []
+
+    async def track_resumed_crawl(state: Dict[str, Any]) -> None:
+        """Track new URLs crawled in phase 2."""
+        await save_state_to_file(state)
+        new_urls = set(state["visited"]) - set(saved_state["visited"])
+        for url in new_urls:
+            if url not in crawled_urls_phase2:
+                crawled_urls_phase2.append(url)
+
+    strategy2 = BFSDeepCrawlStrategy(
+        max_depth=2,
+        max_pages=10,
+        resume_state=saved_state,  # Resume from checkpoint!
+        on_state_change=track_resumed_crawl,
+    )
+
+    config2 = CrawlerRunConfig(
+        deep_crawl_strategy=strategy2,
+        verbose=False,
+    )
+
+    async with AsyncWebCrawler(verbose=False) as crawler:
+        results = await crawler.arun("https://books.toscrape.com", config=config2)
+
+    # Verify no duplicates
+    already_crawled = set(saved_state["visited"])
+    duplicates = set(crawled_urls_phase2) & already_crawled
+
+    print(f"\n--- Results ---")
+    print(f"  Phase 1 URLs: {len(crawled_urls_phase1)}")
+    print(f"  Phase 2 new URLs: {len(crawled_urls_phase2)}")
+    print(f"  Duplicate crawls: {len(duplicates)} (should be 0)")
+    print(f"  Total results: {len(results)}")
+
+    if len(duplicates) == 0:
+        print("\n  SUCCESS: No duplicate work after resume!")
+    else:
+        print(f"\n  WARNING: Found duplicates: {duplicates}")
+
+
+async def example_export_state():
+    """
+    Example 3: Manual state export using export_state().
+
+    If you don't need real-time persistence, you can export
+    the state manually after the crawl completes.
+    """
+    print("\n" + "=" * 60)
+    print("Example 3: Manual State Export")
+    print("=" * 60)
+
+    strategy = BFSDeepCrawlStrategy(
+        max_depth=1,
+        max_pages=3,
+        # No callback - state is still tracked internally
+    )
+
+    config = CrawlerRunConfig(
+        deep_crawl_strategy=strategy,
+        verbose=False,
+    )
+
+    print("\nCrawling without callback...")
+    async with AsyncWebCrawler(verbose=False) as crawler:
+        results = await crawler.arun("https://books.toscrape.com", config=config)
+
+    # Export state after crawl completes
+    # Note: This only works if on_state_change was set during crawl
+    # For this example, we'd need to set on_state_change to get state
+    print(f"\nCrawled {len(results)} pages")
+    print("(For manual export, set on_state_change to capture state)")
+
+
+async def example_state_structure():
+    """
+    Example 4: Understanding the state structure.
+
+    Shows the complete state dictionary that gets saved.
+    """
+    print("\n" + "=" * 60)
+    print("Example 4: State Structure")
+    print("=" * 60)
+
+    captured_state = None
+
+    async def capture_state(state: Dict[str, Any]) -> None:
+        nonlocal captured_state
+        captured_state = state
+
+    strategy = BFSDeepCrawlStrategy(
+        max_depth=1,
+        max_pages=2,
+        on_state_change=capture_state,
+    )
+
+    config = CrawlerRunConfig(
+        deep_crawl_strategy=strategy,
+        verbose=False,
+    )
+
+    async with AsyncWebCrawler(verbose=False) as crawler:
+        await crawler.arun("https://books.toscrape.com", config=config)
+
+    if captured_state:
+        print("\nState structure:")
+        print(json.dumps(captured_state, indent=2, default=str)[:1000] + "...")
+
+        print("\n\nKey fields:")
+        print(f"  strategy_type: '{captured_state['strategy_type']}'")
+        print(f"  visited: List of {len(captured_state['visited'])} URLs")
+        print(f"  pending: List of {len(captured_state['pending'])} queued items")
+        print(f"  depths: Dict mapping URL -> depth level")
+        print(f"  pages_crawled: {captured_state['pages_crawled']}")
+
+
+async def main():
+    """Run all examples."""
+    print("=" * 60)
+    print("Deep Crawl Crash Recovery Examples")
+    print("=" * 60)
+
+    await example_basic_state_persistence()
+    await example_crash_and_resume()
+    await example_state_structure()
+
+    # # Cleanup
+    # if STATE_FILE.exists():
+    #     STATE_FILE.unlink()
+    #     print(f"\n[Cleaned up {STATE_FILE}]")
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/docs/examples/prefetch_two_phase_crawl.py
+++ b/docs/examples/prefetch_two_phase_crawl.py
@@ -0,0 +1,279 @@
+#!/usr/bin/env python3
+"""
+Prefetch Mode and Two-Phase Crawling Example
+
+Prefetch mode is a fast path that skips heavy processing and returns
+only HTML + links. This is ideal for:
+
+- Site mapping: Quickly discover all URLs
+- Selective crawling: Find URLs first, then process only what you need
+- Link validation: Check which pages exist without full processing
+- Crawl planning: Estimate size before committing resources
+
+Key concept:
+- `prefetch=True` in CrawlerRunConfig enables fast link-only extraction
+- Skips: markdown generation, content scraping, media extraction, LLM extraction
+- Returns: HTML and links dictionary
+
+Performance benefit: ~5-10x faster than full processing
+"""
+
+import asyncio
+import time
+from typing import List, Dict
+
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+
+async def example_basic_prefetch():
+    """
+    Example 1: Basic prefetch mode.
+
+    Shows how prefetch returns HTML and links without heavy processing.
+    """
+    print("\n" + "=" * 60)
+    print("Example 1: Basic Prefetch Mode")
+    print("=" * 60)
+
+    async with AsyncWebCrawler(verbose=False) as crawler:
+        # Enable prefetch mode
+        config = CrawlerRunConfig(prefetch=True)
+
+        print("\nFetching with prefetch=True...")
+        result = await crawler.arun("https://books.toscrape.com", config=config)
+
+        print(f"\nResult summary:")
+        print(f"  Success: {result.success}")
+        print(f"  HTML length: {len(result.html) if result.html else 0} chars")
+        print(f"  Internal links: {len(result.links.get('internal', []))}")
+        print(f"  External links: {len(result.links.get('external', []))}")
+
+        # These should be None/empty in prefetch mode
+        print(f"\n  Skipped processing:")
+        print(f"    Markdown: {result.markdown}")
+        print(f"    Cleaned HTML: {result.cleaned_html}")
+        print(f"    Extracted content: {result.extracted_content}")
+
+        # Show some discovered links
+        internal_links = result.links.get("internal", [])
+        if internal_links:
+            print(f"\n  Sample internal links:")
+            for link in internal_links[:5]:
+                print(f"    - {link['href'][:60]}...")
+
+
+async def example_performance_comparison():
+    """
+    Example 2: Compare prefetch vs full processing performance.
+    """
+    print("\n" + "=" * 60)
+    print("Example 2: Performance Comparison")
+    print("=" * 60)
+
+    url = "https://books.toscrape.com"
+
+    async with AsyncWebCrawler(verbose=False) as crawler:
+        # Warm up - first request is slower due to browser startup
+        await crawler.arun(url, config=CrawlerRunConfig())
+
+        # Prefetch mode timing
+        start = time.time()
+        prefetch_result = await crawler.arun(url, config=CrawlerRunConfig(prefetch=True))
+        prefetch_time = time.time() - start
+
+        # Full processing timing
+        start = time.time()
+        full_result = await crawler.arun(url, config=CrawlerRunConfig())
+        full_time = time.time() - start
+
+        print(f"\nTiming comparison:")
+        print(f"  Prefetch mode: {prefetch_time:.3f}s")
+        print(f"  Full processing: {full_time:.3f}s")
+        print(f"  Speedup: {full_time / prefetch_time:.1f}x faster")
+
+        print(f"\nOutput comparison:")
+        print(f"  Prefetch - Links found: {len(prefetch_result.links.get('internal', []))}")
+        print(f"  Full - Links found: {len(full_result.links.get('internal', []))}")
+        print(f"  Full - Markdown length: {len(full_result.markdown.raw_markdown) if full_result.markdown else 0}")
+
+
+async def example_two_phase_crawl():
+    """
+    Example 3: Two-phase crawling pattern.
+
+    Phase 1: Fast discovery with prefetch
+    Phase 2: Full processing on selected URLs
+    """
+    print("\n" + "=" * 60)
+    print("Example 3: Two-Phase Crawling")
+    print("=" * 60)
+
+    async with AsyncWebCrawler(verbose=False) as crawler:
+        # ═══════════════════════════════════════════════════════════
+        # Phase 1: Fast URL discovery
+        # ═══════════════════════════════════════════════════════════
+        print("\n--- Phase 1: Fast Discovery ---")
+
+        prefetch_config = CrawlerRunConfig(prefetch=True)
+        start = time.time()
+        discovery = await crawler.arun("https://books.toscrape.com", config=prefetch_config)
+        discovery_time = time.time() - start
+
+        all_urls = [link["href"] for link in discovery.links.get("internal", [])]
+        print(f"  Discovered {len(all_urls)} URLs in {discovery_time:.2f}s")
+
+        # Filter to URLs we care about (e.g., book detail pages)
+        # On books.toscrape.com, book pages contain "catalogue/" but not "category/"
+        book_urls = [
+            url for url in all_urls
+            if "catalogue/" in url and "category/" not in url
+        ][:5]  # Limit to 5 for demo
+
+        print(f"  Filtered to {len(book_urls)} book pages")
+
+        # ═══════════════════════════════════════════════════════════
+        # Phase 2: Full processing on selected URLs
+        # ═══════════════════════════════════════════════════════════
+        print("\n--- Phase 2: Full Processing ---")
+
+        full_config = CrawlerRunConfig(
+            word_count_threshold=10,
+            remove_overlay_elements=True,
+        )
+
+        results = []
+        start = time.time()
+
+        for url in book_urls:
+            result = await crawler.arun(url, config=full_config)
+            if result.success:
+                results.append(result)
+                title = result.url.split("/")[-2].replace("-", " ").title()[:40]
+                md_len = len(result.markdown.raw_markdown) if result.markdown else 0
+                print(f"    Processed: {title}... ({md_len} chars)")
+
+        processing_time = time.time() - start
+        print(f"\n  Processed {len(results)} pages in {processing_time:.2f}s")
+
+        # ═══════════════════════════════════════════════════════════
+        # Summary
+        # ═══════════════════════════════════════════════════════════
+        print(f"\n--- Summary ---")
+        print(f"  Discovery phase: {discovery_time:.2f}s ({len(all_urls)} URLs)")
+        print(f"  Processing phase: {processing_time:.2f}s ({len(results)} pages)")
+        print(f"  Total time: {discovery_time + processing_time:.2f}s")
+        print(f"  URLs skipped: {len(all_urls) - len(book_urls)} (not matching filter)")
+
+
+async def example_prefetch_with_deep_crawl():
+    """
+    Example 4: Combine prefetch with deep crawl strategy.
+
+    Use prefetch mode during deep crawl for maximum speed.
+    """
+    print("\n" + "=" * 60)
+    print("Example 4: Prefetch with Deep Crawl")
+    print("=" * 60)
+
+    from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+
+    async with AsyncWebCrawler(verbose=False) as crawler:
+        # Deep crawl with prefetch - maximum discovery speed
+        config = CrawlerRunConfig(
+            prefetch=True,  # Fast mode
+            deep_crawl_strategy=BFSDeepCrawlStrategy(
+                max_depth=1,
+                max_pages=10,
+            )
+        )
+
+        print("\nDeep crawling with prefetch mode...")
+        start = time.time()
+
+        result_container = await crawler.arun("https://books.toscrape.com", config=config)
+
+        # Handle iterator result from deep crawl
+        if hasattr(result_container, '__iter__'):
+            results = list(result_container)
+        else:
+            results = [result_container]
+
+        elapsed = time.time() - start
+
+        # Collect all discovered links
+        all_internal_links = set()
+        all_external_links = set()
+
+        for result in results:
+            for link in result.links.get("internal", []):
+                all_internal_links.add(link["href"])
+            for link in result.links.get("external", []):
+                all_external_links.add(link["href"])
+
+        print(f"\nResults:")
+        print(f"  Pages crawled: {len(results)}")
+        print(f"  Total internal links discovered: {len(all_internal_links)}")
+        print(f"  Total external links discovered: {len(all_external_links)}")
+        print(f"  Time: {elapsed:.2f}s")
+
+
+async def example_prefetch_with_raw_html():
+    """
+    Example 5: Prefetch with raw HTML input.
+
+    You can also use prefetch mode with raw: URLs for cached content.
+    """
+    print("\n" + "=" * 60)
+    print("Example 5: Prefetch with Raw HTML")
+    print("=" * 60)
+
+    sample_html = """
+    <html>
+        <head><title>Sample Page</title></head>
+        <body>
+            <h1>Hello World</h1>
+            <nav>
+                <a href="/page1">Internal Page 1</a>
+                <a href="/page2">Internal Page 2</a>
+                <a href="https://example.com/external">External Link</a>
+            </nav>
+            <main>
+                <p>This is the main content with <a href="/page3">another link</a>.</p>
+            </main>
+        </body>
+    </html>
+    """
+
+    async with AsyncWebCrawler(verbose=False) as crawler:
+        config = CrawlerRunConfig(
+            prefetch=True,
+            base_url="https://mysite.com",  # For resolving relative links
+        )
+
+        result = await crawler.arun(f"raw:{sample_html}", config=config)
+
+        print(f"\nExtracted from raw HTML:")
+        print(f"  Internal links: {len(result.links.get('internal', []))}")
+        for link in result.links.get("internal", []):
+            print(f"    - {link['href']} ({link['text']})")
+
+        print(f"\n  External links: {len(result.links.get('external', []))}")
+        for link in result.links.get("external", []):
+            print(f"    - {link['href']} ({link['text']})")
+
+
+async def main():
+    """Run all examples."""
+    print("=" * 60)
+    print("Prefetch Mode and Two-Phase Crawling Examples")
+    print("=" * 60)
+
+    await example_basic_prefetch()
+    await example_performance_comparison()
+    await example_two_phase_crawl()
+    await example_prefetch_with_deep_crawl()
+    await example_prefetch_with_raw_html()
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/docs/md_v2/blog/index.md
+++ b/docs/md_v2/blog/index.md
@@ -20,22 +20,32 @@ Ever wondered why your AI coding assistant struggles with your library despite c

 ## Latest Release

+### [Crawl4AI v0.8.0 – Crash Recovery & Prefetch Mode](../blog/release-v0.8.0.md)
+*January 2026*
+
+Crawl4AI v0.8.0 introduces crash recovery for deep crawls, a new prefetch mode for fast URL discovery, and critical security fixes for Docker deployments.
+
+Key highlights:
+- **🔄 Deep Crawl Crash Recovery**: `on_state_change` callback for real-time state persistence, `resume_state` to continue from checkpoints
+- **⚡ Prefetch Mode**: `prefetch=True` for 5-10x faster URL discovery, perfect for two-phase crawling patterns
+- **🔒 Security Fixes**: Hooks disabled by default, `file://` URLs blocked on Docker API, `__import__` removed from sandbox
+
+[Read full release notes →](../blog/release-v0.8.0.md)
+
+## Recent Releases
+
 ### [Crawl4AI v0.7.8 – Stability & Bug Fix Release](../blog/release-v0.7.8.md)
 *December 2025*

-Crawl4AI v0.7.8 is a focused stability release addressing 11 bugs reported by the community. While there are no new features, these fixes resolve important issues affecting Docker deployments, LLM extraction, URL handling, and dependency compatibility.
+Crawl4AI v0.7.8 is a focused stability release addressing 11 bugs reported by the community. Fixes for Docker deployments, LLM extraction, URL handling, and dependency compatibility.

 Key highlights:
 - **🐳 Docker API Fixes**: ContentRelevanceFilter deserialization, ProxyConfig serialization, cache folder permissions
- **🤖 LLM Improvements**: Configurable rate limiter backoff, HTML input format support, raw HTML URL handling
- **🔗 URL Handling**: Correct relative URL resolution after JavaScript redirects
+- **🤖 LLM Improvements**: Configurable rate limiter backoff, HTML input format support
 - **📦 Dependencies**: Replaced deprecated PyPDF2 with pypdf, Pydantic v2 ConfigDict compatibility
- **🧠 AdaptiveCrawler**: Fixed query expansion to actually use LLM instead of mock data

 [Read full release notes →](../blog/release-v0.7.8.md)

-## Recent Releases
-
 ### [Crawl4AI v0.7.7 – The Self-Hosting & Monitoring Update](../blog/release-v0.7.7.md)
 *November 14, 2025*

@@ -52,36 +62,22 @@ Key highlights:
 ### [Crawl4AI v0.7.6 – The Webhook Infrastructure Update](../blog/release-v0.7.6.md)
 *October 22, 2025*

-Crawl4AI v0.7.6 introduces comprehensive webhook support for the Docker job queue API, bringing real-time notifications to both crawling and LLM extraction workflows. No more polling!
+Crawl4AI v0.7.6 introduces comprehensive webhook support for the Docker job queue API, bringing real-time notifications to both crawling and LLM extraction workflows.

 Key highlights:
 - **🪝 Complete Webhook Support**: Real-time notifications for both `/crawl/job` and `/llm/job` endpoints
- **🔄 Reliable Delivery**: Exponential backoff retry mechanism (5 attempts: 1s → 2s → 4s → 8s → 16s)
+- **🔄 Reliable Delivery**: Exponential backoff retry mechanism
 - **🔐 Custom Authentication**: Add custom headers for webhook authentication
- **📊 Flexible Delivery**: Choose notification-only or include full data in payload
- **⚙️ Global Configuration**: Set default webhook URL in config.yml for all jobs

 [Read full release notes →](../blog/release-v0.7.6.md)

-### [Crawl4AI v0.7.5 – The Docker Hooks & Security Update](../blog/release-v0.7.5.md)
-*September 29, 2025*
-
-Crawl4AI v0.7.5 introduces the powerful Docker Hooks System for complete pipeline customization, enhanced LLM integration with custom providers, HTTPS preservation for modern web security, and resolves multiple community-reported issues.
-
-Key highlights:
- **🔧 Docker Hooks System**: Custom Python functions at 8 key pipeline points for unprecedented customization
- **🤖 Enhanced LLM Integration**: Custom providers with temperature control and base_url configuration
- **🔒 HTTPS Preservation**: Secure internal link handling for modern web applications
- **🐍 Python 3.10+ Support**: Modern language features and enhanced performance
-
-[Read full release notes →](../blog/release-v0.7.5.md)
-
 ---

 ## Older Releases

 | Version | Date | Highlights |
 |---------|------|------------|
+| [v0.7.5](../blog/release-v0.7.5.md) | September 2025 | Docker Hooks System, enhanced LLM integration, HTTPS preservation |
 | [v0.7.4](../blog/release-v0.7.4.md) | August 2025 | LLM-powered table extraction, performance improvements |
 | [v0.7.3](../blog/release-v0.7.3.md) | July 2025 | Undetected browser, multi-URL config, memory monitoring |
 | [v0.7.1](../blog/release-v0.7.1.md) | June 2025 | Bug fixes and stability improvements |
--- a/docs/md_v2/blog/releases/v0.8.0.md
+++ b/docs/md_v2/blog/releases/v0.8.0.md
@@ -0,0 +1,243 @@
+# Crawl4AI v0.8.0 Release Notes
+
+**Release Date**: January 2026
+**Previous Version**: v0.7.6
+**Status**: Release Candidate
+
+---
+
+## Highlights
+
+- **Critical Security Fixes** for Docker API deployment
+- **11 New Features** including crash recovery, prefetch mode, and proxy improvements
+- **Breaking Changes** - see migration guide below
+
+---
+
+## Breaking Changes
+
+### 1. Docker API: Hooks Disabled by Default
+
+**What changed**: Hooks are now disabled by default on the Docker API.
+
+**Why**: Security fix for Remote Code Execution (RCE) vulnerability.
+
+**Who is affected**: Users of the Docker API who use the `hooks` parameter in `/crawl` requests.
+
+**Migration**:
+```bash
+# To re-enable hooks (only if you trust all API users):
+export CRAWL4AI_HOOKS_ENABLED=true
+```
+
+### 2. Docker API: file:// URLs Blocked
+
+**What changed**: The endpoints `/execute_js`, `/screenshot`, `/pdf`, and `/html` now reject `file://` URLs.
+
+**Why**: Security fix for Local File Inclusion (LFI) vulnerability.
+
+**Who is affected**: Users who were reading local files via the Docker API.
+
+**Migration**: Use the Python library directly for local file processing:
+```python
+# Instead of API call with file:// URL, use library:
+from crawl4ai import AsyncWebCrawler
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(url="file:///path/to/file.html")
+```
+
+---
+
+## Security Fixes
+
+### Critical: Remote Code Execution via Hooks (CVE Pending)
+
+**Severity**: CRITICAL (CVSS 10.0)
+**Affected**: Docker API deployment (all versions before v0.8.0)
+**Vector**: `POST /crawl` with malicious `hooks` parameter
+
+**Details**: The `__import__` builtin was available in hook code, allowing attackers to import `os`, `subprocess`, etc. and execute arbitrary commands.
+
+**Fix**:
+1. Removed `__import__` from allowed builtins
+2. Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)
+
+### High: Local File Inclusion via file:// URLs (CVE Pending)
+
+**Severity**: HIGH (CVSS 8.6)
+**Affected**: Docker API deployment (all versions before v0.8.0)
+**Vector**: `POST /execute_js` (and other endpoints) with `file:///etc/passwd`
+
+**Details**: API endpoints accepted `file://` URLs, allowing attackers to read arbitrary files from the server.
+
+**Fix**: URL scheme validation now only allows `http://`, `https://`, and `raw:` URLs.
+
+### Credits
+
+Discovered by **Neo by ProjectDiscovery** ([projectdiscovery.io](https://projectdiscovery.io)) - December 2025
+
+---
+
+## New Features
+
+### 1. init_scripts Support for BrowserConfig
+
+Pre-page-load JavaScript injection for stealth evasions.
+
+```python
+config = BrowserConfig(
+    init_scripts=[
+        "Object.defineProperty(navigator, 'webdriver', {get: () => false})"
+    ]
+)
+```
+
+### 2. CDP Connection Improvements
+
+- WebSocket URL support (`ws://`, `wss://`)
+- Proper cleanup with `cdp_cleanup_on_close=True`
+- Browser reuse across multiple connections
+
+### 3. Crash Recovery for Deep Crawl Strategies
+
+All deep crawl strategies (BFS, DFS, Best-First) now support crash recovery:
+
+```python
+from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+
+strategy = BFSDeepCrawlStrategy(
+    max_depth=3,
+    resume_state=saved_state,  # Resume from checkpoint
+    on_state_change=save_callback  # Persist state in real-time
+)
+```
+
+### 4. PDF and MHTML for raw:/file:// URLs
+
+Generate PDFs and MHTML from cached HTML content.
+
+### 5. Screenshots for raw:/file:// URLs
+
+Render cached HTML and capture screenshots.
+
+### 6. base_url Parameter for CrawlerRunConfig
+
+Proper URL resolution for raw: HTML processing:
+
+```python
+config = CrawlerRunConfig(base_url='https://example.com')
+result = await crawler.arun(url='raw:{html}', config=config)
+```
+
+### 7. Prefetch Mode for Two-Phase Deep Crawling
+
+Fast link extraction without full page processing:
+
+```python
+config = CrawlerRunConfig(prefetch=True)
+```
+
+### 8. Proxy Rotation and Configuration
+
+Enhanced proxy rotation with sticky sessions support.
+
+### 9. Proxy Support for HTTP Strategy
+
+Non-browser crawler now supports proxies.
+
+### 10. Browser Pipeline for raw:/file:// URLs
+
+New `process_in_browser` parameter for browser operations on local content:
+
+```python
+config = CrawlerRunConfig(
+    process_in_browser=True,  # Force browser processing
+    screenshot=True
+)
+result = await crawler.arun(url='raw:<html>...</html>', config=config)
+```
+
+### 11. Smart TTL Cache for Sitemap URL Seeder
+
+Intelligent cache invalidation for sitemaps:
+
+```python
+config = SeedingConfig(
+    cache_ttl_hours=24,
+    validate_sitemap_lastmod=True
+)
+```
+
+---
+
+## Bug Fixes
+
+### raw: URL Parsing Truncates at # Character
+
+**Problem**: CSS color codes like `#eee` were being truncated.
+
+**Before**: `raw:body{background:#eee}` → `body{background:`
+**After**: `raw:body{background:#eee}` → `body{background:#eee}`
+
+### Caching System Improvements
+
+Various fixes to cache validation and persistence.
+
+---
+
+## Documentation Updates
+
+- Multi-sample schema generation documentation
+- URL seeder smart TTL cache parameters
+- Security documentation (SECURITY.md)
+
+---
+
+## Upgrade Guide
+
+### From v0.7.x to v0.8.0
+
+1. **Update the package**:
+   ```bash
+   pip install --upgrade crawl4ai
+   ```
+
+2. **Docker API users**:
+   - Hooks are now disabled by default
+   - If you need hooks: `export CRAWL4AI_HOOKS_ENABLED=true`
+   - `file://` URLs no longer work on API (use library directly)
+
+3. **Review security settings**:
+   ```yaml
+   # config.yml - recommended for production
+   security:
+     enabled: true
+     jwt_enabled: true
+   ```
+
+4. **Test your integration** before deploying to production
+
+### Breaking Change Checklist
+
+- [ ] Check if you use `hooks` parameter in API calls
+- [ ] Check if you use `file://` URLs via the API
+- [ ] Update environment variables if needed
+- [ ] Review security configuration
+
+---
+
+## Full Changelog
+
+See [CHANGELOG.md](../CHANGELOG.md) for complete version history.
+
+---
+
+## Contributors
+
+Thanks to all contributors who made this release possible.
+
+Special thanks to **Neo by ProjectDiscovery** for responsible security disclosure.
+
+---
+
+*For questions or issues, please open a [GitHub Issue](https://github.com/unclecode/crawl4ai/issues).*
--- a/docs/md_v2/core/deep-crawling.md
+++ b/docs/md_v2/core/deep-crawling.md
@@ -4,11 +4,13 @@ One of Crawl4AI's most powerful features is its ability to perform **configurabl

 In this tutorial, you'll learn:

-1. How to set up a **Basic Deep Crawler** with BFS strategy  
-2. Understanding the difference between **streamed and non-streamed** output  
-3. Implementing **filters and scorers** to target specific content  
-4. Creating **advanced filtering chains** for sophisticated crawls  
-5. Using **BestFirstCrawling** for intelligent exploration prioritization  
+1. How to set up a **Basic Deep Crawler** with BFS strategy
+2. Understanding the difference between **streamed and non-streamed** output
+3. Implementing **filters and scorers** to target specific content
+4. Creating **advanced filtering chains** for sophisticated crawls
+5. Using **BestFirstCrawling** for intelligent exploration prioritization
+6. **Crash recovery** for long-running production crawls
+7. **Prefetch mode** for fast URL discovery  

 > **Prerequisites**  
 > - You’ve completed or read [AsyncWebCrawler Basics](../core/simple-crawling.md) to understand how to run a simple crawl.  
@@ -485,7 +487,249 @@ This is especially useful for security-conscious crawling or when dealing with s

 ---

-## 10. Summary & Next Steps
+## 10. Crash Recovery for Long-Running Crawls
+
+For production deployments, especially in cloud environments where instances can be terminated unexpectedly, Crawl4AI provides built-in crash recovery support for all deep crawl strategies.
+
+### 10.1 Enabling State Persistence
+
+All deep crawl strategies (BFS, DFS, Best-First) support two optional parameters:
+
+- **`resume_state`**: Pass a previously saved state to resume from a checkpoint
+- **`on_state_change`**: Async callback fired after each URL is processed
+
+```python
+from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+import json
+
+# Callback to save state after each URL
+async def save_state_to_redis(state: dict):
+    await redis.set("crawl_state", json.dumps(state))
+
+strategy = BFSDeepCrawlStrategy(
+    max_depth=3,
+    on_state_change=save_state_to_redis,  # Called after each URL
+)
+```
+
+### 10.2 State Structure
+
+The state dictionary is JSON-serializable and contains:
+
+```python
+{
+    "strategy_type": "bfs",  # or "dfs", "best_first"
+    "visited": ["url1", "url2", ...],  # Already crawled URLs
+    "pending": [{"url": "...", "parent_url": "..."}],  # Queue/stack
+    "depths": {"url1": 0, "url2": 1},  # Depth tracking
+    "pages_crawled": 42  # Counter
+}
+```
+
+### 10.3 Resuming from a Checkpoint
+
+```python
+import json
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+
+# Load saved state (e.g., from Redis, database, or file)
+saved_state = json.loads(await redis.get("crawl_state"))
+
+# Resume crawling from where we left off
+strategy = BFSDeepCrawlStrategy(
+    max_depth=3,
+    resume_state=saved_state,  # Continue from checkpoint
+    on_state_change=save_state_to_redis,  # Keep saving progress
+)
+
+config = CrawlerRunConfig(deep_crawl_strategy=strategy)
+
+async with AsyncWebCrawler() as crawler:
+    # Will skip already-visited URLs and continue from pending queue
+    results = await crawler.arun(start_url, config=config)
+```
+
+### 10.4 Manual State Export
+
+You can export the last captured state using `export_state()`. Note that this requires `on_state_change` to be set (state is captured in the callback):
+
+```python
+import json
+
+captured_state = None
+
+async def capture_state(state: dict):
+    global captured_state
+    captured_state = state
+
+strategy = BFSDeepCrawlStrategy(
+    max_depth=2,
+    on_state_change=capture_state,  # Required for state capture
+)
+config = CrawlerRunConfig(deep_crawl_strategy=strategy)
+
+async with AsyncWebCrawler() as crawler:
+    results = await crawler.arun(start_url, config=config)
+
+# Get the last captured state
+state = strategy.export_state()
+if state:
+    # Save to your preferred storage
+    with open("crawl_checkpoint.json", "w") as f:
+        json.dump(state, f)
+```
+
+### 10.5 Complete Example: Redis-Based Recovery
+
+```python
+import asyncio
+import json
+import redis.asyncio as redis
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+
+REDIS_KEY = "crawl4ai:crawl_state"
+
+async def main():
+    redis_client = redis.Redis(host='localhost', port=6379, db=0)
+
+    # Check for existing state
+    saved_state = None
+    existing = await redis_client.get(REDIS_KEY)
+    if existing:
+        saved_state = json.loads(existing)
+        print(f"Resuming from checkpoint: {saved_state['pages_crawled']} pages already crawled")
+
+    # State persistence callback
+    async def persist_state(state: dict):
+        await redis_client.set(REDIS_KEY, json.dumps(state))
+
+    # Create strategy with recovery support
+    strategy = BFSDeepCrawlStrategy(
+        max_depth=3,
+        max_pages=100,
+        resume_state=saved_state,
+        on_state_change=persist_state,
+    )
+
+    config = CrawlerRunConfig(deep_crawl_strategy=strategy, stream=True)
+
+    try:
+        async with AsyncWebCrawler() as crawler:
+            async for result in await crawler.arun("https://example.com", config=config):
+                print(f"Crawled: {result.url}")
+    except Exception as e:
+        print(f"Crawl interrupted: {e}")
+        print("State saved - restart to resume")
+    finally:
+        await redis_client.close()
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+### 10.6 Zero Overhead
+
+When `resume_state=None` and `on_state_change=None` (the defaults), there is no performance impact. State tracking only activates when you enable these features.
+
+---
+
+## 11. Prefetch Mode for Fast URL Discovery
+
+When you need to quickly discover URLs without full page processing, use **prefetch mode**. This is ideal for two-phase crawling where you first map the site, then selectively process specific pages.
+
+### 11.1 Enabling Prefetch Mode
+
+```python
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+config = CrawlerRunConfig(prefetch=True)
+
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun("https://example.com", config=config)
+
+    # Result contains only HTML and links - no markdown, no extraction
+    print(f"Found {len(result.links['internal'])} internal links")
+    print(f"Found {len(result.links['external'])} external links")
+```
+
+### 11.2 What Gets Skipped
+
+Prefetch mode uses a fast path that bypasses heavy processing:
+
+| Processing Step | Normal Mode | Prefetch Mode |
+|----------------|-------------|---------------|
+| Fetch HTML | ✅ | ✅ |
+| Extract links | ✅ | ✅ (fast `quick_extract_links()`) |
+| Generate markdown | ✅ | ❌ Skipped |
+| Content scraping | ✅ | ❌ Skipped |
+| Media extraction | ✅ | ❌ Skipped |
+| LLM extraction | ✅ | ❌ Skipped |
+
+### 11.3 Performance Benefit
+
+- **Normal mode**: Full pipeline (~2-5 seconds per page)
+- **Prefetch mode**: HTML + links only (~200-500ms per page)
+
+This makes prefetch mode **5-10x faster** for URL discovery.
+
+### 11.4 Two-Phase Crawling Pattern
+
+The most common use case is two-phase crawling:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+async def two_phase_crawl(start_url: str):
+    async with AsyncWebCrawler() as crawler:
+        # ═══════════════════════════════════════════════
+        # Phase 1: Fast discovery (prefetch mode)
+        # ═══════════════════════════════════════════════
+        prefetch_config = CrawlerRunConfig(prefetch=True)
+        discovery = await crawler.arun(start_url, config=prefetch_config)
+
+        all_urls = [link["href"] for link in discovery.links.get("internal", [])]
+        print(f"Discovered {len(all_urls)} URLs")
+
+        # Filter to URLs you care about
+        blog_urls = [url for url in all_urls if "/blog/" in url]
+        print(f"Found {len(blog_urls)} blog posts to process")
+
+        # ═══════════════════════════════════════════════
+        # Phase 2: Full processing on selected URLs only
+        # ═══════════════════════════════════════════════
+        full_config = CrawlerRunConfig(
+            # Your normal extraction settings
+            word_count_threshold=100,
+            remove_overlay_elements=True,
+        )
+
+        results = []
+        for url in blog_urls:
+            result = await crawler.arun(url, config=full_config)
+            if result.success:
+                results.append(result)
+                print(f"Processed: {url}")
+
+        return results
+
+if __name__ == "__main__":
+    results = asyncio.run(two_phase_crawl("https://example.com"))
+    print(f"Fully processed {len(results)} pages")
+```
+
+### 11.5 Use Cases
+
+- **Site mapping**: Quickly discover all URLs before deciding what to process
+- **Link validation**: Check which pages exist without heavy processing
+- **Selective deep crawl**: Prefetch to find URLs, filter by pattern, then full crawl
+- **Crawl planning**: Estimate crawl size before committing resources
+
+---
+
+## 12. Summary & Next Steps

 In this **Deep Crawling with Crawl4AI** tutorial, you learned to:

@@ -495,5 +739,7 @@ In this **Deep Crawling with Crawl4AI** tutorial, you learned to:
 - Use scorers to prioritize the most relevant pages
 - Limit crawls with `max_pages` and `score_threshold` parameters
 - Build a complete advanced crawler with combined techniques
+- **Implement crash recovery** with `resume_state` and `on_state_change` for production deployments
+- **Use prefetch mode** for fast URL discovery and two-phase crawling

 With these tools, you can efficiently extract structured data from websites at scale, focusing precisely on the content you need for your specific use case.
--- a/docs/md_v2/core/self-hosting.md
+++ b/docs/md_v2/core/self-hosting.md
@@ -67,13 +67,13 @@ Pull and run images directly from Docker Hub without building locally.

 #### 1. Pull the Image

-Our latest release is `0.7.6`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
+Our latest release is `0.8.0`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.

-> 💡 **Note**: The `latest` tag points to the stable `0.7.6` version.
+> 💡 **Note**: The `latest` tag points to the stable `0.8.0` version.

 ```bash
 # Pull the latest version
-docker pull unclecode/crawl4ai:0.7.6
+docker pull unclecode/crawl4ai:0.8.0

 # Or pull using the latest tag
 docker pull unclecode/crawl4ai:latest
@@ -145,7 +145,7 @@ docker stop crawl4ai && docker rm crawl4ai
 #### Docker Hub Versioning Explained

 *   **Image Name:** `unclecode/crawl4ai`
-*   **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.7.6`)
+*   **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.8.0`)
    *   `LIBRARY_VERSION`: The semantic version of the core `crawl4ai` Python library
    *   `SUFFIX`: Optional tag for release candidates (``) and revisions (`r1`)
 *   **`latest` Tag:** Points to the most recent stable version
--- a/docs/md_v2/core/url-seeding.md
+++ b/docs/md_v2/core/url-seeding.md
@@ -255,6 +255,8 @@ The `SeedingConfig` object is your control panel. Here's everything you can conf
 | `scoring_method` | str | None | Scoring method (currently "bm25") |
 | `score_threshold` | float | None | Minimum score to include URL |
 | `filter_nonsense_urls` | bool | True | Filter out utility URLs (robots.txt, etc.) |
+| `cache_ttl_hours` | int | 24 | Hours before sitemap cache expires (0 = no TTL) |
+| `validate_sitemap_lastmod` | bool | True | Check sitemap's lastmod and refetch if newer |

 #### Pattern Matching Examples

@@ -968,10 +970,49 @@ config = SeedingConfig(
 The seeder automatically caches results to speed up repeated operations:

 - **Common Crawl cache**: `~/.crawl4ai/seeder_cache/[index]_[domain]_[hash].jsonl`
- **Sitemap cache**: `~/.crawl4ai/seeder_cache/sitemap_[domain]_[hash].jsonl`
+- **Sitemap cache**: `~/.crawl4ai/seeder_cache/sitemap_[domain]_[hash].json`
 - **HEAD data cache**: `~/.cache/url_seeder/head/[hash].json`

-Cache expires after 7 days by default. Use `force=True` to refresh.
+#### Smart TTL Cache for Sitemaps
+
+Sitemap caches now include intelligent validation:
+
+```python
+# Default: 24-hour TTL with lastmod validation
+config = SeedingConfig(
+    source="sitemap",
+    cache_ttl_hours=24,              # Cache expires after 24 hours
+    validate_sitemap_lastmod=True    # Also check if sitemap was updated
+)
+
+# Aggressive caching (1 week, no lastmod check)
+config = SeedingConfig(
+    source="sitemap",
+    cache_ttl_hours=168,             # 7 days
+    validate_sitemap_lastmod=False   # Trust TTL only
+)
+
+# Always validate (no TTL, only lastmod)
+config = SeedingConfig(
+    source="sitemap",
+    cache_ttl_hours=0,               # Disable TTL
+    validate_sitemap_lastmod=True    # Refetch if sitemap has newer lastmod
+)
+
+# Always fresh (bypass cache completely)
+config = SeedingConfig(
+    source="sitemap",
+    force=True                       # Ignore all caching
+)
+```
+
+**Cache validation priority:**
+1. `force=True` → Always refetch
+2. Cache doesn't exist → Fetch fresh
+3. `validate_sitemap_lastmod=True` and sitemap has newer `<lastmod>` → Refetch
+4. `cache_ttl_hours > 0` and cache is older than TTL → Refetch
+5. Cache corrupted → Refetch (automatic recovery)
+6. Otherwise → Use cache

 ### Pattern Matching Strategies

@@ -1060,6 +1101,9 @@ config = SeedingConfig(
 | Rate limit errors | Reduce `hits_per_sec` and `concurrency` |
 | Memory issues with large sites | Use `max_urls` to limit results, reduce `concurrency` |
 | Connection not closed | Use context manager or call `await seeder.close()` |
+| Stale/outdated URLs | Set `cache_ttl_hours=0` or use `force=True` |
+| Cache not updating | Check `validate_sitemap_lastmod=True`, or use `force=True` |
+| Incomplete URL list | Delete cache file and refetch, or use `force=True` |

 ### Performance Benchmarks

@@ -1119,6 +1163,7 @@ config = SeedingConfig(
 3. **Context Manager Support**: Automatic cleanup with `async with` statement
 4. **URL-Based Scoring**: Smart filtering even without head extraction
 5. **Smart URL Filtering**: Automatically excludes utility/nonsense URLs
-6. **Dual Caching**: Separate caches for URL lists and metadata
+6. **Smart TTL Cache**: Sitemap caches with TTL expiry and lastmod validation
+7. **Automatic Cache Recovery**: Corrupted or incomplete caches are automatically refreshed

-Now go forth and seed intelligently! 🌱🚀
+Now go forth and seed intelligently!
--- a/docs/md_v2/extraction/no-llm-strategies.md
+++ b/docs/md_v2/extraction/no-llm-strategies.md
@@ -712,10 +712,106 @@ strategy = JsonCssExtractionStrategy(css_schema)
 3. **Consider Both CSS and XPath**: Try both schema types and choose the one that works best for your specific case.
 4. **Cache Generated Schemas**: Since generation uses LLM, save successful schemas for reuse.
 5. **API Token Security**: Never hardcode API tokens. Use environment variables or secure configuration management.
-6. **Choose Provider Wisely**: 
+6. **Choose Provider Wisely**:
   - Use OpenAI for production-quality schemas
   - Use Ollama for development, testing, or when you need a self-hosted solution

+### Multi-Sample Schema Generation
+
+When scraping multiple pages with varying DOM structures (e.g., product pages where table rows appear in different positions), single-sample schema generation may produce **fragile selectors** like `tr:nth-child(6)` that break on other pages.
+
+**The Problem:**
+```
+Page A: Manufacturer is in row 6  → selector: tr:nth-child(6) td a
+Page B: Manufacturer is in row 5  → selector FAILS
+Page C: Manufacturer is in row 7  → selector FAILS
+```
+
+**The Solution:** Provide multiple HTML samples so the LLM identifies stable patterns that work across all pages.
+
+```python
+from crawl4ai import JsonCssExtractionStrategy, LLMConfig
+
+# Collect HTML samples from different pages
+html_sample_1 = """
+<table class="specs">
+  <tr><td>Brand</td><td>Apple</td></tr>
+  <tr><td>Manufacturer</td><td><a href="/m/apple">Apple Inc</a></td></tr>
+</table>
+"""
+
+html_sample_2 = """
+<table class="specs">
+  <tr><td>Manufacturer</td><td><a href="/m/samsung">Samsung</a></td></tr>
+  <tr><td>Brand</td><td>Galaxy</td></tr>
+</table>
+"""
+
+html_sample_3 = """
+<table class="specs">
+  <tr><td>Model</td><td>Pixel 8</td></tr>
+  <tr><td>Brand</td><td>Google</td></tr>
+  <tr><td>Manufacturer</td><td><a href="/m/google">Google LLC</a></td></tr>
+</table>
+"""
+
+# Combine samples with labels
+combined_html = """
+## HTML Sample 1 (Product A):
+```html
+""" + html_sample_1 + """
+```
+
+## HTML Sample 2 (Product B):
+```html
+""" + html_sample_2 + """
+```
+
+## HTML Sample 3 (Product C):
+```html
+""" + html_sample_3 + """
+```
+"""
+
+# Provide instructions for stable selectors
+query = """
+IMPORTANT: I'm providing 3 HTML samples from different product pages.
+The manufacturer field appears in different row positions across pages.
+Generate selectors using stable attributes like href patterns (e.g., a[href*='/m/'])
+instead of fragile positional selectors like nth-child().
+Extract: manufacturer name and link.
+"""
+
+# Generate schema with multi-sample awareness
+schema = JsonCssExtractionStrategy.generate_schema(
+    html=combined_html,
+    query=query,
+    schema_type="CSS",
+    llm_config=LLMConfig(provider="openai/gpt-4o", api_token="your-token")
+)
+
+# The generated schema will use stable selectors like:
+# a[href*="/m/"] instead of tr:nth-child(6) td a
+print(schema)
+```
+
+**Key Points for Multi-Sample Queries:**
+
+1. **Format samples clearly** - Use markdown headers and code blocks to separate samples
+2. **State the number of samples** - "I'm providing 3 HTML samples..."
+3. **Explain the variation** - "...the manufacturer field appears in different row positions"
+4. **Request stable selectors** - "Use href patterns, data attributes, or class names instead of nth-child"
+
+**Stable vs Fragile Selectors:**
+
+| Fragile (single sample) | Stable (multi-sample) |
+|------------------------|----------------------|
+| `tr:nth-child(6) td a` | `a[href*="/m/"]` |
+| `div:nth-child(3) .price` | `.price, [data-price]` |
+| `ul li:first-child` | `li[data-featured="true"]` |
+
+This approach lets you generate schemas once that work reliably across hundreds of similar pages with varying structures.
+
 ---

 ## 10. Conclusion
--- a/docs/migration/v0.8.0-upgrade-guide.md
+++ b/docs/migration/v0.8.0-upgrade-guide.md
@@ -0,0 +1,301 @@
+# Migration Guide: Upgrading to Crawl4AI v0.8.0
+
+This guide helps you upgrade from v0.7.x to v0.8.0, with special attention to breaking changes and security updates.
+
+## Quick Summary
+
+| Change | Impact | Action Required |
+|--------|--------|-----------------|
+| Hooks disabled by default | Docker API users with hooks | Set `CRAWL4AI_HOOKS_ENABLED=true` |
+| file:// URLs blocked | Docker API users reading local files | Use Python library directly |
+| Security fixes | All Docker API users | Update immediately |
+
+---
+
+## Step 1: Update the Package
+
+### PyPI Installation
+
+```bash
+pip install --upgrade crawl4ai
+```
+
+### Docker Installation
+
+```bash
+docker pull unclecode/crawl4ai:latest
+# or
+docker pull unclecode/crawl4ai:0.8.0
+```
+
+### From Source
+
+```bash
+git pull origin main
+pip install -e .
+```
+
+---
+
+## Step 2: Check for Breaking Changes
+
+### Are You Affected?
+
+**You ARE affected if you:**
+- Use the Docker API deployment
+- Use the `hooks` parameter in `/crawl` requests
+- Use `file://` URLs via API endpoints
+
+**You are NOT affected if you:**
+- Only use Crawl4AI as a Python library
+- Don't use hooks in your API calls
+- Don't use `file://` URLs via the API
+
+---
+
+## Step 3: Migrate Hooks Usage
+
+### Before v0.8.0
+
+Hooks worked by default:
+
+```bash
+# This worked without any configuration
+curl -X POST http://localhost:11235/crawl \
+  -H "Content-Type: application/json" \
+  -d '{
+    "urls": ["https://example.com"],
+    "hooks": {
+      "code": {
+        "on_page_context_created": "async def hook(page, context, **kwargs):\n    await context.add_cookies([...])\n    return page"
+      }
+    }
+  }'
+```
+
+### After v0.8.0
+
+You must explicitly enable hooks:
+
+**Option A: Environment Variable (Recommended)**
+```bash
+# In your Docker run command or docker-compose.yml
+export CRAWL4AI_HOOKS_ENABLED=true
+```
+
+```yaml
+# docker-compose.yml
+services:
+  crawl4ai:
+    image: unclecode/crawl4ai:0.8.0
+    environment:
+      - CRAWL4AI_HOOKS_ENABLED=true
+```
+
+**Option B: For Kubernetes**
+```yaml
+env:
+  - name: CRAWL4AI_HOOKS_ENABLED
+    value: "true"
+```
+
+### Security Warning
+
+Only enable hooks if:
+- You trust all users who can access the API
+- The API is not exposed to the public internet
+- You have other authentication/authorization in place
+
+---
+
+## Step 4: Migrate file:// URL Usage
+
+### Before v0.8.0
+
+```bash
+# This worked via API
+curl -X POST http://localhost:11235/execute_js \
+  -d '{"url": "file:///var/data/page.html", "scripts": ["document.title"]}'
+```
+
+### After v0.8.0
+
+**Option A: Use the Python Library Directly**
+
+```python
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+async def process_local_file():
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(
+            url="file:///var/data/page.html",
+            config=CrawlerRunConfig(js_code=["document.title"])
+        )
+        return result
+```
+
+**Option B: Use raw: Protocol for HTML Content**
+
+If you have the HTML content, you can still use the API:
+
+```bash
+# Read file content and send as raw:
+HTML_CONTENT=$(cat /var/data/page.html)
+curl -X POST http://localhost:11235/html \
+  -H "Content-Type: application/json" \
+  -d "{\"url\": \"raw:$HTML_CONTENT\"}"
+```
+
+**Option C: Create a Preprocessing Service**
+
+```python
+# preprocessing_service.py
+from fastapi import FastAPI
+from crawl4ai import AsyncWebCrawler
+
+app = FastAPI()
+
+@app.post("/process-local")
+async def process_local(file_path: str):
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(url=f"file://{file_path}")
+        return result.model_dump()
+```
+
+---
+
+## Step 5: Review Security Configuration
+
+### Recommended Production Settings
+
+```yaml
+# config.yml
+security:
+  enabled: true
+  jwt_enabled: true
+  https_redirect: true  # If behind HTTPS proxy
+  trusted_hosts:
+    - "your-domain.com"
+    - "api.your-domain.com"
+```
+
+### Environment Variables
+
+```bash
+# Required for JWT authentication
+export SECRET_KEY="your-secure-random-key-minimum-32-characters"
+
+# Only if you need hooks
+export CRAWL4AI_HOOKS_ENABLED=true
+```
+
+### Generate a Secure Secret Key
+
+```python
+import secrets
+print(secrets.token_urlsafe(32))
+```
+
+---
+
+## Step 6: Test Your Integration
+
+### Quick Validation Script
+
+```python
+import asyncio
+import aiohttp
+
+async def test_upgrade():
+    base_url = "http://localhost:11235"
+
+    # Test 1: Basic crawl should work
+    async with aiohttp.ClientSession() as session:
+        async with session.post(
+            f"{base_url}/crawl",
+            json={"urls": ["https://example.com"]}
+        ) as resp:
+            assert resp.status == 200, "Basic crawl failed"
+            print("✓ Basic crawl works")
+
+    # Test 2: Hooks should be blocked (unless enabled)
+    async with aiohttp.ClientSession() as session:
+        async with session.post(
+            f"{base_url}/crawl",
+            json={
+                "urls": ["https://example.com"],
+                "hooks": {"code": {"on_page_context_created": "async def hook(page, context, **kwargs): return page"}}
+            }
+        ) as resp:
+            if resp.status == 403:
+                print("✓ Hooks correctly blocked (default)")
+            elif resp.status == 200:
+                print("! Hooks enabled - ensure this is intentional")
+
+    # Test 3: file:// should be blocked
+    async with aiohttp.ClientSession() as session:
+        async with session.post(
+            f"{base_url}/execute_js",
+            json={"url": "file:///etc/passwd", "scripts": ["1"]}
+        ) as resp:
+            assert resp.status == 400, "file:// should be blocked"
+            print("✓ file:// URLs correctly blocked")
+
+asyncio.run(test_upgrade())
+```
+
+---
+
+## Troubleshooting
+
+### "Hooks are disabled" Error
+
+**Symptom**: API returns 403 with "Hooks are disabled"
+
+**Solution**: Set `CRAWL4AI_HOOKS_ENABLED=true` if you need hooks
+
+### "URL must start with http://, https://" Error
+
+**Symptom**: API returns 400 when using `file://` URLs
+
+**Solution**: Use Python library directly or `raw:` protocol
+
+### Authentication Errors After Enabling JWT
+
+**Symptom**: API returns 401 Unauthorized
+
+**Solution**:
+1. Get a token: `POST /token` with your email
+2. Include token in requests: `Authorization: Bearer <token>`
+
+---
+
+## Rollback Plan
+
+If you need to rollback:
+
+```bash
+# PyPI
+pip install crawl4ai==0.7.6
+
+# Docker
+docker pull unclecode/crawl4ai:0.7.6
+```
+
+**Warning**: Rolling back re-exposes the security vulnerabilities. Only do this temporarily while fixing integration issues.
+
+---
+
+## Getting Help
+
+- **GitHub Issues**: [github.com/unclecode/crawl4ai/issues](https://github.com/unclecode/crawl4ai/issues)
+- **Security Issues**: See [SECURITY.md](../../SECURITY.md)
+- **Documentation**: [docs.crawl4ai.com](https://docs.crawl4ai.com)
+
+---
+
+## Changelog Reference
+
+For complete list of changes, see:
+- [Release Notes v0.8.0](../RELEASE_NOTES_v0.8.0.md)
+- [CHANGELOG.md](../../CHANGELOG.md)
--- a/docs/releases_review/demo_v0.8.0.py
+++ b/docs/releases_review/demo_v0.8.0.py
@@ -0,0 +1,633 @@
+#!/usr/bin/env python3
+"""
+Crawl4AI v0.8.0 Release Demo - Feature Verification Tests
+==========================================================
+
+This demo ACTUALLY RUNS and VERIFIES the new features in v0.8.0.
+Each test executes real code and validates the feature is working.
+
+New Features Verified:
+1. Crash Recovery - on_state_change callback for real-time state persistence
+2. Crash Recovery - resume_state for resuming from checkpoint
+3. Crash Recovery - State is JSON serializable
+4. Prefetch Mode - Returns HTML and links only
+5. Prefetch Mode - Skips heavy processing (markdown, extraction)
+6. Prefetch Mode - Two-phase crawl pattern
+7. Security - Hooks disabled by default (Docker API)
+
+Breaking Changes in v0.8.0:
+- Docker API hooks disabled by default (CRAWL4AI_HOOKS_ENABLED=false)
+- file:// URLs blocked on Docker API endpoints
+
+Usage:
+    python docs/releases_review/demo_v0.8.0.py
+"""
+
+import asyncio
+import json
+import sys
+import time
+from typing import Dict, Any, List, Optional
+from dataclasses import dataclass
+
+
+# Test results tracking
+@dataclass
+class TestResult:
+    name: str
+    feature: str
+    passed: bool
+    message: str
+    skipped: bool = False
+
+
+results: list[TestResult] = []
+
+
+def print_header(title: str):
+    print(f"\n{'=' * 70}")
+    print(f"{title}")
+    print(f"{'=' * 70}")
+
+
+def print_test(name: str, feature: str):
+    print(f"\n[TEST] {name} ({feature})")
+    print("-" * 50)
+
+
+def record_result(name: str, feature: str, passed: bool, message: str, skipped: bool = False):
+    results.append(TestResult(name, feature, passed, message, skipped))
+    if skipped:
+        print(f"  SKIPPED: {message}")
+    elif passed:
+        print(f"  PASSED: {message}")
+    else:
+        print(f"  FAILED: {message}")
+
+
+# =============================================================================
+# TEST 1: Crash Recovery - State Capture with on_state_change
+# =============================================================================
+async def test_crash_recovery_state_capture():
+    """
+    Verify on_state_change callback is called after each URL is processed.
+
+    NEW in v0.8.0: Deep crawl strategies support on_state_change callback
+    for real-time state persistence (useful for cloud deployments).
+    """
+    print_test("Crash Recovery - State Capture", "on_state_change")
+
+    try:
+        from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+        from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+
+        captured_states: List[Dict[str, Any]] = []
+
+        async def capture_state(state: Dict[str, Any]):
+            """Callback that fires after each URL is processed."""
+            captured_states.append(state.copy())
+
+        strategy = BFSDeepCrawlStrategy(
+            max_depth=1,
+            max_pages=3,
+            on_state_change=capture_state,
+        )
+
+        config = CrawlerRunConfig(
+            deep_crawl_strategy=strategy,
+            verbose=False,
+        )
+
+        async with AsyncWebCrawler(verbose=False) as crawler:
+            await crawler.arun("https://books.toscrape.com", config=config)
+
+        # Verify states were captured
+        if len(captured_states) == 0:
+            record_result("State Capture", "on_state_change", False,
+                         "No states captured - callback not called")
+            return
+
+        # Verify callback was called for each page
+        pages_crawled = captured_states[-1].get("pages_crawled", 0)
+        if pages_crawled != len(captured_states):
+            record_result("State Capture", "on_state_change", False,
+                         f"Callback count {len(captured_states)} != pages_crawled {pages_crawled}")
+            return
+
+        record_result("State Capture", "on_state_change", True,
+                     f"Callback fired {len(captured_states)} times (once per URL)")
+
+    except Exception as e:
+        record_result("State Capture", "on_state_change", False, f"Exception: {e}")
+
+
+# =============================================================================
+# TEST 2: Crash Recovery - Resume from Checkpoint
+# =============================================================================
+async def test_crash_recovery_resume():
+    """
+    Verify crawl can resume from a saved checkpoint without re-crawling visited URLs.
+
+    NEW in v0.8.0: BFSDeepCrawlStrategy accepts resume_state parameter
+    to continue from a previously saved checkpoint.
+    """
+    print_test("Crash Recovery - Resume from Checkpoint", "resume_state")
+
+    try:
+        from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+        from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+
+        # Phase 1: Start crawl and capture state after 2 pages
+        crash_after = 2
+        captured_states: List[Dict] = []
+        phase1_urls: List[str] = []
+
+        async def capture_until_crash(state: Dict[str, Any]):
+            captured_states.append(state.copy())
+            phase1_urls.clear()
+            phase1_urls.extend(state["visited"])
+            if state["pages_crawled"] >= crash_after:
+                raise Exception("Simulated crash")
+
+        strategy1 = BFSDeepCrawlStrategy(
+            max_depth=1,
+            max_pages=5,
+            on_state_change=capture_until_crash,
+        )
+
+        config1 = CrawlerRunConfig(
+            deep_crawl_strategy=strategy1,
+            verbose=False,
+        )
+
+        # Run until "crash"
+        try:
+            async with AsyncWebCrawler(verbose=False) as crawler:
+                await crawler.arun("https://books.toscrape.com", config=config1)
+        except Exception:
+            pass  # Expected crash
+
+        if not captured_states:
+            record_result("Resume from Checkpoint", "resume_state", False,
+                         "No state captured before crash")
+            return
+
+        saved_state = captured_states[-1]
+        print(f"  Phase 1: Crawled {len(phase1_urls)} URLs before crash")
+
+        # Phase 2: Resume from checkpoint
+        phase2_urls: List[str] = []
+
+        async def track_phase2(state: Dict[str, Any]):
+            new_urls = set(state["visited"]) - set(saved_state["visited"])
+            for url in new_urls:
+                if url not in phase2_urls:
+                    phase2_urls.append(url)
+
+        strategy2 = BFSDeepCrawlStrategy(
+            max_depth=1,
+            max_pages=5,
+            resume_state=saved_state,  # Resume from checkpoint!
+            on_state_change=track_phase2,
+        )
+
+        config2 = CrawlerRunConfig(
+            deep_crawl_strategy=strategy2,
+            verbose=False,
+        )
+
+        async with AsyncWebCrawler(verbose=False) as crawler:
+            await crawler.arun("https://books.toscrape.com", config=config2)
+
+        print(f"  Phase 2: Crawled {len(phase2_urls)} new URLs after resume")
+
+        # Verify no duplicates
+        duplicates = set(phase2_urls) & set(phase1_urls)
+        if duplicates:
+            record_result("Resume from Checkpoint", "resume_state", False,
+                         f"Re-crawled {len(duplicates)} URLs: {list(duplicates)[:2]}")
+            return
+
+        record_result("Resume from Checkpoint", "resume_state", True,
+                     f"Resumed successfully, no duplicate crawls")
+
+    except Exception as e:
+        record_result("Resume from Checkpoint", "resume_state", False, f"Exception: {e}")
+
+
+# =============================================================================
+# TEST 3: Crash Recovery - State is JSON Serializable
+# =============================================================================
+async def test_crash_recovery_json_serializable():
+    """
+    Verify the state dictionary can be serialized to JSON (for Redis/DB storage).
+
+    NEW in v0.8.0: State dictionary is designed to be JSON-serializable
+    for easy storage in Redis, databases, or files.
+    """
+    print_test("Crash Recovery - JSON Serializable", "State Structure")
+
+    try:
+        from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+        from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+
+        captured_state: Optional[Dict] = None
+
+        async def capture_state(state: Dict[str, Any]):
+            nonlocal captured_state
+            captured_state = state
+
+        strategy = BFSDeepCrawlStrategy(
+            max_depth=1,
+            max_pages=2,
+            on_state_change=capture_state,
+        )
+
+        config = CrawlerRunConfig(
+            deep_crawl_strategy=strategy,
+            verbose=False,
+        )
+
+        async with AsyncWebCrawler(verbose=False) as crawler:
+            await crawler.arun("https://books.toscrape.com", config=config)
+
+        if not captured_state:
+            record_result("JSON Serializable", "State Structure", False,
+                         "No state captured")
+            return
+
+        # Test JSON serialization round-trip
+        try:
+            json_str = json.dumps(captured_state)
+            restored = json.loads(json_str)
+        except (TypeError, json.JSONDecodeError) as e:
+            record_result("JSON Serializable", "State Structure", False,
+                         f"JSON serialization failed: {e}")
+            return
+
+        # Verify state structure
+        required_fields = ["strategy_type", "visited", "pending", "depths", "pages_crawled"]
+        missing = [f for f in required_fields if f not in restored]
+        if missing:
+            record_result("JSON Serializable", "State Structure", False,
+                         f"Missing fields: {missing}")
+            return
+
+        # Verify types
+        if not isinstance(restored["visited"], list):
+            record_result("JSON Serializable", "State Structure", False,
+                         "visited is not a list")
+            return
+
+        if not isinstance(restored["pages_crawled"], int):
+            record_result("JSON Serializable", "State Structure", False,
+                         "pages_crawled is not an int")
+            return
+
+        record_result("JSON Serializable", "State Structure", True,
+                     f"State serializes to {len(json_str)} bytes, all fields present")
+
+    except Exception as e:
+        record_result("JSON Serializable", "State Structure", False, f"Exception: {e}")
+
+
+# =============================================================================
+# TEST 4: Prefetch Mode - Returns HTML and Links Only
+# =============================================================================
+async def test_prefetch_returns_html_links():
+    """
+    Verify prefetch mode returns HTML and links but skips markdown generation.
+
+    NEW in v0.8.0: CrawlerRunConfig accepts prefetch=True for fast
+    URL discovery without heavy processing.
+    """
+    print_test("Prefetch Mode - HTML and Links", "prefetch=True")
+
+    try:
+        from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+        config = CrawlerRunConfig(prefetch=True)
+
+        async with AsyncWebCrawler(verbose=False) as crawler:
+            result = await crawler.arun("https://books.toscrape.com", config=config)
+
+        # Verify HTML is present
+        if not result.html or len(result.html) < 100:
+            record_result("Prefetch HTML/Links", "prefetch=True", False,
+                         "HTML not returned or too short")
+            return
+
+        # Verify links are present
+        if not result.links:
+            record_result("Prefetch HTML/Links", "prefetch=True", False,
+                         "Links not returned")
+            return
+
+        internal_count = len(result.links.get("internal", []))
+        external_count = len(result.links.get("external", []))
+
+        if internal_count == 0:
+            record_result("Prefetch HTML/Links", "prefetch=True", False,
+                         "No internal links extracted")
+            return
+
+        record_result("Prefetch HTML/Links", "prefetch=True", True,
+                     f"HTML: {len(result.html)} chars, Links: {internal_count} internal, {external_count} external")
+
+    except Exception as e:
+        record_result("Prefetch HTML/Links", "prefetch=True", False, f"Exception: {e}")
+
+
+# =============================================================================
+# TEST 5: Prefetch Mode - Skips Heavy Processing
+# =============================================================================
+async def test_prefetch_skips_processing():
+    """
+    Verify prefetch mode skips markdown generation and content extraction.
+
+    NEW in v0.8.0: prefetch=True skips markdown generation, content scraping,
+    media extraction, and LLM extraction for maximum speed.
+    """
+    print_test("Prefetch Mode - Skips Processing", "prefetch=True")
+
+    try:
+        from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+        config = CrawlerRunConfig(prefetch=True)
+
+        async with AsyncWebCrawler(verbose=False) as crawler:
+            result = await crawler.arun("https://books.toscrape.com", config=config)
+
+        # Check that heavy processing was skipped
+        checks = []
+
+        # Markdown should be None or empty
+        if result.markdown is None:
+            checks.append("markdown=None")
+        elif hasattr(result.markdown, 'raw_markdown') and result.markdown.raw_markdown is None:
+            checks.append("raw_markdown=None")
+        else:
+            record_result("Prefetch Skips Processing", "prefetch=True", False,
+                         f"Markdown was generated (should be skipped)")
+            return
+
+        # cleaned_html should be None
+        if result.cleaned_html is None:
+            checks.append("cleaned_html=None")
+        else:
+            record_result("Prefetch Skips Processing", "prefetch=True", False,
+                         "cleaned_html was generated (should be skipped)")
+            return
+
+        # extracted_content should be None
+        if result.extracted_content is None:
+            checks.append("extracted_content=None")
+
+        record_result("Prefetch Skips Processing", "prefetch=True", True,
+                     f"Heavy processing skipped: {', '.join(checks)}")
+
+    except Exception as e:
+        record_result("Prefetch Skips Processing", "prefetch=True", False, f"Exception: {e}")
+
+
+# =============================================================================
+# TEST 6: Prefetch Mode - Two-Phase Crawl Pattern
+# =============================================================================
+async def test_prefetch_two_phase():
+    """
+    Verify the two-phase crawl pattern: prefetch for discovery, then full processing.
+
+    NEW in v0.8.0: Prefetch mode enables efficient two-phase crawling where
+    you discover URLs quickly, then selectively process important ones.
+    """
+    print_test("Prefetch Mode - Two-Phase Crawl", "Two-Phase Pattern")
+
+    try:
+        from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+        async with AsyncWebCrawler(verbose=False) as crawler:
+            # Phase 1: Fast discovery with prefetch
+            prefetch_config = CrawlerRunConfig(prefetch=True)
+
+            start = time.time()
+            discovery = await crawler.arun("https://books.toscrape.com", config=prefetch_config)
+            prefetch_time = time.time() - start
+
+            all_urls = [link["href"] for link in discovery.links.get("internal", [])]
+
+            # Filter to specific pages (e.g., book detail pages)
+            book_urls = [
+                url for url in all_urls
+                if "catalogue/" in url and "category/" not in url
+            ][:2]  # Just 2 for demo
+
+            print(f"  Phase 1: Found {len(all_urls)} URLs in {prefetch_time:.2f}s")
+            print(f"  Filtered to {len(book_urls)} book pages for full processing")
+
+            if len(book_urls) == 0:
+                record_result("Two-Phase Crawl", "Two-Phase Pattern", False,
+                             "No book URLs found to process")
+                return
+
+            # Phase 2: Full processing on selected URLs
+            full_config = CrawlerRunConfig()  # Normal mode
+
+            start = time.time()
+            processed = []
+            for url in book_urls:
+                result = await crawler.arun(url, config=full_config)
+                if result.success and result.markdown:
+                    processed.append(result)
+
+            full_time = time.time() - start
+
+            print(f"  Phase 2: Processed {len(processed)} pages in {full_time:.2f}s")
+
+            if len(processed) == 0:
+                record_result("Two-Phase Crawl", "Two-Phase Pattern", False,
+                             "No pages successfully processed in phase 2")
+                return
+
+            # Verify full processing includes markdown
+            if not processed[0].markdown or not processed[0].markdown.raw_markdown:
+                record_result("Two-Phase Crawl", "Two-Phase Pattern", False,
+                             "Full processing did not generate markdown")
+                return
+
+            record_result("Two-Phase Crawl", "Two-Phase Pattern", True,
+                         f"Discovered {len(all_urls)} URLs (prefetch), processed {len(processed)} (full)")
+
+    except Exception as e:
+        record_result("Two-Phase Crawl", "Two-Phase Pattern", False, f"Exception: {e}")
+
+
+# =============================================================================
+# TEST 7: Security - Hooks Disabled by Default
+# =============================================================================
+async def test_security_hooks_disabled():
+    """
+    Verify hooks are disabled by default in Docker API for security.
+
+    NEW in v0.8.0: Docker API hooks are disabled by default to prevent
+    Remote Code Execution. Set CRAWL4AI_HOOKS_ENABLED=true to enable.
+    """
+    print_test("Security - Hooks Disabled", "CRAWL4AI_HOOKS_ENABLED")
+
+    try:
+        import os
+
+        # Check the default environment variable
+        hooks_enabled = os.environ.get("CRAWL4AI_HOOKS_ENABLED", "false").lower()
+
+        if hooks_enabled == "true":
+            record_result("Hooks Disabled Default", "Security", True,
+                         "CRAWL4AI_HOOKS_ENABLED is explicitly set to 'true' (user override)",
+                         skipped=True)
+            return
+
+        # Verify default is "false"
+        if hooks_enabled == "false":
+            record_result("Hooks Disabled Default", "Security", True,
+                         "Hooks disabled by default (CRAWL4AI_HOOKS_ENABLED=false)")
+        else:
+            record_result("Hooks Disabled Default", "Security", True,
+                         f"CRAWL4AI_HOOKS_ENABLED='{hooks_enabled}' (not 'true', hooks disabled)")
+
+    except Exception as e:
+        record_result("Hooks Disabled Default", "Security", False, f"Exception: {e}")
+
+
+# =============================================================================
+# TEST 8: Comprehensive Crawl Test
+# =============================================================================
+async def test_comprehensive_crawl():
+    """
+    Run a comprehensive crawl to verify overall stability with new features.
+    """
+    print_test("Comprehensive Crawl Test", "Overall")
+
+    try:
+        from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BrowserConfig
+
+        async with AsyncWebCrawler(config=BrowserConfig(headless=True), verbose=False) as crawler:
+            result = await crawler.arun(
+                url="https://httpbin.org/html",
+                config=CrawlerRunConfig()
+            )
+
+        checks = []
+
+        if result.success:
+            checks.append("success=True")
+        else:
+            record_result("Comprehensive Crawl", "Overall", False,
+                         f"Crawl failed: {result.error_message}")
+            return
+
+        if result.html and len(result.html) > 100:
+            checks.append(f"html={len(result.html)} chars")
+
+        if result.markdown and result.markdown.raw_markdown:
+            checks.append(f"markdown={len(result.markdown.raw_markdown)} chars")
+
+        if result.links:
+            total_links = len(result.links.get("internal", [])) + len(result.links.get("external", []))
+            checks.append(f"links={total_links}")
+
+        record_result("Comprehensive Crawl", "Overall", True,
+                     f"All checks passed: {', '.join(checks)}")
+
+    except Exception as e:
+        record_result("Comprehensive Crawl", "Overall", False, f"Exception: {e}")
+
+
+# =============================================================================
+# MAIN
+# =============================================================================
+
+def print_summary():
+    """Print test results summary"""
+    print_header("TEST RESULTS SUMMARY")
+
+    passed = sum(1 for r in results if r.passed and not r.skipped)
+    failed = sum(1 for r in results if not r.passed and not r.skipped)
+    skipped = sum(1 for r in results if r.skipped)
+
+    print(f"\nTotal: {len(results)} tests")
+    print(f"  Passed:  {passed}")
+    print(f"  Failed:  {failed}")
+    print(f"  Skipped: {skipped}")
+
+    if failed > 0:
+        print("\nFailed Tests:")
+        for r in results:
+            if not r.passed and not r.skipped:
+                print(f"  - {r.name} ({r.feature}): {r.message}")
+
+    if skipped > 0:
+        print("\nSkipped Tests:")
+        for r in results:
+            if r.skipped:
+                print(f"  - {r.name} ({r.feature}): {r.message}")
+
+    print("\n" + "=" * 70)
+    if failed == 0:
+        print("All tests passed! v0.8.0 features verified.")
+    else:
+        print(f"WARNING: {failed} test(s) failed!")
+    print("=" * 70)
+
+    return failed == 0
+
+
+async def main():
+    """Run all verification tests"""
+    print_header("Crawl4AI v0.8.0 - Feature Verification Tests")
+    print("Running actual tests to verify new features...")
+    print("\nKey Features in v0.8.0:")
+    print("  - Crash Recovery for Deep Crawl (resume_state, on_state_change)")
+    print("  - Prefetch Mode for Fast URL Discovery (prefetch=True)")
+    print("  - Security: Hooks disabled by default on Docker API")
+
+    # Run all tests
+    tests = [
+        test_crash_recovery_state_capture,      # on_state_change
+        test_crash_recovery_resume,             # resume_state
+        test_crash_recovery_json_serializable,  # State structure
+        test_prefetch_returns_html_links,       # prefetch=True basics
+        test_prefetch_skips_processing,         # prefetch skips heavy work
+        test_prefetch_two_phase,                # Two-phase pattern
+        test_security_hooks_disabled,           # Security check
+        test_comprehensive_crawl,               # Overall stability
+    ]
+
+    for test_func in tests:
+        try:
+            await test_func()
+        except Exception as e:
+            print(f"\nTest {test_func.__name__} crashed: {e}")
+            results.append(TestResult(
+                test_func.__name__,
+                "Unknown",
+                False,
+                f"Crashed: {e}"
+            ))
+
+    # Print summary
+    all_passed = print_summary()
+
+    return 0 if all_passed else 1
+
+
+if __name__ == "__main__":
+    try:
+        exit_code = asyncio.run(main())
+        sys.exit(exit_code)
+    except KeyboardInterrupt:
+        print("\n\nTests interrupted by user.")
+        sys.exit(1)
+    except Exception as e:
+        print(f"\n\nTest suite failed: {e}")
+        import traceback
+        traceback.print_exc()
+        sys.exit(1)
--- a/docs/security/GHSA-DRAFT-RCE-LFI.md
+++ b/docs/security/GHSA-DRAFT-RCE-LFI.md
@@ -0,0 +1,171 @@
+# GitHub Security Advisory Draft
+
+> **Instructions**: Copy this content to create security advisories at:
+> https://github.com/unclecode/crawl4ai/security/advisories/new
+
+---
+
+## Advisory 1: Remote Code Execution via Hooks Parameter
+
+### Title
+Remote Code Execution in Docker API via Hooks Parameter
+
+### Severity
+Critical
+
+### CVSS Score
+10.0 (CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:H/A:H)
+
+### CWE
+CWE-94 (Improper Control of Generation of Code)
+
+### Package
+crawl4ai (Docker API deployment)
+
+### Affected Versions
+< 0.8.0
+
+### Patched Versions
+0.8.0
+
+### Description
+
+A critical remote code execution vulnerability exists in the Crawl4AI Docker API deployment. The `/crawl` endpoint accepts a `hooks` parameter containing Python code that is executed using `exec()`. The `__import__` builtin was included in the allowed builtins, allowing attackers to import arbitrary modules and execute system commands.
+
+**Attack Vector:**
+```json
+POST /crawl
+{
+  "urls": ["https://example.com"],
+  "hooks": {
+    "code": {
+      "on_page_context_created": "async def hook(page, context, **kwargs):\n    __import__('os').system('malicious_command')\n    return page"
+    }
+  }
+}
+```
+
+### Impact
+
+An unauthenticated attacker can:
+- Execute arbitrary system commands
+- Read/write files on the server
+- Exfiltrate sensitive data (environment variables, API keys)
+- Pivot to internal network services
+- Completely compromise the server
+
+### Mitigation
+
+1. **Upgrade to v0.8.0** (recommended)
+2. If unable to upgrade immediately:
+   - Disable the Docker API
+   - Block `/crawl` endpoint at network level
+   - Add authentication to the API
+
+### Fix Details
+
+1. Removed `__import__` from `allowed_builtins` in `hook_manager.py`
+2. Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)
+3. Users must explicitly opt-in to enable hooks
+
+### Credits
+
+Discovered by Neo by ProjectDiscovery (https://projectdiscovery.io)
+
+### References
+
+- [Release Notes v0.8.0](https://github.com/unclecode/crawl4ai/blob/main/docs/RELEASE_NOTES_v0.8.0.md)
+- [Migration Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/migration/v0.8.0-upgrade-guide.md)
+
+---
+
+## Advisory 2: Local File Inclusion via file:// URLs
+
+### Title
+Local File Inclusion in Docker API via file:// URLs
+
+### Severity
+High
+
+### CVSS Score
+8.6 (CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:N/A:N)
+
+### CWE
+CWE-22 (Improper Limitation of a Pathname to a Restricted Directory)
+
+### Package
+crawl4ai (Docker API deployment)
+
+### Affected Versions
+< 0.8.0
+
+### Patched Versions
+0.8.0
+
+### Description
+
+A local file inclusion vulnerability exists in the Crawl4AI Docker API. The `/execute_js`, `/screenshot`, `/pdf`, and `/html` endpoints accept `file://` URLs, allowing attackers to read arbitrary files from the server filesystem.
+
+**Attack Vector:**
+```json
+POST /execute_js
+{
+  "url": "file:///etc/passwd",
+  "scripts": ["document.body.innerText"]
+}
+```
+
+### Impact
+
+An unauthenticated attacker can:
+- Read sensitive files (`/etc/passwd`, `/etc/shadow`, application configs)
+- Access environment variables via `/proc/self/environ`
+- Discover internal application structure
+- Potentially read credentials and API keys
+
+### Mitigation
+
+1. **Upgrade to v0.8.0** (recommended)
+2. If unable to upgrade immediately:
+   - Disable the Docker API
+   - Add authentication to the API
+   - Use network-level filtering
+
+### Fix Details
+
+Added URL scheme validation to block:
+- `file://` URLs
+- `javascript:` URLs
+- `data:` URLs
+- Other non-HTTP schemes
+
+Only `http://`, `https://`, and `raw:` URLs are now allowed.
+
+### Credits
+
+Discovered by Neo by ProjectDiscovery (https://projectdiscovery.io)
+
+### References
+
+- [Release Notes v0.8.0](https://github.com/unclecode/crawl4ai/blob/main/docs/RELEASE_NOTES_v0.8.0.md)
+- [Migration Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/migration/v0.8.0-upgrade-guide.md)
+
+---
+
+## Creating the Advisories on GitHub
+
+1. Go to: https://github.com/unclecode/crawl4ai/security/advisories/new
+
+2. Fill in the form for each advisory:
+   - **Ecosystem**: PyPI
+   - **Package name**: crawl4ai
+   - **Affected versions**: < 0.8.0
+   - **Patched versions**: 0.8.0
+   - **Severity**: Critical (for RCE), High (for LFI)
+
+3. After creating, GitHub will:
+   - Assign a GHSA ID
+   - Optionally request a CVE
+   - Notify users who have security alerts enabled
+
+4. Coordinate disclosure timing with the fix release