Release v0.8.0: Crash Recovery, Prefetch Mode & Security Fixes (#1712)

* Fix: Use correct URL variable for raw HTML extraction (#1116) - Prevents full HTML content from being passed as URL to extraction strategies - Added unit tests to verify raw HTML and regular URL processing Fix: Wrong URL variable used for extraction of raw html * Fix #1181: Preserve whitespace in code blocks during HTML scraping The remove_empty_elements_fast() method was removing whitespace-only span elements inside <pre> and <code> tags, causing import statements like "import torch" to become "importtorch". Now skips elements inside code blocks where whitespace is significant. * Refactor Pydantic model configuration to use ConfigDict for arbitrary types * Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621 * Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638 * fix: ensure BrowserConfig.to_dict serializes proxy_config * feat: make LLM backoff configurable end-to-end - extend LLMConfig with backoff delay/attempt/factor fields and thread them through LLMExtractionStrategy, LLMContentFilter, table extraction, and Docker API handlers - expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff and document them in the md_v2 guides * reproduced AttributeError from #1642 * pass timeout parameter to docker client request * added missing deep crawling objects to init * generalized query in ContentRelevanceFilter to be a str or list * import modules from enhanceable deserialization * parameterized tests * Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268 * refactor: replace PyPDF2 with pypdf across the codebase. ref #1412 * Add browser_context_id and target_id parameters to BrowserConfig Enable Crawl4AI to connect to pre-created CDP browser contexts, which is essential for cloud browser services that pre-create isolated contexts. Changes: - Add browser_context_id and target_id parameters to BrowserConfig - Update from_kwargs() and to_dict() methods - Modify BrowserManager.start() to use existing context when provided - Add _get_page_by_target_id() helper method - Update get_page() to handle pre-existing targets - Add test for browser_context_id functionality This enables cloud services to: 1. Create isolated CDP contexts before Crawl4AI connects 2. Pass context/target IDs to BrowserConfig 3. Have Crawl4AI reuse existing contexts instead of creating new ones * Add cdp_cleanup_on_close flag to prevent memory leaks in cloud/server scenarios * Fix: add cdp_cleanup_on_close to from_kwargs * Fix: find context by target_id for concurrent CDP connections * Fix: use target_id to find correct page in get_page * Fix: use CDP to find context by browserContextId for concurrent sessions * Revert context matching attempts - Playwright cannot see CDP-created contexts * Add create_isolated_context flag for concurrent CDP crawls When True, forces creation of a new browser context instead of reusing the default context. Essential for concurrent crawls on the same browser to prevent navigation conflicts. * Add context caching to create_isolated_context branch Uses contexts_by_config cache (same as non-CDP mode) to reuse contexts for multiple URLs with same config. Still creates new page per crawl for navigation isolation. Benefits batch/deep crawls. * Add init_scripts support to BrowserConfig for pre-page-load JS injection This adds the ability to inject JavaScript that runs before any page loads, useful for stealth evasions (canvas/audio fingerprinting, userAgentData). - Add init_scripts parameter to BrowserConfig (list of JS strings) - Apply init_scripts in setup_context() via context.add_init_script() - Update from_kwargs() and to_dict() for serialization * Fix CDP connection handling: support WS URLs and proper cleanup Changes to browser_manager.py: 1. _verify_cdp_ready(): Support multiple URL formats - WebSocket URLs (ws://, wss://): Skip HTTP verification, Playwright handles directly - HTTP URLs with query params: Properly parse with urlparse to preserve query string - Fixes issue where naive f"{cdp_url}/json/version" broke WS URLs and query params 2. close(): Proper cleanup when cdp_cleanup_on_close=True - Close all sessions (pages) - Close all contexts - Call browser.close() to disconnect (doesn't terminate browser, just releases connection) - Wait 1 second for CDP connection to fully release - Stop Playwright instance to prevent memory leaks This enables: - Connecting to specific browsers via WS URL - Reusing the same browser with multiple sequential connections - No user wait needed between connections (internal 1s delay handles it) Added tests/browser/test_cdp_cleanup_reuse.py with comprehensive tests. * Update gitignore * Some debugging for caching * Add _generate_screenshot_from_html for raw: and file:// URLs Implements the missing method that was being called but never defined. Now raw: and file:// URLs can generate screenshots by: 1. Loading HTML into a browser page via page.set_content() 2. Taking screenshot using existing take_screenshot() method 3. Cleaning up the page afterward This enables cached HTML to be rendered with screenshots in crawl4ai-cloud. * Add PDF and MHTML support for raw: and file:// URLs - Replace _generate_screenshot_from_html with _generate_media_from_html - New method handles screenshot, PDF, and MHTML in one browser session - Update raw: and file:// URL handlers to use new method - Enables cached HTML to generate all media types * Add crash recovery for deep crawl strategies Add optional resume_state and on_state_change parameters to all deep crawl strategies (BFS, DFS, Best-First) for cloud deployment crash recovery. Features: - resume_state: Pass saved state to resume from checkpoint - on_state_change: Async callback fired after each URL for real-time state persistence to external storage (Redis, DB, etc.) - export_state(): Get last captured state manually - Zero overhead when features are disabled (None defaults) State includes visited URLs, pending queue/stack, depths, and pages_crawled count. All state is JSON-serializable. * Fix: HTTP strategy raw: URL parsing truncates at # character The AsyncHTTPCrawlerStrategy.crawl() method used urlparse() to extract content from raw: URLs. This caused HTML with CSS color codes like #eee to be truncated because # is treated as a URL fragment delimiter. Before: raw:body{background:#eee} -> parsed.path = 'body{background:' After: raw:body{background:#eee} -> raw_content = 'body{background:#eee' Fix: Strip the raw: or raw:// prefix directly instead of using urlparse, matching how the browser strategy handles it. * Add base_url parameter to CrawlerRunConfig for raw HTML processing When processing raw: HTML (e.g., from cache), the URL parameter is meaningless for markdown link resolution. This adds a base_url parameter that can be set explicitly to provide proper URL resolution context. Changes: - Add base_url parameter to CrawlerRunConfig.__init__ - Add base_url to CrawlerRunConfig.from_kwargs - Update aprocess_html to use base_url for markdown generation Usage: config = CrawlerRunConfig(base_url='https://example.com') result = await crawler.arun(url='raw:{html}', config=config) * Add prefetch mode for two-phase deep crawling - Add `prefetch` parameter to CrawlerRunConfig - Add `quick_extract_links()` function for fast link extraction - Add short-circuit in aprocess_html() for prefetch mode - Add 42 tests (unit, integration, regression) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Updates on proxy rotation and proxy configuration * Add proxy support to HTTP crawler strategy * Add browser pipeline support for raw:/file:// URLs - Add process_in_browser parameter to CrawlerRunConfig - Route raw:/file:// URLs through _crawl_web() when browser operations needed - Use page.set_content() instead of goto() for local content - Fix cookie handling for non-HTTP URLs in browser_manager - Auto-detect browser requirements: js_code, wait_for, screenshot, etc. - Maintain fast path for raw:/file:// without browser params Fixes #310 * Add smart TTL cache for sitemap URL seeder - Add cache_ttl_hours and validate_sitemap_lastmod params to SeedingConfig - New JSON cache format with metadata (version, created_at, lastmod, url_count) - Cache validation by TTL expiry and sitemap lastmod comparison - Auto-migration from old .jsonl to new .json format - Fixes bug where incomplete cache was used indefinitely * Update URL seeder docs with smart TTL cache parameters - Add cache_ttl_hours and validate_sitemap_lastmod to parameter table - Document smart TTL cache validation with examples - Add cache-related troubleshooting entries - Update key features summary * Add MEMORY.md to gitignore * Docs: Add multi-sample schema generation section Add documentation explaining how to pass multiple HTML samples to generate_schema() for stable selectors that work across pages with varying DOM structures. Includes: - Problem explanation (fragile nth-child selectors) - Solution with code example - Key points for multi-sample queries - Comparison table of fragile vs stable selectors * Fix critical RCE and LFI vulnerabilities in Docker API deployment Security fixes for vulnerabilities reported by ProjectDiscovery: 1. Remote Code Execution via Hooks (CVE pending) - Remove __import__ from allowed_builtins in hook_manager.py - Prevents arbitrary module imports (os, subprocess, etc.) - Hooks now disabled by default via CRAWL4AI_HOOKS_ENABLED env var 2. Local File Inclusion via file:// URLs (CVE pending) - Add URL scheme validation to /execute_js, /screenshot, /pdf, /html - Block file://, javascript:, data: and other dangerous schemes - Only allow http://, https://, and raw: (where appropriate) 3. Security hardening - Add CRAWL4AI_HOOKS_ENABLED=false as default (opt-in for hooks) - Add security warning comments in config.yml - Add validate_url_scheme() helper for consistent validation Testing: - Add unit tests (test_security_fixes.py) - 16 tests - Add integration tests (run_security_tests.py) for live server Affected endpoints: - POST /crawl (hooks disabled by default) - POST /crawl/stream (hooks disabled by default) - POST /execute_js (URL validation added) - POST /screenshot (URL validation added) - POST /pdf (URL validation added) - POST /html (URL validation added) Breaking changes: - Hooks require CRAWL4AI_HOOKS_ENABLED=true to function - file:// URLs no longer work on API endpoints (use library directly) * Enhance authentication flow by implementing JWT token retrieval and adding authorization headers to API requests * Add release notes for v0.7.9, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates * Add release notes for v0.8.0, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates Documentation for v0.8.0 release: - SECURITY.md: Security policy and vulnerability reporting guidelines - RELEASE_NOTES_v0.8.0.md: Comprehensive release notes - migration/v0.8.0-upgrade-guide.md: Step-by-step migration guide - security/GHSA-DRAFT-RCE-LFI.md: GitHub security advisory drafts - CHANGELOG.md: Updated with v0.8.0 changes Breaking changes documented: - Docker API hooks disabled by default (CRAWL4AI_HOOKS_ENABLED) - file:// URLs blocked on Docker API endpoints Security fixes credited to Neo by ProjectDiscovery * Add examples for deep crawl crash recovery and prefetch mode in documentation * Release v0.8.0: The v0.8.0 Update - Updated version to 0.8.0 - Added comprehensive demo and release notes - Updated all documentation * Update security researcher acknowledgment with a hyperlink for Neo by ProjectDiscovery * Add async agenerate_schema method for schema generation - Extract prompt building to shared _build_schema_prompt() method - Add agenerate_schema() async version using aperform_completion_with_backoff - Refactor generate_schema() to use shared prompt builder - Fixes Gemini/Vertex AI compatibility in async contexts (FastAPI) * Fix: Enable litellm.drop_params for O-series/GPT-5 model compatibility O-series (o1, o3) and GPT-5 models only support temperature=1. Setting litellm.drop_params=True auto-drops unsupported parameters instead of throwing UnsupportedParamsError. Fixes temperature=0.01 error for these models in LLM extraction. --------- Co-authored-by: rbushria <rbushri@gmail.com> Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com> Co-authored-by: Soham Kukreti <kukretisoham@gmail.com> Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com> Co-authored-by: unclecode <unclecode@kidocode.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-17 14:19:15 +01:00
parent c85f56b085
commit f6f7f1b551
58 changed files with 11942 additions and 2411 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -267,6 +267,7 @@ continue_config.json
 .private/

 .claude/
+.context/

 CLAUDE_MONITOR.md
 CLAUDE.md
@@ -295,3 +296,4 @@ scripts/
 *.db
 *.rdb
 *.ldb
+MEMORY.md
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -5,6 +5,46 @@ All notable changes to Crawl4AI will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

+## [0.8.0] - 2026-01-12
+
+### Security
+- **🔒 CRITICAL: Remote Code Execution Fix**: Removed `__import__` from hook allowed builtins
+  - Prevents arbitrary module imports in user-provided hook code
+  - Hooks now disabled by default via `CRAWL4AI_HOOKS_ENABLED` environment variable
+  - Credit: Neo by ProjectDiscovery
+- **🔒 HIGH: Local File Inclusion Fix**: Added URL scheme validation to Docker API endpoints
+  - Blocks `file://`, `javascript:`, `data:` URLs on `/execute_js`, `/screenshot`, `/pdf`, `/html`
+  - Only allows `http://`, `https://`, and `raw:` URLs
+  - Credit: Neo by ProjectDiscovery
+
+### Breaking Changes
+- **Docker API: Hooks disabled by default**: Set `CRAWL4AI_HOOKS_ENABLED=true` to enable
+- **Docker API: file:// URLs blocked**: Use Python library directly for local file processing
+
+### Added
+- **🚀 init_scripts for BrowserConfig**: Pre-page-load JavaScript injection for stealth evasions
+- **🔄 CDP Connection Improvements**: WebSocket URL support, proper cleanup, browser reuse
+- **💾 Crash Recovery for Deep Crawl**: `resume_state` and `on_state_change` for BFS/DFS/Best-First strategies
+- **📄 PDF/MHTML for raw:/file:// URLs**: Generate PDFs and MHTML from cached HTML content
+- **📸 Screenshots for raw:/file:// URLs**: Render cached HTML and capture screenshots
+- **🔗 base_url Parameter**: Proper URL resolution for raw: HTML processing
+- **⚡ Prefetch Mode**: Two-phase deep crawling with fast link extraction
+- **🔀 Enhanced Proxy Support**: Improved proxy rotation and sticky sessions
+- **🌐 HTTP Strategy Proxy Support**: Non-browser crawler now supports proxies
+- **🖥️ Browser Pipeline for raw:/file://**: New `process_in_browser` parameter
+- **📋 Smart TTL Cache for Sitemap Seeder**: `cache_ttl_hours` and `validate_sitemap_lastmod` parameters
+- **📚 Security Documentation**: Added SECURITY.md with vulnerability reporting guidelines
+
+### Fixed
+- **raw: URL Parsing**: Fixed truncation at `#` character (CSS color codes like `#eee`)
+- **Caching System**: Various improvements to cache validation and persistence
+
+### Documentation
+- Multi-sample schema generation section
+- URL seeder smart TTL cache parameters
+- v0.8.0 migration guide
+- Security policy and disclosure process
+
 ## [Unreleased]

 ### Added
--- a/2
+++ b/2
@@ -1,7 +1,7 @@
 FROM python:3.12-slim-bookworm AS build

 # C4ai version
-ARG C4AI_VER=0.7.8
+ARG C4AI_VER=0.8.0
 ENV C4AI_VERSION=$C4AI_VER
 LABEL c4ai.version=$C4AI_VER

--- a/README.md
+++ b/README.md
@@ -37,13 +37,13 @@ Limited slots._

 Crawl4AI turns the web into clean, LLM ready Markdown for RAG, agents, and data pipelines. Fast, controllable, battle tested by a 50k+ star community.

-[✨ Check out latest update v0.7.8](#-recent-updates)
+[✨ Check out latest update v0.8.0](#-recent-updates)

-✨ **New in v0.7.8**: Stability & Bug Fix Release! 11 bug fixes addressing Docker API issues (ContentRelevanceFilter, ProxyConfig, cache permissions), LLM extraction improvements (configurable backoff, HTML input format), URL handling fixes, and dependency updates (pypdf, Pydantic v2). [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.8.md)
+✨ **New in v0.8.0**: Crash Recovery & Prefetch Mode! Deep crawl crash recovery with `resume_state` and `on_state_change` callbacks for long-running crawls. New `prefetch=True` mode for 5-10x faster URL discovery. Critical security fixes for Docker API (hooks disabled by default, file:// URLs blocked). [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.8.0.md)

-✨ Recent v0.7.7: Complete Self-Hosting Platform with Real-time Monitoring! Enterprise-grade monitoring dashboard, comprehensive REST API, WebSocket streaming, smart browser pool management, and production-ready observability. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.7.md)
+✨ Recent v0.7.8: Stability & Bug Fix Release! 11 bug fixes addressing Docker API issues, LLM extraction improvements, URL handling fixes, and dependency updates. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.8.md)

-✨ Previous v0.7.6: Complete Webhook Infrastructure for Docker Job Queue API! Real-time notifications for both `/crawl/job` and `/llm/job` endpoints with exponential backoff retry, custom headers, and flexible delivery modes. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.6.md)
+✨ Previous v0.7.7: Complete Self-Hosting Platform with Real-time Monitoring! Enterprise-grade monitoring dashboard, comprehensive REST API, WebSocket streaming, and smart browser pool management. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.7.md)

 <details>
  <summary>🤓 <strong>My Personal Story</strong></summary>
@@ -562,6 +562,45 @@ async def test_news_crawl():

 ## ✨ Recent Updates

+<details open>
+<summary><strong>Version 0.8.0 Release Highlights - Crash Recovery & Prefetch Mode</strong></summary>
+
+This release introduces crash recovery for deep crawls, a new prefetch mode for fast URL discovery, and critical security fixes for Docker deployments.
+
+- **🔄 Deep Crawl Crash Recovery**:
+  - `on_state_change` callback fires after each URL for real-time state persistence
+  - `resume_state` parameter to continue from a saved checkpoint
+  - JSON-serializable state for Redis/database storage
+  - Works with BFS, DFS, and Best-First strategies
+  ```python
+  from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+
+  strategy = BFSDeepCrawlStrategy(
+      max_depth=3,
+      resume_state=saved_state,  # Continue from checkpoint
+      on_state_change=save_to_redis,  # Called after each URL
+  )
+  ```
+
+- **⚡ Prefetch Mode for Fast URL Discovery**:
+  - `prefetch=True` skips markdown, extraction, and media processing
+  - 5-10x faster than full processing
+  - Perfect for two-phase crawling: discover first, process selectively
+  ```python
+  config = CrawlerRunConfig(prefetch=True)
+  result = await crawler.arun("https://example.com", config=config)
+  # Returns HTML and links only - no markdown generation
+  ```
+
+- **🔒 Security Fixes (Docker API)**:
+  - Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)
+  - `file://` URLs blocked on API endpoints to prevent LFI
+  - `__import__` removed from hook execution sandbox
+
+[Full v0.8.0 Release Notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.8.0.md)
+
+</details>
+
 <details>
 <summary><strong>Version 0.7.8 Release Highlights - Stability & Bug Fix Release</strong></summary>

--- a/SECURITY.md
+++ b/SECURITY.md
@@ -0,0 +1,122 @@
+# Security Policy
+
+## Supported Versions
+
+| Version | Supported          |
+| ------- | ------------------ |
+| 0.8.x   | :white_check_mark: |
+| 0.7.x   | :x: (upgrade recommended) |
+| < 0.7   | :x:                |
+
+## Reporting a Vulnerability
+
+We take security vulnerabilities seriously. If you discover a security issue, please report it responsibly.
+
+### How to Report
+
+**DO NOT** open a public GitHub issue for security vulnerabilities.
+
+Instead, please report via one of these methods:
+
+1. **GitHub Security Advisories (Preferred)**
+   - Go to [Security Advisories](https://github.com/unclecode/crawl4ai/security/advisories)
+   - Click "New draft security advisory"
+   - Fill in the details
+
+2. **Email**
+   - Send details to: security@crawl4ai.com
+   - Use subject: `[SECURITY] Brief description`
+   - Include:
+     - Description of the vulnerability
+     - Steps to reproduce
+     - Potential impact
+     - Any suggested fixes
+
+### What to Expect
+
+- **Acknowledgment**: Within 48 hours
+- **Initial Assessment**: Within 7 days
+- **Resolution Timeline**: Depends on severity
+  - Critical: 24-72 hours
+  - High: 7 days
+  - Medium: 30 days
+  - Low: 90 days
+
+### Disclosure Policy
+
+- We follow responsible disclosure practices
+- We will coordinate with you on disclosure timing
+- Credit will be given to reporters (unless anonymity is requested)
+- We may request CVE assignment for significant vulnerabilities
+
+## Security Best Practices for Users
+
+### Docker API Deployment
+
+If you're running the Crawl4AI Docker API in production:
+
+1. **Enable Authentication**
+   ```yaml
+   # config.yml
+   security:
+     enabled: true
+     jwt_enabled: true
+   ```
+   ```bash
+   # Set a strong secret key
+   export SECRET_KEY="your-secure-random-key-here"
+   ```
+
+2. **Hooks are Disabled by Default** (v0.8.0+)
+   - Only enable if you trust all API users
+   - Set `CRAWL4AI_HOOKS_ENABLED=true` only when necessary
+
+3. **Network Security**
+   - Run behind a reverse proxy (nginx, traefik)
+   - Use HTTPS in production
+   - Restrict access to trusted IPs if possible
+
+4. **Container Security**
+   - Run as non-root user (default in our container)
+   - Use read-only filesystem where possible
+   - Limit container resources
+
+### Library Usage
+
+When using Crawl4AI as a Python library:
+
+1. **Validate URLs** before crawling untrusted input
+2. **Sanitize extracted content** before using in other systems
+3. **Be cautious with hooks** - they execute arbitrary code
+
+## Known Security Issues
+
+### Fixed in v0.8.0
+
+| ID | Severity | Description | Fix |
+|----|----------|-------------|-----|
+| CVE-pending-1 | CRITICAL | RCE via hooks `__import__` | Removed from allowed builtins |
+| CVE-pending-2 | HIGH | LFI via `file://` URLs | URL scheme validation added |
+
+See [Security Advisory](https://github.com/unclecode/crawl4ai/security/advisories) for details.
+
+## Security Features
+
+### v0.8.0+
+
+- **URL Scheme Validation**: Blocks `file://`, `javascript:`, `data:` URLs on API
+- **Hooks Disabled by Default**: Opt-in via `CRAWL4AI_HOOKS_ENABLED=true`
+- **Restricted Hook Builtins**: No `__import__`, `eval`, `exec`, `open`
+- **JWT Authentication**: Optional but recommended for production
+- **Rate Limiting**: Configurable request limits
+- **Security Headers**: X-Frame-Options, CSP, HSTS when enabled
+
+## Acknowledgments
+
+We thank the following security researchers for responsibly disclosing vulnerabilities:
+
+- **[Neo by ProjectDiscovery](https://projectdiscovery.io/blog/introducing-neo)** - RCE and LFI vulnerabilities (December 2025)
+
+---
+
+*Last updated: January 2026*
--- a/crawl4ai/version.py
+++ b/crawl4ai/version.py
@@ -1,7 +1,7 @@
 # crawl4ai/__version__.py

 # This is the version that will be used for stable releases
-__version__ = "0.7.8"
+__version__ = "0.8.0"

 # For nightly builds, this gets set during build process
 __nightly_version__ = None
--- a/crawl4ai/async_configs.py
+++ b/crawl4ai/async_configs.py
@@ -373,6 +373,20 @@ class BrowserConfig:
        use_managed_browser (bool): Launch the browser using a managed approach (e.g., via CDP), allowing
                                    advanced manipulation. Default: False.
        cdp_url (str): URL for the Chrome DevTools Protocol (CDP) endpoint. Default: "ws://localhost:9222/devtools/browser/".
+        browser_context_id (str or None): Pre-existing CDP browser context ID to use. When provided along with
+                                          cdp_url, the crawler will reuse this context instead of creating a new one.
+                                          Useful for cloud browser services that pre-create isolated contexts.
+                                          Default: None.
+        target_id (str or None): Pre-existing CDP target ID (page) to use. When provided along with
+                                 browser_context_id, the crawler will reuse this target instead of creating
+                                 a new page. Default: None.
+        cdp_cleanup_on_close (bool): When True and using cdp_url, the close() method will still clean up
+                                     the local Playwright client resources. Useful for cloud/server scenarios
+                                     where you don't own the remote browser but need to prevent memory leaks
+                                     from accumulated Playwright instances. Default: False.
+        create_isolated_context (bool): When True and using cdp_url, forces creation of a new browser context
+                                        instead of reusing the default context. Essential for concurrent crawls
+                                        on the same browser to prevent navigation conflicts. Default: False.
        debugging_port (int): Port for the browser debugging protocol. Default: 9222.
        use_persistent_context (bool): Use a persistent browser context (like a persistent profile).
                                       Automatically sets use_managed_browser=True. Default: False.
@@ -427,6 +441,10 @@ class BrowserConfig:
        browser_mode: str = "dedicated",
        use_managed_browser: bool = False,
        cdp_url: str = None,
+        browser_context_id: str = None,
+        target_id: str = None,
+        cdp_cleanup_on_close: bool = False,
+        create_isolated_context: bool = False,
        use_persistent_context: bool = False,
        user_data_dir: str = None,
        chrome_channel: str = "chromium",
@@ -459,6 +477,7 @@ class BrowserConfig:
        debugging_port: int = 9222,
        host: str = "localhost",
        enable_stealth: bool = False,
+        init_scripts: List[str] = None,
    ):
        
        self.browser_type = browser_type
@@ -466,6 +485,10 @@ class BrowserConfig:
        self.browser_mode = browser_mode
        self.use_managed_browser = use_managed_browser
        self.cdp_url = cdp_url
+        self.browser_context_id = browser_context_id
+        self.target_id = target_id
+        self.cdp_cleanup_on_close = cdp_cleanup_on_close
+        self.create_isolated_context = create_isolated_context
        self.use_persistent_context = use_persistent_context
        self.user_data_dir = user_data_dir
        self.chrome_channel = chrome_channel or self.browser_type or "chromium"
@@ -514,6 +537,7 @@ class BrowserConfig:
        self.debugging_port = debugging_port
        self.host = host
        self.enable_stealth = enable_stealth
+        self.init_scripts = init_scripts if init_scripts is not None else []

        fa_user_agenr_generator = ValidUAGenerator()
        if self.user_agent_mode == "random":
@@ -561,6 +585,10 @@ class BrowserConfig:
            browser_mode=kwargs.get("browser_mode", "dedicated"),
            use_managed_browser=kwargs.get("use_managed_browser", False),
            cdp_url=kwargs.get("cdp_url"),
+            browser_context_id=kwargs.get("browser_context_id"),
+            target_id=kwargs.get("target_id"),
+            cdp_cleanup_on_close=kwargs.get("cdp_cleanup_on_close", False),
+            create_isolated_context=kwargs.get("create_isolated_context", False),
            use_persistent_context=kwargs.get("use_persistent_context", False),
            user_data_dir=kwargs.get("user_data_dir"),
            chrome_channel=kwargs.get("chrome_channel", "chromium"),
@@ -589,6 +617,7 @@ class BrowserConfig:
            debugging_port=kwargs.get("debugging_port", 9222),
            host=kwargs.get("host", "localhost"),
            enable_stealth=kwargs.get("enable_stealth", False),
+            init_scripts=kwargs.get("init_scripts", []),
        )

    def to_dict(self):
@@ -598,6 +627,10 @@ class BrowserConfig:
            "browser_mode": self.browser_mode,
            "use_managed_browser": self.use_managed_browser,
            "cdp_url": self.cdp_url,
+            "browser_context_id": self.browser_context_id,
+            "target_id": self.target_id,
+            "cdp_cleanup_on_close": self.cdp_cleanup_on_close,
+            "create_isolated_context": self.create_isolated_context,
            "use_persistent_context": self.use_persistent_context,
            "user_data_dir": self.user_data_dir,
            "chrome_channel": self.chrome_channel,
@@ -624,6 +657,7 @@ class BrowserConfig:
            "debugging_port": self.debugging_port,
            "host": self.host,
            "enable_stealth": self.enable_stealth,
+            "init_scripts": self.init_scripts,
        }


@@ -999,6 +1033,18 @@ class CrawlerRunConfig():
        proxy_config (ProxyConfig or dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
                                     If None, no additional proxy config. Default: None.

+        # Sticky Proxy Session Parameters
+        proxy_session_id (str or None): When set, maintains the same proxy for all requests sharing this session ID.
+                                        The proxy is acquired on first request and reused for subsequent requests.
+                                        Session expires when explicitly released or crawler context is closed.
+                                        Default: None.
+        proxy_session_ttl (int or None): Time-to-live for sticky session in seconds.
+                                         After TTL expires, a new proxy is acquired on next request.
+                                         Default: None (session lasts until explicitly released or crawler closes).
+        proxy_session_auto_release (bool): If True, automatically release the proxy session after a batch operation.
+                                           Useful for arun_many() to clean up sessions automatically.
+                                           Default: False.
+
        # Browser Location and Identity Parameters
        locale (str or None): Locale to use for the browser context (e.g., "en-US").
                             Default: None.
@@ -1027,6 +1073,15 @@ class CrawlerRunConfig():
        shared_data (dict or None): Shared data to be passed between hooks.
                                     Default: None.

+        # Cache Validation Parameters (Smart Cache)
+        check_cache_freshness (bool): If True, validates cached content freshness using HTTP
+                                      conditional requests (ETag/Last-Modified) and head fingerprinting
+                                      before returning cached results. Avoids full browser crawls when
+                                      content hasn't changed. Only applies when cache_mode allows reads.
+                                      Default: False.
+        cache_validation_timeout (float): Timeout in seconds for cache validation HTTP requests.
+                                          Default: 10.0.
+
        # Page Navigation and Timing Parameters
        wait_until (str): The condition to wait for when navigating, e.g. "domcontentloaded".
                          Default: "domcontentloaded".
@@ -1133,6 +1188,12 @@ class CrawlerRunConfig():
        # Connection Parameters
        stream (bool): If True, enables streaming of crawled URLs as they are processed when used with arun_many.
                      Default: False.
+        process_in_browser (bool): If True, forces raw:/file:// URLs to be processed through the browser
+                                   pipeline (enabling js_code, wait_for, scrolling, etc.). When False (default),
+                                   raw:/file:// URLs use a fast path that returns HTML directly without browser
+                                   interaction. This is automatically enabled when browser-requiring parameters
+                                   are detected (js_code, wait_for, screenshot, pdf, etc.).
+                                   Default: False.

        check_robots_txt (bool): Whether to check robots.txt rules before crawling. Default: False
                                 Default: False.
@@ -1178,6 +1239,10 @@ class CrawlerRunConfig():
        scraping_strategy: ContentScrapingStrategy = None,
        proxy_config: Union[ProxyConfig, dict, None] = None,
        proxy_rotation_strategy: Optional[ProxyRotationStrategy] = None,
+        # Sticky Proxy Session Parameters
+        proxy_session_id: Optional[str] = None,
+        proxy_session_ttl: Optional[int] = None,
+        proxy_session_auto_release: bool = False,
        # Browser Location and Identity Parameters
        locale: Optional[str] = None,
        timezone_id: Optional[str] = None,
@@ -1192,6 +1257,9 @@ class CrawlerRunConfig():
        no_cache_read: bool = False,
        no_cache_write: bool = False,
        shared_data: dict = None,
+        # Cache Validation Parameters (Smart Cache)
+        check_cache_freshness: bool = False,
+        cache_validation_timeout: float = 10.0,
        # Page Navigation and Timing Parameters
        wait_until: str = "domcontentloaded",
        page_timeout: int = PAGE_TIMEOUT,
@@ -1245,7 +1313,10 @@ class CrawlerRunConfig():
        # Connection Parameters
        method: str = "GET",
        stream: bool = False,
+        prefetch: bool = False,  # When True, return only HTML + links (skip heavy processing)
+        process_in_browser: bool = False,  # Force browser processing for raw:/file:// URLs
        url: str = None,
+        base_url: str = None,  # Base URL for markdown link resolution (used with raw: HTML)
        check_robots_txt: bool = False,
        user_agent: str = None,
        user_agent_mode: str = None,
@@ -1264,6 +1335,7 @@ class CrawlerRunConfig():
    ):
        # TODO: Planning to set properties dynamically based on the __init__ signature
        self.url = url
+        self.base_url = base_url  # Base URL for markdown link resolution

        # Content Processing Parameters
        self.word_count_threshold = word_count_threshold
@@ -1289,6 +1361,11 @@ class CrawlerRunConfig():

        self.proxy_rotation_strategy = proxy_rotation_strategy

+        # Sticky Proxy Session Parameters
+        self.proxy_session_id = proxy_session_id
+        self.proxy_session_ttl = proxy_session_ttl
+        self.proxy_session_auto_release = proxy_session_auto_release
+
        # Browser Location and Identity Parameters
        self.locale = locale
        self.timezone_id = timezone_id
@@ -1305,6 +1382,9 @@ class CrawlerRunConfig():
        self.no_cache_read = no_cache_read
        self.no_cache_write = no_cache_write
        self.shared_data = shared_data
+        # Cache Validation (Smart Cache)
+        self.check_cache_freshness = check_cache_freshness
+        self.cache_validation_timeout = cache_validation_timeout

        # Page Navigation and Timing Parameters
        self.wait_until = wait_until
@@ -1371,6 +1451,8 @@ class CrawlerRunConfig():

        # Connection Parameters
        self.stream = stream
+        self.prefetch = prefetch  # Prefetch mode: return only HTML + links
+        self.process_in_browser = process_in_browser  # Force browser processing for raw:/file:// URLs
        self.method = method

        # Robots.txt Handling Parameters
@@ -1568,6 +1650,10 @@ class CrawlerRunConfig():
            scraping_strategy=kwargs.get("scraping_strategy"),
            proxy_config=kwargs.get("proxy_config"),
            proxy_rotation_strategy=kwargs.get("proxy_rotation_strategy"),
+            # Sticky Proxy Session Parameters
+            proxy_session_id=kwargs.get("proxy_session_id"),
+            proxy_session_ttl=kwargs.get("proxy_session_ttl"),
+            proxy_session_auto_release=kwargs.get("proxy_session_auto_release", False),
            # Browser Location and Identity Parameters
            locale=kwargs.get("locale", None),
            timezone_id=kwargs.get("timezone_id", None),
@@ -1643,6 +1729,8 @@ class CrawlerRunConfig():
            # Connection Parameters
            method=kwargs.get("method", "GET"),
            stream=kwargs.get("stream", False),
+            prefetch=kwargs.get("prefetch", False),
+            process_in_browser=kwargs.get("process_in_browser", False),
            check_robots_txt=kwargs.get("check_robots_txt", False),
            user_agent=kwargs.get("user_agent"),
            user_agent_mode=kwargs.get("user_agent_mode"),
@@ -1652,6 +1740,7 @@ class CrawlerRunConfig():
            # Link Extraction Parameters
            link_preview_config=kwargs.get("link_preview_config"),
            url=kwargs.get("url"),
+            base_url=kwargs.get("base_url"),
            # URL Matching Parameters
            url_matcher=kwargs.get("url_matcher"),
            match_mode=kwargs.get("match_mode", MatchMode.OR),
@@ -1691,6 +1780,9 @@ class CrawlerRunConfig():
            "scraping_strategy": self.scraping_strategy,
            "proxy_config": self.proxy_config,
            "proxy_rotation_strategy": self.proxy_rotation_strategy,
+            "proxy_session_id": self.proxy_session_id,
+            "proxy_session_ttl": self.proxy_session_ttl,
+            "proxy_session_auto_release": self.proxy_session_auto_release,
            "locale": self.locale,
            "timezone_id": self.timezone_id,
            "geolocation": self.geolocation,
@@ -1747,6 +1839,8 @@ class CrawlerRunConfig():
            "capture_console_messages": self.capture_console_messages,
            "method": self.method,
            "stream": self.stream,
+            "prefetch": self.prefetch,
+            "process_in_browser": self.process_in_browser,
            "check_robots_txt": self.check_robots_txt,
            "user_agent": self.user_agent,
            "user_agent_mode": self.user_agent_mode,
@@ -1902,6 +1996,8 @@ class SeedingConfig:
        score_threshold: Optional[float] = None,
        scoring_method: str = "bm25",
        filter_nonsense_urls: bool = True,
+        cache_ttl_hours: int = 24,
+        validate_sitemap_lastmod: bool = True,
    ):
        """
        Initialize URL seeding configuration.
@@ -1937,6 +2033,10 @@ class SeedingConfig:
                          Future: "semantic". Default: "bm25"
            filter_nonsense_urls: Filter out utility URLs like robots.txt, sitemap.xml,
                                 ads.txt, favicon.ico, etc. Default: True
+            cache_ttl_hours: Hours before sitemap cache expires. Set to 0 to disable TTL
+                            (only lastmod validation). Default: 24
+            validate_sitemap_lastmod: If True, compares sitemap's <lastmod> with cache
+                                     timestamp and refetches if sitemap is newer. Default: True
        """
        self.source = source
        self.pattern = pattern
@@ -1953,6 +2053,8 @@ class SeedingConfig:
        self.score_threshold = score_threshold
        self.scoring_method = scoring_method
        self.filter_nonsense_urls = filter_nonsense_urls
+        self.cache_ttl_hours = cache_ttl_hours
+        self.validate_sitemap_lastmod = validate_sitemap_lastmod

    # Add to_dict, from_kwargs, and clone methods for consistency
    def to_dict(self) -> Dict[str, Any]:
--- a/crawl4ai/async_crawler_strategy.py
+++ b/crawl4ai/async_crawler_strategy.py
@@ -452,48 +452,48 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
        if url.startswith(("http://", "https://", "view-source:")):
            return await self._crawl_web(url, config)

-        elif url.startswith("file://"):
-            # initialize empty lists for console messages
-            captured_console = []
+        elif url.startswith("file://") or url.startswith("raw://") or url.startswith("raw:"):
+            # Check if browser processing is required for file:// or raw: URLs
+            needs_browser = (
+                config.process_in_browser or
+                config.screenshot or
+                config.pdf or
+                config.capture_mhtml or
+                config.js_code or
+                config.wait_for or
+                config.scan_full_page or
+                config.remove_overlay_elements or
+                config.simulate_user or
+                config.magic or
+                config.process_iframes or
+                config.capture_console_messages or
+                config.capture_network_requests
+            )

+            if needs_browser:
+                # Route through _crawl_web() for full browser pipeline
+                # _crawl_web() will detect file:// and raw: URLs and use set_content()
+                return await self._crawl_web(url, config)
+
+            # Fast path: return HTML directly without browser interaction
+            if url.startswith("file://"):
                # Process local file
                local_file_path = url[7:]  # Remove 'file://' prefix
                if not os.path.exists(local_file_path):
                    raise FileNotFoundError(f"Local file not found: {local_file_path}")
                with open(local_file_path, "r", encoding="utf-8") as f:
                    html = f.read()
-            if config.screenshot:
-                screenshot_data = await self._generate_screenshot_from_html(html)
-            if config.capture_console_messages:
-                page, context = await self.browser_manager.get_page(crawlerRunConfig=config)
-                captured_console = await self._capture_console_messages(page, url)
+            else:
+                # Process raw HTML content (raw:// or raw:)
+                html = url[6:] if url.startswith("raw://") else url[4:]

            return AsyncCrawlResponse(
                html=html,
                response_headers=response_headers,
                status_code=status_code,
-                screenshot=screenshot_data,
-                get_delayed_content=None,
-                console_messages=captured_console,
-            )
-
-        ##### 
-        # Since both "raw:" and "raw://" start with "raw:", the first condition is always true for both, so "raw://" will be sliced as "//...", which is incorrect.
-        # Fix: Check for "raw://" first, then "raw:"
-        # Also, the prefix "raw://" is actually 6 characters long, not 7, so it should be sliced accordingly: url[6:]
-        #####
-        elif url.startswith("raw://") or url.startswith("raw:"):
-            # Process raw HTML content
-            # raw_html = url[4:] if url[:4] == "raw:" else url[7:]
-            raw_html = url[6:] if url.startswith("raw://") else url[4:]
-            html = raw_html
-            if config.screenshot:
-                screenshot_data = await self._generate_screenshot_from_html(html)
-            return AsyncCrawlResponse(
-                html=html,
-                response_headers=response_headers,
-                status_code=status_code,
-                screenshot=screenshot_data,
+                screenshot=None,
+                pdf_data=None,
+                mhtml_data=None,
                get_delayed_content=None,
            )
        else:
@@ -666,6 +666,28 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
            if not config.js_only:
                await self.execute_hook("before_goto", page, context=context, url=url, config=config)

+                # Check if this is a file:// or raw: URL that needs set_content() instead of goto()
+                is_local_content = url.startswith("file://") or url.startswith("raw://") or url.startswith("raw:")
+
+                if is_local_content:
+                    # Load local content using set_content() instead of network navigation
+                    if url.startswith("file://"):
+                        local_file_path = url[7:]  # Remove 'file://' prefix
+                        if not os.path.exists(local_file_path):
+                            raise FileNotFoundError(f"Local file not found: {local_file_path}")
+                        with open(local_file_path, "r", encoding="utf-8") as f:
+                            html_content = f.read()
+                    else:
+                        # raw:// or raw:
+                        html_content = url[6:] if url.startswith("raw://") else url[4:]
+
+                    await page.set_content(html_content, wait_until=config.wait_until)
+                    response = None
+                    redirected_url = config.base_url or url
+                    status_code = 200
+                    response_headers = {}
+                else:
+                    # Standard web navigation with goto()
                    try:
                        # Generate a unique nonce for this request
                        if config.experimental.get("use_csp_nonce", False):
@@ -695,10 +717,6 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                        else:
                            raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")

-                await self.execute_hook(
-                    "after_goto", page, context=context, url=url, response=response, config=config
-                )
-
                    # ──────────────────────────────────────────────────────────────
                    # Walk the redirect chain.  Playwright returns only the last
                    # hop, so we trace the `request.redirected_from` links until the
@@ -720,12 +738,10 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):

                        status_code = first_resp.status
                        response_headers = first_resp.headers
-                # if response is None:
-                #     status_code = 200
-                #     response_headers = {}
-                # else:
-                #     status_code = response.status
-                #     response_headers = response.headers
+
+                await self.execute_hook(
+                    "after_goto", page, context=context, url=url, response=response, config=config
+                )

            else:
                status_code = 200
@@ -1525,6 +1541,77 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):

        return captured_console

+    async def _generate_media_from_html(
+        self, html: str, config: CrawlerRunConfig = None
+    ) -> tuple:
+        """
+        Generate media (screenshot, PDF, MHTML) from raw HTML content.
+
+        This method is used for raw: and file:// URLs where we have HTML content
+        but need to render it in a browser to generate media outputs.
+
+        Args:
+            html (str): The raw HTML content to render
+            config (CrawlerRunConfig, optional): Configuration for media options
+
+        Returns:
+            tuple: (screenshot_data, pdf_data, mhtml_data) - any can be None
+        """
+        page = None
+        screenshot_data = None
+        pdf_data = None
+        mhtml_data = None
+
+        try:
+            # Get a browser page
+            config = config or CrawlerRunConfig()
+            page, context = await self.browser_manager.get_page(crawlerRunConfig=config)
+
+            # Load the HTML content into the page
+            await page.set_content(html, wait_until="domcontentloaded")
+
+            # Generate requested media
+            if config.pdf:
+                pdf_data = await self.export_pdf(page)
+
+            if config.capture_mhtml:
+                mhtml_data = await self.capture_mhtml(page)
+
+            if config.screenshot:
+                if config.screenshot_wait_for:
+                    await asyncio.sleep(config.screenshot_wait_for)
+                screenshot_height_threshold = getattr(config, 'screenshot_height_threshold', None)
+                screenshot_data = await self.take_screenshot(
+                    page, screenshot_height_threshold=screenshot_height_threshold
+                )
+
+            return screenshot_data, pdf_data, mhtml_data
+
+        except Exception as e:
+            error_message = f"Failed to generate media from HTML: {str(e)}"
+            self.logger.error(
+                message="HTML media generation failed: {error}",
+                tag="ERROR",
+                params={"error": error_message},
+            )
+            # Return error image for screenshot if it was requested
+            if config and config.screenshot:
+                img = Image.new("RGB", (800, 600), color="black")
+                draw = ImageDraw.Draw(img)
+                font = ImageFont.load_default()
+                draw.text((10, 10), error_message, fill=(255, 255, 255), font=font)
+                buffered = BytesIO()
+                img.save(buffered, format="JPEG")
+                screenshot_data = base64.b64encode(buffered.getvalue()).decode("utf-8")
+            return screenshot_data, pdf_data, mhtml_data
+        finally:
+            # Clean up the page
+            if page:
+                try:
+                    await page.close()
+                except Exception:
+                    pass
+
    async def take_screenshot(self, page, **kwargs) -> str:
        """
        Take a screenshot of the current page.
@@ -2293,6 +2380,25 @@ class AsyncHTTPCrawlerStrategy(AsyncCrawlerStrategy):
        )


+    def _format_proxy_url(self, proxy_config) -> str:
+        """Format ProxyConfig into aiohttp-compatible proxy URL."""
+        if not proxy_config:
+            return None
+
+        server = proxy_config.server
+        username = getattr(proxy_config, 'username', None)
+        password = getattr(proxy_config, 'password', None)
+
+        if username and password:
+            # Insert credentials into URL: http://user:pass@host:port
+            if '://' in server:
+                protocol, rest = server.split('://', 1)
+                return f"{protocol}://{username}:{password}@{rest}"
+            else:
+                return f"http://{username}:{password}@{server}"
+
+        return server
+
    async def _handle_http(
        self,
        url: str,
@@ -2316,6 +2422,12 @@ class AsyncHTTPCrawlerStrategy(AsyncCrawlerStrategy):
                'headers': headers
            }

+            # Add proxy support - use config.proxy_config (set by arun() from rotation strategy or direct config)
+            proxy_url = None
+            if config.proxy_config:
+                proxy_url = self._format_proxy_url(config.proxy_config)
+                request_kwargs['proxy'] = proxy_url
+
            if self.browser_config.method == "POST":
                if self.browser_config.data:
                    request_kwargs['data'] = self.browser_config.data
@@ -2386,7 +2498,10 @@ class AsyncHTTPCrawlerStrategy(AsyncCrawlerStrategy):
            if scheme == 'file':
                return await self._handle_file(parsed.path)
            elif scheme == 'raw':
-                return await self._handle_raw(parsed.path)
+                # Don't use parsed.path - urlparse truncates at '#' which is common in CSS
+                # Strip prefix directly: "raw://" (6 chars) or "raw:" (4 chars)
+                raw_content = url[6:] if url.startswith("raw://") else url[4:]
+                return await self._handle_raw(raw_content)
            else:  # http or https
                return await self._handle_http(url, config)
                
--- a/crawl4ai/async_database.py
+++ b/crawl4ai/async_database.py
@@ -1,4 +1,5 @@
 import os
+import time
 from pathlib import Path
 import aiosqlite
 import asyncio
@@ -262,6 +263,11 @@ class AsyncDatabaseManager:
                "screenshot",
                "response_headers",
                "downloaded_files",
+                # Smart cache validation columns (added in 0.8.x)
+                "etag",
+                "last_modified",
+                "head_fingerprint",
+                "cached_at",
            ]

            for column in new_columns:
@@ -275,6 +281,11 @@ class AsyncDatabaseManager:
            await db.execute(
                f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT "{{}}"'
            )
+        elif new_column == "cached_at":
+            # Timestamp column for cache validation
+            await db.execute(
+                f"ALTER TABLE crawled_data ADD COLUMN {new_column} REAL DEFAULT 0"
+            )
        else:
            await db.execute(
                f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""'
@@ -378,6 +389,92 @@ class AsyncDatabaseManager:
            )
            return None

+    async def aget_cache_metadata(self, url: str) -> Optional[Dict]:
+        """
+        Retrieve only cache validation metadata for a URL (lightweight query).
+
+        Returns dict with: url, etag, last_modified, head_fingerprint, cached_at, response_headers
+        This is used for cache validation without loading full content.
+        """
+        async def _get_metadata(db):
+            async with db.execute(
+                """SELECT url, etag, last_modified, head_fingerprint, cached_at, response_headers
+                   FROM crawled_data WHERE url = ?""",
+                (url,)
+            ) as cursor:
+                row = await cursor.fetchone()
+                if not row:
+                    return None
+
+                columns = [description[0] for description in cursor.description]
+                row_dict = dict(zip(columns, row))
+
+                # Parse response_headers JSON
+                try:
+                    row_dict["response_headers"] = (
+                        json.loads(row_dict["response_headers"])
+                        if row_dict["response_headers"] else {}
+                    )
+                except json.JSONDecodeError:
+                    row_dict["response_headers"] = {}
+
+                return row_dict
+
+        try:
+            return await self.execute_with_retry(_get_metadata)
+        except Exception as e:
+            self.logger.error(
+                message="Error retrieving cache metadata: {error}",
+                tag="ERROR",
+                force_verbose=True,
+                params={"error": str(e)},
+            )
+            return None
+
+    async def aupdate_cache_metadata(
+        self,
+        url: str,
+        etag: Optional[str] = None,
+        last_modified: Optional[str] = None,
+        head_fingerprint: Optional[str] = None,
+    ):
+        """
+        Update only the cache validation metadata for a URL.
+        Used to update etag/last_modified after a successful validation.
+        """
+        async def _update(db):
+            updates = []
+            values = []
+
+            if etag is not None:
+                updates.append("etag = ?")
+                values.append(etag)
+            if last_modified is not None:
+                updates.append("last_modified = ?")
+                values.append(last_modified)
+            if head_fingerprint is not None:
+                updates.append("head_fingerprint = ?")
+                values.append(head_fingerprint)
+
+            if not updates:
+                return
+
+            values.append(url)
+            await db.execute(
+                f"UPDATE crawled_data SET {', '.join(updates)} WHERE url = ?",
+                tuple(values)
+            )
+
+        try:
+            await self.execute_with_retry(_update)
+        except Exception as e:
+            self.logger.error(
+                message="Error updating cache metadata: {error}",
+                tag="ERROR",
+                force_verbose=True,
+                params={"error": str(e)},
+            )
+
    async def acache_url(self, result: CrawlResult):
        """Cache CrawlResult data"""
        # Store content files and get hashes
@@ -425,15 +522,24 @@ class AsyncDatabaseManager:
        for field, (content, content_type) in content_map.items():
            content_hashes[field] = await self._store_content(content, content_type)

+        # Extract cache validation headers from response
+        response_headers = result.response_headers or {}
+        etag = response_headers.get("etag") or response_headers.get("ETag") or ""
+        last_modified = response_headers.get("last-modified") or response_headers.get("Last-Modified") or ""
+        # head_fingerprint is set by caller via result attribute (if available)
+        head_fingerprint = getattr(result, "head_fingerprint", None) or ""
+        cached_at = time.time()
+
        async def _cache(db):
            await db.execute(
                """
                INSERT INTO crawled_data (
                    url, html, cleaned_html, markdown,
                    extracted_content, success, media, links, metadata,
-                    screenshot, response_headers, downloaded_files
+                    screenshot, response_headers, downloaded_files,
+                    etag, last_modified, head_fingerprint, cached_at
                )
-                VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+                VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
                ON CONFLICT(url) DO UPDATE SET
                    html = excluded.html,
                    cleaned_html = excluded.cleaned_html,
@@ -445,7 +551,11 @@ class AsyncDatabaseManager:
                    metadata = excluded.metadata,
                    screenshot = excluded.screenshot,
                    response_headers = excluded.response_headers,
-                    downloaded_files = excluded.downloaded_files
+                    downloaded_files = excluded.downloaded_files,
+                    etag = excluded.etag,
+                    last_modified = excluded.last_modified,
+                    head_fingerprint = excluded.head_fingerprint,
+                    cached_at = excluded.cached_at
            """,
                (
                    result.url,
@@ -460,6 +570,10 @@ class AsyncDatabaseManager:
                    content_hashes["screenshot"],
                    json.dumps(result.response_headers or {}),
                    json.dumps(result.downloaded_files or []),
+                    etag,
+                    last_modified,
+                    head_fingerprint,
+                    cached_at,
                ),
            )

--- a/crawl4ai/async_url_seeder.py
+++ b/crawl4ai/async_url_seeder.py
@@ -24,7 +24,7 @@ import os
 import pathlib
 import re
 import time
-from datetime import timedelta
+from datetime import datetime, timedelta, timezone
 from pathlib import Path
 from typing import Any, Dict, Iterable, List, Optional, Sequence, Union
 from urllib.parse import quote, urljoin
@@ -78,6 +78,103 @@ _link_rx = re.compile(
 # ────────────────────────────────────────────────────────────────────────── helpers


+def _parse_sitemap_lastmod(xml_content: bytes) -> Optional[str]:
+    """Extract the most recent lastmod from sitemap XML."""
+    try:
+        if LXML:
+            root = etree.fromstring(xml_content)
+            # Get all lastmod elements (namespace-agnostic)
+            lastmods = root.xpath("//*[local-name()='lastmod']/text()")
+            if lastmods:
+                # Return the most recent one
+                return max(lastmods)
+    except Exception:
+        pass
+    return None
+
+
+def _is_cache_valid(
+    cache_path: pathlib.Path,
+    ttl_hours: int,
+    validate_lastmod: bool,
+    current_lastmod: Optional[str] = None
+) -> bool:
+    """
+    Check if sitemap cache is still valid.
+
+    Returns False (invalid) if:
+    - File doesn't exist
+    - File is corrupted/unreadable
+    - TTL expired (if ttl_hours > 0)
+    - Sitemap lastmod is newer than cache (if validate_lastmod=True)
+    """
+    if not cache_path.exists():
+        return False
+
+    try:
+        with open(cache_path, "r") as f:
+            data = json.load(f)
+
+        # Check version
+        if data.get("version") != 1:
+            return False
+
+        # Check TTL
+        if ttl_hours > 0:
+            created_at = datetime.fromisoformat(data["created_at"].replace("Z", "+00:00"))
+            age_hours = (datetime.now(timezone.utc) - created_at).total_seconds() / 3600
+            if age_hours > ttl_hours:
+                return False
+
+        # Check lastmod
+        if validate_lastmod and current_lastmod:
+            cached_lastmod = data.get("sitemap_lastmod")
+            if cached_lastmod and current_lastmod > cached_lastmod:
+                return False
+
+        # Check URL count (sanity check - if 0, likely corrupted)
+        if data.get("url_count", 0) == 0:
+            return False
+
+        return True
+
+    except (json.JSONDecodeError, KeyError, ValueError, IOError):
+        # Corrupted cache - return False to trigger refetch
+        return False
+
+
+def _read_cache(cache_path: pathlib.Path) -> List[str]:
+    """Read URLs from cache file. Returns empty list on error."""
+    try:
+        with open(cache_path, "r") as f:
+            data = json.load(f)
+        return data.get("urls", [])
+    except Exception:
+        return []
+
+
+def _write_cache(
+    cache_path: pathlib.Path,
+    urls: List[str],
+    sitemap_url: str,
+    sitemap_lastmod: Optional[str]
+) -> None:
+    """Write URLs to cache with metadata."""
+    data = {
+        "version": 1,
+        "created_at": datetime.now(timezone.utc).isoformat(),
+        "sitemap_lastmod": sitemap_lastmod,
+        "sitemap_url": sitemap_url,
+        "url_count": len(urls),
+        "urls": urls
+    }
+    try:
+        with open(cache_path, "w") as f:
+            json.dump(data, f)
+    except Exception:
+        pass  # Fail silently - cache is optional
+
+
 def _match(url: str, pattern: str) -> bool:
    if fnmatch.fnmatch(url, pattern):
        return True
@@ -295,6 +392,10 @@ class AsyncUrlSeeder:
        score_threshold = config.score_threshold
        scoring_method = config.scoring_method

+        # Store cache config for use in _from_sitemaps
+        self._cache_ttl_hours = getattr(config, 'cache_ttl_hours', 24)
+        self._validate_sitemap_lastmod = getattr(config, 'validate_sitemap_lastmod', True)
+
        # Ensure seeder's logger verbose matches the config's verbose if it's set
        if self.logger and hasattr(self.logger, 'verbose') and config.verbose is not None:
            self.logger.verbose = config.verbose
@@ -764,68 +865,222 @@ class AsyncUrlSeeder:
    # ─────────────────────────────── Sitemaps
    async def _from_sitemaps(self, domain: str, pattern: str, force: bool = False):
        """
-        1. Probe default sitemap locations.
-        2. If none exist, parse robots.txt for alternative sitemap URLs.
-        3. Yield only URLs that match `pattern`.
+        Discover URLs from sitemaps with smart TTL-based caching.
+
+        1. Check cache validity (TTL + lastmod)
+        2. If valid, yield from cache
+        3. If invalid or force=True, fetch fresh and update cache
+        4. FALLBACK: If anything fails, bypass cache and fetch directly
        """
+        # Get config values (passed via self during urls() call)
+        cache_ttl_hours = getattr(self, '_cache_ttl_hours', 24)
+        validate_lastmod = getattr(self, '_validate_sitemap_lastmod', True)

-       # ── cache file (same logic as _from_cc)
+        # Cache file path (new format: .json instead of .jsonl)
        host = re.sub(r'^https?://', '', domain).rstrip('/')
-        host = re.sub('[/?#]+', '_', domain)
+        host_safe = re.sub('[/?#]+', '_', host)
        digest = hashlib.md5(pattern.encode()).hexdigest()[:8]
-        path = self.cache_dir / f"sitemap_{host}_{digest}.jsonl"
+        cache_path = self.cache_dir / f"sitemap_{host_safe}_{digest}.json"

-        if path.exists() and not force:
-            self._log("info", "Loading sitemap URLs for {d} from cache: {p}",
-                      params={"d": host, "p": str(path)}, tag="URL_SEED")
-            async with aiofiles.open(path, "r") as fp:
-                async for line in fp:
-                    url = line.strip()
-                    if _match(url, pattern):
-                        yield url
-            return
+        # Check for old .jsonl format and delete it
+        old_cache_path = self.cache_dir / f"sitemap_{host_safe}_{digest}.jsonl"
+        if old_cache_path.exists():
+            try:
+                old_cache_path.unlink()
+                self._log("info", "Deleted old cache format: {p}",
+                          params={"p": str(old_cache_path)}, tag="URL_SEED")
+            except Exception:
+                pass

-        # 1️⃣ direct sitemap probe
-        # strip any scheme so we can handle https → http fallback
-        host = re.sub(r'^https?://', '', domain).rstrip('/')
+        # Step 1: Find sitemap URL and get lastmod (needed for validation)
+        sitemap_url = None
+        sitemap_lastmod = None
+        sitemap_content = None

-        schemes = ('https', 'http')  # prefer TLS, downgrade if needed
+        schemes = ('https', 'http')
        for scheme in schemes:
            for suffix in ("/sitemap.xml", "/sitemap_index.xml"):
                sm = f"{scheme}://{host}{suffix}"
-                sm = await self._resolve_head(sm)
-                if sm:
-                    self._log("info", "Found sitemap at {url}", params={
-                              "url": sm}, tag="URL_SEED")
-                    async with aiofiles.open(path, "w") as fp:
-                        async for u in self._iter_sitemap(sm):
-                            await fp.write(u + "\n")
+                resolved = await self._resolve_head(sm)
+                if resolved:
+                    sitemap_url = resolved
+                    # Fetch sitemap content to get lastmod
+                    try:
+                        r = await self.client.get(sitemap_url, timeout=15, follow_redirects=True)
+                        if 200 <= r.status_code < 300:
+                            sitemap_content = r.content
+                            sitemap_lastmod = _parse_sitemap_lastmod(sitemap_content)
+                    except Exception:
+                        pass
+                    break
+            if sitemap_url:
+                break
+
+        # Step 2: Check cache validity (skip if force=True)
+        if not force and cache_path.exists():
+            if _is_cache_valid(cache_path, cache_ttl_hours, validate_lastmod, sitemap_lastmod):
+                self._log("info", "Loading sitemap URLs from valid cache: {p}",
+                          params={"p": str(cache_path)}, tag="URL_SEED")
+                cached_urls = _read_cache(cache_path)
+                for url in cached_urls:
+                    if _match(url, pattern):
+                        yield url
+                return
+            else:
+                self._log("info", "Cache invalid/expired, refetching sitemap for {d}",
+                          params={"d": domain}, tag="URL_SEED")
+
+        # Step 3: Fetch fresh URLs
+        discovered_urls = []
+
+        if sitemap_url and sitemap_content:
+            self._log("info", "Found sitemap at {url}", params={"url": sitemap_url}, tag="URL_SEED")
+
+            # Parse sitemap (reuse content we already fetched)
+            async for u in self._iter_sitemap_content(sitemap_url, sitemap_content):
+                discovered_urls.append(u)
                if _match(u, pattern):
                    yield u
-                    return
-
-        # 2️⃣ robots.txt fallback
-        robots = f"https://{domain.rstrip('/')}/robots.txt"
+        elif sitemap_url:
+            # We have a sitemap URL but no content (fetch failed earlier), try again
+            self._log("info", "Found sitemap at {url}", params={"url": sitemap_url}, tag="URL_SEED")
+            async for u in self._iter_sitemap(sitemap_url):
+                discovered_urls.append(u)
+                if _match(u, pattern):
+                    yield u
+        else:
+            # Fallback: robots.txt
+            robots = f"https://{host}/robots.txt"
            try:
                r = await self.client.get(robots, timeout=10, follow_redirects=True)
-            if not 200 <= r.status_code < 300:
-                self._log("warning", "robots.txt unavailable for {d} HTTP{c}", params={
-                          "d": domain, "c": r.status_code}, tag="URL_SEED")
-                return
-            sitemap_lines = [l.split(":", 1)[1].strip(
-            ) for l in r.text.splitlines() if l.lower().startswith("sitemap:")]
-        except Exception as e:
-            self._log("warning", "Failed to fetch robots.txt for {d}: {e}", params={
-                      "d": domain, "e": str(e)}, tag="URL_SEED")
-            return
-
-        if sitemap_lines:
-            async with aiofiles.open(path, "w") as fp:
+                if 200 <= r.status_code < 300:
+                    sitemap_lines = [l.split(":", 1)[1].strip()
+                                     for l in r.text.splitlines()
+                                     if l.lower().startswith("sitemap:")]
                    for sm in sitemap_lines:
                        async for u in self._iter_sitemap(sm):
-                        await fp.write(u + "\n")
+                            discovered_urls.append(u)
                            if _match(u, pattern):
                                yield u
+                else:
+                    self._log("warning", "robots.txt unavailable for {d} HTTP{c}",
+                              params={"d": domain, "c": r.status_code}, tag="URL_SEED")
+                    return
+            except Exception as e:
+                self._log("warning", "Failed to fetch robots.txt for {d}: {e}",
+                          params={"d": domain, "e": str(e)}, tag="URL_SEED")
+                return
+
+        # Step 4: Write to cache (FALLBACK: if write fails, URLs still yielded above)
+        if discovered_urls:
+            _write_cache(cache_path, discovered_urls, sitemap_url or "", sitemap_lastmod)
+            self._log("info", "Cached {count} URLs for {d}",
+                      params={"count": len(discovered_urls), "d": domain}, tag="URL_SEED")
+
+    async def _iter_sitemap_content(self, url: str, content: bytes):
+        """Parse sitemap from already-fetched content."""
+        data = gzip.decompress(content) if url.endswith(".gz") else content
+        base_url = url
+
+        def _normalize_loc(raw: Optional[str]) -> Optional[str]:
+            if not raw:
+                return None
+            normalized = urljoin(base_url, raw.strip())
+            if not normalized:
+                return None
+            return normalized
+
+        # Detect if this is a sitemap index
+        is_sitemap_index = False
+        sub_sitemaps = []
+        regular_urls = []
+
+        if LXML:
+            try:
+                parser = etree.XMLParser(recover=True)
+                root = etree.fromstring(data, parser=parser)
+                sitemap_loc_nodes = root.xpath("//*[local-name()='sitemap']/*[local-name()='loc']")
+                url_loc_nodes = root.xpath("//*[local-name()='url']/*[local-name()='loc']")
+
+                if sitemap_loc_nodes:
+                    is_sitemap_index = True
+                    for sitemap_elem in sitemap_loc_nodes:
+                        loc = _normalize_loc(sitemap_elem.text)
+                        if loc:
+                            sub_sitemaps.append(loc)
+
+                if not is_sitemap_index:
+                    for loc_elem in url_loc_nodes:
+                        loc = _normalize_loc(loc_elem.text)
+                        if loc:
+                            regular_urls.append(loc)
+            except Exception as e:
+                self._log("error", "LXML parsing error for sitemap {url}: {error}",
+                          params={"url": url, "error": str(e)}, tag="URL_SEED")
+                return
+        else:
+            import xml.etree.ElementTree as ET
+            try:
+                root = ET.fromstring(data)
+                for elem in root.iter():
+                    if '}' in elem.tag:
+                        elem.tag = elem.tag.split('}')[1]
+
+                sitemaps = root.findall('.//sitemap')
+                url_entries = root.findall('.//url')
+
+                if sitemaps:
+                    is_sitemap_index = True
+                    for sitemap in sitemaps:
+                        loc_elem = sitemap.find('loc')
+                        loc = _normalize_loc(loc_elem.text if loc_elem is not None else None)
+                        if loc:
+                            sub_sitemaps.append(loc)
+
+                if not is_sitemap_index:
+                    for url_elem in url_entries:
+                        loc_elem = url_elem.find('loc')
+                        loc = _normalize_loc(loc_elem.text if loc_elem is not None else None)
+                        if loc:
+                            regular_urls.append(loc)
+            except Exception as e:
+                self._log("error", "ElementTree parsing error for sitemap {url}: {error}",
+                          params={"url": url, "error": str(e)}, tag="URL_SEED")
+                return
+
+        # Process based on type
+        if is_sitemap_index and sub_sitemaps:
+            self._log("info", "Processing sitemap index with {count} sub-sitemaps",
+                      params={"count": len(sub_sitemaps)}, tag="URL_SEED")
+
+            queue_size = min(50000, len(sub_sitemaps) * 1000)
+            result_queue = asyncio.Queue(maxsize=queue_size)
+            completed_count = 0
+            total_sitemaps = len(sub_sitemaps)
+
+            async def process_subsitemap(sitemap_url: str):
+                try:
+                    async for u in self._iter_sitemap(sitemap_url):
+                        await result_queue.put(u)
+                except Exception as e:
+                    self._log("error", "Error processing sub-sitemap {url}: {error}",
+                              params={"url": sitemap_url, "error": str(e)}, tag="URL_SEED")
+                finally:
+                    await result_queue.put(None)
+
+            tasks = [asyncio.create_task(process_subsitemap(sm)) for sm in sub_sitemaps]
+
+            while completed_count < total_sitemaps:
+                item = await result_queue.get()
+                if item is None:
+                    completed_count += 1
+                else:
+                    yield item
+
+            await asyncio.gather(*tasks, return_exceptions=True)
+        else:
+            for u in regular_urls:
+                yield u

    async def _iter_sitemap(self, url: str):
        try:
--- a/crawl4ai/async_webcrawler.py
+++ b/crawl4ai/async_webcrawler.py
@@ -47,7 +47,9 @@ from .utils import (
    get_error_context,
    RobotsParser,
    preprocess_html_for_schema,
+    compute_head_fingerprint,
 )
+from .cache_validator import CacheValidator, CacheValidationResult


 class AsyncWebCrawler:
@@ -267,6 +269,51 @@ class AsyncWebCrawler:
                if cache_context.should_read():
                    cached_result = await async_db_manager.aget_cached_url(url)

+                # Smart Cache: Validate cache freshness if enabled
+                if cached_result and config.check_cache_freshness:
+                    cache_metadata = await async_db_manager.aget_cache_metadata(url)
+                    if cache_metadata:
+                        async with CacheValidator(timeout=config.cache_validation_timeout) as validator:
+                            validation = await validator.validate(
+                                url=url,
+                                stored_etag=cache_metadata.get("etag"),
+                                stored_last_modified=cache_metadata.get("last_modified"),
+                                stored_head_fingerprint=cache_metadata.get("head_fingerprint"),
+                            )
+
+                        if validation.status == CacheValidationResult.FRESH:
+                            cached_result.cache_status = "hit_validated"
+                            self.logger.info(
+                                message="Cache validated: {reason}",
+                                tag="CACHE",
+                                params={"reason": validation.reason}
+                            )
+                            # Update metadata if we got new values
+                            if validation.new_etag or validation.new_last_modified:
+                                await async_db_manager.aupdate_cache_metadata(
+                                    url=url,
+                                    etag=validation.new_etag,
+                                    last_modified=validation.new_last_modified,
+                                    head_fingerprint=validation.new_head_fingerprint,
+                                )
+                        elif validation.status == CacheValidationResult.ERROR:
+                            cached_result.cache_status = "hit_fallback"
+                            self.logger.warning(
+                                message="Cache validation failed, using cached: {reason}",
+                                tag="CACHE",
+                                params={"reason": validation.reason}
+                            )
+                        else:
+                            # STALE or UNKNOWN - force recrawl
+                            self.logger.info(
+                                message="Cache stale: {reason}",
+                                tag="CACHE",
+                                params={"reason": validation.reason}
+                            )
+                            cached_result = None
+                elif cached_result:
+                    cached_result.cache_status = "hit"
+
                if cached_result:
                    html = sanitize_input_encode(cached_result.html)
                    extracted_content = sanitize_input_encode(
@@ -296,6 +343,24 @@ class AsyncWebCrawler:

                # Update proxy configuration from rotation strategy if available
                if config and config.proxy_rotation_strategy:
+                    # Handle sticky sessions - use same proxy for all requests with same session_id
+                    if config.proxy_session_id:
+                        next_proxy: ProxyConfig = await config.proxy_rotation_strategy.get_proxy_for_session(
+                            config.proxy_session_id,
+                            ttl=config.proxy_session_ttl
+                        )
+                        if next_proxy:
+                            self.logger.info(
+                                message="Using sticky proxy session: {session_id} -> {proxy}",
+                                tag="PROXY",
+                                params={
+                                    "session_id": config.proxy_session_id,
+                                    "proxy": next_proxy.server
+                                }
+                            )
+                            config.proxy_config = next_proxy
+                    else:
+                        # Existing behavior: rotate on each request
                        next_proxy: ProxyConfig = await config.proxy_rotation_strategy.get_next_proxy()
                        if next_proxy:
                            self.logger.info(
@@ -304,7 +369,6 @@ class AsyncWebCrawler:
                                params={"proxy": next_proxy.server}
                            )
                            config.proxy_config = next_proxy
-                        # config = config.clone(proxy_config=next_proxy)

                # Fetch fresh content if needed
                if not cached_result or not html:
@@ -383,6 +447,14 @@ class AsyncWebCrawler:
                    crawl_result.success = bool(html)
                    crawl_result.session_id = getattr(
                        config, "session_id", None)
+                    crawl_result.cache_status = "miss"
+
+                    # Compute head fingerprint for cache validation
+                    if html:
+                        head_end = html.lower().find('</head>')
+                        if head_end != -1:
+                            head_html = html[:head_end + 7]
+                            crawl_result.head_fingerprint = compute_head_fingerprint(head_html)

                    self.logger.url_status(
                        url=cache_context.display_url,
@@ -459,6 +531,27 @@ class AsyncWebCrawler:
        Returns:
            CrawlResult: Processed result containing extracted and formatted content
        """
+        # === PREFETCH MODE SHORT-CIRCUIT ===
+        if getattr(config, 'prefetch', False):
+            from .utils import quick_extract_links
+
+            # Use base_url from config (for raw: URLs), redirected_url, or original url
+            effective_url = getattr(config, 'base_url', None) or kwargs.get('redirected_url') or url
+            links = quick_extract_links(html, effective_url)
+
+            return CrawlResult(
+                url=url,
+                html=html,
+                success=True,
+                links=links,
+                status_code=kwargs.get('status_code'),
+                response_headers=kwargs.get('response_headers'),
+                redirected_url=kwargs.get('redirected_url'),
+                ssl_certificate=kwargs.get('ssl_certificate'),
+                # All other fields default to None
+            )
+        # === END PREFETCH SHORT-CIRCUIT ===
+
        cleaned_html = ""
        try:
            _url = url if not kwargs.get("is_raw_html", False) else "Raw HTML"
@@ -563,7 +656,8 @@ class AsyncWebCrawler:
        markdown_result: MarkdownGenerationResult = (
            markdown_generator.generate_markdown(
                input_html=markdown_input_html,
-                base_url=params.get("redirected_url", url)
+                # Use explicit base_url if provided (for raw: HTML), otherwise redirected_url, then url
+                base_url=params.get("base_url") or params.get("redirected_url") or url
                # html2text_options=kwargs.get('html2text', {})
            )
        )
@@ -756,21 +850,45 @@ class AsyncWebCrawler:
        # Handle stream setting - use first config's stream setting if config is a list
        if isinstance(config, list):
            stream = config[0].stream if config else False
+            primary_config = config[0] if config else None
        else:
            stream = config.stream
+            primary_config = config
+
+        # Helper to release sticky session if auto_release is enabled
+        async def maybe_release_session():
+            if (primary_config and
+                primary_config.proxy_session_id and
+                primary_config.proxy_session_auto_release and
+                primary_config.proxy_rotation_strategy):
+                await primary_config.proxy_rotation_strategy.release_session(
+                    primary_config.proxy_session_id
+                )
+                self.logger.info(
+                    message="Auto-released proxy session: {session_id}",
+                    tag="PROXY",
+                    params={"session_id": primary_config.proxy_session_id}
+                )

        if stream:
-
            async def result_transformer():
+                try:
                    async for task_result in dispatcher.run_urls_stream(
                        crawler=self, urls=urls, config=config
                    ):
                        yield transform_result(task_result)
+                finally:
+                    # Auto-release session after streaming completes
+                    await maybe_release_session()

            return result_transformer()
        else:
+            try:
                _results = await dispatcher.run_urls(crawler=self, urls=urls, config=config)
                return [transform_result(res) for res in _results]
+            finally:
+                # Auto-release session after batch completes
+                await maybe_release_session()

    async def aseed_urls(
        self,
--- a/crawl4ai/browser_manager.py
+++ b/crawl4ai/browser_manager.py
@@ -668,8 +668,38 @@ class BrowserManager:

            self.browser = await self.playwright.chromium.connect_over_cdp(cdp_url)
            contexts = self.browser.contexts
+
+            # If browser_context_id is provided, we're using a pre-created context
+            if self.config.browser_context_id:
+                if self.logger:
+                    self.logger.debug(
+                        f"Using pre-existing browser context: {self.config.browser_context_id}",
+                        tag="BROWSER"
+                    )
+                # When connecting to a pre-created context, it should be in contexts
                if contexts:
                    self.default_context = contexts[0]
+                    if self.logger:
+                        self.logger.debug(
+                            f"Found {len(contexts)} existing context(s), using first one",
+                            tag="BROWSER"
+                        )
+                else:
+                    # Context was created but not yet visible - wait a bit
+                    await asyncio.sleep(0.2)
+                    contexts = self.browser.contexts
+                    if contexts:
+                        self.default_context = contexts[0]
+                    else:
+                        # Still no contexts - this shouldn't happen with pre-created context
+                        if self.logger:
+                            self.logger.warning(
+                                "Pre-created context not found, creating new one",
+                                tag="BROWSER"
+                            )
+                        self.default_context = await self.create_browser_context()
+            elif contexts:
+                self.default_context = contexts[0]
            else:
                self.default_context = await self.create_browser_context()
            await self.setup_context(self.default_context)
@@ -687,13 +717,38 @@ class BrowserManager:
            self.default_context = self.browser

    async def _verify_cdp_ready(self, cdp_url: str) -> bool:
-        """Verify CDP endpoint is ready with exponential backoff"""
+        """Verify CDP endpoint is ready with exponential backoff.
+
+        Supports multiple URL formats:
+        - HTTP URLs: http://localhost:9222
+        - HTTP URLs with query params: http://localhost:9222?browser_id=XXX
+        - WebSocket URLs: ws://localhost:9222/devtools/browser/XXX
+        """
        import aiohttp
-        self.logger.debug(f"Starting CDP verification for {cdp_url}", tag="BROWSER")
+        from urllib.parse import urlparse, urlunparse
+
+        # If WebSocket URL, Playwright handles connection directly - skip HTTP verification
+        if cdp_url.startswith(('ws://', 'wss://')):
+            self.logger.debug(f"WebSocket CDP URL provided, skipping HTTP verification", tag="BROWSER")
+            return True
+
+        # Parse HTTP URL and properly construct /json/version endpoint
+        parsed = urlparse(cdp_url)
+        # Build URL with /json/version path, preserving query params
+        verify_url = urlunparse((
+            parsed.scheme,
+            parsed.netloc,
+            '/json/version',  # Always use this path for verification
+            '',  # params
+            parsed.query,  # preserve query string
+            ''   # fragment
+        ))
+
+        self.logger.debug(f"Starting CDP verification for {verify_url}", tag="BROWSER")
        for attempt in range(5):
            try:
                async with aiohttp.ClientSession() as session:
-                    async with session.get(f"{cdp_url}/json/version", timeout=aiohttp.ClientTimeout(total=2)) as response:
+                    async with session.get(verify_url, timeout=aiohttp.ClientTimeout(total=2)) as response:
                        if response.status == 200:
                            self.logger.debug(f"CDP endpoint ready after {attempt + 1} attempts", tag="BROWSER")
                            return True
@@ -840,15 +895,24 @@ class BrowserManager:
            combined_headers.update(self.config.headers)
            await context.set_extra_http_headers(combined_headers)

-        # Add default cookie
+        # Add default cookie (skip for raw:/file:// URLs which are not valid cookie URLs)
+        cookie_url = None
+        if crawlerRunConfig and crawlerRunConfig.url:
+            url = crawlerRunConfig.url
+            # Only set cookie for http/https URLs
+            if url.startswith(("http://", "https://")):
+                cookie_url = url
+            elif crawlerRunConfig.base_url and crawlerRunConfig.base_url.startswith(("http://", "https://")):
+                # Use base_url as fallback for raw:/file:// URLs
+                cookie_url = crawlerRunConfig.base_url
+
+        if cookie_url:
            await context.add_cookies(
                [
                    {
                        "name": "cookiesEnabled",
                        "value": "true",
-                    "url": crawlerRunConfig.url
-                    if crawlerRunConfig and crawlerRunConfig.url
-                    else "https://crawl4ai.com/",
+                        "url": cookie_url,
                    }
                ]
            )
@@ -862,6 +926,11 @@ class BrowserManager:
            ):
                await context.add_init_script(load_js_script("navigator_overrider"))

+        # Apply custom init_scripts from BrowserConfig (for stealth evasions, etc.)
+        if self.config.init_scripts:
+            for script in self.config.init_scripts:
+                await context.add_init_script(script)
+
    async def create_browser_context(self, crawlerRunConfig: CrawlerRunConfig = None):
        """
        Creates and returns a new browser context with configured settings.
@@ -1042,6 +1111,62 @@ class BrowserManager:
                        params={"error": str(e)}
                    )

+    async def _get_page_by_target_id(self, context: BrowserContext, target_id: str):
+        """
+        Get an existing page by its CDP target ID.
+
+        This is used when connecting to a pre-created browser context with an existing page.
+        Playwright may not immediately see targets created via raw CDP commands, so we
+        use CDP to get all targets and find the matching one.
+
+        Args:
+            context: The browser context to search in
+            target_id: The CDP target ID to find
+
+        Returns:
+            Page object if found, None otherwise
+        """
+        try:
+            # First check if Playwright already sees the page
+            for page in context.pages:
+                # Playwright's internal target ID might match
+                if hasattr(page, '_impl_obj') and hasattr(page._impl_obj, '_target_id'):
+                    if page._impl_obj._target_id == target_id:
+                        return page
+
+            # If not found, try using CDP to get targets
+            if hasattr(self.browser, '_impl_obj') and hasattr(self.browser._impl_obj, '_connection'):
+                cdp_session = await context.new_cdp_session(context.pages[0] if context.pages else None)
+                if cdp_session:
+                    try:
+                        result = await cdp_session.send("Target.getTargets")
+                        targets = result.get("targetInfos", [])
+                        for target in targets:
+                            if target.get("targetId") == target_id:
+                                # Found the target - if it's a page type, we can use it
+                                if target.get("type") == "page":
+                                    # The page exists, let Playwright discover it
+                                    await asyncio.sleep(0.1)
+                                    # Refresh pages list
+                                    if context.pages:
+                                        return context.pages[0]
+                    finally:
+                        await cdp_session.detach()
+
+            # Fallback: if there are any pages now, return the first one
+            if context.pages:
+                return context.pages[0]
+
+            return None
+        except Exception as e:
+            if self.logger:
+                self.logger.warning(
+                    message="Failed to get page by target ID: {error}",
+                    tag="BROWSER",
+                    params={"error": str(e)}
+                )
+            return None
+
    async def get_page(self, crawlerRunConfig: CrawlerRunConfig):
        """
        Get a page for the given session ID, creating a new one if needed.
@@ -1063,7 +1188,25 @@ class BrowserManager:

        # If using a managed browser, just grab the shared default_context
        if self.config.use_managed_browser:
-            if self.config.storage_state:
+            # If create_isolated_context is True, create isolated contexts for concurrent crawls
+            # Uses the same caching mechanism as non-CDP mode: cache context by config signature,
+            # but always create a new page. This prevents navigation conflicts while allowing
+            # context reuse for multiple URLs with the same config (e.g., batch/deep crawls).
+            if self.config.create_isolated_context:
+                config_signature = self._make_config_signature(crawlerRunConfig)
+
+                async with self._contexts_lock:
+                    if config_signature in self.contexts_by_config:
+                        context = self.contexts_by_config[config_signature]
+                    else:
+                        context = await self.create_browser_context(crawlerRunConfig)
+                        await self.setup_context(context, crawlerRunConfig)
+                        self.contexts_by_config[config_signature] = context
+
+                # Always create a new page for each crawl (isolation for navigation)
+                page = await context.new_page()
+                await self._apply_stealth_to_page(page)
+            elif self.config.storage_state:
                context = await self.create_browser_context(crawlerRunConfig)
                ctx = self.default_context        # default context, one window only
                ctx = await clone_runtime_state(context, ctx, crawlerRunConfig, self.config)
@@ -1086,6 +1229,14 @@ class BrowserManager:
                            pages = context.pages
                            if pages:
                                page = pages[0]
+                            elif self.config.browser_context_id and self.config.target_id:
+                                # Pre-existing context/target provided - use CDP to get the page
+                                # This handles the case where Playwright doesn't see the target yet
+                                page = await self._get_page_by_target_id(context, self.config.target_id)
+                                if not page:
+                                    # Fallback: create new page in existing context
+                                    page = await context.new_page()
+                                    await self._apply_stealth_to_page(page)
                            else:
                                page = await context.new_page()
                                await self._apply_stealth_to_page(page)
@@ -1140,6 +1291,42 @@ class BrowserManager:
    async def close(self):
        """Close all browser resources and clean up."""
        if self.config.cdp_url:
+            # When using external CDP, we don't own the browser process.
+            # If cdp_cleanup_on_close is True, properly disconnect from the browser
+            # and clean up Playwright resources. This frees the browser for other clients.
+            if self.config.cdp_cleanup_on_close:
+                # First close all sessions (pages)
+                session_ids = list(self.sessions.keys())
+                for session_id in session_ids:
+                    await self.kill_session(session_id)
+
+                # Close all contexts we created
+                for ctx in self.contexts_by_config.values():
+                    try:
+                        await ctx.close()
+                    except Exception:
+                        pass
+                self.contexts_by_config.clear()
+
+                # Disconnect from browser (doesn't terminate it, just releases connection)
+                if self.browser:
+                    try:
+                        await self.browser.close()
+                    except Exception as e:
+                        if self.logger:
+                            self.logger.debug(
+                                message="Error disconnecting from CDP browser: {error}",
+                                tag="BROWSER",
+                                params={"error": str(e)}
+                            )
+                    self.browser = None
+                    # Allow time for CDP connection to fully release before another client connects
+                    await asyncio.sleep(1.0)
+
+                # Stop Playwright instance to prevent memory leaks
+                if self.playwright:
+                    await self.playwright.stop()
+                    self.playwright = None
            return

        if self.config.sleep_on_close:
--- a/crawl4ai/cache_validator.py
+++ b/crawl4ai/cache_validator.py
@@ -0,0 +1,270 @@
+"""
+Cache validation using HTTP conditional requests and head fingerprinting.
+
+Uses httpx for fast, lightweight HTTP requests (no browser needed).
+This module enables smart cache validation to avoid unnecessary full browser crawls
+when content hasn't changed.
+
+Validation Strategy:
+1. Send HEAD request with If-None-Match / If-Modified-Since headers
+2. If server returns 304 Not Modified → cache is FRESH
+3. If server returns 200 → fetch <head> and compare fingerprint
+4. If fingerprint matches → cache is FRESH (minor changes only)
+5. Otherwise → cache is STALE, need full recrawl
+"""
+
+import httpx
+from dataclasses import dataclass
+from typing import Optional, Tuple
+from enum import Enum
+
+from .utils import compute_head_fingerprint
+
+
+class CacheValidationResult(Enum):
+    """Result of cache validation check."""
+    FRESH = "fresh"       # Content unchanged, use cache
+    STALE = "stale"       # Content changed, need recrawl
+    UNKNOWN = "unknown"   # Couldn't determine, need recrawl
+    ERROR = "error"       # Request failed, use cache as fallback
+
+
+@dataclass
+class ValidationResult:
+    """Detailed result of a cache validation attempt."""
+    status: CacheValidationResult
+    new_etag: Optional[str] = None
+    new_last_modified: Optional[str] = None
+    new_head_fingerprint: Optional[str] = None
+    reason: str = ""
+
+
+class CacheValidator:
+    """
+    Validates cache freshness using lightweight HTTP requests.
+
+    This validator uses httpx to make fast HTTP requests without needing
+    a full browser. It supports two validation methods:
+
+    1. HTTP Conditional Requests (Layer 3):
+       - Uses If-None-Match with stored ETag
+       - Uses If-Modified-Since with stored Last-Modified
+       - Server returns 304 if content unchanged
+
+    2. Head Fingerprinting (Layer 4):
+       - Fetches only the <head> section (~5KB)
+       - Compares fingerprint of key meta tags
+       - Catches changes even without server support for conditional requests
+    """
+
+    def __init__(self, timeout: float = 10.0, user_agent: Optional[str] = None):
+        """
+        Initialize the cache validator.
+
+        Args:
+            timeout: Request timeout in seconds
+            user_agent: Custom User-Agent string (optional)
+        """
+        self.timeout = timeout
+        self.user_agent = user_agent or "Mozilla/5.0 (compatible; Crawl4AI/1.0)"
+        self._client: Optional[httpx.AsyncClient] = None
+
+    async def _get_client(self) -> httpx.AsyncClient:
+        """Get or create the httpx client."""
+        if self._client is None:
+            self._client = httpx.AsyncClient(
+                http2=True,
+                timeout=self.timeout,
+                follow_redirects=True,
+                headers={"User-Agent": self.user_agent}
+            )
+        return self._client
+
+    async def validate(
+        self,
+        url: str,
+        stored_etag: Optional[str] = None,
+        stored_last_modified: Optional[str] = None,
+        stored_head_fingerprint: Optional[str] = None,
+    ) -> ValidationResult:
+        """
+        Validate if cached content is still fresh.
+
+        Args:
+            url: The URL to validate
+            stored_etag: Previously stored ETag header value
+            stored_last_modified: Previously stored Last-Modified header value
+            stored_head_fingerprint: Previously computed head fingerprint
+
+        Returns:
+            ValidationResult with status and any updated metadata
+        """
+        client = await self._get_client()
+
+        # Build conditional request headers
+        headers = {}
+        if stored_etag:
+            headers["If-None-Match"] = stored_etag
+        if stored_last_modified:
+            headers["If-Modified-Since"] = stored_last_modified
+
+        try:
+            # Step 1: Try HEAD request with conditional headers
+            if headers:
+                response = await client.head(url, headers=headers)
+
+                if response.status_code == 304:
+                    return ValidationResult(
+                        status=CacheValidationResult.FRESH,
+                        reason="Server returned 304 Not Modified"
+                    )
+
+                # Got 200, extract new headers for potential update
+                new_etag = response.headers.get("etag")
+                new_last_modified = response.headers.get("last-modified")
+
+                # If we have fingerprint, compare it
+                if stored_head_fingerprint:
+                    head_html, _, _ = await self._fetch_head(url)
+                    if head_html:
+                        new_fingerprint = compute_head_fingerprint(head_html)
+                        if new_fingerprint and new_fingerprint == stored_head_fingerprint:
+                            return ValidationResult(
+                                status=CacheValidationResult.FRESH,
+                                new_etag=new_etag,
+                                new_last_modified=new_last_modified,
+                                new_head_fingerprint=new_fingerprint,
+                                reason="Head fingerprint matches"
+                            )
+                        elif new_fingerprint:
+                            return ValidationResult(
+                                status=CacheValidationResult.STALE,
+                                new_etag=new_etag,
+                                new_last_modified=new_last_modified,
+                                new_head_fingerprint=new_fingerprint,
+                                reason="Head fingerprint changed"
+                            )
+
+                # Headers changed and no fingerprint match
+                return ValidationResult(
+                    status=CacheValidationResult.STALE,
+                    new_etag=new_etag,
+                    new_last_modified=new_last_modified,
+                    reason="Server returned 200, content may have changed"
+                )
+
+            # Step 2: No conditional headers available, try fingerprint only
+            if stored_head_fingerprint:
+                head_html, new_etag, new_last_modified = await self._fetch_head(url)
+
+                if head_html:
+                    new_fingerprint = compute_head_fingerprint(head_html)
+
+                    if new_fingerprint and new_fingerprint == stored_head_fingerprint:
+                        return ValidationResult(
+                            status=CacheValidationResult.FRESH,
+                            new_etag=new_etag,
+                            new_last_modified=new_last_modified,
+                            new_head_fingerprint=new_fingerprint,
+                            reason="Head fingerprint matches"
+                        )
+                    elif new_fingerprint:
+                        return ValidationResult(
+                            status=CacheValidationResult.STALE,
+                            new_etag=new_etag,
+                            new_last_modified=new_last_modified,
+                            new_head_fingerprint=new_fingerprint,
+                            reason="Head fingerprint changed"
+                        )
+
+            # Step 3: No validation data available
+            return ValidationResult(
+                status=CacheValidationResult.UNKNOWN,
+                reason="No validation data available (no etag, last-modified, or fingerprint)"
+            )
+
+        except httpx.TimeoutException:
+            return ValidationResult(
+                status=CacheValidationResult.ERROR,
+                reason="Validation request timed out"
+            )
+        except httpx.RequestError as e:
+            return ValidationResult(
+                status=CacheValidationResult.ERROR,
+                reason=f"Validation request failed: {type(e).__name__}"
+            )
+        except Exception as e:
+            # On unexpected error, prefer using cache over failing
+            return ValidationResult(
+                status=CacheValidationResult.ERROR,
+                reason=f"Validation error: {str(e)}"
+            )
+
+    async def _fetch_head(self, url: str) -> Tuple[Optional[str], Optional[str], Optional[str]]:
+        """
+        Fetch only the <head> section of a page.
+
+        Uses streaming to stop reading after </head> is found,
+        minimizing bandwidth usage.
+
+        Args:
+            url: The URL to fetch
+
+        Returns:
+            Tuple of (head_html, etag, last_modified)
+        """
+        client = await self._get_client()
+
+        try:
+            async with client.stream(
+                "GET",
+                url,
+                headers={"Accept-Encoding": "identity"}  # Disable compression for easier parsing
+            ) as response:
+                etag = response.headers.get("etag")
+                last_modified = response.headers.get("last-modified")
+
+                if response.status_code != 200:
+                    return None, etag, last_modified
+
+                # Read until </head> or max 64KB
+                chunks = []
+                total_bytes = 0
+                max_bytes = 65536
+
+                async for chunk in response.aiter_bytes(4096):
+                    chunks.append(chunk)
+                    total_bytes += len(chunk)
+
+                    content = b''.join(chunks)
+                    # Check for </head> (case insensitive)
+                    if b'</head>' in content.lower() or b'</HEAD>' in content:
+                        break
+                    if total_bytes >= max_bytes:
+                        break
+
+                html = content.decode('utf-8', errors='replace')
+
+                # Extract just the head section
+                head_end = html.lower().find('</head>')
+                if head_end != -1:
+                    html = html[:head_end + 7]
+
+                return html, etag, last_modified
+
+        except Exception:
+            return None, None, None
+
+    async def close(self):
+        """Close the HTTP client and release resources."""
+        if self._client:
+            await self._client.aclose()
+            self._client = None
+
+    async def __aenter__(self):
+        """Async context manager entry."""
+        return self
+
+    async def __aexit__(self, exc_type, exc_val, exc_tb):
+        """Async context manager exit."""
+        await self.close()
--- a/crawl4ai/deep_crawling/bff_strategy.py
+++ b/crawl4ai/deep_crawling/bff_strategy.py
@@ -2,7 +2,7 @@
 import asyncio
 import logging
 from datetime import datetime
-from typing import AsyncGenerator, Optional, Set, Dict, List, Tuple
+from typing import AsyncGenerator, Optional, Set, Dict, List, Tuple, Any, Callable, Awaitable
 from urllib.parse import urlparse

 from ..models import TraversalStats
@@ -41,6 +41,9 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
        include_external: bool = False,
        max_pages: int = infinity,
        logger: Optional[logging.Logger] = None,
+        # Optional resume/callback parameters for crash recovery
+        resume_state: Optional[Dict[str, Any]] = None,
+        on_state_change: Optional[Callable[[Dict[str, Any]], Awaitable[None]]] = None,
    ):
        self.max_depth = max_depth
        self.filter_chain = filter_chain
@@ -57,6 +60,12 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
        self.stats = TraversalStats(start_time=datetime.now())
        self._cancel_event = asyncio.Event()
        self._pages_crawled = 0
+        # Store for use in arun methods
+        self._resume_state = resume_state
+        self._on_state_change = on_state_change
+        self._last_state: Optional[Dict[str, Any]] = None
+        # Shadow list for queue items (only used when on_state_change is set)
+        self._queue_shadow: Optional[List[Tuple[float, int, str, Optional[str]]]] = None

    async def can_process_url(self, url: str, depth: int) -> bool:
        """
@@ -140,11 +149,31 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
        are treated as higher priority. URLs are processed in batches for efficiency.
        """
        queue: asyncio.PriorityQueue = asyncio.PriorityQueue()
-        # Push the initial URL with score 0 and depth 0.
+
+        # Conditional state initialization for resume support
+        if self._resume_state:
+            visited = set(self._resume_state.get("visited", []))
+            depths = dict(self._resume_state.get("depths", {}))
+            self._pages_crawled = self._resume_state.get("pages_crawled", 0)
+            # Restore queue from saved items
+            queue_items = self._resume_state.get("queue_items", [])
+            for item in queue_items:
+                await queue.put((item["score"], item["depth"], item["url"], item["parent_url"]))
+            # Initialize shadow list if callback is set
+            if self._on_state_change:
+                self._queue_shadow = [
+                    (item["score"], item["depth"], item["url"], item["parent_url"])
+                    for item in queue_items
+                ]
+        else:
+            # Original initialization
            initial_score = self.url_scorer.score(start_url) if self.url_scorer else 0
            await queue.put((-initial_score, 0, start_url, None))
            visited: Set[str] = set()
            depths: Dict[str, int] = {start_url: 0}
+            # Initialize shadow list if callback is set
+            if self._on_state_change:
+                self._queue_shadow = [(-initial_score, 0, start_url, None)]

        while not queue.empty() and not self._cancel_event.is_set():
            # Stop if we've reached the max pages limit
@@ -166,6 +195,12 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
                if queue.empty():
                    break
                item = await queue.get()
+                # Remove from shadow list if tracking
+                if self._on_state_change and self._queue_shadow is not None:
+                    try:
+                        self._queue_shadow.remove(item)
+                    except ValueError:
+                        pass  # Item may have been removed already
                score, depth, url, parent_url = item
                if url in visited:
                    continue
@@ -210,7 +245,26 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
                    for new_url, new_parent in new_links:
                        new_depth = depths.get(new_url, depth + 1)
                        new_score = self.url_scorer.score(new_url) if self.url_scorer else 0
-                        await queue.put((-new_score, new_depth, new_url, new_parent))
+                        queue_item = (-new_score, new_depth, new_url, new_parent)
+                        await queue.put(queue_item)
+                        # Add to shadow list if tracking
+                        if self._on_state_change and self._queue_shadow is not None:
+                            self._queue_shadow.append(queue_item)
+
+                    # Capture state after EACH URL processed (if callback set)
+                    if self._on_state_change and self._queue_shadow is not None:
+                        state = {
+                            "strategy_type": "best_first",
+                            "visited": list(visited),
+                            "queue_items": [
+                                {"score": s, "depth": d, "url": u, "parent_url": p}
+                                for s, d, u, p in self._queue_shadow
+                            ],
+                            "depths": depths,
+                            "pages_crawled": self._pages_crawled,
+                        }
+                        self._last_state = state
+                        await self._on_state_change(state)

        # End of crawl.

@@ -269,3 +323,15 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
        """
        self._cancel_event.set()
        self.stats.end_time = datetime.now()
+
+    def export_state(self) -> Optional[Dict[str, Any]]:
+        """
+        Export current crawl state for external persistence.
+
+        Note: This returns the last captured state. For real-time state,
+        use the on_state_change callback.
+
+        Returns:
+            Dict with strategy state, or None if no state captured yet.
+        """
+        return self._last_state
--- a/crawl4ai/deep_crawling/bfs_strategy.py
+++ b/crawl4ai/deep_crawling/bfs_strategy.py
@@ -2,7 +2,7 @@
 import asyncio
 import logging
 from datetime import datetime
-from typing import AsyncGenerator, Optional, Set, Dict, List, Tuple
+from typing import AsyncGenerator, Optional, Set, Dict, List, Tuple, Any, Callable, Awaitable
 from urllib.parse import urlparse

 from ..models import TraversalStats
@@ -31,6 +31,9 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
        score_threshold: float = -infinity,
        max_pages: int = infinity,
        logger: Optional[logging.Logger] = None,
+        # Optional resume/callback parameters for crash recovery
+        resume_state: Optional[Dict[str, Any]] = None,
+        on_state_change: Optional[Callable[[Dict[str, Any]], Awaitable[None]]] = None,
    ):
        self.max_depth = max_depth
        self.filter_chain = filter_chain
@@ -48,6 +51,10 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
        self.stats = TraversalStats(start_time=datetime.now())
        self._cancel_event = asyncio.Event()
        self._pages_crawled = 0
+        # Store for use in arun methods
+        self._resume_state = resume_state
+        self._on_state_change = on_state_change
+        self._last_state: Optional[Dict[str, Any]] = None

    async def can_process_url(self, url: str, depth: int) -> bool:
        """
@@ -155,6 +162,17 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
        Batch (non-streaming) mode:
        Processes one BFS level at a time, then yields all the results.
        """
+        # Conditional state initialization for resume support
+        if self._resume_state:
+            visited = set(self._resume_state.get("visited", []))
+            current_level = [
+                (item["url"], item["parent_url"])
+                for item in self._resume_state.get("pending", [])
+            ]
+            depths = dict(self._resume_state.get("depths", {}))
+            self._pages_crawled = self._resume_state.get("pages_crawled", 0)
+        else:
+            # Original initialization
            visited: Set[str] = set()
            # current_level holds tuples: (url, parent_url)
            current_level: List[Tuple[str, Optional[str]]] = [(start_url, None)]
@@ -175,10 +193,6 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
            batch_config = config.clone(deep_crawl_strategy=None, stream=False)
            batch_results = await crawler.arun_many(urls=urls, config=batch_config)

-            # Update pages crawled counter - count only successful crawls
-            successful_results = [r for r in batch_results if r.success]
-            self._pages_crawled += len(successful_results)
-            
            for result in batch_results:
                url = result.url
                depth = depths.get(url, 0)
@@ -190,9 +204,24 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):

                # Only discover links from successful crawls
                if result.success:
+                    # Increment pages crawled per URL for accurate state tracking
+                    self._pages_crawled += 1
+
                    # Link discovery will handle the max pages limit internally
                    await self.link_discovery(result, url, depth, visited, next_level, depths)

+                    # Capture state after EACH URL processed (if callback set)
+                    if self._on_state_change:
+                        state = {
+                            "strategy_type": "bfs",
+                            "visited": list(visited),
+                            "pending": [{"url": u, "parent_url": p} for u, p in next_level],
+                            "depths": depths,
+                            "pages_crawled": self._pages_crawled,
+                        }
+                        self._last_state = state
+                        await self._on_state_change(state)
+
            current_level = next_level

        return results
@@ -207,6 +236,17 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
        Streaming mode:
        Processes one BFS level at a time and yields results immediately as they arrive.
        """
+        # Conditional state initialization for resume support
+        if self._resume_state:
+            visited = set(self._resume_state.get("visited", []))
+            current_level = [
+                (item["url"], item["parent_url"])
+                for item in self._resume_state.get("pending", [])
+            ]
+            depths = dict(self._resume_state.get("depths", {}))
+            self._pages_crawled = self._resume_state.get("pages_crawled", 0)
+        else:
+            # Original initialization
            visited: Set[str] = set()
            current_level: List[Tuple[str, Optional[str]]] = [(start_url, None)]
            depths: Dict[str, int] = {start_url: 0}
@@ -245,6 +285,18 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
                    # Link discovery will handle the max pages limit internally
                    await self.link_discovery(result, url, depth, visited, next_level, depths)

+                    # Capture state after EACH URL processed (if callback set)
+                    if self._on_state_change:
+                        state = {
+                            "strategy_type": "bfs",
+                            "visited": list(visited),
+                            "pending": [{"url": u, "parent_url": p} for u, p in next_level],
+                            "depths": depths,
+                            "pages_crawled": self._pages_crawled,
+                        }
+                        self._last_state = state
+                        await self._on_state_change(state)
+
            # If we didn't get results back (e.g. due to errors), avoid getting stuck in an infinite loop
            # by considering these URLs as visited but not counting them toward the max_pages limit
            if results_count == 0 and urls:
@@ -258,3 +310,15 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
        """
        self._cancel_event.set()
        self.stats.end_time = datetime.now()
+
+    def export_state(self) -> Optional[Dict[str, Any]]:
+        """
+        Export current crawl state for external persistence.
+
+        Note: This returns the last captured state. For real-time state,
+        use the on_state_change callback.
+
+        Returns:
+            Dict with strategy state, or None if no state captured yet.
+        """
+        return self._last_state
--- a/crawl4ai/deep_crawling/dfs_strategy.py
+++ b/crawl4ai/deep_crawling/dfs_strategy.py
@@ -38,6 +38,19 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
        in control of traversal. Every successful page bumps ``_pages_crawled`` and
        seeds new stack items discovered via :meth:`link_discovery`.
        """
+        # Conditional state initialization for resume support
+        if self._resume_state:
+            visited = set(self._resume_state.get("visited", []))
+            stack = [
+                (item["url"], item["parent_url"], item["depth"])
+                for item in self._resume_state.get("stack", [])
+            ]
+            depths = dict(self._resume_state.get("depths", {}))
+            self._pages_crawled = self._resume_state.get("pages_crawled", 0)
+            self._dfs_seen = set(self._resume_state.get("dfs_seen", []))
+            results: List[CrawlResult] = []
+        else:
+            # Original initialization
            visited: Set[str] = set()
            # Stack items: (url, parent_url, depth)
            stack: List[Tuple[str, Optional[str], int]] = [(start_url, None, 0)]
@@ -79,6 +92,22 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
                    for new_url, new_parent in reversed(new_links):
                        new_depth = depths.get(new_url, depth + 1)
                        stack.append((new_url, new_parent, new_depth))
+
+                    # Capture state after each URL processed (if callback set)
+                    if self._on_state_change:
+                        state = {
+                            "strategy_type": "dfs",
+                            "visited": list(visited),
+                            "stack": [
+                                {"url": u, "parent_url": p, "depth": d}
+                                for u, p, d in stack
+                            ],
+                            "depths": depths,
+                            "pages_crawled": self._pages_crawled,
+                            "dfs_seen": list(self._dfs_seen),
+                        }
+                        self._last_state = state
+                        await self._on_state_change(state)
        return results

    async def _arun_stream(
@@ -94,6 +123,18 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
        yielded before we even look at the next stack entry. Successful crawls
        still feed :meth:`link_discovery`, keeping DFS order intact.
        """
+        # Conditional state initialization for resume support
+        if self._resume_state:
+            visited = set(self._resume_state.get("visited", []))
+            stack = [
+                (item["url"], item["parent_url"], item["depth"])
+                for item in self._resume_state.get("stack", [])
+            ]
+            depths = dict(self._resume_state.get("depths", {}))
+            self._pages_crawled = self._resume_state.get("pages_crawled", 0)
+            self._dfs_seen = set(self._resume_state.get("dfs_seen", []))
+        else:
+            # Original initialization
            visited: Set[str] = set()
            stack: List[Tuple[str, Optional[str], int]] = [(start_url, None, 0)]
            depths: Dict[str, int] = {start_url: 0}
@@ -130,6 +171,22 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
                        new_depth = depths.get(new_url, depth + 1)
                        stack.append((new_url, new_parent, new_depth))

+                    # Capture state after each URL processed (if callback set)
+                    if self._on_state_change:
+                        state = {
+                            "strategy_type": "dfs",
+                            "visited": list(visited),
+                            "stack": [
+                                {"url": u, "parent_url": p, "depth": d}
+                                for u, p, d in stack
+                            ],
+                            "depths": depths,
+                            "pages_crawled": self._pages_crawled,
+                            "dfs_seen": list(self._dfs_seen),
+                        }
+                        self._last_state = state
+                        await self._on_state_change(state)
+
    async def link_discovery(
        self,
        result: CrawlResult,
--- a/crawl4ai/extraction_strategy.py
+++ b/crawl4ai/extraction_strategy.py
@@ -1277,44 +1277,18 @@ class JsonElementExtractionStrategy(ExtractionStrategy):
    }

    @staticmethod
-    def generate_schema(
-        html: str,
-        schema_type: str = "CSS", # or XPATH
-        query: str = None,
-        target_json_example: str = None,
-        llm_config: 'LLMConfig' = create_llm_config(),
-        provider: str = None,
-        api_token: str = None,
-        **kwargs
-    ) -> dict:
+    def _build_schema_prompt(html: str, schema_type: str, query: str = None, target_json_example: str = None) -> str:
        """
-        Generate extraction schema from HTML content and optional query.
-        
-        Args:
-            html (str): The HTML content to analyze
-            query (str, optional): Natural language description of what data to extract
-            provider (str): Legacy Parameter. LLM provider to use 
-            api_token (str): Legacy Parameter. API token for LLM provider
-            llm_config (LLMConfig): LLM configuration object
-            prompt (str, optional): Custom prompt template to use
-            **kwargs: Additional args passed to LLM processor
+        Build the prompt for schema generation. Shared by sync and async methods.

        Returns:
-            dict: Generated schema following the JsonElementExtractionStrategy format
+            str: Combined system and user prompt
        """
        from .prompts import JSON_SCHEMA_BUILDER
-        from .utils import perform_completion_with_backoff
-        for name, message in JsonElementExtractionStrategy._GENERATE_SCHEMA_UNWANTED_PROPS.items():
-            if locals()[name] is not None:
-                raise AttributeError(f"Setting '{name}' is deprecated. {message}")

-        # Use default or custom prompt
        prompt_template = JSON_SCHEMA_BUILDER if schema_type == "CSS" else JSON_SCHEMA_BUILDER_XPATH

-        # Build the prompt
-        system_message = {
-            "role": "system", 
-            "content": f"""You specialize in generating special JSON schemas for web scraping. This schema uses CSS or XPATH selectors to present a repetitive pattern in crawled HTML, such as a product in a product list or a search result item in a list of search results. We use this JSON schema to pass to a language model along with the HTML content to extract structured data from the HTML. The language model uses the JSON schema to extract data from the HTML and retrieve values for fields in the JSON schema, following the schema.
+        system_content = f"""You specialize in generating special JSON schemas for web scraping. This schema uses CSS or XPATH selectors to present a repetitive pattern in crawled HTML, such as a product in a product list or a search result item in a list of search results. We use this JSON schema to pass to a language model along with the HTML content to extract structured data from the HTML. The language model uses the JSON schema to extract data from the HTML and retrieve values for fields in the JSON schema, following the schema.

 Generating this HTML manually is not feasible, so you need to generate the JSON schema using the HTML content. The HTML copied from the crawled website is provided below, which we believe contains the repetitive pattern.

@@ -1335,31 +1309,27 @@ In this scenario, use your best judgment to generate the schema. You need to exa

 # What are the instructions and details for this schema generation?
 {prompt_template}"""
-        }

-        user_message = {
-            "role": "user",
-            "content": f"""
+        user_content = f"""
                HTML to analyze:
                ```html
                {html}
                ```
                """
-        }

        if query:
-            user_message["content"] += f"\n\n## Query or explanation of target/goal data item:\n{query}"
+            user_content += f"\n\n## Query or explanation of target/goal data item:\n{query}"
        if target_json_example:
-            user_message["content"] += f"\n\n## Example of target JSON object:\n```json\n{target_json_example}\n```"
+            user_content += f"\n\n## Example of target JSON object:\n```json\n{target_json_example}\n```"

        if query and not target_json_example:
-            user_message["content"] += """IMPORTANT: To remind you, in this process, we are not providing a rigid example of the adjacent objects we seek. We rely on your understanding of the explanation provided in the above section. Make sure to grasp what we are looking for and, based on that, create the best schema.."""
+            user_content += """IMPORTANT: To remind you, in this process, we are not providing a rigid example of the adjacent objects we seek. We rely on your understanding of the explanation provided in the above section. Make sure to grasp what we are looking for and, based on that, create the best schema.."""
        elif not query and target_json_example:
-            user_message["content"] += """IMPORTANT: Please remember that in this process, we provided a proper example of a target JSON object. Make sure to adhere to the structure and create a schema that exactly fits this example. If you find that some elements on the page do not match completely, vote for the majority."""
+            user_content += """IMPORTANT: Please remember that in this process, we provided a proper example of a target JSON object. Make sure to adhere to the structure and create a schema that exactly fits this example. If you find that some elements on the page do not match completely, vote for the majority."""
        elif not query and not target_json_example:
-            user_message["content"] += """IMPORTANT: Since we neither have a query nor an example, it is crucial to rely solely on the HTML content provided. Leverage your expertise to determine the schema based on the repetitive patterns observed in the content."""
+            user_content += """IMPORTANT: Since we neither have a query nor an example, it is crucial to rely solely on the HTML content provided. Leverage your expertise to determine the schema based on the repetitive patterns observed in the content."""

-        user_message["content"] += """IMPORTANT: 
+        user_content += """IMPORTANT:
        0/ Ensure your schema remains reliable by avoiding selectors that appear to generate dynamically and are not dependable. You want a reliable schema, as it consistently returns the same data even after many page reloads.
        1/ DO NOT USE use base64 kind of classes, they are temporary and not reliable.
        2/ Every selector must refer to only one unique element. You should ensure your selector points to a single element and is unique to the place that contains the information. You have to use available techniques based on CSS or XPATH requested schema to make sure your selector is unique and also not fragile, meaning if we reload the page now or in the future, the selector should remain reliable.
@@ -1368,20 +1338,98 @@ In this scenario, use your best judgment to generate the schema. You need to exa
        Analyze the HTML and generate a JSON schema that follows the specified format. Only output valid JSON schema, nothing else.
        """

+        return "\n\n".join([system_content, user_content])
+
+    @staticmethod
+    def generate_schema(
+        html: str,
+        schema_type: str = "CSS",
+        query: str = None,
+        target_json_example: str = None,
+        llm_config: 'LLMConfig' = create_llm_config(),
+        provider: str = None,
+        api_token: str = None,
+        **kwargs
+    ) -> dict:
+        """
+        Generate extraction schema from HTML content and optional query (sync version).
+
+        Args:
+            html (str): The HTML content to analyze
+            query (str, optional): Natural language description of what data to extract
+            provider (str): Legacy Parameter. LLM provider to use
+            api_token (str): Legacy Parameter. API token for LLM provider
+            llm_config (LLMConfig): LLM configuration object
+            **kwargs: Additional args passed to LLM processor
+
+        Returns:
+            dict: Generated schema following the JsonElementExtractionStrategy format
+        """
+        from .utils import perform_completion_with_backoff
+
+        for name, message in JsonElementExtractionStrategy._GENERATE_SCHEMA_UNWANTED_PROPS.items():
+            if locals()[name] is not None:
+                raise AttributeError(f"Setting '{name}' is deprecated. {message}")
+
+        prompt = JsonElementExtractionStrategy._build_schema_prompt(html, schema_type, query, target_json_example)
+
        try:
-            # Call LLM with backoff handling
            response = perform_completion_with_backoff(
                provider=llm_config.provider,
-                prompt_with_variables="\n\n".join([system_message["content"], user_message["content"]]),
-                json_response = True,                
+                prompt_with_variables=prompt,
+                json_response=True,
                api_token=llm_config.api_token,
                base_url=llm_config.base_url,
                extra_args=kwargs
            )
-            
-            # Extract and return schema
            return json.loads(response.choices[0].message.content)
+        except Exception as e:
+            raise Exception(f"Failed to generate schema: {str(e)}")

+    @staticmethod
+    async def agenerate_schema(
+        html: str,
+        schema_type: str = "CSS",
+        query: str = None,
+        target_json_example: str = None,
+        llm_config: 'LLMConfig' = None,
+        **kwargs
+    ) -> dict:
+        """
+        Generate extraction schema from HTML content (async version).
+
+        Use this method when calling from async contexts (e.g., FastAPI) to avoid
+        issues with certain LLM providers (e.g., Gemini/Vertex AI) that require
+        async execution.
+
+        Args:
+            html (str): The HTML content to analyze
+            schema_type (str): "CSS" or "XPATH"
+            query (str, optional): Natural language description of what data to extract
+            target_json_example (str, optional): Example of desired JSON output
+            llm_config (LLMConfig): LLM configuration object
+            **kwargs: Additional args passed to LLM processor
+
+        Returns:
+            dict: Generated schema following the JsonElementExtractionStrategy format
+        """
+        from .utils import aperform_completion_with_backoff
+
+        if llm_config is None:
+            llm_config = create_llm_config()
+
+        prompt = JsonElementExtractionStrategy._build_schema_prompt(html, schema_type, query, target_json_example)
+
+        try:
+            response = await aperform_completion_with_backoff(
+                provider=llm_config.provider,
+                prompt_with_variables=prompt,
+                json_response=True,
+                api_token=llm_config.api_token,
+                base_url=llm_config.base_url,
+                extra_args=kwargs
+            )
+            return json.loads(response.choices[0].message.content)
        except Exception as e:
            raise Exception(f"Failed to generate schema: {str(e)}")

--- a/crawl4ai/models.py
+++ b/crawl4ai/models.py
@@ -152,6 +152,10 @@ class CrawlResult(BaseModel):
    network_requests: Optional[List[Dict[str, Any]]] = None
    console_messages: Optional[List[Dict[str, Any]]] = None
    tables: List[Dict] = Field(default_factory=list)  # NEW – [{headers,rows,caption,summary}]
+    # Cache validation metadata (Smart Cache)
+    head_fingerprint: Optional[str] = None
+    cached_at: Optional[float] = None
+    cache_status: Optional[str] = None  # "hit", "hit_validated", "hit_fallback", "miss"

    model_config = ConfigDict(arbitrary_types_allowed=True)

--- a/crawl4ai/proxy_strategy.py
+++ b/crawl4ai/proxy_strategy.py
@@ -1,7 +1,9 @@
-from typing import List, Dict, Optional
+from typing import List, Dict, Optional, Tuple
 from abc import ABC, abstractmethod
 from itertools import cycle
 import os
+import asyncio
+import time


 ########### ATTENTION PEOPLE OF EARTH ###########
@@ -131,8 +133,67 @@ class ProxyRotationStrategy(ABC):
        """Add proxy configurations to the strategy"""
        pass

-class RoundRobinProxyStrategy:
-    """Simple round-robin proxy rotation strategy using ProxyConfig objects"""
+    @abstractmethod
+    async def get_proxy_for_session(
+        self,
+        session_id: str,
+        ttl: Optional[int] = None
+    ) -> Optional[ProxyConfig]:
+        """
+        Get or create a sticky proxy for a session.
+
+        If session_id already has an assigned proxy (and hasn't expired), return it.
+        If session_id is new, acquire a new proxy and associate it.
+
+        Args:
+            session_id: Unique session identifier
+            ttl: Optional time-to-live in seconds for this session
+
+        Returns:
+            ProxyConfig for this session
+        """
+        pass
+
+    @abstractmethod
+    async def release_session(self, session_id: str) -> None:
+        """
+        Release a sticky session, making the proxy available for reuse.
+
+        Args:
+            session_id: Session to release
+        """
+        pass
+
+    @abstractmethod
+    def get_session_proxy(self, session_id: str) -> Optional[ProxyConfig]:
+        """
+        Get the proxy for an existing session without creating new one.
+
+        Args:
+            session_id: Session to look up
+
+        Returns:
+            ProxyConfig if session exists and hasn't expired, None otherwise
+        """
+        pass
+
+    @abstractmethod
+    def get_active_sessions(self) -> Dict[str, ProxyConfig]:
+        """
+        Get all active sticky sessions.
+
+        Returns:
+            Dictionary mapping session_id to ProxyConfig
+        """
+        pass
+
+class RoundRobinProxyStrategy(ProxyRotationStrategy):
+    """Simple round-robin proxy rotation strategy using ProxyConfig objects.
+
+    Supports sticky sessions where a session_id can be bound to a specific proxy
+    for the duration of the session. This is useful for deep crawling where
+    you want to maintain the same IP address across multiple requests.
+    """

    def __init__(self, proxies: List[ProxyConfig] = None):
        """
@@ -141,8 +202,12 @@ class RoundRobinProxyStrategy:
        Args:
            proxies: List of ProxyConfig objects
        """
-        self._proxies = []
+        self._proxies: List[ProxyConfig] = []
        self._proxy_cycle = None
+        # Session tracking: maps session_id -> (ProxyConfig, created_at, ttl)
+        self._sessions: Dict[str, Tuple[ProxyConfig, float, Optional[int]]] = {}
+        self._session_lock = asyncio.Lock()
+
        if proxies:
            self.add_proxies(proxies)

@@ -156,3 +221,121 @@ class RoundRobinProxyStrategy:
        if not self._proxy_cycle:
            return None
        return next(self._proxy_cycle)
+
+    async def get_proxy_for_session(
+        self,
+        session_id: str,
+        ttl: Optional[int] = None
+    ) -> Optional[ProxyConfig]:
+        """
+        Get or create a sticky proxy for a session.
+
+        If session_id already has an assigned proxy (and hasn't expired), return it.
+        If session_id is new, acquire a new proxy and associate it.
+
+        Args:
+            session_id: Unique session identifier
+            ttl: Optional time-to-live in seconds for this session
+
+        Returns:
+            ProxyConfig for this session
+        """
+        async with self._session_lock:
+            # Check if session exists and hasn't expired
+            if session_id in self._sessions:
+                proxy, created_at, session_ttl = self._sessions[session_id]
+
+                # Check TTL expiration
+                effective_ttl = ttl if ttl is not None else session_ttl
+                if effective_ttl is not None:
+                    elapsed = time.time() - created_at
+                    if elapsed >= effective_ttl:
+                        # Session expired, remove it and get new proxy
+                        del self._sessions[session_id]
+                    else:
+                        return proxy
+                else:
+                    return proxy
+
+            # Acquire new proxy for this session
+            proxy = await self.get_next_proxy()
+            if proxy:
+                self._sessions[session_id] = (proxy, time.time(), ttl)
+
+            return proxy
+
+    async def release_session(self, session_id: str) -> None:
+        """
+        Release a sticky session, making the proxy available for reuse.
+
+        Args:
+            session_id: Session to release
+        """
+        async with self._session_lock:
+            if session_id in self._sessions:
+                del self._sessions[session_id]
+
+    def get_session_proxy(self, session_id: str) -> Optional[ProxyConfig]:
+        """
+        Get the proxy for an existing session without creating new one.
+
+        Args:
+            session_id: Session to look up
+
+        Returns:
+            ProxyConfig if session exists and hasn't expired, None otherwise
+        """
+        if session_id not in self._sessions:
+            return None
+
+        proxy, created_at, ttl = self._sessions[session_id]
+
+        # Check TTL expiration
+        if ttl is not None:
+            elapsed = time.time() - created_at
+            if elapsed >= ttl:
+                return None
+
+        return proxy
+
+    def get_active_sessions(self) -> Dict[str, ProxyConfig]:
+        """
+        Get all active sticky sessions (excluding expired ones).
+
+        Returns:
+            Dictionary mapping session_id to ProxyConfig
+        """
+        current_time = time.time()
+        active_sessions = {}
+
+        for session_id, (proxy, created_at, ttl) in self._sessions.items():
+            # Skip expired sessions
+            if ttl is not None:
+                elapsed = current_time - created_at
+                if elapsed >= ttl:
+                    continue
+            active_sessions[session_id] = proxy
+
+        return active_sessions
+
+    async def cleanup_expired_sessions(self) -> int:
+        """
+        Remove all expired sessions from tracking.
+
+        Returns:
+            Number of sessions removed
+        """
+        async with self._session_lock:
+            current_time = time.time()
+            expired = []
+
+            for session_id, (proxy, created_at, ttl) in self._sessions.items():
+                if ttl is not None:
+                    elapsed = current_time - created_at
+                    if elapsed >= ttl:
+                        expired.append(session_id)
+
+            for session_id in expired:
+                del self._sessions[session_id]
+
+            return len(expired)
--- a/crawl4ai/utils.py
+++ b/crawl4ai/utils.py
@@ -1775,6 +1775,8 @@ def perform_completion_with_backoff(

    from litellm import completion
    from litellm.exceptions import RateLimitError
+    import litellm
+    litellm.drop_params = True  # Auto-drop unsupported params (e.g., temperature for O-series/GPT-5)

    extra_args = {"temperature": 0.01, "api_key": api_token, "base_url": base_url}
    if json_response:
@@ -1864,7 +1866,9 @@ async def aperform_completion_with_backoff(

    from litellm import acompletion
    from litellm.exceptions import RateLimitError
+    import litellm
    import asyncio
+    litellm.drop_params = True  # Auto-drop unsupported params (e.g., temperature for O-series/GPT-5)

    extra_args = {"temperature": 0.01, "api_key": api_token, "base_url": base_url}
    if json_response:
@@ -2461,6 +2465,54 @@ def normalize_url_tmp(href, base_url):
    return href.strip()


+def quick_extract_links(html: str, base_url: str) -> Dict[str, List[Dict[str, str]]]:
+    """
+    Fast link extraction for prefetch mode.
+    Only extracts <a href> tags - no media, no cleaning, no heavy processing.
+
+    Args:
+        html: Raw HTML string
+        base_url: Base URL for resolving relative links
+
+    Returns:
+        {"internal": [{"href": "...", "text": "..."}], "external": [...]}
+    """
+    from lxml.html import document_fromstring
+
+    try:
+        doc = document_fromstring(html)
+    except Exception:
+        return {"internal": [], "external": []}
+
+    base_domain = get_base_domain(base_url)
+    internal: List[Dict[str, str]] = []
+    external: List[Dict[str, str]] = []
+    seen: Set[str] = set()
+
+    for a in doc.xpath("//a[@href]"):
+        href = a.get("href", "").strip()
+        if not href or href.startswith(("#", "javascript:", "mailto:", "tel:")):
+            continue
+
+        # Normalize URL
+        normalized = normalize_url_for_deep_crawl(href, base_url)
+        if not normalized or normalized in seen:
+            continue
+        seen.add(normalized)
+
+        # Extract text (truncated for memory efficiency)
+        text = (a.text_content() or "").strip()[:200]
+
+        link_data = {"href": normalized, "text": text}
+
+        if is_external_url(normalized, base_domain):
+            external.append(link_data)
+        else:
+            internal.append(link_data)
+
+    return {"internal": internal, "external": external}
+
+
 def get_base_domain(url: str) -> str:
    """
    Extract the base domain from a given URL, handling common edge cases.
@@ -2828,6 +2880,67 @@ def generate_content_hash(content: str) -> str:
    # return hashlib.sha256(content.encode()).hexdigest()


+def compute_head_fingerprint(head_html: str) -> str:
+    """
+    Compute a fingerprint of <head> content for cache validation.
+
+    Focuses on content that typically changes when page updates:
+    - <title>
+    - <meta name="description">
+    - <meta property="og:title|og:description|og:image|og:updated_time">
+    - <meta property="article:modified_time">
+    - <meta name="last-modified">
+
+    Uses xxhash for speed, combines multiple signals into a single hash.
+
+    Args:
+        head_html: The HTML content of the <head> section
+
+    Returns:
+        A hex string fingerprint, or empty string if no signals found
+    """
+    if not head_html:
+        return ""
+
+    head_lower = head_html.lower()
+    signals = []
+
+    # Extract title
+    title_match = re.search(r'<title[^>]*>(.*?)</title>', head_lower, re.DOTALL)
+    if title_match:
+        signals.append(title_match.group(1).strip())
+
+    # Meta tags to extract (name or property attribute, and the value to match)
+    meta_tags = [
+        ("name", "description"),
+        ("name", "last-modified"),
+        ("property", "og:title"),
+        ("property", "og:description"),
+        ("property", "og:image"),
+        ("property", "og:updated_time"),
+        ("property", "article:modified_time"),
+    ]
+
+    for attr_type, attr_value in meta_tags:
+        # Handle both attribute orders: attr="value" content="..." and content="..." attr="value"
+        patterns = [
+            rf'<meta[^>]*{attr_type}=["\']{ re.escape(attr_value)}["\'][^>]*content=["\']([^"\']*)["\']',
+            rf'<meta[^>]*content=["\']([^"\']*)["\'][^>]*{attr_type}=["\']{re.escape(attr_value)}["\']',
+        ]
+        for pattern in patterns:
+            match = re.search(pattern, head_lower)
+            if match:
+                signals.append(match.group(1).strip())
+                break  # Found this tag, move to next
+
+    if not signals:
+        return ""
+
+    # Combine signals and hash
+    combined = '|'.join(signals)
+    return xxhash.xxh64(combined.encode()).hexdigest()
+
+
 def ensure_content_dirs(base_path: str) -> Dict[str, str]:
    """Create content directories if they don't exist"""
    dirs = {
--- a/deploy/docker/README.md
+++ b/deploy/docker/README.md
@@ -59,13 +59,13 @@ Pull and run images directly from Docker Hub without building locally.

 #### 1. Pull the Image

-Our latest stable release is `0.7.7`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
+Our latest stable release is `0.8.0`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.

 ```bash
-# Pull the latest stable version (0.7.7)
-docker pull unclecode/crawl4ai:0.7.7
+# Pull the latest stable version (0.8.0)
+docker pull unclecode/crawl4ai:0.8.0

-# Or use the latest tag (points to 0.7.7)
+# Or use the latest tag (points to 0.8.0)
 docker pull unclecode/crawl4ai:latest
 ```

@@ -100,7 +100,7 @@ EOL
      -p 11235:11235 \
      --name crawl4ai \
      --shm-size=1g \
-      unclecode/crawl4ai:0.7.7
+      unclecode/crawl4ai:0.8.0
    ```

 *   **With LLM support:**
@@ -111,7 +111,7 @@ EOL
      --name crawl4ai \
      --env-file .llm.env \
      --shm-size=1g \
-      unclecode/crawl4ai:0.7.7
+      unclecode/crawl4ai:0.8.0
    ```

 > The server will be available at `http://localhost:11235`. Visit `/playground` to access the interactive testing interface.
@@ -184,7 +184,7 @@ The `docker-compose.yml` file in the project root provides a simplified approach
    ```bash
    # Pulls and runs the release candidate from Docker Hub
    # Automatically selects the correct architecture
-    IMAGE=unclecode/crawl4ai:0.7.7 docker compose up -d
+    IMAGE=unclecode/crawl4ai:0.8.0 docker compose up -d
    ```

 *   **Build and Run Locally:**
--- a/deploy/docker/config.yml
+++ b/deploy/docker/config.yml
@@ -37,6 +37,10 @@ rate_limiting:
  storage_uri: "memory://"  # Use "redis://localhost:6379" for production

 # Security Configuration
+# WARNING: For production deployments, enable security and use proper SECRET_KEY:
+#   - Set jwt_enabled: true for authentication
+#   - Set SECRET_KEY environment variable to a secure random value
+#   - Set CRAWL4AI_HOOKS_ENABLED=true only if you need hooks (RCE risk)
 security:
  enabled: false
  jwt_enabled: false
--- a/deploy/docker/hook_manager.py
+++ b/deploy/docker/hook_manager.py
@@ -117,18 +117,18 @@ class UserHookManager:
        """
        try:
            # Create a safe namespace for the hook
-            # Use a more complete builtins that includes __import__
+            # SECURITY: No __import__ to prevent arbitrary module imports (RCE risk)
            import builtins
            safe_builtins = {}

-            # Add safe built-in functions
+            # Add safe built-in functions (no __import__ for security)
            allowed_builtins = [
                'print', 'len', 'str', 'int', 'float', 'bool',
                'list', 'dict', 'set', 'tuple', 'range', 'enumerate',
                'zip', 'map', 'filter', 'any', 'all', 'sum', 'min', 'max',
                'sorted', 'reversed', 'abs', 'round', 'isinstance', 'type',
                'getattr', 'hasattr', 'setattr', 'callable', 'iter', 'next',
-                '__import__', '__build_class__'  # Required for exec
+                '__build_class__'  # Required for class definitions in exec
            ]
            
            for name in allowed_builtins:
--- a/deploy/docker/server.py
+++ b/deploy/docker/server.py
@@ -79,6 +79,10 @@ __version__ = "0.5.1-d1"
 MAX_PAGES = config["crawler"]["pool"].get("max_pages", 30)
 GLOBAL_SEM = asyncio.Semaphore(MAX_PAGES)

+# ── security feature flags ───────────────────────────────────
+# Hooks are disabled by default for security (RCE risk). Set to "true" to enable.
+HOOKS_ENABLED = os.environ.get("CRAWL4AI_HOOKS_ENABLED", "false").lower() == "true"
+
 # ── default browser config helper ─────────────────────────────
 def get_default_browser_config() -> BrowserConfig:
    """Get default BrowserConfig from config.yml."""
@@ -236,6 +240,19 @@ async def add_security_headers(request: Request, call_next):
        resp.headers.update(config["security"]["headers"])
    return resp

+# ───────────────── URL validation helper ─────────────────
+ALLOWED_URL_SCHEMES = ("http://", "https://")
+ALLOWED_URL_SCHEMES_WITH_RAW = ("http://", "https://", "raw:", "raw://")
+
+
+def validate_url_scheme(url: str, allow_raw: bool = False) -> None:
+    """Validate URL scheme to prevent file:// LFI attacks."""
+    allowed = ALLOWED_URL_SCHEMES_WITH_RAW if allow_raw else ALLOWED_URL_SCHEMES
+    if not url.startswith(allowed):
+        schemes = ", ".join(allowed)
+        raise HTTPException(400, f"URL must start with {schemes}")
+
+
 # ───────────────── safe config‑dump helper ─────────────────
 ALLOWED_TYPES = {
    "CrawlerRunConfig": CrawlerRunConfig,
@@ -337,6 +354,7 @@ async def generate_html(
    Crawls the URL, preprocesses the raw HTML for schema extraction, and returns the processed HTML.
    Use when you need sanitized HTML structures for building schemas or further processing.
    """
+    validate_url_scheme(body.url, allow_raw=True)
    from crawler_pool import get_crawler
    cfg = CrawlerRunConfig()
    try:
@@ -368,6 +386,7 @@ async def generate_screenshot(
    Use when you need an image snapshot of the rendered page. Its recommened to provide an output path to save the screenshot.
    Then in result instead of the screenshot you will get a path to the saved file.
    """
+    validate_url_scheme(body.url)
    from crawler_pool import get_crawler
    try:
        cfg = CrawlerRunConfig(screenshot=True, screenshot_wait_for=body.screenshot_wait_for)
@@ -402,6 +421,7 @@ async def generate_pdf(
    Use when you need a printable or archivable snapshot of the page. It is recommended to provide an output path to save the PDF.
    Then in result instead of the PDF you will get a path to the saved file.
    """
+    validate_url_scheme(body.url)
    from crawler_pool import get_crawler
    try:
        cfg = CrawlerRunConfig(pdf=True)
@@ -474,6 +494,7 @@ async def execute_js(
        ```

    """
+    validate_url_scheme(body.url)
    from crawler_pool import get_crawler
    try:
        cfg = CrawlerRunConfig(js_code=body.scripts)
@@ -600,6 +621,8 @@ async def crawl(
    """
    if not crawl_request.urls:
        raise HTTPException(400, "At least one URL required")
+    if crawl_request.hooks and not HOOKS_ENABLED:
+        raise HTTPException(403, "Hooks are disabled. Set CRAWL4AI_HOOKS_ENABLED=true to enable.")
    # Check whether it is a redirection for a streaming request
    crawler_config = CrawlerRunConfig.load(crawl_request.crawler_config)
    if crawler_config.stream:
@@ -635,6 +658,8 @@ async def crawl_stream(
 ):
    if not crawl_request.urls:
        raise HTTPException(400, "At least one URL required")
+    if crawl_request.hooks and not HOOKS_ENABLED:
+        raise HTTPException(403, "Hooks are disabled. Set CRAWL4AI_HOOKS_ENABLED=true to enable.")

    return await stream_process(crawl_request=crawl_request)

--- a/deploy/docker/tests/run_security_tests.py
+++ b/deploy/docker/tests/run_security_tests.py
@@ -0,0 +1,196 @@
+#!/usr/bin/env python3
+"""
+Security Integration Tests for Crawl4AI Docker API.
+Tests that security fixes are working correctly against a running server.
+
+Usage:
+    python run_security_tests.py [base_url]
+
+Example:
+    python run_security_tests.py http://localhost:11235
+"""
+
+import subprocess
+import sys
+import re
+
+# Colors for terminal output
+GREEN = '\033[0;32m'
+RED = '\033[0;31m'
+YELLOW = '\033[1;33m'
+NC = '\033[0m'  # No Color
+
+PASSED = 0
+FAILED = 0
+
+
+def run_curl(args: list) -> str:
+    """Run curl command and return output."""
+    try:
+        result = subprocess.run(
+            ['curl', '-s'] + args,
+            capture_output=True,
+            text=True,
+            timeout=30
+        )
+        return result.stdout + result.stderr
+    except subprocess.TimeoutExpired:
+        return "TIMEOUT"
+    except Exception as e:
+        return str(e)
+
+
+def test_expect(name: str, expect_pattern: str, curl_args: list) -> bool:
+    """Run a test and check if output matches expected pattern."""
+    global PASSED, FAILED
+
+    result = run_curl(curl_args)
+
+    if re.search(expect_pattern, result, re.IGNORECASE):
+        print(f"{GREEN}✓{NC} {name}")
+        PASSED += 1
+        return True
+    else:
+        print(f"{RED}✗{NC} {name}")
+        print(f"  Expected pattern: {expect_pattern}")
+        print(f"  Got: {result[:200]}")
+        FAILED += 1
+        return False
+
+
+def main():
+    global PASSED, FAILED
+
+    base_url = sys.argv[1] if len(sys.argv) > 1 else "http://localhost:11235"
+
+    print("=" * 60)
+    print("Crawl4AI Security Integration Tests")
+    print(f"Target: {base_url}")
+    print("=" * 60)
+    print()
+
+    # Check server availability
+    print("Checking server availability...")
+    result = run_curl(['-o', '/dev/null', '-w', '%{http_code}', f'{base_url}/health'])
+    if '200' not in result:
+        print(f"{RED}ERROR: Server not reachable at {base_url}{NC}")
+        print("Please start the server first.")
+        sys.exit(1)
+    print(f"{GREEN}Server is running{NC}")
+    print()
+
+    # === Part A: Security Tests ===
+    print("=== Part A: Security Tests ===")
+    print("(Vulnerabilities must be BLOCKED)")
+    print()
+
+    test_expect(
+        "A1: Hooks disabled by default (403)",
+        r"403|disabled|Hooks are disabled",
+        ['-X', 'POST', f'{base_url}/crawl',
+         '-H', 'Content-Type: application/json',
+         '-d', '{"urls":["https://example.com"],"hooks":{"code":{"on_page_context_created":"async def hook(page, context, **kwargs): return page"}}}']
+    )
+
+    test_expect(
+        "A2: file:// blocked on /execute_js (400)",
+        r"400|must start with",
+        ['-X', 'POST', f'{base_url}/execute_js',
+         '-H', 'Content-Type: application/json',
+         '-d', '{"url":"file:///etc/passwd","scripts":["1"]}']
+    )
+
+    test_expect(
+        "A3: file:// blocked on /screenshot (400)",
+        r"400|must start with",
+        ['-X', 'POST', f'{base_url}/screenshot',
+         '-H', 'Content-Type: application/json',
+         '-d', '{"url":"file:///etc/passwd"}']
+    )
+
+    test_expect(
+        "A4: file:// blocked on /pdf (400)",
+        r"400|must start with",
+        ['-X', 'POST', f'{base_url}/pdf',
+         '-H', 'Content-Type: application/json',
+         '-d', '{"url":"file:///etc/passwd"}']
+    )
+
+    test_expect(
+        "A5: file:// blocked on /html (400)",
+        r"400|must start with",
+        ['-X', 'POST', f'{base_url}/html',
+         '-H', 'Content-Type: application/json',
+         '-d', '{"url":"file:///etc/passwd"}']
+    )
+
+    print()
+
+    # === Part B: Functionality Tests ===
+    print("=== Part B: Functionality Tests ===")
+    print("(Normal operations must WORK)")
+    print()
+
+    test_expect(
+        "B1: Basic crawl works",
+        r"success.*true|results",
+        ['-X', 'POST', f'{base_url}/crawl',
+         '-H', 'Content-Type: application/json',
+         '-d', '{"urls":["https://example.com"]}']
+    )
+
+    test_expect(
+        "B2: /md works with https://",
+        r"success.*true|markdown",
+        ['-X', 'POST', f'{base_url}/md',
+         '-H', 'Content-Type: application/json',
+         '-d', '{"url":"https://example.com"}']
+    )
+
+    test_expect(
+        "B3: Health endpoint works",
+        r"ok",
+        [f'{base_url}/health']
+    )
+
+    print()
+
+    # === Part C: Edge Cases ===
+    print("=== Part C: Edge Cases ===")
+    print("(Malformed input must be REJECTED)")
+    print()
+
+    test_expect(
+        "C1: javascript: URL rejected (400)",
+        r"400|must start with",
+        ['-X', 'POST', f'{base_url}/execute_js',
+         '-H', 'Content-Type: application/json',
+         '-d', '{"url":"javascript:alert(1)","scripts":["1"]}']
+    )
+
+    test_expect(
+        "C2: data: URL rejected (400)",
+        r"400|must start with",
+        ['-X', 'POST', f'{base_url}/execute_js',
+         '-H', 'Content-Type: application/json',
+         '-d', '{"url":"data:text/html,<h1>test</h1>","scripts":["1"]}']
+    )
+
+    print()
+    print("=" * 60)
+    print("Results")
+    print("=" * 60)
+    print(f"Passed: {GREEN}{PASSED}{NC}")
+    print(f"Failed: {RED}{FAILED}{NC}")
+    print()
+
+    if FAILED > 0:
+        print(f"{RED}SOME TESTS FAILED{NC}")
+        sys.exit(1)
+    else:
+        print(f"{GREEN}ALL TESTS PASSED{NC}")
+        sys.exit(0)
+
+
+if __name__ == '__main__':
+    main()
--- a/deploy/docker/tests/test_security_fixes.py
+++ b/deploy/docker/tests/test_security_fixes.py
@@ -0,0 +1,170 @@
+#!/usr/bin/env python3
+"""
+Unit tests for security fixes.
+These tests verify the security fixes at the code level without needing a running server.
+"""
+
+import sys
+import os
+
+# Add parent directory to path to import modules
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+
+import unittest
+
+
+class TestURLValidation(unittest.TestCase):
+    """Test URL scheme validation helper."""
+
+    def setUp(self):
+        """Set up test fixtures."""
+        # Import the validation constants and function
+        self.ALLOWED_URL_SCHEMES = ("http://", "https://")
+        self.ALLOWED_URL_SCHEMES_WITH_RAW = ("http://", "https://", "raw:", "raw://")
+
+    def validate_url_scheme(self, url: str, allow_raw: bool = False) -> bool:
+        """Local version of validate_url_scheme for testing."""
+        allowed = self.ALLOWED_URL_SCHEMES_WITH_RAW if allow_raw else self.ALLOWED_URL_SCHEMES
+        return url.startswith(allowed)
+
+    # === SECURITY TESTS: These URLs must be BLOCKED ===
+
+    def test_file_url_blocked(self):
+        """file:// URLs must be blocked (LFI vulnerability)."""
+        self.assertFalse(self.validate_url_scheme("file:///etc/passwd"))
+        self.assertFalse(self.validate_url_scheme("file:///etc/passwd", allow_raw=True))
+
+    def test_file_url_blocked_windows(self):
+        """file:// URLs with Windows paths must be blocked."""
+        self.assertFalse(self.validate_url_scheme("file:///C:/Windows/System32/config/sam"))
+
+    def test_javascript_url_blocked(self):
+        """javascript: URLs must be blocked (XSS)."""
+        self.assertFalse(self.validate_url_scheme("javascript:alert(1)"))
+
+    def test_data_url_blocked(self):
+        """data: URLs must be blocked."""
+        self.assertFalse(self.validate_url_scheme("data:text/html,<script>alert(1)</script>"))
+
+    def test_ftp_url_blocked(self):
+        """ftp: URLs must be blocked."""
+        self.assertFalse(self.validate_url_scheme("ftp://example.com/file"))
+
+    def test_empty_url_blocked(self):
+        """Empty URLs must be blocked."""
+        self.assertFalse(self.validate_url_scheme(""))
+
+    def test_relative_url_blocked(self):
+        """Relative URLs must be blocked."""
+        self.assertFalse(self.validate_url_scheme("/etc/passwd"))
+        self.assertFalse(self.validate_url_scheme("../../../etc/passwd"))
+
+    # === FUNCTIONALITY TESTS: These URLs must be ALLOWED ===
+
+    def test_http_url_allowed(self):
+        """http:// URLs must be allowed."""
+        self.assertTrue(self.validate_url_scheme("http://example.com"))
+        self.assertTrue(self.validate_url_scheme("http://localhost:8080"))
+
+    def test_https_url_allowed(self):
+        """https:// URLs must be allowed."""
+        self.assertTrue(self.validate_url_scheme("https://example.com"))
+        self.assertTrue(self.validate_url_scheme("https://example.com/path?query=1"))
+
+    def test_raw_url_allowed_when_enabled(self):
+        """raw: URLs must be allowed when allow_raw=True."""
+        self.assertTrue(self.validate_url_scheme("raw:<html></html>", allow_raw=True))
+        self.assertTrue(self.validate_url_scheme("raw://<html></html>", allow_raw=True))
+
+    def test_raw_url_blocked_when_disabled(self):
+        """raw: URLs must be blocked when allow_raw=False."""
+        self.assertFalse(self.validate_url_scheme("raw:<html></html>", allow_raw=False))
+        self.assertFalse(self.validate_url_scheme("raw://<html></html>", allow_raw=False))
+
+
+class TestHookBuiltins(unittest.TestCase):
+    """Test that dangerous builtins are removed from hooks."""
+
+    def test_import_not_in_allowed_builtins(self):
+        """__import__ must NOT be in allowed_builtins."""
+        allowed_builtins = [
+            'print', 'len', 'str', 'int', 'float', 'bool',
+            'list', 'dict', 'set', 'tuple', 'range', 'enumerate',
+            'zip', 'map', 'filter', 'any', 'all', 'sum', 'min', 'max',
+            'sorted', 'reversed', 'abs', 'round', 'isinstance', 'type',
+            'getattr', 'hasattr', 'setattr', 'callable', 'iter', 'next',
+            '__build_class__'  # Required for class definitions in exec
+        ]
+
+        self.assertNotIn('__import__', allowed_builtins)
+        self.assertNotIn('eval', allowed_builtins)
+        self.assertNotIn('exec', allowed_builtins)
+        self.assertNotIn('compile', allowed_builtins)
+        self.assertNotIn('open', allowed_builtins)
+
+    def test_build_class_in_allowed_builtins(self):
+        """__build_class__ must be in allowed_builtins (needed for class definitions)."""
+        allowed_builtins = [
+            'print', 'len', 'str', 'int', 'float', 'bool',
+            'list', 'dict', 'set', 'tuple', 'range', 'enumerate',
+            'zip', 'map', 'filter', 'any', 'all', 'sum', 'min', 'max',
+            'sorted', 'reversed', 'abs', 'round', 'isinstance', 'type',
+            'getattr', 'hasattr', 'setattr', 'callable', 'iter', 'next',
+            '__build_class__'
+        ]
+
+        self.assertIn('__build_class__', allowed_builtins)
+
+
+class TestHooksEnabled(unittest.TestCase):
+    """Test HOOKS_ENABLED environment variable logic."""
+
+    def test_hooks_disabled_by_default(self):
+        """Hooks must be disabled by default."""
+        # Simulate the default behavior
+        hooks_enabled = os.environ.get("CRAWL4AI_HOOKS_ENABLED", "false").lower() == "true"
+
+        # Clear any existing env var to test default
+        original = os.environ.pop("CRAWL4AI_HOOKS_ENABLED", None)
+        try:
+            hooks_enabled = os.environ.get("CRAWL4AI_HOOKS_ENABLED", "false").lower() == "true"
+            self.assertFalse(hooks_enabled)
+        finally:
+            if original is not None:
+                os.environ["CRAWL4AI_HOOKS_ENABLED"] = original
+
+    def test_hooks_enabled_when_true(self):
+        """Hooks must be enabled when CRAWL4AI_HOOKS_ENABLED=true."""
+        original = os.environ.get("CRAWL4AI_HOOKS_ENABLED")
+        try:
+            os.environ["CRAWL4AI_HOOKS_ENABLED"] = "true"
+            hooks_enabled = os.environ.get("CRAWL4AI_HOOKS_ENABLED", "false").lower() == "true"
+            self.assertTrue(hooks_enabled)
+        finally:
+            if original is not None:
+                os.environ["CRAWL4AI_HOOKS_ENABLED"] = original
+            else:
+                os.environ.pop("CRAWL4AI_HOOKS_ENABLED", None)
+
+    def test_hooks_disabled_when_false(self):
+        """Hooks must be disabled when CRAWL4AI_HOOKS_ENABLED=false."""
+        original = os.environ.get("CRAWL4AI_HOOKS_ENABLED")
+        try:
+            os.environ["CRAWL4AI_HOOKS_ENABLED"] = "false"
+            hooks_enabled = os.environ.get("CRAWL4AI_HOOKS_ENABLED", "false").lower() == "true"
+            self.assertFalse(hooks_enabled)
+        finally:
+            if original is not None:
+                os.environ["CRAWL4AI_HOOKS_ENABLED"] = original
+            else:
+                os.environ.pop("CRAWL4AI_HOOKS_ENABLED", None)
+
+
+if __name__ == '__main__':
+    print("=" * 60)
+    print("Crawl4AI Security Fixes - Unit Tests")
+    print("=" * 60)
+    print()
+
+    # Run tests with verbosity
+    unittest.main(verbosity=2)
--- a/docs/RELEASE_NOTES_v0.8.0.md
+++ b/docs/RELEASE_NOTES_v0.8.0.md
@@ -0,0 +1,243 @@
+# Crawl4AI v0.8.0 Release Notes
+
+**Release Date**: January 2026
+**Previous Version**: v0.7.6
+**Status**: Release Candidate
+
+---
+
+## Highlights
+
+- **Critical Security Fixes** for Docker API deployment
+- **11 New Features** including crash recovery, prefetch mode, and proxy improvements
+- **Breaking Changes** - see migration guide below
+
+---
+
+## Breaking Changes
+
+### 1. Docker API: Hooks Disabled by Default
+
+**What changed**: Hooks are now disabled by default on the Docker API.
+
+**Why**: Security fix for Remote Code Execution (RCE) vulnerability.
+
+**Who is affected**: Users of the Docker API who use the `hooks` parameter in `/crawl` requests.
+
+**Migration**:
+```bash
+# To re-enable hooks (only if you trust all API users):
+export CRAWL4AI_HOOKS_ENABLED=true
+```
+
+### 2. Docker API: file:// URLs Blocked
+
+**What changed**: The endpoints `/execute_js`, `/screenshot`, `/pdf`, and `/html` now reject `file://` URLs.
+
+**Why**: Security fix for Local File Inclusion (LFI) vulnerability.
+
+**Who is affected**: Users who were reading local files via the Docker API.
+
+**Migration**: Use the Python library directly for local file processing:
+```python
+# Instead of API call with file:// URL, use library:
+from crawl4ai import AsyncWebCrawler
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(url="file:///path/to/file.html")
+```
+
+---
+
+## Security Fixes
+
+### Critical: Remote Code Execution via Hooks (CVE Pending)
+
+**Severity**: CRITICAL (CVSS 10.0)
+**Affected**: Docker API deployment (all versions before v0.8.0)
+**Vector**: `POST /crawl` with malicious `hooks` parameter
+
+**Details**: The `__import__` builtin was available in hook code, allowing attackers to import `os`, `subprocess`, etc. and execute arbitrary commands.
+
+**Fix**:
+1. Removed `__import__` from allowed builtins
+2. Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)
+
+### High: Local File Inclusion via file:// URLs (CVE Pending)
+
+**Severity**: HIGH (CVSS 8.6)
+**Affected**: Docker API deployment (all versions before v0.8.0)
+**Vector**: `POST /execute_js` (and other endpoints) with `file:///etc/passwd`
+
+**Details**: API endpoints accepted `file://` URLs, allowing attackers to read arbitrary files from the server.
+
+**Fix**: URL scheme validation now only allows `http://`, `https://`, and `raw:` URLs.
+
+### Credits
+
+Discovered by **Neo by ProjectDiscovery** ([projectdiscovery.io](https://projectdiscovery.io)) - December 2025
+
+---
+
+## New Features
+
+### 1. init_scripts Support for BrowserConfig
+
+Pre-page-load JavaScript injection for stealth evasions.
+
+```python
+config = BrowserConfig(
+    init_scripts=[
+        "Object.defineProperty(navigator, 'webdriver', {get: () => false})"
+    ]
+)
+```
+
+### 2. CDP Connection Improvements
+
+- WebSocket URL support (`ws://`, `wss://`)
+- Proper cleanup with `cdp_cleanup_on_close=True`
+- Browser reuse across multiple connections
+
+### 3. Crash Recovery for Deep Crawl Strategies
+
+All deep crawl strategies (BFS, DFS, Best-First) now support crash recovery:
+
+```python
+from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+
+strategy = BFSDeepCrawlStrategy(
+    max_depth=3,
+    resume_state=saved_state,  # Resume from checkpoint
+    on_state_change=save_callback  # Persist state in real-time
+)
+```
+
+### 4. PDF and MHTML for raw:/file:// URLs
+
+Generate PDFs and MHTML from cached HTML content.
+
+### 5. Screenshots for raw:/file:// URLs
+
+Render cached HTML and capture screenshots.
+
+### 6. base_url Parameter for CrawlerRunConfig
+
+Proper URL resolution for raw: HTML processing:
+
+```python
+config = CrawlerRunConfig(base_url='https://example.com')
+result = await crawler.arun(url='raw:{html}', config=config)
+```
+
+### 7. Prefetch Mode for Two-Phase Deep Crawling
+
+Fast link extraction without full page processing:
+
+```python
+config = CrawlerRunConfig(prefetch=True)
+```
+
+### 8. Proxy Rotation and Configuration
+
+Enhanced proxy rotation with sticky sessions support.
+
+### 9. Proxy Support for HTTP Strategy
+
+Non-browser crawler now supports proxies.
+
+### 10. Browser Pipeline for raw:/file:// URLs
+
+New `process_in_browser` parameter for browser operations on local content:
+
+```python
+config = CrawlerRunConfig(
+    process_in_browser=True,  # Force browser processing
+    screenshot=True
+)
+result = await crawler.arun(url='raw:<html>...</html>', config=config)
+```
+
+### 11. Smart TTL Cache for Sitemap URL Seeder
+
+Intelligent cache invalidation for sitemaps:
+
+```python
+config = SeedingConfig(
+    cache_ttl_hours=24,
+    validate_sitemap_lastmod=True
+)
+```
+
+---
+
+## Bug Fixes
+
+### raw: URL Parsing Truncates at # Character
+
+**Problem**: CSS color codes like `#eee` were being truncated.
+
+**Before**: `raw:body{background:#eee}` → `body{background:`
+**After**: `raw:body{background:#eee}` → `body{background:#eee}`
+
+### Caching System Improvements
+
+Various fixes to cache validation and persistence.
+
+---
+
+## Documentation Updates
+
+- Multi-sample schema generation documentation
+- URL seeder smart TTL cache parameters
+- Security documentation (SECURITY.md)
+
+---
+
+## Upgrade Guide
+
+### From v0.7.x to v0.8.0
+
+1. **Update the package**:
+   ```bash
+   pip install --upgrade crawl4ai
+   ```
+
+2. **Docker API users**:
+   - Hooks are now disabled by default
+   - If you need hooks: `export CRAWL4AI_HOOKS_ENABLED=true`
+   - `file://` URLs no longer work on API (use library directly)
+
+3. **Review security settings**:
+   ```yaml
+   # config.yml - recommended for production
+   security:
+     enabled: true
+     jwt_enabled: true
+   ```
+
+4. **Test your integration** before deploying to production
+
+### Breaking Change Checklist
+
+- [ ] Check if you use `hooks` parameter in API calls
+- [ ] Check if you use `file://` URLs via the API
+- [ ] Update environment variables if needed
+- [ ] Review security configuration
+
+---
+
+## Full Changelog
+
+See [CHANGELOG.md](../CHANGELOG.md) for complete version history.
+
+---
+
+## Contributors
+
+Thanks to all contributors who made this release possible.
+
+Special thanks to **Neo by ProjectDiscovery** for responsible security disclosure.
+
+---
+
+*For questions or issues, please open a [GitHub Issue](https://github.com/unclecode/crawl4ai/issues).*
--- a/docs/blog/release-v0.8.0.md
+++ b/docs/blog/release-v0.8.0.md
@@ -0,0 +1,243 @@
+# Crawl4AI v0.8.0 Release Notes
+
+**Release Date**: January 2026
+**Previous Version**: v0.7.6
+**Status**: Release Candidate
+
+---
+
+## Highlights
+
+- **Critical Security Fixes** for Docker API deployment
+- **11 New Features** including crash recovery, prefetch mode, and proxy improvements
+- **Breaking Changes** - see migration guide below
+
+---
+
+## Breaking Changes
+
+### 1. Docker API: Hooks Disabled by Default
+
+**What changed**: Hooks are now disabled by default on the Docker API.
+
+**Why**: Security fix for Remote Code Execution (RCE) vulnerability.
+
+**Who is affected**: Users of the Docker API who use the `hooks` parameter in `/crawl` requests.
+
+**Migration**:
+```bash
+# To re-enable hooks (only if you trust all API users):
+export CRAWL4AI_HOOKS_ENABLED=true
+```
+
+### 2. Docker API: file:// URLs Blocked
+
+**What changed**: The endpoints `/execute_js`, `/screenshot`, `/pdf`, and `/html` now reject `file://` URLs.
+
+**Why**: Security fix for Local File Inclusion (LFI) vulnerability.
+
+**Who is affected**: Users who were reading local files via the Docker API.
+
+**Migration**: Use the Python library directly for local file processing:
+```python
+# Instead of API call with file:// URL, use library:
+from crawl4ai import AsyncWebCrawler
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(url="file:///path/to/file.html")
+```
+
+---
+
+## Security Fixes
+
+### Critical: Remote Code Execution via Hooks (CVE Pending)
+
+**Severity**: CRITICAL (CVSS 10.0)
+**Affected**: Docker API deployment (all versions before v0.8.0)
+**Vector**: `POST /crawl` with malicious `hooks` parameter
+
+**Details**: The `__import__` builtin was available in hook code, allowing attackers to import `os`, `subprocess`, etc. and execute arbitrary commands.
+
+**Fix**:
+1. Removed `__import__` from allowed builtins
+2. Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)
+
+### High: Local File Inclusion via file:// URLs (CVE Pending)
+
+**Severity**: HIGH (CVSS 8.6)
+**Affected**: Docker API deployment (all versions before v0.8.0)
+**Vector**: `POST /execute_js` (and other endpoints) with `file:///etc/passwd`
+
+**Details**: API endpoints accepted `file://` URLs, allowing attackers to read arbitrary files from the server.
+
+**Fix**: URL scheme validation now only allows `http://`, `https://`, and `raw:` URLs.
+
+### Credits
+
+Discovered by **Neo by ProjectDiscovery** ([projectdiscovery.io](https://projectdiscovery.io)) - December 2025
+
+---
+
+## New Features
+
+### 1. init_scripts Support for BrowserConfig
+
+Pre-page-load JavaScript injection for stealth evasions.
+
+```python
+config = BrowserConfig(
+    init_scripts=[
+        "Object.defineProperty(navigator, 'webdriver', {get: () => false})"
+    ]
+)
+```
+
+### 2. CDP Connection Improvements
+
+- WebSocket URL support (`ws://`, `wss://`)
+- Proper cleanup with `cdp_cleanup_on_close=True`
+- Browser reuse across multiple connections
+
+### 3. Crash Recovery for Deep Crawl Strategies
+
+All deep crawl strategies (BFS, DFS, Best-First) now support crash recovery:
+
+```python
+from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+
+strategy = BFSDeepCrawlStrategy(
+    max_depth=3,
+    resume_state=saved_state,  # Resume from checkpoint
+    on_state_change=save_callback  # Persist state in real-time
+)
+```
+
+### 4. PDF and MHTML for raw:/file:// URLs
+
+Generate PDFs and MHTML from cached HTML content.
+
+### 5. Screenshots for raw:/file:// URLs
+
+Render cached HTML and capture screenshots.
+
+### 6. base_url Parameter for CrawlerRunConfig
+
+Proper URL resolution for raw: HTML processing:
+
+```python
+config = CrawlerRunConfig(base_url='https://example.com')
+result = await crawler.arun(url='raw:{html}', config=config)
+```
+
+### 7. Prefetch Mode for Two-Phase Deep Crawling
+
+Fast link extraction without full page processing:
+
+```python
+config = CrawlerRunConfig(prefetch=True)
+```
+
+### 8. Proxy Rotation and Configuration
+
+Enhanced proxy rotation with sticky sessions support.
+
+### 9. Proxy Support for HTTP Strategy
+
+Non-browser crawler now supports proxies.
+
+### 10. Browser Pipeline for raw:/file:// URLs
+
+New `process_in_browser` parameter for browser operations on local content:
+
+```python
+config = CrawlerRunConfig(
+    process_in_browser=True,  # Force browser processing
+    screenshot=True
+)
+result = await crawler.arun(url='raw:<html>...</html>', config=config)
+```
+
+### 11. Smart TTL Cache for Sitemap URL Seeder
+
+Intelligent cache invalidation for sitemaps:
+
+```python
+config = SeedingConfig(
+    cache_ttl_hours=24,
+    validate_sitemap_lastmod=True
+)
+```
+
+---
+
+## Bug Fixes
+
+### raw: URL Parsing Truncates at # Character
+
+**Problem**: CSS color codes like `#eee` were being truncated.
+
+**Before**: `raw:body{background:#eee}` → `body{background:`
+**After**: `raw:body{background:#eee}` → `body{background:#eee}`
+
+### Caching System Improvements
+
+Various fixes to cache validation and persistence.
+
+---
+
+## Documentation Updates
+
+- Multi-sample schema generation documentation
+- URL seeder smart TTL cache parameters
+- Security documentation (SECURITY.md)
+
+---
+
+## Upgrade Guide
+
+### From v0.7.x to v0.8.0
+
+1. **Update the package**:
+   ```bash
+   pip install --upgrade crawl4ai
+   ```
+
+2. **Docker API users**:
+   - Hooks are now disabled by default
+   - If you need hooks: `export CRAWL4AI_HOOKS_ENABLED=true`
+   - `file://` URLs no longer work on API (use library directly)
+
+3. **Review security settings**:
+   ```yaml
+   # config.yml - recommended for production
+   security:
+     enabled: true
+     jwt_enabled: true
+   ```
+
+4. **Test your integration** before deploying to production
+
+### Breaking Change Checklist
+
+- [ ] Check if you use `hooks` parameter in API calls
+- [ ] Check if you use `file://` URLs via the API
+- [ ] Update environment variables if needed
+- [ ] Review security configuration
+
+---
+
+## Full Changelog
+
+See [CHANGELOG.md](../CHANGELOG.md) for complete version history.
+
+---
+
+## Contributors
+
+Thanks to all contributors who made this release possible.
+
+Special thanks to **Neo by ProjectDiscovery** for responsible security disclosure.
+
+---
+
+*For questions or issues, please open a [GitHub Issue](https://github.com/unclecode/crawl4ai/issues).*
--- a/docs/examples/deep_crawl_crash_recovery.py
+++ b/docs/examples/deep_crawl_crash_recovery.py
@@ -0,0 +1,297 @@
+#!/usr/bin/env python3
+"""
+Deep Crawl Crash Recovery Example
+
+This example demonstrates how to implement crash recovery for long-running
+deep crawls. The feature is useful for:
+
+- Cloud deployments with spot/preemptible instances
+- Long-running crawls that may be interrupted
+- Distributed crawling with state coordination
+
+Key concepts:
+- `on_state_change`: Callback fired after each URL is processed
+- `resume_state`: Pass saved state to continue from a checkpoint
+- `export_state()`: Get the last captured state manually
+
+Works with all strategies: BFSDeepCrawlStrategy, DFSDeepCrawlStrategy,
+BestFirstCrawlingStrategy
+"""
+
+import asyncio
+import json
+import os
+from pathlib import Path
+from typing import Dict, Any, List
+
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+
+
+# File to store crawl state (in production, use Redis/database)
+STATE_FILE = Path("crawl_state.json")
+
+
+async def save_state_to_file(state: Dict[str, Any]) -> None:
+    """
+    Callback to save state after each URL is processed.
+
+    In production, you might save to:
+    - Redis: await redis.set("crawl_state", json.dumps(state))
+    - Database: await db.execute("UPDATE crawls SET state = ?", json.dumps(state))
+    - S3: await s3.put_object(Bucket="crawls", Key="state.json", Body=json.dumps(state))
+    """
+    with open(STATE_FILE, "w") as f:
+        json.dump(state, f, indent=2)
+    print(f"  [State saved] Pages: {state['pages_crawled']}, Pending: {len(state['pending'])}")
+
+
+def load_state_from_file() -> Dict[str, Any] | None:
+    """Load previously saved state, if it exists."""
+    if STATE_FILE.exists():
+        with open(STATE_FILE, "r") as f:
+            return json.load(f)
+    return None
+
+
+async def example_basic_state_persistence():
+    """
+    Example 1: Basic state persistence with file storage.
+
+    The on_state_change callback is called after each URL is processed,
+    allowing you to save progress in real-time.
+    """
+    print("\n" + "=" * 60)
+    print("Example 1: Basic State Persistence")
+    print("=" * 60)
+
+    # Clean up any previous state
+    if STATE_FILE.exists():
+        STATE_FILE.unlink()
+
+    strategy = BFSDeepCrawlStrategy(
+        max_depth=2,
+        max_pages=5,
+        on_state_change=save_state_to_file,  # Save after each URL
+    )
+
+    config = CrawlerRunConfig(
+        deep_crawl_strategy=strategy,
+        verbose=False,
+    )
+
+    print("\nStarting crawl with state persistence...")
+    async with AsyncWebCrawler(verbose=False) as crawler:
+        results = await crawler.arun("https://books.toscrape.com", config=config)
+
+    # Show final state
+    if STATE_FILE.exists():
+        with open(STATE_FILE, "r") as f:
+            final_state = json.load(f)
+
+        print(f"\nFinal state saved to {STATE_FILE}:")
+        print(f"  - Strategy: {final_state['strategy_type']}")
+        print(f"  - Pages crawled: {final_state['pages_crawled']}")
+        print(f"  - URLs visited: {len(final_state['visited'])}")
+        print(f"  - URLs pending: {len(final_state['pending'])}")
+
+    print(f"\nCrawled {len(results)} pages total")
+
+
+async def example_crash_and_resume():
+    """
+    Example 2: Simulate a crash and resume from checkpoint.
+
+    This demonstrates the full crash recovery workflow:
+    1. Start crawling with state persistence
+    2. "Crash" after N pages
+    3. Resume from saved state
+    4. Verify no duplicate work
+    """
+    print("\n" + "=" * 60)
+    print("Example 2: Crash and Resume")
+    print("=" * 60)
+
+    # Clean up any previous state
+    if STATE_FILE.exists():
+        STATE_FILE.unlink()
+
+    crash_after = 3
+    crawled_urls_phase1: List[str] = []
+
+    async def save_and_maybe_crash(state: Dict[str, Any]) -> None:
+        """Save state, then simulate crash after N pages."""
+        # Always save state first
+        await save_state_to_file(state)
+        crawled_urls_phase1.clear()
+        crawled_urls_phase1.extend(state["visited"])
+
+        # Simulate crash after reaching threshold
+        if state["pages_crawled"] >= crash_after:
+            raise Exception("Simulated crash! (This is intentional)")
+
+    # Phase 1: Start crawl that will "crash"
+    print(f"\n--- Phase 1: Crawl until 'crash' after {crash_after} pages ---")
+
+    strategy1 = BFSDeepCrawlStrategy(
+        max_depth=2,
+        max_pages=10,
+        on_state_change=save_and_maybe_crash,
+    )
+
+    config = CrawlerRunConfig(
+        deep_crawl_strategy=strategy1,
+        verbose=False,
+    )
+
+    try:
+        async with AsyncWebCrawler(verbose=False) as crawler:
+            await crawler.arun("https://books.toscrape.com", config=config)
+    except Exception as e:
+        print(f"\n  Crash occurred: {e}")
+        print(f"  URLs crawled before crash: {len(crawled_urls_phase1)}")
+
+    # Phase 2: Resume from checkpoint
+    print("\n--- Phase 2: Resume from checkpoint ---")
+
+    saved_state = load_state_from_file()
+    if not saved_state:
+        print("  ERROR: No saved state found!")
+        return
+
+    print(f"  Loaded state: {saved_state['pages_crawled']} pages, {len(saved_state['pending'])} pending")
+
+    crawled_urls_phase2: List[str] = []
+
+    async def track_resumed_crawl(state: Dict[str, Any]) -> None:
+        """Track new URLs crawled in phase 2."""
+        await save_state_to_file(state)
+        new_urls = set(state["visited"]) - set(saved_state["visited"])
+        for url in new_urls:
+            if url not in crawled_urls_phase2:
+                crawled_urls_phase2.append(url)
+
+    strategy2 = BFSDeepCrawlStrategy(
+        max_depth=2,
+        max_pages=10,
+        resume_state=saved_state,  # Resume from checkpoint!
+        on_state_change=track_resumed_crawl,
+    )
+
+    config2 = CrawlerRunConfig(
+        deep_crawl_strategy=strategy2,
+        verbose=False,
+    )
+
+    async with AsyncWebCrawler(verbose=False) as crawler:
+        results = await crawler.arun("https://books.toscrape.com", config=config2)
+
+    # Verify no duplicates
+    already_crawled = set(saved_state["visited"])
+    duplicates = set(crawled_urls_phase2) & already_crawled
+
+    print(f"\n--- Results ---")
+    print(f"  Phase 1 URLs: {len(crawled_urls_phase1)}")
+    print(f"  Phase 2 new URLs: {len(crawled_urls_phase2)}")
+    print(f"  Duplicate crawls: {len(duplicates)} (should be 0)")
+    print(f"  Total results: {len(results)}")
+
+    if len(duplicates) == 0:
+        print("\n  SUCCESS: No duplicate work after resume!")
+    else:
+        print(f"\n  WARNING: Found duplicates: {duplicates}")
+
+
+async def example_export_state():
+    """
+    Example 3: Manual state export using export_state().
+
+    If you don't need real-time persistence, you can export
+    the state manually after the crawl completes.
+    """
+    print("\n" + "=" * 60)
+    print("Example 3: Manual State Export")
+    print("=" * 60)
+
+    strategy = BFSDeepCrawlStrategy(
+        max_depth=1,
+        max_pages=3,
+        # No callback - state is still tracked internally
+    )
+
+    config = CrawlerRunConfig(
+        deep_crawl_strategy=strategy,
+        verbose=False,
+    )
+
+    print("\nCrawling without callback...")
+    async with AsyncWebCrawler(verbose=False) as crawler:
+        results = await crawler.arun("https://books.toscrape.com", config=config)
+
+    # Export state after crawl completes
+    # Note: This only works if on_state_change was set during crawl
+    # For this example, we'd need to set on_state_change to get state
+    print(f"\nCrawled {len(results)} pages")
+    print("(For manual export, set on_state_change to capture state)")
+
+
+async def example_state_structure():
+    """
+    Example 4: Understanding the state structure.
+
+    Shows the complete state dictionary that gets saved.
+    """
+    print("\n" + "=" * 60)
+    print("Example 4: State Structure")
+    print("=" * 60)
+
+    captured_state = None
+
+    async def capture_state(state: Dict[str, Any]) -> None:
+        nonlocal captured_state
+        captured_state = state
+
+    strategy = BFSDeepCrawlStrategy(
+        max_depth=1,
+        max_pages=2,
+        on_state_change=capture_state,
+    )
+
+    config = CrawlerRunConfig(
+        deep_crawl_strategy=strategy,
+        verbose=False,
+    )
+
+    async with AsyncWebCrawler(verbose=False) as crawler:
+        await crawler.arun("https://books.toscrape.com", config=config)
+
+    if captured_state:
+        print("\nState structure:")
+        print(json.dumps(captured_state, indent=2, default=str)[:1000] + "...")
+
+        print("\n\nKey fields:")
+        print(f"  strategy_type: '{captured_state['strategy_type']}'")
+        print(f"  visited: List of {len(captured_state['visited'])} URLs")
+        print(f"  pending: List of {len(captured_state['pending'])} queued items")
+        print(f"  depths: Dict mapping URL -> depth level")
+        print(f"  pages_crawled: {captured_state['pages_crawled']}")
+
+
+async def main():
+    """Run all examples."""
+    print("=" * 60)
+    print("Deep Crawl Crash Recovery Examples")
+    print("=" * 60)
+
+    await example_basic_state_persistence()
+    await example_crash_and_resume()
+    await example_state_structure()
+
+    # # Cleanup
+    # if STATE_FILE.exists():
+    #     STATE_FILE.unlink()
+    #     print(f"\n[Cleaned up {STATE_FILE}]")
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/docs/examples/prefetch_two_phase_crawl.py
+++ b/docs/examples/prefetch_two_phase_crawl.py
@@ -0,0 +1,279 @@
+#!/usr/bin/env python3
+"""
+Prefetch Mode and Two-Phase Crawling Example
+
+Prefetch mode is a fast path that skips heavy processing and returns
+only HTML + links. This is ideal for:
+
+- Site mapping: Quickly discover all URLs
+- Selective crawling: Find URLs first, then process only what you need
+- Link validation: Check which pages exist without full processing
+- Crawl planning: Estimate size before committing resources
+
+Key concept:
+- `prefetch=True` in CrawlerRunConfig enables fast link-only extraction
+- Skips: markdown generation, content scraping, media extraction, LLM extraction
+- Returns: HTML and links dictionary
+
+Performance benefit: ~5-10x faster than full processing
+"""
+
+import asyncio
+import time
+from typing import List, Dict
+
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+
+async def example_basic_prefetch():
+    """
+    Example 1: Basic prefetch mode.
+
+    Shows how prefetch returns HTML and links without heavy processing.
+    """
+    print("\n" + "=" * 60)
+    print("Example 1: Basic Prefetch Mode")
+    print("=" * 60)
+
+    async with AsyncWebCrawler(verbose=False) as crawler:
+        # Enable prefetch mode
+        config = CrawlerRunConfig(prefetch=True)
+
+        print("\nFetching with prefetch=True...")
+        result = await crawler.arun("https://books.toscrape.com", config=config)
+
+        print(f"\nResult summary:")
+        print(f"  Success: {result.success}")
+        print(f"  HTML length: {len(result.html) if result.html else 0} chars")
+        print(f"  Internal links: {len(result.links.get('internal', []))}")
+        print(f"  External links: {len(result.links.get('external', []))}")
+
+        # These should be None/empty in prefetch mode
+        print(f"\n  Skipped processing:")
+        print(f"    Markdown: {result.markdown}")
+        print(f"    Cleaned HTML: {result.cleaned_html}")
+        print(f"    Extracted content: {result.extracted_content}")
+
+        # Show some discovered links
+        internal_links = result.links.get("internal", [])
+        if internal_links:
+            print(f"\n  Sample internal links:")
+            for link in internal_links[:5]:
+                print(f"    - {link['href'][:60]}...")
+
+
+async def example_performance_comparison():
+    """
+    Example 2: Compare prefetch vs full processing performance.
+    """
+    print("\n" + "=" * 60)
+    print("Example 2: Performance Comparison")
+    print("=" * 60)
+
+    url = "https://books.toscrape.com"
+
+    async with AsyncWebCrawler(verbose=False) as crawler:
+        # Warm up - first request is slower due to browser startup
+        await crawler.arun(url, config=CrawlerRunConfig())
+
+        # Prefetch mode timing
+        start = time.time()
+        prefetch_result = await crawler.arun(url, config=CrawlerRunConfig(prefetch=True))
+        prefetch_time = time.time() - start
+
+        # Full processing timing
+        start = time.time()
+        full_result = await crawler.arun(url, config=CrawlerRunConfig())
+        full_time = time.time() - start
+
+        print(f"\nTiming comparison:")
+        print(f"  Prefetch mode: {prefetch_time:.3f}s")
+        print(f"  Full processing: {full_time:.3f}s")
+        print(f"  Speedup: {full_time / prefetch_time:.1f}x faster")
+
+        print(f"\nOutput comparison:")
+        print(f"  Prefetch - Links found: {len(prefetch_result.links.get('internal', []))}")
+        print(f"  Full - Links found: {len(full_result.links.get('internal', []))}")
+        print(f"  Full - Markdown length: {len(full_result.markdown.raw_markdown) if full_result.markdown else 0}")
+
+
+async def example_two_phase_crawl():
+    """
+    Example 3: Two-phase crawling pattern.
+
+    Phase 1: Fast discovery with prefetch
+    Phase 2: Full processing on selected URLs
+    """
+    print("\n" + "=" * 60)
+    print("Example 3: Two-Phase Crawling")
+    print("=" * 60)
+
+    async with AsyncWebCrawler(verbose=False) as crawler:
+        # ═══════════════════════════════════════════════════════════
+        # Phase 1: Fast URL discovery
+        # ═══════════════════════════════════════════════════════════
+        print("\n--- Phase 1: Fast Discovery ---")
+
+        prefetch_config = CrawlerRunConfig(prefetch=True)
+        start = time.time()
+        discovery = await crawler.arun("https://books.toscrape.com", config=prefetch_config)
+        discovery_time = time.time() - start
+
+        all_urls = [link["href"] for link in discovery.links.get("internal", [])]
+        print(f"  Discovered {len(all_urls)} URLs in {discovery_time:.2f}s")
+
+        # Filter to URLs we care about (e.g., book detail pages)
+        # On books.toscrape.com, book pages contain "catalogue/" but not "category/"
+        book_urls = [
+            url for url in all_urls
+            if "catalogue/" in url and "category/" not in url
+        ][:5]  # Limit to 5 for demo
+
+        print(f"  Filtered to {len(book_urls)} book pages")
+
+        # ═══════════════════════════════════════════════════════════
+        # Phase 2: Full processing on selected URLs
+        # ═══════════════════════════════════════════════════════════
+        print("\n--- Phase 2: Full Processing ---")
+
+        full_config = CrawlerRunConfig(
+            word_count_threshold=10,
+            remove_overlay_elements=True,
+        )
+
+        results = []
+        start = time.time()
+
+        for url in book_urls:
+            result = await crawler.arun(url, config=full_config)
+            if result.success:
+                results.append(result)
+                title = result.url.split("/")[-2].replace("-", " ").title()[:40]
+                md_len = len(result.markdown.raw_markdown) if result.markdown else 0
+                print(f"    Processed: {title}... ({md_len} chars)")
+
+        processing_time = time.time() - start
+        print(f"\n  Processed {len(results)} pages in {processing_time:.2f}s")
+
+        # ═══════════════════════════════════════════════════════════
+        # Summary
+        # ═══════════════════════════════════════════════════════════
+        print(f"\n--- Summary ---")
+        print(f"  Discovery phase: {discovery_time:.2f}s ({len(all_urls)} URLs)")
+        print(f"  Processing phase: {processing_time:.2f}s ({len(results)} pages)")
+        print(f"  Total time: {discovery_time + processing_time:.2f}s")
+        print(f"  URLs skipped: {len(all_urls) - len(book_urls)} (not matching filter)")
+
+
+async def example_prefetch_with_deep_crawl():
+    """
+    Example 4: Combine prefetch with deep crawl strategy.
+
+    Use prefetch mode during deep crawl for maximum speed.
+    """
+    print("\n" + "=" * 60)
+    print("Example 4: Prefetch with Deep Crawl")
+    print("=" * 60)
+
+    from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+
+    async with AsyncWebCrawler(verbose=False) as crawler:
+        # Deep crawl with prefetch - maximum discovery speed
+        config = CrawlerRunConfig(
+            prefetch=True,  # Fast mode
+            deep_crawl_strategy=BFSDeepCrawlStrategy(
+                max_depth=1,
+                max_pages=10,
+            )
+        )
+
+        print("\nDeep crawling with prefetch mode...")
+        start = time.time()
+
+        result_container = await crawler.arun("https://books.toscrape.com", config=config)
+
+        # Handle iterator result from deep crawl
+        if hasattr(result_container, '__iter__'):
+            results = list(result_container)
+        else:
+            results = [result_container]
+
+        elapsed = time.time() - start
+
+        # Collect all discovered links
+        all_internal_links = set()
+        all_external_links = set()
+
+        for result in results:
+            for link in result.links.get("internal", []):
+                all_internal_links.add(link["href"])
+            for link in result.links.get("external", []):
+                all_external_links.add(link["href"])
+
+        print(f"\nResults:")
+        print(f"  Pages crawled: {len(results)}")
+        print(f"  Total internal links discovered: {len(all_internal_links)}")
+        print(f"  Total external links discovered: {len(all_external_links)}")
+        print(f"  Time: {elapsed:.2f}s")
+
+
+async def example_prefetch_with_raw_html():
+    """
+    Example 5: Prefetch with raw HTML input.
+
+    You can also use prefetch mode with raw: URLs for cached content.
+    """
+    print("\n" + "=" * 60)
+    print("Example 5: Prefetch with Raw HTML")
+    print("=" * 60)
+
+    sample_html = """
+    <html>
+        <head><title>Sample Page</title></head>
+        <body>
+            <h1>Hello World</h1>
+            <nav>
+                <a href="/page1">Internal Page 1</a>
+                <a href="/page2">Internal Page 2</a>
+                <a href="https://example.com/external">External Link</a>
+            </nav>
+            <main>
+                <p>This is the main content with <a href="/page3">another link</a>.</p>
+            </main>
+        </body>
+    </html>
+    """
+
+    async with AsyncWebCrawler(verbose=False) as crawler:
+        config = CrawlerRunConfig(
+            prefetch=True,
+            base_url="https://mysite.com",  # For resolving relative links
+        )
+
+        result = await crawler.arun(f"raw:{sample_html}", config=config)
+
+        print(f"\nExtracted from raw HTML:")
+        print(f"  Internal links: {len(result.links.get('internal', []))}")
+        for link in result.links.get("internal", []):
+            print(f"    - {link['href']} ({link['text']})")
+
+        print(f"\n  External links: {len(result.links.get('external', []))}")
+        for link in result.links.get("external", []):
+            print(f"    - {link['href']} ({link['text']})")
+
+
+async def main():
+    """Run all examples."""
+    print("=" * 60)
+    print("Prefetch Mode and Two-Phase Crawling Examples")
+    print("=" * 60)
+
+    await example_basic_prefetch()
+    await example_performance_comparison()
+    await example_two_phase_crawl()
+    await example_prefetch_with_deep_crawl()
+    await example_prefetch_with_raw_html()
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/docs/md_v2/blog/index.md
+++ b/docs/md_v2/blog/index.md
@@ -20,22 +20,32 @@ Ever wondered why your AI coding assistant struggles with your library despite c

 ## Latest Release

+### [Crawl4AI v0.8.0 – Crash Recovery & Prefetch Mode](../blog/release-v0.8.0.md)
+*January 2026*
+
+Crawl4AI v0.8.0 introduces crash recovery for deep crawls, a new prefetch mode for fast URL discovery, and critical security fixes for Docker deployments.
+
+Key highlights:
+- **🔄 Deep Crawl Crash Recovery**: `on_state_change` callback for real-time state persistence, `resume_state` to continue from checkpoints
+- **⚡ Prefetch Mode**: `prefetch=True` for 5-10x faster URL discovery, perfect for two-phase crawling patterns
+- **🔒 Security Fixes**: Hooks disabled by default, `file://` URLs blocked on Docker API, `__import__` removed from sandbox
+
+[Read full release notes →](../blog/release-v0.8.0.md)
+
+## Recent Releases
+
 ### [Crawl4AI v0.7.8 – Stability & Bug Fix Release](../blog/release-v0.7.8.md)
 *December 2025*

-Crawl4AI v0.7.8 is a focused stability release addressing 11 bugs reported by the community. While there are no new features, these fixes resolve important issues affecting Docker deployments, LLM extraction, URL handling, and dependency compatibility.
+Crawl4AI v0.7.8 is a focused stability release addressing 11 bugs reported by the community. Fixes for Docker deployments, LLM extraction, URL handling, and dependency compatibility.

 Key highlights:
 - **🐳 Docker API Fixes**: ContentRelevanceFilter deserialization, ProxyConfig serialization, cache folder permissions
- **🤖 LLM Improvements**: Configurable rate limiter backoff, HTML input format support, raw HTML URL handling
- **🔗 URL Handling**: Correct relative URL resolution after JavaScript redirects
+- **🤖 LLM Improvements**: Configurable rate limiter backoff, HTML input format support
 - **📦 Dependencies**: Replaced deprecated PyPDF2 with pypdf, Pydantic v2 ConfigDict compatibility
- **🧠 AdaptiveCrawler**: Fixed query expansion to actually use LLM instead of mock data

 [Read full release notes →](../blog/release-v0.7.8.md)

-## Recent Releases
-
 ### [Crawl4AI v0.7.7 – The Self-Hosting & Monitoring Update](../blog/release-v0.7.7.md)
 *November 14, 2025*

@@ -52,36 +62,22 @@ Key highlights:
 ### [Crawl4AI v0.7.6 – The Webhook Infrastructure Update](../blog/release-v0.7.6.md)
 *October 22, 2025*

-Crawl4AI v0.7.6 introduces comprehensive webhook support for the Docker job queue API, bringing real-time notifications to both crawling and LLM extraction workflows. No more polling!
+Crawl4AI v0.7.6 introduces comprehensive webhook support for the Docker job queue API, bringing real-time notifications to both crawling and LLM extraction workflows.

 Key highlights:
 - **🪝 Complete Webhook Support**: Real-time notifications for both `/crawl/job` and `/llm/job` endpoints
- **🔄 Reliable Delivery**: Exponential backoff retry mechanism (5 attempts: 1s → 2s → 4s → 8s → 16s)
+- **🔄 Reliable Delivery**: Exponential backoff retry mechanism
 - **🔐 Custom Authentication**: Add custom headers for webhook authentication
- **📊 Flexible Delivery**: Choose notification-only or include full data in payload
- **⚙️ Global Configuration**: Set default webhook URL in config.yml for all jobs

 [Read full release notes →](../blog/release-v0.7.6.md)

-### [Crawl4AI v0.7.5 – The Docker Hooks & Security Update](../blog/release-v0.7.5.md)
-*September 29, 2025*
-
-Crawl4AI v0.7.5 introduces the powerful Docker Hooks System for complete pipeline customization, enhanced LLM integration with custom providers, HTTPS preservation for modern web security, and resolves multiple community-reported issues.
-
-Key highlights:
- **🔧 Docker Hooks System**: Custom Python functions at 8 key pipeline points for unprecedented customization
- **🤖 Enhanced LLM Integration**: Custom providers with temperature control and base_url configuration
- **🔒 HTTPS Preservation**: Secure internal link handling for modern web applications
- **🐍 Python 3.10+ Support**: Modern language features and enhanced performance
-
-[Read full release notes →](../blog/release-v0.7.5.md)
-
 ---

 ## Older Releases

 | Version | Date | Highlights |
 |---------|------|------------|
+| [v0.7.5](../blog/release-v0.7.5.md) | September 2025 | Docker Hooks System, enhanced LLM integration, HTTPS preservation |
 | [v0.7.4](../blog/release-v0.7.4.md) | August 2025 | LLM-powered table extraction, performance improvements |
 | [v0.7.3](../blog/release-v0.7.3.md) | July 2025 | Undetected browser, multi-URL config, memory monitoring |
 | [v0.7.1](../blog/release-v0.7.1.md) | June 2025 | Bug fixes and stability improvements |
--- a/docs/md_v2/blog/releases/v0.8.0.md
+++ b/docs/md_v2/blog/releases/v0.8.0.md
@@ -0,0 +1,243 @@
+# Crawl4AI v0.8.0 Release Notes
+
+**Release Date**: January 2026
+**Previous Version**: v0.7.6
+**Status**: Release Candidate
+
+---
+
+## Highlights
+
+- **Critical Security Fixes** for Docker API deployment
+- **11 New Features** including crash recovery, prefetch mode, and proxy improvements
+- **Breaking Changes** - see migration guide below
+
+---
+
+## Breaking Changes
+
+### 1. Docker API: Hooks Disabled by Default
+
+**What changed**: Hooks are now disabled by default on the Docker API.
+
+**Why**: Security fix for Remote Code Execution (RCE) vulnerability.
+
+**Who is affected**: Users of the Docker API who use the `hooks` parameter in `/crawl` requests.
+
+**Migration**:
+```bash
+# To re-enable hooks (only if you trust all API users):
+export CRAWL4AI_HOOKS_ENABLED=true
+```
+
+### 2. Docker API: file:// URLs Blocked
+
+**What changed**: The endpoints `/execute_js`, `/screenshot`, `/pdf`, and `/html` now reject `file://` URLs.
+
+**Why**: Security fix for Local File Inclusion (LFI) vulnerability.
+
+**Who is affected**: Users who were reading local files via the Docker API.
+
+**Migration**: Use the Python library directly for local file processing:
+```python
+# Instead of API call with file:// URL, use library:
+from crawl4ai import AsyncWebCrawler
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(url="file:///path/to/file.html")
+```
+
+---
+
+## Security Fixes
+
+### Critical: Remote Code Execution via Hooks (CVE Pending)
+
+**Severity**: CRITICAL (CVSS 10.0)
+**Affected**: Docker API deployment (all versions before v0.8.0)
+**Vector**: `POST /crawl` with malicious `hooks` parameter
+
+**Details**: The `__import__` builtin was available in hook code, allowing attackers to import `os`, `subprocess`, etc. and execute arbitrary commands.
+
+**Fix**:
+1. Removed `__import__` from allowed builtins
+2. Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)
+
+### High: Local File Inclusion via file:// URLs (CVE Pending)
+
+**Severity**: HIGH (CVSS 8.6)
+**Affected**: Docker API deployment (all versions before v0.8.0)
+**Vector**: `POST /execute_js` (and other endpoints) with `file:///etc/passwd`
+
+**Details**: API endpoints accepted `file://` URLs, allowing attackers to read arbitrary files from the server.
+
+**Fix**: URL scheme validation now only allows `http://`, `https://`, and `raw:` URLs.
+
+### Credits
+
+Discovered by **Neo by ProjectDiscovery** ([projectdiscovery.io](https://projectdiscovery.io)) - December 2025
+
+---
+
+## New Features
+
+### 1. init_scripts Support for BrowserConfig
+
+Pre-page-load JavaScript injection for stealth evasions.
+
+```python
+config = BrowserConfig(
+    init_scripts=[
+        "Object.defineProperty(navigator, 'webdriver', {get: () => false})"
+    ]
+)
+```
+
+### 2. CDP Connection Improvements
+
+- WebSocket URL support (`ws://`, `wss://`)
+- Proper cleanup with `cdp_cleanup_on_close=True`
+- Browser reuse across multiple connections
+
+### 3. Crash Recovery for Deep Crawl Strategies
+
+All deep crawl strategies (BFS, DFS, Best-First) now support crash recovery:
+
+```python
+from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+
+strategy = BFSDeepCrawlStrategy(
+    max_depth=3,
+    resume_state=saved_state,  # Resume from checkpoint
+    on_state_change=save_callback  # Persist state in real-time
+)
+```
+
+### 4. PDF and MHTML for raw:/file:// URLs
+
+Generate PDFs and MHTML from cached HTML content.
+
+### 5. Screenshots for raw:/file:// URLs
+
+Render cached HTML and capture screenshots.
+
+### 6. base_url Parameter for CrawlerRunConfig
+
+Proper URL resolution for raw: HTML processing:
+
+```python
+config = CrawlerRunConfig(base_url='https://example.com')
+result = await crawler.arun(url='raw:{html}', config=config)
+```
+
+### 7. Prefetch Mode for Two-Phase Deep Crawling
+
+Fast link extraction without full page processing:
+
+```python
+config = CrawlerRunConfig(prefetch=True)
+```
+
+### 8. Proxy Rotation and Configuration
+
+Enhanced proxy rotation with sticky sessions support.
+
+### 9. Proxy Support for HTTP Strategy
+
+Non-browser crawler now supports proxies.
+
+### 10. Browser Pipeline for raw:/file:// URLs
+
+New `process_in_browser` parameter for browser operations on local content:
+
+```python
+config = CrawlerRunConfig(
+    process_in_browser=True,  # Force browser processing
+    screenshot=True
+)
+result = await crawler.arun(url='raw:<html>...</html>', config=config)
+```
+
+### 11. Smart TTL Cache for Sitemap URL Seeder
+
+Intelligent cache invalidation for sitemaps:
+
+```python
+config = SeedingConfig(
+    cache_ttl_hours=24,
+    validate_sitemap_lastmod=True
+)
+```
+
+---
+
+## Bug Fixes
+
+### raw: URL Parsing Truncates at # Character
+
+**Problem**: CSS color codes like `#eee` were being truncated.
+
+**Before**: `raw:body{background:#eee}` → `body{background:`
+**After**: `raw:body{background:#eee}` → `body{background:#eee}`
+
+### Caching System Improvements
+
+Various fixes to cache validation and persistence.
+
+---
+
+## Documentation Updates
+
+- Multi-sample schema generation documentation
+- URL seeder smart TTL cache parameters
+- Security documentation (SECURITY.md)
+
+---
+
+## Upgrade Guide
+
+### From v0.7.x to v0.8.0
+
+1. **Update the package**:
+   ```bash
+   pip install --upgrade crawl4ai
+   ```
+
+2. **Docker API users**:
+   - Hooks are now disabled by default
+   - If you need hooks: `export CRAWL4AI_HOOKS_ENABLED=true`
+   - `file://` URLs no longer work on API (use library directly)
+
+3. **Review security settings**:
+   ```yaml
+   # config.yml - recommended for production
+   security:
+     enabled: true
+     jwt_enabled: true
+   ```
+
+4. **Test your integration** before deploying to production
+
+### Breaking Change Checklist
+
+- [ ] Check if you use `hooks` parameter in API calls
+- [ ] Check if you use `file://` URLs via the API
+- [ ] Update environment variables if needed
+- [ ] Review security configuration
+
+---
+
+## Full Changelog
+
+See [CHANGELOG.md](../CHANGELOG.md) for complete version history.
+
+---
+
+## Contributors
+
+Thanks to all contributors who made this release possible.
+
+Special thanks to **Neo by ProjectDiscovery** for responsible security disclosure.
+
+---
+
+*For questions or issues, please open a [GitHub Issue](https://github.com/unclecode/crawl4ai/issues).*
--- a/docs/md_v2/core/deep-crawling.md
+++ b/docs/md_v2/core/deep-crawling.md
@@ -9,6 +9,8 @@ In this tutorial, you'll learn:
 3. Implementing **filters and scorers** to target specific content
 4. Creating **advanced filtering chains** for sophisticated crawls
 5. Using **BestFirstCrawling** for intelligent exploration prioritization
+6. **Crash recovery** for long-running production crawls
+7. **Prefetch mode** for fast URL discovery  

 > **Prerequisites**  
 > - You’ve completed or read [AsyncWebCrawler Basics](../core/simple-crawling.md) to understand how to run a simple crawl.  
@@ -485,7 +487,249 @@ This is especially useful for security-conscious crawling or when dealing with s

 ---

-## 10. Summary & Next Steps
+## 10. Crash Recovery for Long-Running Crawls
+
+For production deployments, especially in cloud environments where instances can be terminated unexpectedly, Crawl4AI provides built-in crash recovery support for all deep crawl strategies.
+
+### 10.1 Enabling State Persistence
+
+All deep crawl strategies (BFS, DFS, Best-First) support two optional parameters:
+
+- **`resume_state`**: Pass a previously saved state to resume from a checkpoint
+- **`on_state_change`**: Async callback fired after each URL is processed
+
+```python
+from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+import json
+
+# Callback to save state after each URL
+async def save_state_to_redis(state: dict):
+    await redis.set("crawl_state", json.dumps(state))
+
+strategy = BFSDeepCrawlStrategy(
+    max_depth=3,
+    on_state_change=save_state_to_redis,  # Called after each URL
+)
+```
+
+### 10.2 State Structure
+
+The state dictionary is JSON-serializable and contains:
+
+```python
+{
+    "strategy_type": "bfs",  # or "dfs", "best_first"
+    "visited": ["url1", "url2", ...],  # Already crawled URLs
+    "pending": [{"url": "...", "parent_url": "..."}],  # Queue/stack
+    "depths": {"url1": 0, "url2": 1},  # Depth tracking
+    "pages_crawled": 42  # Counter
+}
+```
+
+### 10.3 Resuming from a Checkpoint
+
+```python
+import json
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+
+# Load saved state (e.g., from Redis, database, or file)
+saved_state = json.loads(await redis.get("crawl_state"))
+
+# Resume crawling from where we left off
+strategy = BFSDeepCrawlStrategy(
+    max_depth=3,
+    resume_state=saved_state,  # Continue from checkpoint
+    on_state_change=save_state_to_redis,  # Keep saving progress
+)
+
+config = CrawlerRunConfig(deep_crawl_strategy=strategy)
+
+async with AsyncWebCrawler() as crawler:
+    # Will skip already-visited URLs and continue from pending queue
+    results = await crawler.arun(start_url, config=config)
+```
+
+### 10.4 Manual State Export
+
+You can export the last captured state using `export_state()`. Note that this requires `on_state_change` to be set (state is captured in the callback):
+
+```python
+import json
+
+captured_state = None
+
+async def capture_state(state: dict):
+    global captured_state
+    captured_state = state
+
+strategy = BFSDeepCrawlStrategy(
+    max_depth=2,
+    on_state_change=capture_state,  # Required for state capture
+)
+config = CrawlerRunConfig(deep_crawl_strategy=strategy)
+
+async with AsyncWebCrawler() as crawler:
+    results = await crawler.arun(start_url, config=config)
+
+# Get the last captured state
+state = strategy.export_state()
+if state:
+    # Save to your preferred storage
+    with open("crawl_checkpoint.json", "w") as f:
+        json.dump(state, f)
+```
+
+### 10.5 Complete Example: Redis-Based Recovery
+
+```python
+import asyncio
+import json
+import redis.asyncio as redis
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+
+REDIS_KEY = "crawl4ai:crawl_state"
+
+async def main():
+    redis_client = redis.Redis(host='localhost', port=6379, db=0)
+
+    # Check for existing state
+    saved_state = None
+    existing = await redis_client.get(REDIS_KEY)
+    if existing:
+        saved_state = json.loads(existing)
+        print(f"Resuming from checkpoint: {saved_state['pages_crawled']} pages already crawled")
+
+    # State persistence callback
+    async def persist_state(state: dict):
+        await redis_client.set(REDIS_KEY, json.dumps(state))
+
+    # Create strategy with recovery support
+    strategy = BFSDeepCrawlStrategy(
+        max_depth=3,
+        max_pages=100,
+        resume_state=saved_state,
+        on_state_change=persist_state,
+    )
+
+    config = CrawlerRunConfig(deep_crawl_strategy=strategy, stream=True)
+
+    try:
+        async with AsyncWebCrawler() as crawler:
+            async for result in await crawler.arun("https://example.com", config=config):
+                print(f"Crawled: {result.url}")
+    except Exception as e:
+        print(f"Crawl interrupted: {e}")
+        print("State saved - restart to resume")
+    finally:
+        await redis_client.close()
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+### 10.6 Zero Overhead
+
+When `resume_state=None` and `on_state_change=None` (the defaults), there is no performance impact. State tracking only activates when you enable these features.
+
+---
+
+## 11. Prefetch Mode for Fast URL Discovery
+
+When you need to quickly discover URLs without full page processing, use **prefetch mode**. This is ideal for two-phase crawling where you first map the site, then selectively process specific pages.
+
+### 11.1 Enabling Prefetch Mode
+
+```python
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+config = CrawlerRunConfig(prefetch=True)
+
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun("https://example.com", config=config)
+
+    # Result contains only HTML and links - no markdown, no extraction
+    print(f"Found {len(result.links['internal'])} internal links")
+    print(f"Found {len(result.links['external'])} external links")
+```
+
+### 11.2 What Gets Skipped
+
+Prefetch mode uses a fast path that bypasses heavy processing:
+
+| Processing Step | Normal Mode | Prefetch Mode |
+|----------------|-------------|---------------|
+| Fetch HTML | ✅ | ✅ |
+| Extract links | ✅ | ✅ (fast `quick_extract_links()`) |
+| Generate markdown | ✅ | ❌ Skipped |
+| Content scraping | ✅ | ❌ Skipped |
+| Media extraction | ✅ | ❌ Skipped |
+| LLM extraction | ✅ | ❌ Skipped |
+
+### 11.3 Performance Benefit
+
+- **Normal mode**: Full pipeline (~2-5 seconds per page)
+- **Prefetch mode**: HTML + links only (~200-500ms per page)
+
+This makes prefetch mode **5-10x faster** for URL discovery.
+
+### 11.4 Two-Phase Crawling Pattern
+
+The most common use case is two-phase crawling:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+async def two_phase_crawl(start_url: str):
+    async with AsyncWebCrawler() as crawler:
+        # ═══════════════════════════════════════════════
+        # Phase 1: Fast discovery (prefetch mode)
+        # ═══════════════════════════════════════════════
+        prefetch_config = CrawlerRunConfig(prefetch=True)
+        discovery = await crawler.arun(start_url, config=prefetch_config)
+
+        all_urls = [link["href"] for link in discovery.links.get("internal", [])]
+        print(f"Discovered {len(all_urls)} URLs")
+
+        # Filter to URLs you care about
+        blog_urls = [url for url in all_urls if "/blog/" in url]
+        print(f"Found {len(blog_urls)} blog posts to process")
+
+        # ═══════════════════════════════════════════════
+        # Phase 2: Full processing on selected URLs only
+        # ═══════════════════════════════════════════════
+        full_config = CrawlerRunConfig(
+            # Your normal extraction settings
+            word_count_threshold=100,
+            remove_overlay_elements=True,
+        )
+
+        results = []
+        for url in blog_urls:
+            result = await crawler.arun(url, config=full_config)
+            if result.success:
+                results.append(result)
+                print(f"Processed: {url}")
+
+        return results
+
+if __name__ == "__main__":
+    results = asyncio.run(two_phase_crawl("https://example.com"))
+    print(f"Fully processed {len(results)} pages")
+```
+
+### 11.5 Use Cases
+
+- **Site mapping**: Quickly discover all URLs before deciding what to process
+- **Link validation**: Check which pages exist without heavy processing
+- **Selective deep crawl**: Prefetch to find URLs, filter by pattern, then full crawl
+- **Crawl planning**: Estimate crawl size before committing resources
+
+---
+
+## 12. Summary & Next Steps

 In this **Deep Crawling with Crawl4AI** tutorial, you learned to:

@@ -495,5 +739,7 @@ In this **Deep Crawling with Crawl4AI** tutorial, you learned to:
 - Use scorers to prioritize the most relevant pages
 - Limit crawls with `max_pages` and `score_threshold` parameters
 - Build a complete advanced crawler with combined techniques
+- **Implement crash recovery** with `resume_state` and `on_state_change` for production deployments
+- **Use prefetch mode** for fast URL discovery and two-phase crawling

 With these tools, you can efficiently extract structured data from websites at scale, focusing precisely on the content you need for your specific use case.
--- a/docs/md_v2/core/self-hosting.md
+++ b/docs/md_v2/core/self-hosting.md
@@ -67,13 +67,13 @@ Pull and run images directly from Docker Hub without building locally.

 #### 1. Pull the Image

-Our latest release is `0.7.6`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
+Our latest release is `0.8.0`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.

-> 💡 **Note**: The `latest` tag points to the stable `0.7.6` version.
+> 💡 **Note**: The `latest` tag points to the stable `0.8.0` version.

 ```bash
 # Pull the latest version
-docker pull unclecode/crawl4ai:0.7.6
+docker pull unclecode/crawl4ai:0.8.0

 # Or pull using the latest tag
 docker pull unclecode/crawl4ai:latest
@@ -145,7 +145,7 @@ docker stop crawl4ai && docker rm crawl4ai
 #### Docker Hub Versioning Explained

 *   **Image Name:** `unclecode/crawl4ai`
-*   **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.7.6`)
+*   **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.8.0`)
    *   `LIBRARY_VERSION`: The semantic version of the core `crawl4ai` Python library
    *   `SUFFIX`: Optional tag for release candidates (``) and revisions (`r1`)
 *   **`latest` Tag:** Points to the most recent stable version
--- a/docs/md_v2/core/url-seeding.md
+++ b/docs/md_v2/core/url-seeding.md
@@ -255,6 +255,8 @@ The `SeedingConfig` object is your control panel. Here's everything you can conf
 | `scoring_method` | str | None | Scoring method (currently "bm25") |
 | `score_threshold` | float | None | Minimum score to include URL |
 | `filter_nonsense_urls` | bool | True | Filter out utility URLs (robots.txt, etc.) |
+| `cache_ttl_hours` | int | 24 | Hours before sitemap cache expires (0 = no TTL) |
+| `validate_sitemap_lastmod` | bool | True | Check sitemap's lastmod and refetch if newer |

 #### Pattern Matching Examples

@@ -968,10 +970,49 @@ config = SeedingConfig(
 The seeder automatically caches results to speed up repeated operations:

 - **Common Crawl cache**: `~/.crawl4ai/seeder_cache/[index]_[domain]_[hash].jsonl`
- **Sitemap cache**: `~/.crawl4ai/seeder_cache/sitemap_[domain]_[hash].jsonl`
+- **Sitemap cache**: `~/.crawl4ai/seeder_cache/sitemap_[domain]_[hash].json`
 - **HEAD data cache**: `~/.cache/url_seeder/head/[hash].json`

-Cache expires after 7 days by default. Use `force=True` to refresh.
+#### Smart TTL Cache for Sitemaps
+
+Sitemap caches now include intelligent validation:
+
+```python
+# Default: 24-hour TTL with lastmod validation
+config = SeedingConfig(
+    source="sitemap",
+    cache_ttl_hours=24,              # Cache expires after 24 hours
+    validate_sitemap_lastmod=True    # Also check if sitemap was updated
+)
+
+# Aggressive caching (1 week, no lastmod check)
+config = SeedingConfig(
+    source="sitemap",
+    cache_ttl_hours=168,             # 7 days
+    validate_sitemap_lastmod=False   # Trust TTL only
+)
+
+# Always validate (no TTL, only lastmod)
+config = SeedingConfig(
+    source="sitemap",
+    cache_ttl_hours=0,               # Disable TTL
+    validate_sitemap_lastmod=True    # Refetch if sitemap has newer lastmod
+)
+
+# Always fresh (bypass cache completely)
+config = SeedingConfig(
+    source="sitemap",
+    force=True                       # Ignore all caching
+)
+```
+
+**Cache validation priority:**
+1. `force=True` → Always refetch
+2. Cache doesn't exist → Fetch fresh
+3. `validate_sitemap_lastmod=True` and sitemap has newer `<lastmod>` → Refetch
+4. `cache_ttl_hours > 0` and cache is older than TTL → Refetch
+5. Cache corrupted → Refetch (automatic recovery)
+6. Otherwise → Use cache

 ### Pattern Matching Strategies

@@ -1060,6 +1101,9 @@ config = SeedingConfig(
 | Rate limit errors | Reduce `hits_per_sec` and `concurrency` |
 | Memory issues with large sites | Use `max_urls` to limit results, reduce `concurrency` |
 | Connection not closed | Use context manager or call `await seeder.close()` |
+| Stale/outdated URLs | Set `cache_ttl_hours=0` or use `force=True` |
+| Cache not updating | Check `validate_sitemap_lastmod=True`, or use `force=True` |
+| Incomplete URL list | Delete cache file and refetch, or use `force=True` |

 ### Performance Benchmarks

@@ -1119,6 +1163,7 @@ config = SeedingConfig(
 3. **Context Manager Support**: Automatic cleanup with `async with` statement
 4. **URL-Based Scoring**: Smart filtering even without head extraction
 5. **Smart URL Filtering**: Automatically excludes utility/nonsense URLs
-6. **Dual Caching**: Separate caches for URL lists and metadata
+6. **Smart TTL Cache**: Sitemap caches with TTL expiry and lastmod validation
+7. **Automatic Cache Recovery**: Corrupted or incomplete caches are automatically refreshed

-Now go forth and seed intelligently! 🌱🚀
+Now go forth and seed intelligently!
--- a/docs/md_v2/extraction/no-llm-strategies.md
+++ b/docs/md_v2/extraction/no-llm-strategies.md
@@ -716,6 +716,102 @@ strategy = JsonCssExtractionStrategy(css_schema)
   - Use OpenAI for production-quality schemas
   - Use Ollama for development, testing, or when you need a self-hosted solution

+### Multi-Sample Schema Generation
+
+When scraping multiple pages with varying DOM structures (e.g., product pages where table rows appear in different positions), single-sample schema generation may produce **fragile selectors** like `tr:nth-child(6)` that break on other pages.
+
+**The Problem:**
+```
+Page A: Manufacturer is in row 6  → selector: tr:nth-child(6) td a
+Page B: Manufacturer is in row 5  → selector FAILS
+Page C: Manufacturer is in row 7  → selector FAILS
+```
+
+**The Solution:** Provide multiple HTML samples so the LLM identifies stable patterns that work across all pages.
+
+```python
+from crawl4ai import JsonCssExtractionStrategy, LLMConfig
+
+# Collect HTML samples from different pages
+html_sample_1 = """
+<table class="specs">
+  <tr><td>Brand</td><td>Apple</td></tr>
+  <tr><td>Manufacturer</td><td><a href="/m/apple">Apple Inc</a></td></tr>
+</table>
+"""
+
+html_sample_2 = """
+<table class="specs">
+  <tr><td>Manufacturer</td><td><a href="/m/samsung">Samsung</a></td></tr>
+  <tr><td>Brand</td><td>Galaxy</td></tr>
+</table>
+"""
+
+html_sample_3 = """
+<table class="specs">
+  <tr><td>Model</td><td>Pixel 8</td></tr>
+  <tr><td>Brand</td><td>Google</td></tr>
+  <tr><td>Manufacturer</td><td><a href="/m/google">Google LLC</a></td></tr>
+</table>
+"""
+
+# Combine samples with labels
+combined_html = """
+## HTML Sample 1 (Product A):
+```html
+""" + html_sample_1 + """
+```
+
+## HTML Sample 2 (Product B):
+```html
+""" + html_sample_2 + """
+```
+
+## HTML Sample 3 (Product C):
+```html
+""" + html_sample_3 + """
+```
+"""
+
+# Provide instructions for stable selectors
+query = """
+IMPORTANT: I'm providing 3 HTML samples from different product pages.
+The manufacturer field appears in different row positions across pages.
+Generate selectors using stable attributes like href patterns (e.g., a[href*='/m/'])
+instead of fragile positional selectors like nth-child().
+Extract: manufacturer name and link.
+"""
+
+# Generate schema with multi-sample awareness
+schema = JsonCssExtractionStrategy.generate_schema(
+    html=combined_html,
+    query=query,
+    schema_type="CSS",
+    llm_config=LLMConfig(provider="openai/gpt-4o", api_token="your-token")
+)
+
+# The generated schema will use stable selectors like:
+# a[href*="/m/"] instead of tr:nth-child(6) td a
+print(schema)
+```
+
+**Key Points for Multi-Sample Queries:**
+
+1. **Format samples clearly** - Use markdown headers and code blocks to separate samples
+2. **State the number of samples** - "I'm providing 3 HTML samples..."
+3. **Explain the variation** - "...the manufacturer field appears in different row positions"
+4. **Request stable selectors** - "Use href patterns, data attributes, or class names instead of nth-child"
+
+**Stable vs Fragile Selectors:**
+
+| Fragile (single sample) | Stable (multi-sample) |
+|------------------------|----------------------|
+| `tr:nth-child(6) td a` | `a[href*="/m/"]` |
+| `div:nth-child(3) .price` | `.price, [data-price]` |
+| `ul li:first-child` | `li[data-featured="true"]` |
+
+This approach lets you generate schemas once that work reliably across hundreds of similar pages with varying structures.
+
 ---

 ## 10. Conclusion
--- a/docs/migration/v0.8.0-upgrade-guide.md
+++ b/docs/migration/v0.8.0-upgrade-guide.md
@@ -0,0 +1,301 @@
+# Migration Guide: Upgrading to Crawl4AI v0.8.0
+
+This guide helps you upgrade from v0.7.x to v0.8.0, with special attention to breaking changes and security updates.
+
+## Quick Summary
+
+| Change | Impact | Action Required |
+|--------|--------|-----------------|
+| Hooks disabled by default | Docker API users with hooks | Set `CRAWL4AI_HOOKS_ENABLED=true` |
+| file:// URLs blocked | Docker API users reading local files | Use Python library directly |
+| Security fixes | All Docker API users | Update immediately |
+
+---
+
+## Step 1: Update the Package
+
+### PyPI Installation
+
+```bash
+pip install --upgrade crawl4ai
+```
+
+### Docker Installation
+
+```bash
+docker pull unclecode/crawl4ai:latest
+# or
+docker pull unclecode/crawl4ai:0.8.0
+```
+
+### From Source
+
+```bash
+git pull origin main
+pip install -e .
+```
+
+---
+
+## Step 2: Check for Breaking Changes
+
+### Are You Affected?
+
+**You ARE affected if you:**
+- Use the Docker API deployment
+- Use the `hooks` parameter in `/crawl` requests
+- Use `file://` URLs via API endpoints
+
+**You are NOT affected if you:**
+- Only use Crawl4AI as a Python library
+- Don't use hooks in your API calls
+- Don't use `file://` URLs via the API
+
+---
+
+## Step 3: Migrate Hooks Usage
+
+### Before v0.8.0
+
+Hooks worked by default:
+
+```bash
+# This worked without any configuration
+curl -X POST http://localhost:11235/crawl \
+  -H "Content-Type: application/json" \
+  -d '{
+    "urls": ["https://example.com"],
+    "hooks": {
+      "code": {
+        "on_page_context_created": "async def hook(page, context, **kwargs):\n    await context.add_cookies([...])\n    return page"
+      }
+    }
+  }'
+```
+
+### After v0.8.0
+
+You must explicitly enable hooks:
+
+**Option A: Environment Variable (Recommended)**
+```bash
+# In your Docker run command or docker-compose.yml
+export CRAWL4AI_HOOKS_ENABLED=true
+```
+
+```yaml
+# docker-compose.yml
+services:
+  crawl4ai:
+    image: unclecode/crawl4ai:0.8.0
+    environment:
+      - CRAWL4AI_HOOKS_ENABLED=true
+```
+
+**Option B: For Kubernetes**
+```yaml
+env:
+  - name: CRAWL4AI_HOOKS_ENABLED
+    value: "true"
+```
+
+### Security Warning
+
+Only enable hooks if:
+- You trust all users who can access the API
+- The API is not exposed to the public internet
+- You have other authentication/authorization in place
+
+---
+
+## Step 4: Migrate file:// URL Usage
+
+### Before v0.8.0
+
+```bash
+# This worked via API
+curl -X POST http://localhost:11235/execute_js \
+  -d '{"url": "file:///var/data/page.html", "scripts": ["document.title"]}'
+```
+
+### After v0.8.0
+
+**Option A: Use the Python Library Directly**
+
+```python
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+async def process_local_file():
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(
+            url="file:///var/data/page.html",
+            config=CrawlerRunConfig(js_code=["document.title"])
+        )
+        return result
+```
+
+**Option B: Use raw: Protocol for HTML Content**
+
+If you have the HTML content, you can still use the API:
+
+```bash
+# Read file content and send as raw:
+HTML_CONTENT=$(cat /var/data/page.html)
+curl -X POST http://localhost:11235/html \
+  -H "Content-Type: application/json" \
+  -d "{\"url\": \"raw:$HTML_CONTENT\"}"
+```
+
+**Option C: Create a Preprocessing Service**
+
+```python
+# preprocessing_service.py
+from fastapi import FastAPI
+from crawl4ai import AsyncWebCrawler
+
+app = FastAPI()
+
+@app.post("/process-local")
+async def process_local(file_path: str):
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(url=f"file://{file_path}")
+        return result.model_dump()
+```
+
+---
+
+## Step 5: Review Security Configuration
+
+### Recommended Production Settings
+
+```yaml
+# config.yml
+security:
+  enabled: true
+  jwt_enabled: true
+  https_redirect: true  # If behind HTTPS proxy
+  trusted_hosts:
+    - "your-domain.com"
+    - "api.your-domain.com"
+```
+
+### Environment Variables
+
+```bash
+# Required for JWT authentication
+export SECRET_KEY="your-secure-random-key-minimum-32-characters"
+
+# Only if you need hooks
+export CRAWL4AI_HOOKS_ENABLED=true
+```
+
+### Generate a Secure Secret Key
+
+```python
+import secrets
+print(secrets.token_urlsafe(32))
+```
+
+---
+
+## Step 6: Test Your Integration
+
+### Quick Validation Script
+
+```python
+import asyncio
+import aiohttp
+
+async def test_upgrade():
+    base_url = "http://localhost:11235"
+
+    # Test 1: Basic crawl should work
+    async with aiohttp.ClientSession() as session:
+        async with session.post(
+            f"{base_url}/crawl",
+            json={"urls": ["https://example.com"]}
+        ) as resp:
+            assert resp.status == 200, "Basic crawl failed"
+            print("✓ Basic crawl works")
+
+    # Test 2: Hooks should be blocked (unless enabled)
+    async with aiohttp.ClientSession() as session:
+        async with session.post(
+            f"{base_url}/crawl",
+            json={
+                "urls": ["https://example.com"],
+                "hooks": {"code": {"on_page_context_created": "async def hook(page, context, **kwargs): return page"}}
+            }
+        ) as resp:
+            if resp.status == 403:
+                print("✓ Hooks correctly blocked (default)")
+            elif resp.status == 200:
+                print("! Hooks enabled - ensure this is intentional")
+
+    # Test 3: file:// should be blocked
+    async with aiohttp.ClientSession() as session:
+        async with session.post(
+            f"{base_url}/execute_js",
+            json={"url": "file:///etc/passwd", "scripts": ["1"]}
+        ) as resp:
+            assert resp.status == 400, "file:// should be blocked"
+            print("✓ file:// URLs correctly blocked")
+
+asyncio.run(test_upgrade())
+```
+
+---
+
+## Troubleshooting
+
+### "Hooks are disabled" Error
+
+**Symptom**: API returns 403 with "Hooks are disabled"
+
+**Solution**: Set `CRAWL4AI_HOOKS_ENABLED=true` if you need hooks
+
+### "URL must start with http://, https://" Error
+
+**Symptom**: API returns 400 when using `file://` URLs
+
+**Solution**: Use Python library directly or `raw:` protocol
+
+### Authentication Errors After Enabling JWT
+
+**Symptom**: API returns 401 Unauthorized
+
+**Solution**:
+1. Get a token: `POST /token` with your email
+2. Include token in requests: `Authorization: Bearer <token>`
+
+---
+
+## Rollback Plan
+
+If you need to rollback:
+
+```bash
+# PyPI
+pip install crawl4ai==0.7.6
+
+# Docker
+docker pull unclecode/crawl4ai:0.7.6
+```
+
+**Warning**: Rolling back re-exposes the security vulnerabilities. Only do this temporarily while fixing integration issues.
+
+---
+
+## Getting Help
+
+- **GitHub Issues**: [github.com/unclecode/crawl4ai/issues](https://github.com/unclecode/crawl4ai/issues)
+- **Security Issues**: See [SECURITY.md](../../SECURITY.md)
+- **Documentation**: [docs.crawl4ai.com](https://docs.crawl4ai.com)
+
+---
+
+## Changelog Reference
+
+For complete list of changes, see:
+- [Release Notes v0.8.0](../RELEASE_NOTES_v0.8.0.md)
+- [CHANGELOG.md](../../CHANGELOG.md)
--- a/docs/releases_review/demo_v0.8.0.py
+++ b/docs/releases_review/demo_v0.8.0.py
@@ -0,0 +1,633 @@
+#!/usr/bin/env python3
+"""
+Crawl4AI v0.8.0 Release Demo - Feature Verification Tests
+==========================================================
+
+This demo ACTUALLY RUNS and VERIFIES the new features in v0.8.0.
+Each test executes real code and validates the feature is working.
+
+New Features Verified:
+1. Crash Recovery - on_state_change callback for real-time state persistence
+2. Crash Recovery - resume_state for resuming from checkpoint
+3. Crash Recovery - State is JSON serializable
+4. Prefetch Mode - Returns HTML and links only
+5. Prefetch Mode - Skips heavy processing (markdown, extraction)
+6. Prefetch Mode - Two-phase crawl pattern
+7. Security - Hooks disabled by default (Docker API)
+
+Breaking Changes in v0.8.0:
+- Docker API hooks disabled by default (CRAWL4AI_HOOKS_ENABLED=false)
+- file:// URLs blocked on Docker API endpoints
+
+Usage:
+    python docs/releases_review/demo_v0.8.0.py
+"""
+
+import asyncio
+import json
+import sys
+import time
+from typing import Dict, Any, List, Optional
+from dataclasses import dataclass
+
+
+# Test results tracking
+@dataclass
+class TestResult:
+    name: str
+    feature: str
+    passed: bool
+    message: str
+    skipped: bool = False
+
+
+results: list[TestResult] = []
+
+
+def print_header(title: str):
+    print(f"\n{'=' * 70}")
+    print(f"{title}")
+    print(f"{'=' * 70}")
+
+
+def print_test(name: str, feature: str):
+    print(f"\n[TEST] {name} ({feature})")
+    print("-" * 50)
+
+
+def record_result(name: str, feature: str, passed: bool, message: str, skipped: bool = False):
+    results.append(TestResult(name, feature, passed, message, skipped))
+    if skipped:
+        print(f"  SKIPPED: {message}")
+    elif passed:
+        print(f"  PASSED: {message}")
+    else:
+        print(f"  FAILED: {message}")
+
+
+# =============================================================================
+# TEST 1: Crash Recovery - State Capture with on_state_change
+# =============================================================================
+async def test_crash_recovery_state_capture():
+    """
+    Verify on_state_change callback is called after each URL is processed.
+
+    NEW in v0.8.0: Deep crawl strategies support on_state_change callback
+    for real-time state persistence (useful for cloud deployments).
+    """
+    print_test("Crash Recovery - State Capture", "on_state_change")
+
+    try:
+        from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+        from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+
+        captured_states: List[Dict[str, Any]] = []
+
+        async def capture_state(state: Dict[str, Any]):
+            """Callback that fires after each URL is processed."""
+            captured_states.append(state.copy())
+
+        strategy = BFSDeepCrawlStrategy(
+            max_depth=1,
+            max_pages=3,
+            on_state_change=capture_state,
+        )
+
+        config = CrawlerRunConfig(
+            deep_crawl_strategy=strategy,
+            verbose=False,
+        )
+
+        async with AsyncWebCrawler(verbose=False) as crawler:
+            await crawler.arun("https://books.toscrape.com", config=config)
+
+        # Verify states were captured
+        if len(captured_states) == 0:
+            record_result("State Capture", "on_state_change", False,
+                         "No states captured - callback not called")
+            return
+
+        # Verify callback was called for each page
+        pages_crawled = captured_states[-1].get("pages_crawled", 0)
+        if pages_crawled != len(captured_states):
+            record_result("State Capture", "on_state_change", False,
+                         f"Callback count {len(captured_states)} != pages_crawled {pages_crawled}")
+            return
+
+        record_result("State Capture", "on_state_change", True,
+                     f"Callback fired {len(captured_states)} times (once per URL)")
+
+    except Exception as e:
+        record_result("State Capture", "on_state_change", False, f"Exception: {e}")
+
+
+# =============================================================================
+# TEST 2: Crash Recovery - Resume from Checkpoint
+# =============================================================================
+async def test_crash_recovery_resume():
+    """
+    Verify crawl can resume from a saved checkpoint without re-crawling visited URLs.
+
+    NEW in v0.8.0: BFSDeepCrawlStrategy accepts resume_state parameter
+    to continue from a previously saved checkpoint.
+    """
+    print_test("Crash Recovery - Resume from Checkpoint", "resume_state")
+
+    try:
+        from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+        from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+
+        # Phase 1: Start crawl and capture state after 2 pages
+        crash_after = 2
+        captured_states: List[Dict] = []
+        phase1_urls: List[str] = []
+
+        async def capture_until_crash(state: Dict[str, Any]):
+            captured_states.append(state.copy())
+            phase1_urls.clear()
+            phase1_urls.extend(state["visited"])
+            if state["pages_crawled"] >= crash_after:
+                raise Exception("Simulated crash")
+
+        strategy1 = BFSDeepCrawlStrategy(
+            max_depth=1,
+            max_pages=5,
+            on_state_change=capture_until_crash,
+        )
+
+        config1 = CrawlerRunConfig(
+            deep_crawl_strategy=strategy1,
+            verbose=False,
+        )
+
+        # Run until "crash"
+        try:
+            async with AsyncWebCrawler(verbose=False) as crawler:
+                await crawler.arun("https://books.toscrape.com", config=config1)
+        except Exception:
+            pass  # Expected crash
+
+        if not captured_states:
+            record_result("Resume from Checkpoint", "resume_state", False,
+                         "No state captured before crash")
+            return
+
+        saved_state = captured_states[-1]
+        print(f"  Phase 1: Crawled {len(phase1_urls)} URLs before crash")
+
+        # Phase 2: Resume from checkpoint
+        phase2_urls: List[str] = []
+
+        async def track_phase2(state: Dict[str, Any]):
+            new_urls = set(state["visited"]) - set(saved_state["visited"])
+            for url in new_urls:
+                if url not in phase2_urls:
+                    phase2_urls.append(url)
+
+        strategy2 = BFSDeepCrawlStrategy(
+            max_depth=1,
+            max_pages=5,
+            resume_state=saved_state,  # Resume from checkpoint!
+            on_state_change=track_phase2,
+        )
+
+        config2 = CrawlerRunConfig(
+            deep_crawl_strategy=strategy2,
+            verbose=False,
+        )
+
+        async with AsyncWebCrawler(verbose=False) as crawler:
+            await crawler.arun("https://books.toscrape.com", config=config2)
+
+        print(f"  Phase 2: Crawled {len(phase2_urls)} new URLs after resume")
+
+        # Verify no duplicates
+        duplicates = set(phase2_urls) & set(phase1_urls)
+        if duplicates:
+            record_result("Resume from Checkpoint", "resume_state", False,
+                         f"Re-crawled {len(duplicates)} URLs: {list(duplicates)[:2]}")
+            return
+
+        record_result("Resume from Checkpoint", "resume_state", True,
+                     f"Resumed successfully, no duplicate crawls")
+
+    except Exception as e:
+        record_result("Resume from Checkpoint", "resume_state", False, f"Exception: {e}")
+
+
+# =============================================================================
+# TEST 3: Crash Recovery - State is JSON Serializable
+# =============================================================================
+async def test_crash_recovery_json_serializable():
+    """
+    Verify the state dictionary can be serialized to JSON (for Redis/DB storage).
+
+    NEW in v0.8.0: State dictionary is designed to be JSON-serializable
+    for easy storage in Redis, databases, or files.
+    """
+    print_test("Crash Recovery - JSON Serializable", "State Structure")
+
+    try:
+        from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+        from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+
+        captured_state: Optional[Dict] = None
+
+        async def capture_state(state: Dict[str, Any]):
+            nonlocal captured_state
+            captured_state = state
+
+        strategy = BFSDeepCrawlStrategy(
+            max_depth=1,
+            max_pages=2,
+            on_state_change=capture_state,
+        )
+
+        config = CrawlerRunConfig(
+            deep_crawl_strategy=strategy,
+            verbose=False,
+        )
+
+        async with AsyncWebCrawler(verbose=False) as crawler:
+            await crawler.arun("https://books.toscrape.com", config=config)
+
+        if not captured_state:
+            record_result("JSON Serializable", "State Structure", False,
+                         "No state captured")
+            return
+
+        # Test JSON serialization round-trip
+        try:
+            json_str = json.dumps(captured_state)
+            restored = json.loads(json_str)
+        except (TypeError, json.JSONDecodeError) as e:
+            record_result("JSON Serializable", "State Structure", False,
+                         f"JSON serialization failed: {e}")
+            return
+
+        # Verify state structure
+        required_fields = ["strategy_type", "visited", "pending", "depths", "pages_crawled"]
+        missing = [f for f in required_fields if f not in restored]
+        if missing:
+            record_result("JSON Serializable", "State Structure", False,
+                         f"Missing fields: {missing}")
+            return
+
+        # Verify types
+        if not isinstance(restored["visited"], list):
+            record_result("JSON Serializable", "State Structure", False,
+                         "visited is not a list")
+            return
+
+        if not isinstance(restored["pages_crawled"], int):
+            record_result("JSON Serializable", "State Structure", False,
+                         "pages_crawled is not an int")
+            return
+
+        record_result("JSON Serializable", "State Structure", True,
+                     f"State serializes to {len(json_str)} bytes, all fields present")
+
+    except Exception as e:
+        record_result("JSON Serializable", "State Structure", False, f"Exception: {e}")
+
+
+# =============================================================================
+# TEST 4: Prefetch Mode - Returns HTML and Links Only
+# =============================================================================
+async def test_prefetch_returns_html_links():
+    """
+    Verify prefetch mode returns HTML and links but skips markdown generation.
+
+    NEW in v0.8.0: CrawlerRunConfig accepts prefetch=True for fast
+    URL discovery without heavy processing.
+    """
+    print_test("Prefetch Mode - HTML and Links", "prefetch=True")
+
+    try:
+        from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+        config = CrawlerRunConfig(prefetch=True)
+
+        async with AsyncWebCrawler(verbose=False) as crawler:
+            result = await crawler.arun("https://books.toscrape.com", config=config)
+
+        # Verify HTML is present
+        if not result.html or len(result.html) < 100:
+            record_result("Prefetch HTML/Links", "prefetch=True", False,
+                         "HTML not returned or too short")
+            return
+
+        # Verify links are present
+        if not result.links:
+            record_result("Prefetch HTML/Links", "prefetch=True", False,
+                         "Links not returned")
+            return
+
+        internal_count = len(result.links.get("internal", []))
+        external_count = len(result.links.get("external", []))
+
+        if internal_count == 0:
+            record_result("Prefetch HTML/Links", "prefetch=True", False,
+                         "No internal links extracted")
+            return
+
+        record_result("Prefetch HTML/Links", "prefetch=True", True,
+                     f"HTML: {len(result.html)} chars, Links: {internal_count} internal, {external_count} external")
+
+    except Exception as e:
+        record_result("Prefetch HTML/Links", "prefetch=True", False, f"Exception: {e}")
+
+
+# =============================================================================
+# TEST 5: Prefetch Mode - Skips Heavy Processing
+# =============================================================================
+async def test_prefetch_skips_processing():
+    """
+    Verify prefetch mode skips markdown generation and content extraction.
+
+    NEW in v0.8.0: prefetch=True skips markdown generation, content scraping,
+    media extraction, and LLM extraction for maximum speed.
+    """
+    print_test("Prefetch Mode - Skips Processing", "prefetch=True")
+
+    try:
+        from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+        config = CrawlerRunConfig(prefetch=True)
+
+        async with AsyncWebCrawler(verbose=False) as crawler:
+            result = await crawler.arun("https://books.toscrape.com", config=config)
+
+        # Check that heavy processing was skipped
+        checks = []
+
+        # Markdown should be None or empty
+        if result.markdown is None:
+            checks.append("markdown=None")
+        elif hasattr(result.markdown, 'raw_markdown') and result.markdown.raw_markdown is None:
+            checks.append("raw_markdown=None")
+        else:
+            record_result("Prefetch Skips Processing", "prefetch=True", False,
+                         f"Markdown was generated (should be skipped)")
+            return
+
+        # cleaned_html should be None
+        if result.cleaned_html is None:
+            checks.append("cleaned_html=None")
+        else:
+            record_result("Prefetch Skips Processing", "prefetch=True", False,
+                         "cleaned_html was generated (should be skipped)")
+            return
+
+        # extracted_content should be None
+        if result.extracted_content is None:
+            checks.append("extracted_content=None")
+
+        record_result("Prefetch Skips Processing", "prefetch=True", True,
+                     f"Heavy processing skipped: {', '.join(checks)}")
+
+    except Exception as e:
+        record_result("Prefetch Skips Processing", "prefetch=True", False, f"Exception: {e}")
+
+
+# =============================================================================
+# TEST 6: Prefetch Mode - Two-Phase Crawl Pattern
+# =============================================================================
+async def test_prefetch_two_phase():
+    """
+    Verify the two-phase crawl pattern: prefetch for discovery, then full processing.
+
+    NEW in v0.8.0: Prefetch mode enables efficient two-phase crawling where
+    you discover URLs quickly, then selectively process important ones.
+    """
+    print_test("Prefetch Mode - Two-Phase Crawl", "Two-Phase Pattern")
+
+    try:
+        from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+        async with AsyncWebCrawler(verbose=False) as crawler:
+            # Phase 1: Fast discovery with prefetch
+            prefetch_config = CrawlerRunConfig(prefetch=True)
+
+            start = time.time()
+            discovery = await crawler.arun("https://books.toscrape.com", config=prefetch_config)
+            prefetch_time = time.time() - start
+
+            all_urls = [link["href"] for link in discovery.links.get("internal", [])]
+
+            # Filter to specific pages (e.g., book detail pages)
+            book_urls = [
+                url for url in all_urls
+                if "catalogue/" in url and "category/" not in url
+            ][:2]  # Just 2 for demo
+
+            print(f"  Phase 1: Found {len(all_urls)} URLs in {prefetch_time:.2f}s")
+            print(f"  Filtered to {len(book_urls)} book pages for full processing")
+
+            if len(book_urls) == 0:
+                record_result("Two-Phase Crawl", "Two-Phase Pattern", False,
+                             "No book URLs found to process")
+                return
+
+            # Phase 2: Full processing on selected URLs
+            full_config = CrawlerRunConfig()  # Normal mode
+
+            start = time.time()
+            processed = []
+            for url in book_urls:
+                result = await crawler.arun(url, config=full_config)
+                if result.success and result.markdown:
+                    processed.append(result)
+
+            full_time = time.time() - start
+
+            print(f"  Phase 2: Processed {len(processed)} pages in {full_time:.2f}s")
+
+            if len(processed) == 0:
+                record_result("Two-Phase Crawl", "Two-Phase Pattern", False,
+                             "No pages successfully processed in phase 2")
+                return
+
+            # Verify full processing includes markdown
+            if not processed[0].markdown or not processed[0].markdown.raw_markdown:
+                record_result("Two-Phase Crawl", "Two-Phase Pattern", False,
+                             "Full processing did not generate markdown")
+                return
+
+            record_result("Two-Phase Crawl", "Two-Phase Pattern", True,
+                         f"Discovered {len(all_urls)} URLs (prefetch), processed {len(processed)} (full)")
+
+    except Exception as e:
+        record_result("Two-Phase Crawl", "Two-Phase Pattern", False, f"Exception: {e}")
+
+
+# =============================================================================
+# TEST 7: Security - Hooks Disabled by Default
+# =============================================================================
+async def test_security_hooks_disabled():
+    """
+    Verify hooks are disabled by default in Docker API for security.
+
+    NEW in v0.8.0: Docker API hooks are disabled by default to prevent
+    Remote Code Execution. Set CRAWL4AI_HOOKS_ENABLED=true to enable.
+    """
+    print_test("Security - Hooks Disabled", "CRAWL4AI_HOOKS_ENABLED")
+
+    try:
+        import os
+
+        # Check the default environment variable
+        hooks_enabled = os.environ.get("CRAWL4AI_HOOKS_ENABLED", "false").lower()
+
+        if hooks_enabled == "true":
+            record_result("Hooks Disabled Default", "Security", True,
+                         "CRAWL4AI_HOOKS_ENABLED is explicitly set to 'true' (user override)",
+                         skipped=True)
+            return
+
+        # Verify default is "false"
+        if hooks_enabled == "false":
+            record_result("Hooks Disabled Default", "Security", True,
+                         "Hooks disabled by default (CRAWL4AI_HOOKS_ENABLED=false)")
+        else:
+            record_result("Hooks Disabled Default", "Security", True,
+                         f"CRAWL4AI_HOOKS_ENABLED='{hooks_enabled}' (not 'true', hooks disabled)")
+
+    except Exception as e:
+        record_result("Hooks Disabled Default", "Security", False, f"Exception: {e}")
+
+
+# =============================================================================
+# TEST 8: Comprehensive Crawl Test
+# =============================================================================
+async def test_comprehensive_crawl():
+    """
+    Run a comprehensive crawl to verify overall stability with new features.
+    """
+    print_test("Comprehensive Crawl Test", "Overall")
+
+    try:
+        from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BrowserConfig
+
+        async with AsyncWebCrawler(config=BrowserConfig(headless=True), verbose=False) as crawler:
+            result = await crawler.arun(
+                url="https://httpbin.org/html",
+                config=CrawlerRunConfig()
+            )
+
+        checks = []
+
+        if result.success:
+            checks.append("success=True")
+        else:
+            record_result("Comprehensive Crawl", "Overall", False,
+                         f"Crawl failed: {result.error_message}")
+            return
+
+        if result.html and len(result.html) > 100:
+            checks.append(f"html={len(result.html)} chars")
+
+        if result.markdown and result.markdown.raw_markdown:
+            checks.append(f"markdown={len(result.markdown.raw_markdown)} chars")
+
+        if result.links:
+            total_links = len(result.links.get("internal", [])) + len(result.links.get("external", []))
+            checks.append(f"links={total_links}")
+
+        record_result("Comprehensive Crawl", "Overall", True,
+                     f"All checks passed: {', '.join(checks)}")
+
+    except Exception as e:
+        record_result("Comprehensive Crawl", "Overall", False, f"Exception: {e}")
+
+
+# =============================================================================
+# MAIN
+# =============================================================================
+
+def print_summary():
+    """Print test results summary"""
+    print_header("TEST RESULTS SUMMARY")
+
+    passed = sum(1 for r in results if r.passed and not r.skipped)
+    failed = sum(1 for r in results if not r.passed and not r.skipped)
+    skipped = sum(1 for r in results if r.skipped)
+
+    print(f"\nTotal: {len(results)} tests")
+    print(f"  Passed:  {passed}")
+    print(f"  Failed:  {failed}")
+    print(f"  Skipped: {skipped}")
+
+    if failed > 0:
+        print("\nFailed Tests:")
+        for r in results:
+            if not r.passed and not r.skipped:
+                print(f"  - {r.name} ({r.feature}): {r.message}")
+
+    if skipped > 0:
+        print("\nSkipped Tests:")
+        for r in results:
+            if r.skipped:
+                print(f"  - {r.name} ({r.feature}): {r.message}")
+
+    print("\n" + "=" * 70)
+    if failed == 0:
+        print("All tests passed! v0.8.0 features verified.")
+    else:
+        print(f"WARNING: {failed} test(s) failed!")
+    print("=" * 70)
+
+    return failed == 0
+
+
+async def main():
+    """Run all verification tests"""
+    print_header("Crawl4AI v0.8.0 - Feature Verification Tests")
+    print("Running actual tests to verify new features...")
+    print("\nKey Features in v0.8.0:")
+    print("  - Crash Recovery for Deep Crawl (resume_state, on_state_change)")
+    print("  - Prefetch Mode for Fast URL Discovery (prefetch=True)")
+    print("  - Security: Hooks disabled by default on Docker API")
+
+    # Run all tests
+    tests = [
+        test_crash_recovery_state_capture,      # on_state_change
+        test_crash_recovery_resume,             # resume_state
+        test_crash_recovery_json_serializable,  # State structure
+        test_prefetch_returns_html_links,       # prefetch=True basics
+        test_prefetch_skips_processing,         # prefetch skips heavy work
+        test_prefetch_two_phase,                # Two-phase pattern
+        test_security_hooks_disabled,           # Security check
+        test_comprehensive_crawl,               # Overall stability
+    ]
+
+    for test_func in tests:
+        try:
+            await test_func()
+        except Exception as e:
+            print(f"\nTest {test_func.__name__} crashed: {e}")
+            results.append(TestResult(
+                test_func.__name__,
+                "Unknown",
+                False,
+                f"Crashed: {e}"
+            ))
+
+    # Print summary
+    all_passed = print_summary()
+
+    return 0 if all_passed else 1
+
+
+if __name__ == "__main__":
+    try:
+        exit_code = asyncio.run(main())
+        sys.exit(exit_code)
+    except KeyboardInterrupt:
+        print("\n\nTests interrupted by user.")
+        sys.exit(1)
+    except Exception as e:
+        print(f"\n\nTest suite failed: {e}")
+        import traceback
+        traceback.print_exc()
+        sys.exit(1)
--- a/docs/security/GHSA-DRAFT-RCE-LFI.md
+++ b/docs/security/GHSA-DRAFT-RCE-LFI.md
@@ -0,0 +1,171 @@
+# GitHub Security Advisory Draft
+
+> **Instructions**: Copy this content to create security advisories at:
+> https://github.com/unclecode/crawl4ai/security/advisories/new
+
+---
+
+## Advisory 1: Remote Code Execution via Hooks Parameter
+
+### Title
+Remote Code Execution in Docker API via Hooks Parameter
+
+### Severity
+Critical
+
+### CVSS Score
+10.0 (CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:H/A:H)
+
+### CWE
+CWE-94 (Improper Control of Generation of Code)
+
+### Package
+crawl4ai (Docker API deployment)
+
+### Affected Versions
+< 0.8.0
+
+### Patched Versions
+0.8.0
+
+### Description
+
+A critical remote code execution vulnerability exists in the Crawl4AI Docker API deployment. The `/crawl` endpoint accepts a `hooks` parameter containing Python code that is executed using `exec()`. The `__import__` builtin was included in the allowed builtins, allowing attackers to import arbitrary modules and execute system commands.
+
+**Attack Vector:**
+```json
+POST /crawl
+{
+  "urls": ["https://example.com"],
+  "hooks": {
+    "code": {
+      "on_page_context_created": "async def hook(page, context, **kwargs):\n    __import__('os').system('malicious_command')\n    return page"
+    }
+  }
+}
+```
+
+### Impact
+
+An unauthenticated attacker can:
+- Execute arbitrary system commands
+- Read/write files on the server
+- Exfiltrate sensitive data (environment variables, API keys)
+- Pivot to internal network services
+- Completely compromise the server
+
+### Mitigation
+
+1. **Upgrade to v0.8.0** (recommended)
+2. If unable to upgrade immediately:
+   - Disable the Docker API
+   - Block `/crawl` endpoint at network level
+   - Add authentication to the API
+
+### Fix Details
+
+1. Removed `__import__` from `allowed_builtins` in `hook_manager.py`
+2. Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)
+3. Users must explicitly opt-in to enable hooks
+
+### Credits
+
+Discovered by Neo by ProjectDiscovery (https://projectdiscovery.io)
+
+### References
+
+- [Release Notes v0.8.0](https://github.com/unclecode/crawl4ai/blob/main/docs/RELEASE_NOTES_v0.8.0.md)
+- [Migration Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/migration/v0.8.0-upgrade-guide.md)
+
+---
+
+## Advisory 2: Local File Inclusion via file:// URLs
+
+### Title
+Local File Inclusion in Docker API via file:// URLs
+
+### Severity
+High
+
+### CVSS Score
+8.6 (CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:N/A:N)
+
+### CWE
+CWE-22 (Improper Limitation of a Pathname to a Restricted Directory)
+
+### Package
+crawl4ai (Docker API deployment)
+
+### Affected Versions
+< 0.8.0
+
+### Patched Versions
+0.8.0
+
+### Description
+
+A local file inclusion vulnerability exists in the Crawl4AI Docker API. The `/execute_js`, `/screenshot`, `/pdf`, and `/html` endpoints accept `file://` URLs, allowing attackers to read arbitrary files from the server filesystem.
+
+**Attack Vector:**
+```json
+POST /execute_js
+{
+  "url": "file:///etc/passwd",
+  "scripts": ["document.body.innerText"]
+}
+```
+
+### Impact
+
+An unauthenticated attacker can:
+- Read sensitive files (`/etc/passwd`, `/etc/shadow`, application configs)
+- Access environment variables via `/proc/self/environ`
+- Discover internal application structure
+- Potentially read credentials and API keys
+
+### Mitigation
+
+1. **Upgrade to v0.8.0** (recommended)
+2. If unable to upgrade immediately:
+   - Disable the Docker API
+   - Add authentication to the API
+   - Use network-level filtering
+
+### Fix Details
+
+Added URL scheme validation to block:
+- `file://` URLs
+- `javascript:` URLs
+- `data:` URLs
+- Other non-HTTP schemes
+
+Only `http://`, `https://`, and `raw:` URLs are now allowed.
+
+### Credits
+
+Discovered by Neo by ProjectDiscovery (https://projectdiscovery.io)
+
+### References
+
+- [Release Notes v0.8.0](https://github.com/unclecode/crawl4ai/blob/main/docs/RELEASE_NOTES_v0.8.0.md)
+- [Migration Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/migration/v0.8.0-upgrade-guide.md)
+
+---
+
+## Creating the Advisories on GitHub
+
+1. Go to: https://github.com/unclecode/crawl4ai/security/advisories/new
+
+2. Fill in the form for each advisory:
+   - **Ecosystem**: PyPI
+   - **Package name**: crawl4ai
+   - **Affected versions**: < 0.8.0
+   - **Patched versions**: 0.8.0
+   - **Severity**: Critical (for RCE), High (for LFI)
+
+3. After creating, GitHub will:
+   - Assign a GHSA ID
+   - Optionally request a CVE
+   - Notify users who have security alerts enabled
+
+4. Coordinate disclosure timing with the fix release
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -1,4 +1,4 @@
-site_name: Crawl4AI Documentation (v0.7.x)
+site_name: Crawl4AI Documentation (v0.8.x)
 site_description:  🚀🤖 Crawl4AI, Open-source LLM-Friendly Web Crawler & Scraper
 site_url: https://docs.crawl4ai.com
 repo_url: https://github.com/unclecode/crawl4ai
--- a/tests/browser/test_browser_context_id.py
+++ b/tests/browser/test_browser_context_id.py
@@ -0,0 +1,489 @@
+"""Test for browser_context_id and target_id parameters.
+
+These tests verify that Crawl4AI can connect to and use pre-created
+browser contexts, which is essential for cloud browser services that
+pre-create isolated contexts for each user.
+
+The flow being tested:
+1. Start a browser with CDP
+2. Create a context via raw CDP commands (simulating cloud service)
+3. Create a page/target in that context
+4. Have Crawl4AI connect using browser_context_id and target_id
+5. Verify Crawl4AI uses the existing context/page instead of creating new ones
+"""
+
+import asyncio
+import json
+import os
+import sys
+import websockets
+
+# Add the project root to Python path if running directly
+if __name__ == "__main__":
+    sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '../..')))
+
+from crawl4ai.browser_manager import BrowserManager, ManagedBrowser
+from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
+from crawl4ai.async_logger import AsyncLogger
+
+# Create a logger for clear terminal output
+logger = AsyncLogger(verbose=True, log_file=None)
+
+
+class CDPContextCreator:
+    """
+    Helper class to create browser contexts via raw CDP commands.
+    This simulates what a cloud browser service would do.
+    """
+
+    def __init__(self, cdp_url: str):
+        self.cdp_url = cdp_url
+        self._message_id = 0
+        self._ws = None
+        self._pending_responses = {}
+        self._receiver_task = None
+
+    async def connect(self):
+        """Establish WebSocket connection to browser."""
+        # Convert HTTP URL to WebSocket URL if needed
+        ws_url = self.cdp_url.replace("http://", "ws://").replace("https://", "wss://")
+        if not ws_url.endswith("/devtools/browser"):
+            # Get the browser websocket URL from /json/version
+            import aiohttp
+            async with aiohttp.ClientSession() as session:
+                async with session.get(f"{self.cdp_url}/json/version") as response:
+                    data = await response.json()
+                    ws_url = data.get("webSocketDebuggerUrl", ws_url)
+
+        self._ws = await websockets.connect(ws_url, max_size=None, ping_interval=None)
+        self._receiver_task = asyncio.create_task(self._receive_messages())
+        logger.info(f"Connected to CDP at {ws_url}", tag="CDP")
+
+    async def disconnect(self):
+        """Close WebSocket connection."""
+        if self._receiver_task:
+            self._receiver_task.cancel()
+            try:
+                await self._receiver_task
+            except asyncio.CancelledError:
+                pass
+        if self._ws:
+            await self._ws.close()
+            self._ws = None
+
+    async def _receive_messages(self):
+        """Background task to receive CDP messages."""
+        try:
+            async for message in self._ws:
+                data = json.loads(message)
+                msg_id = data.get('id')
+                if msg_id is not None and msg_id in self._pending_responses:
+                    self._pending_responses[msg_id].set_result(data)
+        except asyncio.CancelledError:
+            pass
+        except Exception as e:
+            logger.error(f"CDP receiver error: {e}", tag="CDP")
+
+    async def _send_command(self, method: str, params: dict = None) -> dict:
+        """Send CDP command and wait for response."""
+        self._message_id += 1
+        msg_id = self._message_id
+
+        message = {
+            "id": msg_id,
+            "method": method,
+            "params": params or {}
+        }
+
+        future = asyncio.get_event_loop().create_future()
+        self._pending_responses[msg_id] = future
+
+        try:
+            await self._ws.send(json.dumps(message))
+            response = await asyncio.wait_for(future, timeout=30.0)
+
+            if 'error' in response:
+                raise Exception(f"CDP error: {response['error']}")
+
+            return response.get('result', {})
+        finally:
+            self._pending_responses.pop(msg_id, None)
+
+    async def create_context(self) -> dict:
+        """
+        Create an isolated browser context with a blank page.
+
+        Returns:
+            dict with browser_context_id, target_id, and cdp_session_id
+        """
+        await self.connect()
+
+        # 1. Create isolated browser context
+        result = await self._send_command("Target.createBrowserContext", {
+            "disposeOnDetach": False  # Keep context alive
+        })
+        browser_context_id = result["browserContextId"]
+        logger.info(f"Created browser context: {browser_context_id}", tag="CDP")
+
+        # 2. Create a new page (target) in the context
+        result = await self._send_command("Target.createTarget", {
+            "url": "about:blank",
+            "browserContextId": browser_context_id
+        })
+        target_id = result["targetId"]
+        logger.info(f"Created target: {target_id}", tag="CDP")
+
+        # 3. Attach to the target to get a session ID
+        result = await self._send_command("Target.attachToTarget", {
+            "targetId": target_id,
+            "flatten": True
+        })
+        cdp_session_id = result["sessionId"]
+        logger.info(f"Attached to target, sessionId: {cdp_session_id}", tag="CDP")
+
+        return {
+            "browser_context_id": browser_context_id,
+            "target_id": target_id,
+            "cdp_session_id": cdp_session_id
+        }
+
+    async def get_targets(self) -> list:
+        """Get list of all targets in the browser."""
+        result = await self._send_command("Target.getTargets")
+        return result.get("targetInfos", [])
+
+    async def dispose_context(self, browser_context_id: str):
+        """Dispose of a browser context."""
+        try:
+            await self._send_command("Target.disposeBrowserContext", {
+                "browserContextId": browser_context_id
+            })
+            logger.info(f"Disposed browser context: {browser_context_id}", tag="CDP")
+        except Exception as e:
+            logger.warning(f"Error disposing context: {e}", tag="CDP")
+
+
+async def test_browser_context_id_basic():
+    """
+    Test that BrowserConfig accepts browser_context_id and target_id parameters.
+    """
+    logger.info("Testing BrowserConfig browser_context_id parameter", tag="TEST")
+
+    try:
+        # Test that BrowserConfig accepts the new parameters
+        config = BrowserConfig(
+            cdp_url="http://localhost:9222",
+            browser_context_id="test-context-id",
+            target_id="test-target-id",
+            headless=True
+        )
+
+        # Verify parameters are set correctly
+        assert config.browser_context_id == "test-context-id", "browser_context_id not set"
+        assert config.target_id == "test-target-id", "target_id not set"
+
+        # Test from_kwargs
+        config2 = BrowserConfig.from_kwargs({
+            "cdp_url": "http://localhost:9222",
+            "browser_context_id": "test-context-id-2",
+            "target_id": "test-target-id-2"
+        })
+
+        assert config2.browser_context_id == "test-context-id-2", "browser_context_id not set via from_kwargs"
+        assert config2.target_id == "test-target-id-2", "target_id not set via from_kwargs"
+
+        # Test to_dict
+        config_dict = config.to_dict()
+        assert config_dict.get("browser_context_id") == "test-context-id", "browser_context_id not in to_dict"
+        assert config_dict.get("target_id") == "test-target-id", "target_id not in to_dict"
+
+        logger.success("BrowserConfig browser_context_id test passed", tag="TEST")
+        return True
+
+    except Exception as e:
+        logger.error(f"Test failed: {str(e)}", tag="TEST")
+        return False
+
+
+async def test_pre_created_context_usage():
+    """
+    Test that Crawl4AI uses a pre-created browser context instead of creating a new one.
+
+    This simulates the cloud browser service flow:
+    1. Start browser with CDP
+    2. Create context via raw CDP (simulating cloud service)
+    3. Have Crawl4AI connect with browser_context_id
+    4. Verify it uses existing context
+    """
+    logger.info("Testing pre-created context usage", tag="TEST")
+
+    # Start a managed browser first
+    browser_config_initial = BrowserConfig(
+        use_managed_browser=True,
+        headless=True,
+        debugging_port=9226,  # Use unique port
+        verbose=True
+    )
+
+    managed_browser = ManagedBrowser(browser_config=browser_config_initial, logger=logger)
+    cdp_creator = None
+    manager = None
+    context_info = None
+
+    try:
+        # Start the browser
+        cdp_url = await managed_browser.start()
+        logger.info(f"Browser started at {cdp_url}", tag="TEST")
+
+        # Create a context via raw CDP (simulating cloud service)
+        cdp_creator = CDPContextCreator(cdp_url)
+        context_info = await cdp_creator.create_context()
+
+        logger.info(f"Pre-created context: {context_info['browser_context_id']}", tag="TEST")
+        logger.info(f"Pre-created target: {context_info['target_id']}", tag="TEST")
+
+        # Get initial target count
+        targets_before = await cdp_creator.get_targets()
+        initial_target_count = len(targets_before)
+        logger.info(f"Initial target count: {initial_target_count}", tag="TEST")
+
+        # Now create BrowserManager with browser_context_id and target_id
+        browser_config = BrowserConfig(
+            cdp_url=cdp_url,
+            browser_context_id=context_info['browser_context_id'],
+            target_id=context_info['target_id'],
+            headless=True,
+            verbose=True
+        )
+
+        manager = BrowserManager(browser_config=browser_config, logger=logger)
+        await manager.start()
+
+        logger.info("BrowserManager started with pre-created context", tag="TEST")
+
+        # Get a page
+        crawler_config = CrawlerRunConfig()
+        page, context = await manager.get_page(crawler_config)
+
+        # Navigate to a test page
+        await page.goto("https://example.com", wait_until="domcontentloaded")
+        title = await page.title()
+
+        logger.info(f"Page title: {title}", tag="TEST")
+
+        # Get target count after
+        targets_after = await cdp_creator.get_targets()
+        final_target_count = len(targets_after)
+        logger.info(f"Final target count: {final_target_count}", tag="TEST")
+
+        # Verify: target count should not have increased significantly
+        # (allow for 1 extra target for internal use, but not many more)
+        target_diff = final_target_count - initial_target_count
+        logger.info(f"Target count difference: {target_diff}", tag="TEST")
+
+        # Success criteria:
+        # 1. Page navigation worked
+        # 2. Target count didn't explode (reused existing context)
+        success = title == "Example Domain" and target_diff <= 1
+
+        if success:
+            logger.success("Pre-created context usage test passed", tag="TEST")
+        else:
+            logger.error(f"Test failed - Title: {title}, Target diff: {target_diff}", tag="TEST")
+
+        return success
+
+    except Exception as e:
+        logger.error(f"Test failed: {str(e)}", tag="TEST")
+        import traceback
+        traceback.print_exc()
+        return False
+
+    finally:
+        # Cleanup
+        if manager:
+            try:
+                await manager.close()
+            except:
+                pass
+
+        if cdp_creator and context_info:
+            try:
+                await cdp_creator.dispose_context(context_info['browser_context_id'])
+                await cdp_creator.disconnect()
+            except:
+                pass
+
+        if managed_browser:
+            try:
+                await managed_browser.cleanup()
+            except:
+                pass
+
+
+async def test_context_isolation():
+    """
+    Test that using browser_context_id actually provides isolation.
+    Create two contexts and verify they don't share state.
+    """
+    logger.info("Testing context isolation with browser_context_id", tag="TEST")
+
+    browser_config_initial = BrowserConfig(
+        use_managed_browser=True,
+        headless=True,
+        debugging_port=9227,
+        verbose=True
+    )
+
+    managed_browser = ManagedBrowser(browser_config=browser_config_initial, logger=logger)
+    cdp_creator = None
+    manager1 = None
+    manager2 = None
+    context_info_1 = None
+    context_info_2 = None
+
+    try:
+        # Start the browser
+        cdp_url = await managed_browser.start()
+        logger.info(f"Browser started at {cdp_url}", tag="TEST")
+
+        # Create two separate contexts
+        cdp_creator = CDPContextCreator(cdp_url)
+        context_info_1 = await cdp_creator.create_context()
+        logger.info(f"Context 1: {context_info_1['browser_context_id']}", tag="TEST")
+
+        # Need to reconnect for second context (or use same connection)
+        await cdp_creator.disconnect()
+        cdp_creator2 = CDPContextCreator(cdp_url)
+        context_info_2 = await cdp_creator2.create_context()
+        logger.info(f"Context 2: {context_info_2['browser_context_id']}", tag="TEST")
+
+        # Verify contexts are different
+        assert context_info_1['browser_context_id'] != context_info_2['browser_context_id'], \
+            "Contexts should have different IDs"
+
+        # Connect with first context
+        browser_config_1 = BrowserConfig(
+            cdp_url=cdp_url,
+            browser_context_id=context_info_1['browser_context_id'],
+            target_id=context_info_1['target_id'],
+            headless=True
+        )
+
+        manager1 = BrowserManager(browser_config=browser_config_1, logger=logger)
+        await manager1.start()
+
+        # Set a cookie in context 1
+        page1, ctx1 = await manager1.get_page(CrawlerRunConfig())
+        await page1.goto("https://example.com", wait_until="domcontentloaded")
+        await ctx1.add_cookies([{
+            "name": "test_isolation",
+            "value": "context_1_value",
+            "domain": "example.com",
+            "path": "/"
+        }])
+
+        cookies1 = await ctx1.cookies(["https://example.com"])
+        cookie1_value = next((c["value"] for c in cookies1 if c["name"] == "test_isolation"), None)
+        logger.info(f"Cookie in context 1: {cookie1_value}", tag="TEST")
+
+        # Connect with second context
+        browser_config_2 = BrowserConfig(
+            cdp_url=cdp_url,
+            browser_context_id=context_info_2['browser_context_id'],
+            target_id=context_info_2['target_id'],
+            headless=True
+        )
+
+        manager2 = BrowserManager(browser_config=browser_config_2, logger=logger)
+        await manager2.start()
+
+        # Check cookies in context 2 - should not have the cookie from context 1
+        page2, ctx2 = await manager2.get_page(CrawlerRunConfig())
+        await page2.goto("https://example.com", wait_until="domcontentloaded")
+
+        cookies2 = await ctx2.cookies(["https://example.com"])
+        cookie2_value = next((c["value"] for c in cookies2 if c["name"] == "test_isolation"), None)
+        logger.info(f"Cookie in context 2: {cookie2_value}", tag="TEST")
+
+        # Verify isolation
+        isolation_works = cookie1_value == "context_1_value" and cookie2_value is None
+
+        if isolation_works:
+            logger.success("Context isolation test passed", tag="TEST")
+        else:
+            logger.error(f"Isolation failed - Cookie1: {cookie1_value}, Cookie2: {cookie2_value}", tag="TEST")
+
+        return isolation_works
+
+    except Exception as e:
+        logger.error(f"Test failed: {str(e)}", tag="TEST")
+        import traceback
+        traceback.print_exc()
+        return False
+
+    finally:
+        # Cleanup
+        for mgr in [manager1, manager2]:
+            if mgr:
+                try:
+                    await mgr.close()
+                except:
+                    pass
+
+        for ctx_info, creator in [(context_info_1, cdp_creator), (context_info_2, cdp_creator2 if 'cdp_creator2' in dir() else None)]:
+            if ctx_info and creator:
+                try:
+                    await creator.dispose_context(ctx_info['browser_context_id'])
+                    await creator.disconnect()
+                except:
+                    pass
+
+        if managed_browser:
+            try:
+                await managed_browser.cleanup()
+            except:
+                pass
+
+
+async def run_tests():
+    """Run all browser_context_id tests."""
+    results = []
+
+    logger.info("Running browser_context_id tests", tag="SUITE")
+
+    # Basic parameter test
+    results.append(("browser_context_id_basic", await test_browser_context_id_basic()))
+
+    # Pre-created context usage test
+    results.append(("pre_created_context_usage", await test_pre_created_context_usage()))
+
+    # Note: Context isolation test is commented out because isolation is enforced
+    # at the CDP level by the cloud browser service, not at the Playwright level.
+    # When multiple BrowserManagers connect to the same browser, Playwright sees
+    # all contexts. In production, each worker gets exactly one pre-created context.
+    # results.append(("context_isolation", await test_context_isolation()))
+
+    # Print summary
+    total = len(results)
+    passed = sum(1 for _, r in results if r)
+
+    logger.info("=" * 50, tag="SUMMARY")
+    logger.info(f"Test Results: {passed}/{total} passed", tag="SUMMARY")
+    logger.info("=" * 50, tag="SUMMARY")
+
+    for name, result in results:
+        status = "PASSED" if result else "FAILED"
+        logger.info(f"  {name}: {status}", tag="SUMMARY")
+
+    if passed == total:
+        logger.success("All tests passed!", tag="SUMMARY")
+        return True
+    else:
+        logger.error(f"{total - passed} tests failed", tag="SUMMARY")
+        return False
+
+
+if __name__ == "__main__":
+    success = asyncio.run(run_tests())
+    sys.exit(0 if success else 1)
--- a/tests/browser/test_cdp_cleanup_reuse.py
+++ b/tests/browser/test_cdp_cleanup_reuse.py
@@ -0,0 +1,281 @@
+#!/usr/bin/env python3
+"""
+Tests for CDP connection cleanup and browser reuse.
+
+These tests verify that:
+1. WebSocket URLs are properly handled (skip HTTP verification)
+2. cdp_cleanup_on_close properly disconnects without terminating the browser
+3. The same browser can be reused by multiple sequential connections
+
+Requirements:
+- A CDP-compatible browser pool service running (e.g., chromepoold)
+- Service should be accessible at CDP_SERVICE_URL (default: http://localhost:11235)
+
+Usage:
+    pytest tests/browser/test_cdp_cleanup_reuse.py -v
+
+Or run directly:
+    python tests/browser/test_cdp_cleanup_reuse.py
+"""
+
+import asyncio
+import os
+import pytest
+import requests
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
+
+# Configuration
+CDP_SERVICE_URL = os.getenv("CDP_SERVICE_URL", "http://localhost:11235")
+
+
+def is_cdp_service_available():
+    """Check if CDP service is running."""
+    try:
+        resp = requests.get(f"{CDP_SERVICE_URL}/health", timeout=2)
+        return resp.status_code == 200
+    except:
+        return False
+
+
+def create_browser():
+    """Create a browser via CDP service API."""
+    resp = requests.post(
+        f"{CDP_SERVICE_URL}/v1/browsers",
+        json={"headless": True},
+        timeout=10
+    )
+    resp.raise_for_status()
+    return resp.json()
+
+
+def get_browser_info(browser_id):
+    """Get browser info from CDP service."""
+    resp = requests.get(f"{CDP_SERVICE_URL}/v1/browsers", timeout=5)
+    for browser in resp.json():
+        if browser["id"] == browser_id:
+            return browser
+    return None
+
+
+def delete_browser(browser_id):
+    """Delete a browser via CDP service API."""
+    try:
+        requests.delete(f"{CDP_SERVICE_URL}/v1/browsers/{browser_id}", timeout=5)
+    except:
+        pass
+
+
+# Skip all tests if CDP service is not available
+pytestmark = pytest.mark.skipif(
+    not is_cdp_service_available(),
+    reason=f"CDP service not available at {CDP_SERVICE_URL}"
+)
+
+
+class TestCDPWebSocketURL:
+    """Tests for WebSocket URL handling."""
+
+    @pytest.mark.asyncio
+    async def test_websocket_url_skips_http_verification(self):
+        """WebSocket URLs should skip HTTP /json/version verification."""
+        browser = create_browser()
+        try:
+            ws_url = browser["ws_url"]
+            assert ws_url.startswith("ws://") or ws_url.startswith("wss://")
+
+            async with AsyncWebCrawler(
+                config=BrowserConfig(
+                    browser_mode="cdp",
+                    cdp_url=ws_url,
+                    headless=True,
+                    cdp_cleanup_on_close=True,
+                )
+            ) as crawler:
+                result = await crawler.arun(
+                    url="https://example.com",
+                    config=CrawlerRunConfig(verbose=False),
+                )
+                assert result.success
+                assert "Example Domain" in result.metadata.get("title", "")
+        finally:
+            delete_browser(browser["browser_id"])
+
+
+class TestCDPCleanupOnClose:
+    """Tests for cdp_cleanup_on_close behavior."""
+
+    @pytest.mark.asyncio
+    async def test_browser_survives_after_cleanup_close(self):
+        """Browser should remain alive after close with cdp_cleanup_on_close=True."""
+        browser = create_browser()
+        browser_id = browser["browser_id"]
+        ws_url = browser["ws_url"]
+
+        try:
+            # Verify browser exists
+            info_before = get_browser_info(browser_id)
+            assert info_before is not None
+            pid_before = info_before["pid"]
+
+            # Connect, crawl, and close with cleanup
+            async with AsyncWebCrawler(
+                config=BrowserConfig(
+                    browser_mode="cdp",
+                    cdp_url=ws_url,
+                    headless=True,
+                    cdp_cleanup_on_close=True,
+                )
+            ) as crawler:
+                result = await crawler.arun(
+                    url="https://example.com",
+                    config=CrawlerRunConfig(verbose=False),
+                )
+                assert result.success
+
+            # Browser should still exist with same PID
+            info_after = get_browser_info(browser_id)
+            assert info_after is not None, "Browser was terminated but should only disconnect"
+            assert info_after["pid"] == pid_before, "Browser PID changed unexpectedly"
+        finally:
+            delete_browser(browser_id)
+
+
+class TestCDPBrowserReuse:
+    """Tests for reusing the same browser with multiple connections."""
+
+    @pytest.mark.asyncio
+    async def test_sequential_connections_same_browser(self):
+        """Multiple sequential connections to the same browser should work."""
+        browser = create_browser()
+        browser_id = browser["browser_id"]
+        ws_url = browser["ws_url"]
+
+        try:
+            urls = [
+                "https://example.com",
+                "https://httpbin.org/ip",
+                "https://httpbin.org/headers",
+            ]
+
+            for i, url in enumerate(urls, 1):
+                # Each connection uses cdp_cleanup_on_close=True
+                async with AsyncWebCrawler(
+                    config=BrowserConfig(
+                        browser_mode="cdp",
+                        cdp_url=ws_url,
+                        headless=True,
+                        cdp_cleanup_on_close=True,
+                    )
+                ) as crawler:
+                    result = await crawler.arun(
+                        url=url,
+                        config=CrawlerRunConfig(verbose=False),
+                    )
+                    assert result.success, f"Connection {i} failed for {url}"
+
+                # Verify browser is still healthy
+                info = get_browser_info(browser_id)
+                assert info is not None, f"Browser died after connection {i}"
+
+        finally:
+            delete_browser(browser_id)
+
+    @pytest.mark.asyncio
+    async def test_no_user_wait_needed_between_connections(self):
+        """With cdp_cleanup_on_close=True, no user wait should be needed."""
+        browser = create_browser()
+        browser_id = browser["browser_id"]
+        ws_url = browser["ws_url"]
+
+        try:
+            # Rapid-fire connections with NO sleep between them
+            for i in range(3):
+                async with AsyncWebCrawler(
+                    config=BrowserConfig(
+                        browser_mode="cdp",
+                        cdp_url=ws_url,
+                        headless=True,
+                        cdp_cleanup_on_close=True,
+                    )
+                ) as crawler:
+                    result = await crawler.arun(
+                        url="https://example.com",
+                        config=CrawlerRunConfig(verbose=False),
+                    )
+                    assert result.success, f"Rapid connection {i+1} failed"
+                # NO asyncio.sleep() here - internal delay should be sufficient
+        finally:
+            delete_browser(browser_id)
+
+
+class TestCDPBackwardCompatibility:
+    """Tests for backward compatibility with existing CDP usage."""
+
+    @pytest.mark.asyncio
+    async def test_http_url_with_browser_id_works(self):
+        """HTTP URL with browser_id query param should work (backward compatibility)."""
+        browser = create_browser()
+        browser_id = browser["browser_id"]
+        try:
+            # Use HTTP URL with browser_id query parameter
+            http_url = f"{CDP_SERVICE_URL}?browser_id={browser_id}"
+
+            async with AsyncWebCrawler(
+                config=BrowserConfig(
+                    browser_mode="cdp",
+                    cdp_url=http_url,
+                    headless=True,
+                    cdp_cleanup_on_close=True,
+                )
+            ) as crawler:
+                result = await crawler.arun(
+                    url="https://example.com",
+                    config=CrawlerRunConfig(verbose=False),
+                )
+                assert result.success
+        finally:
+            delete_browser(browser_id)
+
+
+# Allow running directly
+if __name__ == "__main__":
+    if not is_cdp_service_available():
+        print(f"CDP service not available at {CDP_SERVICE_URL}")
+        print("Please start a CDP-compatible browser pool service first.")
+        exit(1)
+
+    async def run_tests():
+        print("=" * 60)
+        print("CDP Cleanup and Browser Reuse Tests")
+        print("=" * 60)
+
+        tests = [
+            ("WebSocket URL handling", TestCDPWebSocketURL().test_websocket_url_skips_http_verification),
+            ("Browser survives after cleanup", TestCDPCleanupOnClose().test_browser_survives_after_cleanup_close),
+            ("Sequential connections", TestCDPBrowserReuse().test_sequential_connections_same_browser),
+            ("No user wait needed", TestCDPBrowserReuse().test_no_user_wait_needed_between_connections),
+            ("HTTP URL with browser_id", TestCDPBackwardCompatibility().test_http_url_with_browser_id_works),
+        ]
+
+        results = []
+        for name, test_func in tests:
+            print(f"\n--- {name} ---")
+            try:
+                await test_func()
+                print(f"PASS")
+                results.append((name, True))
+            except Exception as e:
+                print(f"FAIL: {e}")
+                results.append((name, False))
+
+        print("\n" + "=" * 60)
+        print("SUMMARY")
+        print("=" * 60)
+        for name, passed in results:
+            print(f"  {name}: {'PASS' if passed else 'FAIL'}")
+
+        all_passed = all(r[1] for r in results)
+        print(f"\nOverall: {'ALL TESTS PASSED' if all_passed else 'SOME TESTS FAILED'}")
+        return 0 if all_passed else 1
+
+    exit(asyncio.run(run_tests()))
--- a/tests/cache_validation/init.py
+++ b/tests/cache_validation/init.py
@@ -0,0 +1 @@
+# Cache validation test suite
--- a/tests/cache_validation/conftest.py
+++ b/tests/cache_validation/conftest.py
@@ -0,0 +1,40 @@
+"""Pytest fixtures for cache validation tests."""
+
+import pytest
+
+
+def pytest_configure(config):
+    """Register custom markers."""
+    config.addinivalue_line(
+        "markers", "integration: marks tests as integration tests (may require network)"
+    )
+
+
+@pytest.fixture
+def sample_head_html():
+    """Sample HTML head section for testing."""
+    return '''
+    <head>
+        <meta charset="utf-8">
+        <title>Test Page Title</title>
+        <meta name="description" content="This is a test page description">
+        <meta property="og:title" content="OG Test Title">
+        <meta property="og:description" content="OG Description">
+        <meta property="og:image" content="https://example.com/image.jpg">
+        <meta property="article:modified_time" content="2024-12-01T00:00:00Z">
+        <link rel="stylesheet" href="style.css">
+        <script src="app.js"></script>
+    </head>
+    '''
+
+
+@pytest.fixture
+def minimal_head_html():
+    """Minimal head with just a title."""
+    return '<head><title>Minimal</title></head>'
+
+
+@pytest.fixture
+def empty_head_html():
+    """Empty head section."""
+    return '<head></head>'
--- a/tests/cache_validation/test_end_to_end.py
+++ b/tests/cache_validation/test_end_to_end.py
@@ -0,0 +1,449 @@
+"""
+End-to-end tests for Smart Cache validation.
+
+Tests the full flow:
+1. Fresh crawl (browser launch) - SLOW
+2. Cached crawl without validation (check_cache_freshness=False) - FAST
+3. Cached crawl with validation (check_cache_freshness=True) - FAST (304/fingerprint)
+
+Verifies all layers:
+- Database storage of etag, last_modified, head_fingerprint, cached_at
+- Cache validation logic
+- HTTP conditional requests (304 Not Modified)
+- Performance improvements
+"""
+
+import pytest
+import time
+import asyncio
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
+from crawl4ai.async_database import async_db_manager
+
+
+class TestEndToEndCacheValidation:
+    """End-to-end tests for the complete cache validation flow."""
+
+    @pytest.mark.asyncio
+    async def test_full_cache_flow_docs_python(self):
+        """
+        Test complete cache flow with docs.python.org:
+        1. Fresh crawl (slow - browser) - using BYPASS to force fresh
+        2. Cache hit without validation (fast)
+        3. Cache hit with validation (fast - 304)
+        """
+        url = "https://docs.python.org/3/"
+
+        browser_config = BrowserConfig(headless=True, verbose=False)
+
+        # ========== CRAWL 1: Fresh crawl (force with WRITE_ONLY to skip cache read) ==========
+        config1 = CrawlerRunConfig(
+            cache_mode=CacheMode.WRITE_ONLY,  # Skip reading, write new data
+            check_cache_freshness=False,
+        )
+
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            start1 = time.perf_counter()
+            result1 = await crawler.arun(url, config=config1)
+            time1 = time.perf_counter() - start1
+
+        assert result1.success, f"First crawl failed: {result1.error_message}"
+        # WRITE_ONLY means we did a fresh crawl and wrote to cache
+        assert result1.cache_status == "miss", f"Expected 'miss', got '{result1.cache_status}'"
+
+        print(f"\n[CRAWL 1] Fresh crawl: {time1:.2f}s (cache_status: {result1.cache_status})")
+
+        # Verify data is stored in database
+        metadata = await async_db_manager.aget_cache_metadata(url)
+        assert metadata is not None, "Metadata should be stored in database"
+        assert metadata.get("etag") or metadata.get("last_modified"), "Should have ETag or Last-Modified"
+        print(f"  - Stored ETag: {metadata.get('etag', 'N/A')[:30]}...")
+        print(f"  - Stored Last-Modified: {metadata.get('last_modified', 'N/A')}")
+        print(f"  - Stored head_fingerprint: {metadata.get('head_fingerprint', 'N/A')}")
+        print(f"  - Stored cached_at: {metadata.get('cached_at', 'N/A')}")
+
+        # ========== CRAWL 2: Cache hit WITHOUT validation ==========
+        config2 = CrawlerRunConfig(
+            cache_mode=CacheMode.ENABLED,
+            check_cache_freshness=False,  # Skip validation - pure cache hit
+        )
+
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            start2 = time.perf_counter()
+            result2 = await crawler.arun(url, config=config2)
+            time2 = time.perf_counter() - start2
+
+        assert result2.success, f"Second crawl failed: {result2.error_message}"
+        assert result2.cache_status == "hit", f"Expected 'hit', got '{result2.cache_status}'"
+
+        print(f"\n[CRAWL 2] Cache hit (no validation): {time2:.2f}s (cache_status: {result2.cache_status})")
+        print(f"  - Speedup: {time1/time2:.1f}x faster than fresh crawl")
+
+        # Should be MUCH faster - no browser, no HTTP request
+        assert time2 < time1 / 2, f"Cache hit should be at least 2x faster (was {time1/time2:.1f}x)"
+
+        # ========== CRAWL 3: Cache hit WITH validation (304) ==========
+        config3 = CrawlerRunConfig(
+            cache_mode=CacheMode.ENABLED,
+            check_cache_freshness=True,  # Validate cache freshness
+        )
+
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            start3 = time.perf_counter()
+            result3 = await crawler.arun(url, config=config3)
+            time3 = time.perf_counter() - start3
+
+        assert result3.success, f"Third crawl failed: {result3.error_message}"
+        # Should be "hit_validated" (304) or "hit_fallback" (error during validation)
+        assert result3.cache_status in ["hit_validated", "hit_fallback"], \
+            f"Expected validated cache hit, got '{result3.cache_status}'"
+
+        print(f"\n[CRAWL 3] Cache hit (with validation): {time3:.2f}s (cache_status: {result3.cache_status})")
+        print(f"  - Speedup: {time1/time3:.1f}x faster than fresh crawl")
+
+        # Should still be fast - just a HEAD request, no browser
+        assert time3 < time1 / 2, f"Validated cache hit should be faster than fresh crawl"
+
+        # ========== SUMMARY ==========
+        print(f"\n{'='*60}")
+        print(f"PERFORMANCE SUMMARY for {url}")
+        print(f"{'='*60}")
+        print(f"  Fresh crawl (browser):        {time1:.2f}s")
+        print(f"  Cache hit (no validation):    {time2:.2f}s ({time1/time2:.1f}x faster)")
+        print(f"  Cache hit (with validation):  {time3:.2f}s ({time1/time3:.1f}x faster)")
+        print(f"{'='*60}")
+
+    @pytest.mark.asyncio
+    async def test_full_cache_flow_crawl4ai_docs(self):
+        """Test with docs.crawl4ai.com."""
+        url = "https://docs.crawl4ai.com/"
+
+        browser_config = BrowserConfig(headless=True, verbose=False)
+
+        # Fresh crawl - use WRITE_ONLY to ensure we get fresh data
+        config1 = CrawlerRunConfig(cache_mode=CacheMode.WRITE_ONLY, check_cache_freshness=False)
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            start1 = time.perf_counter()
+            result1 = await crawler.arun(url, config=config1)
+            time1 = time.perf_counter() - start1
+
+        assert result1.success
+        assert result1.cache_status == "miss"
+        print(f"\n[docs.crawl4ai.com] Fresh: {time1:.2f}s")
+
+        # Cache hit with validation
+        config2 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=True)
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            start2 = time.perf_counter()
+            result2 = await crawler.arun(url, config=config2)
+            time2 = time.perf_counter() - start2
+
+        assert result2.success
+        assert result2.cache_status in ["hit_validated", "hit_fallback"]
+        print(f"[docs.crawl4ai.com] Validated: {time2:.2f}s ({time1/time2:.1f}x faster)")
+
+    @pytest.mark.asyncio
+    async def test_verify_database_storage(self):
+        """Verify all validation metadata is properly stored in database."""
+        url = "https://docs.python.org/3/library/asyncio.html"
+
+        browser_config = BrowserConfig(headless=True, verbose=False)
+        config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
+
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            result = await crawler.arun(url, config=config)
+
+        assert result.success
+
+        # Verify all fields in database
+        metadata = await async_db_manager.aget_cache_metadata(url)
+
+        assert metadata is not None, "Metadata must be stored"
+        assert "url" in metadata
+        assert "etag" in metadata
+        assert "last_modified" in metadata
+        assert "head_fingerprint" in metadata
+        assert "cached_at" in metadata
+        assert "response_headers" in metadata
+
+        print(f"\nDatabase storage verification for {url}:")
+        print(f"  - etag: {metadata['etag'][:40] if metadata['etag'] else 'None'}...")
+        print(f"  - last_modified: {metadata['last_modified']}")
+        print(f"  - head_fingerprint: {metadata['head_fingerprint']}")
+        print(f"  - cached_at: {metadata['cached_at']}")
+        print(f"  - response_headers keys: {list(metadata['response_headers'].keys())[:5]}...")
+
+        # At least one validation field should be populated
+        has_validation_data = (
+            metadata["etag"] or
+            metadata["last_modified"] or
+            metadata["head_fingerprint"]
+        )
+        assert has_validation_data, "Should have at least one validation field"
+
+    @pytest.mark.asyncio
+    async def test_head_fingerprint_stored_and_used(self):
+        """Verify head fingerprint is computed, stored, and used for validation."""
+        url = "https://example.com/"
+
+        browser_config = BrowserConfig(headless=True, verbose=False)
+
+        # Fresh crawl
+        config1 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            result1 = await crawler.arun(url, config=config1)
+
+        assert result1.success
+        assert result1.head_fingerprint, "head_fingerprint should be set on CrawlResult"
+
+        # Verify in database
+        metadata = await async_db_manager.aget_cache_metadata(url)
+        assert metadata["head_fingerprint"], "head_fingerprint should be stored in database"
+        assert metadata["head_fingerprint"] == result1.head_fingerprint
+
+        print(f"\nHead fingerprint for {url}:")
+        print(f"  - CrawlResult.head_fingerprint: {result1.head_fingerprint}")
+        print(f"  - Database head_fingerprint: {metadata['head_fingerprint']}")
+
+        # Validate using fingerprint
+        config2 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=True)
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            result2 = await crawler.arun(url, config=config2)
+
+        assert result2.success
+        assert result2.cache_status in ["hit_validated", "hit_fallback"]
+        print(f"  - Validation result: {result2.cache_status}")
+
+
+class TestCacheValidationPerformance:
+    """Performance benchmarks for cache validation."""
+
+    @pytest.mark.asyncio
+    async def test_multiple_urls_performance(self):
+        """Test cache performance across multiple URLs."""
+        urls = [
+            "https://docs.python.org/3/",
+            "https://docs.python.org/3/library/asyncio.html",
+            "https://en.wikipedia.org/wiki/Python_(programming_language)",
+        ]
+
+        browser_config = BrowserConfig(headless=True, verbose=False)
+        fresh_times = []
+        cached_times = []
+
+        print(f"\n{'='*70}")
+        print("MULTI-URL PERFORMANCE TEST")
+        print(f"{'='*70}")
+
+        # Fresh crawls - use WRITE_ONLY to force fresh crawl
+        for url in urls:
+            config = CrawlerRunConfig(cache_mode=CacheMode.WRITE_ONLY, check_cache_freshness=False)
+            async with AsyncWebCrawler(config=browser_config) as crawler:
+                start = time.perf_counter()
+                result = await crawler.arun(url, config=config)
+                elapsed = time.perf_counter() - start
+                fresh_times.append(elapsed)
+                print(f"Fresh:  {url[:50]:50} {elapsed:.2f}s ({result.cache_status})")
+
+        # Cached crawls with validation
+        for url in urls:
+            config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=True)
+            async with AsyncWebCrawler(config=browser_config) as crawler:
+                start = time.perf_counter()
+                result = await crawler.arun(url, config=config)
+                elapsed = time.perf_counter() - start
+                cached_times.append(elapsed)
+                print(f"Cached: {url[:50]:50} {elapsed:.2f}s ({result.cache_status})")
+
+        avg_fresh = sum(fresh_times) / len(fresh_times)
+        avg_cached = sum(cached_times) / len(cached_times)
+        total_fresh = sum(fresh_times)
+        total_cached = sum(cached_times)
+
+        print(f"\n{'='*70}")
+        print(f"RESULTS:")
+        print(f"  Total fresh crawl time:  {total_fresh:.2f}s")
+        print(f"  Total cached time:       {total_cached:.2f}s")
+        print(f"  Average speedup:         {avg_fresh/avg_cached:.1f}x")
+        print(f"  Time saved:              {total_fresh - total_cached:.2f}s")
+        print(f"{'='*70}")
+
+        # Cached should be significantly faster
+        assert avg_cached < avg_fresh / 2, "Cached crawls should be at least 2x faster"
+
+    @pytest.mark.asyncio
+    async def test_repeated_access_same_url(self):
+        """Test repeated access to the same URL shows consistent cache hits."""
+        url = "https://docs.python.org/3/"
+        num_accesses = 5
+
+        browser_config = BrowserConfig(headless=True, verbose=False)
+
+        print(f"\n{'='*60}")
+        print(f"REPEATED ACCESS TEST: {url}")
+        print(f"{'='*60}")
+
+        # First access - fresh crawl
+        config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            start = time.perf_counter()
+            result = await crawler.arun(url, config=config)
+            fresh_time = time.perf_counter() - start
+        print(f"Access 1 (fresh):     {fresh_time:.2f}s - {result.cache_status}")
+
+        # Repeated accesses - should all be cache hits
+        cached_times = []
+        for i in range(2, num_accesses + 1):
+            config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=True)
+            async with AsyncWebCrawler(config=browser_config) as crawler:
+                start = time.perf_counter()
+                result = await crawler.arun(url, config=config)
+                elapsed = time.perf_counter() - start
+                cached_times.append(elapsed)
+            print(f"Access {i} (cached):    {elapsed:.2f}s - {result.cache_status}")
+            assert result.cache_status in ["hit", "hit_validated", "hit_fallback"]
+
+        avg_cached = sum(cached_times) / len(cached_times)
+        print(f"\nAverage cached time: {avg_cached:.2f}s")
+        print(f"Speedup over fresh:  {fresh_time/avg_cached:.1f}x")
+
+
+class TestCacheValidationModes:
+    """Test different cache modes and their behavior."""
+
+    @pytest.mark.asyncio
+    async def test_cache_bypass_always_fresh(self):
+        """CacheMode.BYPASS should always do fresh crawl."""
+        # Use a unique URL path to avoid cache from other tests
+        url = "https://example.com/test-bypass"
+
+        browser_config = BrowserConfig(headless=True, verbose=False)
+
+        # First crawl with WRITE_ONLY to populate cache (always fresh)
+        config1 = CrawlerRunConfig(cache_mode=CacheMode.WRITE_ONLY, check_cache_freshness=False)
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            result1 = await crawler.arun(url, config=config1)
+        assert result1.cache_status == "miss"
+
+        # Second crawl with BYPASS - should NOT use cache
+        config2 = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, check_cache_freshness=False)
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            result2 = await crawler.arun(url, config=config2)
+
+        # BYPASS mode means no cache interaction
+        assert result2.cache_status is None or result2.cache_status == "miss"
+        print(f"\nCacheMode.BYPASS result: {result2.cache_status}")
+
+    @pytest.mark.asyncio
+    async def test_validation_disabled_uses_cache_directly(self):
+        """With check_cache_freshness=False, should use cache without HTTP validation."""
+        url = "https://docs.python.org/3/tutorial/"
+
+        browser_config = BrowserConfig(headless=True, verbose=False)
+
+        # Fresh crawl - use WRITE_ONLY to force fresh
+        config1 = CrawlerRunConfig(cache_mode=CacheMode.WRITE_ONLY, check_cache_freshness=False)
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            result1 = await crawler.arun(url, config=config1)
+        assert result1.cache_status == "miss"
+
+        # Cached with validation DISABLED - should be "hit" (not "hit_validated")
+        config2 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            start = time.perf_counter()
+            result2 = await crawler.arun(url, config=config2)
+            elapsed = time.perf_counter() - start
+
+        assert result2.cache_status == "hit", f"Expected 'hit', got '{result2.cache_status}'"
+        print(f"\nValidation disabled: {elapsed:.3f}s (cache_status: {result2.cache_status})")
+
+        # Should be very fast - no HTTP request at all
+        assert elapsed < 1.0, "Cache hit without validation should be < 1 second"
+
+    @pytest.mark.asyncio
+    async def test_validation_enabled_checks_freshness(self):
+        """With check_cache_freshness=True, should validate before using cache."""
+        url = "https://docs.python.org/3/reference/"
+
+        browser_config = BrowserConfig(headless=True, verbose=False)
+
+        # Fresh crawl
+        config1 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            result1 = await crawler.arun(url, config=config1)
+
+        # Cached with validation ENABLED - should be "hit_validated"
+        config2 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=True)
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            start = time.perf_counter()
+            result2 = await crawler.arun(url, config=config2)
+            elapsed = time.perf_counter() - start
+
+        assert result2.cache_status in ["hit_validated", "hit_fallback"]
+        print(f"\nValidation enabled: {elapsed:.3f}s (cache_status: {result2.cache_status})")
+
+
+class TestCacheValidationResponseHeaders:
+    """Test that response headers are properly stored and retrieved."""
+
+    @pytest.mark.asyncio
+    async def test_response_headers_stored(self):
+        """Verify response headers including ETag and Last-Modified are stored."""
+        url = "https://docs.python.org/3/"
+
+        browser_config = BrowserConfig(headless=True, verbose=False)
+        config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
+
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            result = await crawler.arun(url, config=config)
+
+        assert result.success
+        assert result.response_headers is not None
+
+        # Check that cache-relevant headers are captured
+        headers = result.response_headers
+        print(f"\nResponse headers for {url}:")
+
+        # Look for ETag (case-insensitive)
+        etag = headers.get("etag") or headers.get("ETag")
+        print(f"  - ETag: {etag}")
+
+        # Look for Last-Modified
+        last_modified = headers.get("last-modified") or headers.get("Last-Modified")
+        print(f"  - Last-Modified: {last_modified}")
+
+        # Look for Cache-Control
+        cache_control = headers.get("cache-control") or headers.get("Cache-Control")
+        print(f"  - Cache-Control: {cache_control}")
+
+        # At least one should be present for docs.python.org
+        assert etag or last_modified, "Should have ETag or Last-Modified header"
+
+    @pytest.mark.asyncio
+    async def test_headers_used_for_validation(self):
+        """Verify stored headers are used for conditional requests."""
+        url = "https://docs.crawl4ai.com/"
+
+        browser_config = BrowserConfig(headless=True, verbose=False)
+
+        # Fresh crawl to store headers
+        config1 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            result1 = await crawler.arun(url, config=config1)
+
+        # Get stored metadata
+        metadata = await async_db_manager.aget_cache_metadata(url)
+        stored_etag = metadata.get("etag")
+        stored_last_modified = metadata.get("last_modified")
+
+        print(f"\nStored validation data for {url}:")
+        print(f"  - etag: {stored_etag}")
+        print(f"  - last_modified: {stored_last_modified}")
+
+        # Validate - should use stored headers
+        config2 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=True)
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            result2 = await crawler.arun(url, config=config2)
+
+        # Should get validated hit (304 response)
+        assert result2.cache_status in ["hit_validated", "hit_fallback"]
+        print(f"  - Validation result: {result2.cache_status}")
--- a/tests/cache_validation/test_head_fingerprint.py
+++ b/tests/cache_validation/test_head_fingerprint.py
@@ -0,0 +1,97 @@
+"""Unit tests for head fingerprinting."""
+
+import pytest
+from crawl4ai.utils import compute_head_fingerprint
+
+
+class TestHeadFingerprint:
+    """Tests for the compute_head_fingerprint function."""
+
+    def test_same_content_same_fingerprint(self):
+        """Identical <head> content produces same fingerprint."""
+        head = "<head><title>Test Page</title></head>"
+        fp1 = compute_head_fingerprint(head)
+        fp2 = compute_head_fingerprint(head)
+        assert fp1 == fp2
+        assert fp1 != ""
+
+    def test_different_title_different_fingerprint(self):
+        """Different title produces different fingerprint."""
+        head1 = "<head><title>Title A</title></head>"
+        head2 = "<head><title>Title B</title></head>"
+        assert compute_head_fingerprint(head1) != compute_head_fingerprint(head2)
+
+    def test_empty_head_returns_empty_string(self):
+        """Empty or None head should return empty fingerprint."""
+        assert compute_head_fingerprint("") == ""
+        assert compute_head_fingerprint(None) == ""
+
+    def test_head_without_signals_returns_empty(self):
+        """Head without title or key meta tags returns empty."""
+        head = "<head><link rel='stylesheet' href='style.css'></head>"
+        assert compute_head_fingerprint(head) == ""
+
+    def test_extracts_title(self):
+        """Title is extracted and included in fingerprint."""
+        head1 = "<head><title>My Title</title></head>"
+        head2 = "<head><title>My Title</title><link href='x'></head>"
+        # Same title should produce same fingerprint
+        assert compute_head_fingerprint(head1) == compute_head_fingerprint(head2)
+
+    def test_extracts_meta_description(self):
+        """Meta description is extracted."""
+        head1 = '<head><meta name="description" content="Test description"></head>'
+        head2 = '<head><meta name="description" content="Different description"></head>'
+        assert compute_head_fingerprint(head1) != compute_head_fingerprint(head2)
+
+    def test_extracts_og_tags(self):
+        """Open Graph tags are extracted."""
+        head1 = '<head><meta property="og:title" content="OG Title"></head>'
+        head2 = '<head><meta property="og:title" content="Different OG Title"></head>'
+        assert compute_head_fingerprint(head1) != compute_head_fingerprint(head2)
+
+    def test_extracts_og_image(self):
+        """og:image is extracted and affects fingerprint."""
+        head1 = '<head><meta property="og:image" content="https://example.com/img1.jpg"></head>'
+        head2 = '<head><meta property="og:image" content="https://example.com/img2.jpg"></head>'
+        assert compute_head_fingerprint(head1) != compute_head_fingerprint(head2)
+
+    def test_extracts_article_modified_time(self):
+        """article:modified_time is extracted."""
+        head1 = '<head><meta property="article:modified_time" content="2024-01-01T00:00:00Z"></head>'
+        head2 = '<head><meta property="article:modified_time" content="2024-12-01T00:00:00Z"></head>'
+        assert compute_head_fingerprint(head1) != compute_head_fingerprint(head2)
+
+    def test_case_insensitive(self):
+        """Fingerprinting is case-insensitive for tags."""
+        head1 = "<head><TITLE>Test</TITLE></head>"
+        head2 = "<head><title>test</title></head>"
+        # Both should extract title (case insensitive)
+        fp1 = compute_head_fingerprint(head1)
+        fp2 = compute_head_fingerprint(head2)
+        assert fp1 != ""
+        assert fp2 != ""
+
+    def test_handles_attribute_order(self):
+        """Handles different attribute orders in meta tags."""
+        head1 = '<head><meta name="description" content="Test"></head>'
+        head2 = '<head><meta content="Test" name="description"></head>'
+        assert compute_head_fingerprint(head1) == compute_head_fingerprint(head2)
+
+    def test_real_world_head(self):
+        """Test with a realistic head section."""
+        head = '''
+        <head>
+            <meta charset="utf-8">
+            <title>Python Documentation</title>
+            <meta name="description" content="Official Python documentation">
+            <meta property="og:title" content="Python Docs">
+            <meta property="og:description" content="Learn Python">
+            <meta property="og:image" content="https://python.org/logo.png">
+            <link rel="stylesheet" href="styles.css">
+        </head>
+        '''
+        fp = compute_head_fingerprint(head)
+        assert fp != ""
+        # Should be deterministic
+        assert fp == compute_head_fingerprint(head)
--- a/tests/cache_validation/test_real_domains.py
+++ b/tests/cache_validation/test_real_domains.py
@@ -0,0 +1,354 @@
+"""
+Real-world tests for cache validation using actual HTTP requests.
+No mocks - all tests hit real servers.
+"""
+
+import pytest
+from crawl4ai.cache_validator import CacheValidator, CacheValidationResult
+from crawl4ai.utils import compute_head_fingerprint
+
+
+class TestRealDomainsConditionalSupport:
+    """Test domains that support HTTP conditional requests (ETag/Last-Modified)."""
+
+    @pytest.mark.asyncio
+    async def test_docs_python_org_etag(self):
+        """docs.python.org supports ETag - should return 304."""
+        url = "https://docs.python.org/3/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            # First fetch to get ETag
+            head_html, etag, last_modified = await validator._fetch_head(url)
+
+            assert head_html is not None, "Should fetch head content"
+            assert etag is not None, "docs.python.org should return ETag"
+
+            # Validate with the ETag we just got
+            result = await validator.validate(url=url, stored_etag=etag)
+
+            assert result.status == CacheValidationResult.FRESH, f"Expected FRESH, got {result.status}: {result.reason}"
+            assert "304" in result.reason
+
+    @pytest.mark.asyncio
+    async def test_docs_crawl4ai_etag(self):
+        """docs.crawl4ai.com supports ETag - should return 304."""
+        url = "https://docs.crawl4ai.com/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, etag, last_modified = await validator._fetch_head(url)
+
+            assert etag is not None, "docs.crawl4ai.com should return ETag"
+
+            result = await validator.validate(url=url, stored_etag=etag)
+
+            assert result.status == CacheValidationResult.FRESH, f"Expected FRESH, got {result.status}: {result.reason}"
+
+    @pytest.mark.asyncio
+    async def test_wikipedia_last_modified(self):
+        """Wikipedia supports Last-Modified - should return 304."""
+        url = "https://en.wikipedia.org/wiki/Web_crawler"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, etag, last_modified = await validator._fetch_head(url)
+
+            assert last_modified is not None, "Wikipedia should return Last-Modified"
+
+            result = await validator.validate(url=url, stored_last_modified=last_modified)
+
+            assert result.status == CacheValidationResult.FRESH, f"Expected FRESH, got {result.status}: {result.reason}"
+
+    @pytest.mark.asyncio
+    async def test_github_pages(self):
+        """GitHub Pages supports conditional requests."""
+        url = "https://pages.github.com/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, etag, last_modified = await validator._fetch_head(url)
+
+            # GitHub Pages typically has at least one
+            has_conditional = etag is not None or last_modified is not None
+            assert has_conditional, "GitHub Pages should support conditional requests"
+
+            result = await validator.validate(
+                url=url,
+                stored_etag=etag,
+                stored_last_modified=last_modified,
+            )
+
+            assert result.status == CacheValidationResult.FRESH
+
+    @pytest.mark.asyncio
+    async def test_httpbin_etag(self):
+        """httpbin.org/etag endpoint for testing ETag."""
+        url = "https://httpbin.org/etag/test-etag-value"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            result = await validator.validate(url=url, stored_etag='"test-etag-value"')
+
+            # httpbin should return 304 for matching ETag
+            assert result.status == CacheValidationResult.FRESH, f"Expected FRESH, got {result.status}: {result.reason}"
+
+
+class TestRealDomainsNoConditionalSupport:
+    """Test domains that may NOT support HTTP conditional requests."""
+
+    @pytest.mark.asyncio
+    async def test_dynamic_site_fingerprint_fallback(self):
+        """Test fingerprint-based validation for sites without conditional support."""
+        # Use a site that changes frequently but has stable head
+        url = "https://example.com/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            # Get head and compute fingerprint
+            head_html, etag, last_modified = await validator._fetch_head(url)
+
+            assert head_html is not None
+            fingerprint = compute_head_fingerprint(head_html)
+
+            # Validate using fingerprint (not etag/last-modified)
+            result = await validator.validate(
+                url=url,
+                stored_head_fingerprint=fingerprint,
+            )
+
+            # Should be FRESH since fingerprint should match
+            assert result.status == CacheValidationResult.FRESH, f"Expected FRESH, got {result.status}: {result.reason}"
+            assert "fingerprint" in result.reason.lower()
+
+    @pytest.mark.asyncio
+    async def test_news_site_changes_frequently(self):
+        """News sites change frequently - test that we can detect changes."""
+        url = "https://www.bbc.com/news"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, etag, last_modified = await validator._fetch_head(url)
+
+            # BBC News has ETag but it changes with content
+            assert head_html is not None
+
+            # Using a fake old ETag should return STALE (200 with different content)
+            result = await validator.validate(
+                url=url,
+                stored_etag='"fake-old-etag-12345"',
+            )
+
+            # Should be STALE because the ETag doesn't match
+            assert result.status == CacheValidationResult.STALE, f"Expected STALE, got {result.status}: {result.reason}"
+
+
+class TestRealDomainsEdgeCases:
+    """Edge cases with real domains."""
+
+    @pytest.mark.asyncio
+    async def test_nonexistent_domain(self):
+        """Non-existent domain should return ERROR."""
+        url = "https://this-domain-definitely-does-not-exist-xyz123.com/"
+
+        async with CacheValidator(timeout=5.0) as validator:
+            result = await validator.validate(url=url, stored_etag='"test"')
+
+            assert result.status == CacheValidationResult.ERROR
+
+    @pytest.mark.asyncio
+    async def test_timeout_slow_server(self):
+        """Test timeout handling with a slow endpoint."""
+        # httpbin delay endpoint
+        url = "https://httpbin.org/delay/10"
+
+        async with CacheValidator(timeout=2.0) as validator:  # 2 second timeout
+            result = await validator.validate(url=url, stored_etag='"test"')
+
+            # Should timeout and return ERROR
+            assert result.status == CacheValidationResult.ERROR
+            assert "timeout" in result.reason.lower() or "timed out" in result.reason.lower()
+
+    @pytest.mark.asyncio
+    async def test_redirect_handling(self):
+        """Test that redirects are followed."""
+        # httpbin redirect
+        url = "https://httpbin.org/redirect/1"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, etag, last_modified = await validator._fetch_head(url)
+
+            # Should follow redirect and get content
+            # The final page might not have useful head content, but shouldn't error
+            # This tests that redirects are handled
+
+    @pytest.mark.asyncio
+    async def test_https_only(self):
+        """Test HTTPS connection."""
+        url = "https://www.google.com/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, etag, last_modified = await validator._fetch_head(url)
+
+            assert head_html is not None
+            assert "<title" in head_html.lower()
+
+
+class TestRealDomainsHeadFingerprint:
+    """Test head fingerprint extraction with real domains."""
+
+    @pytest.mark.asyncio
+    async def test_python_docs_fingerprint(self):
+        """Python docs has title and meta tags."""
+        url = "https://docs.python.org/3/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, _, _ = await validator._fetch_head(url)
+
+            assert head_html is not None
+            fingerprint = compute_head_fingerprint(head_html)
+
+            assert fingerprint != "", "Should extract fingerprint from Python docs"
+
+            # Fingerprint should be consistent
+            fingerprint2 = compute_head_fingerprint(head_html)
+            assert fingerprint == fingerprint2
+
+    @pytest.mark.asyncio
+    async def test_github_fingerprint(self):
+        """GitHub has og: tags."""
+        url = "https://github.com/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, _, _ = await validator._fetch_head(url)
+
+            assert head_html is not None
+            assert "og:" in head_html.lower() or "title" in head_html.lower()
+
+            fingerprint = compute_head_fingerprint(head_html)
+            assert fingerprint != ""
+
+    @pytest.mark.asyncio
+    async def test_crawl4ai_docs_fingerprint(self):
+        """Crawl4AI docs should have title and description."""
+        url = "https://docs.crawl4ai.com/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, _, _ = await validator._fetch_head(url)
+
+            assert head_html is not None
+            fingerprint = compute_head_fingerprint(head_html)
+
+            assert fingerprint != "", "Should extract fingerprint from Crawl4AI docs"
+
+
+class TestRealDomainsFetchHead:
+    """Test _fetch_head functionality with real domains."""
+
+    @pytest.mark.asyncio
+    async def test_fetch_stops_at_head_close(self):
+        """Verify we stop reading after </head>."""
+        url = "https://docs.python.org/3/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, _, _ = await validator._fetch_head(url)
+
+            assert head_html is not None
+            assert "</head>" in head_html.lower()
+            # Should NOT contain body content
+            assert "<body" not in head_html.lower() or head_html.lower().index("</head>") < head_html.lower().find("<body")
+
+    @pytest.mark.asyncio
+    async def test_extracts_both_headers(self):
+        """Test extraction of both ETag and Last-Modified."""
+        url = "https://docs.python.org/3/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, etag, last_modified = await validator._fetch_head(url)
+
+            # Python docs should have both
+            assert etag is not None, "Should have ETag"
+            assert last_modified is not None, "Should have Last-Modified"
+
+    @pytest.mark.asyncio
+    async def test_handles_missing_head_tag(self):
+        """Handle pages that might not have proper head structure."""
+        # API endpoint that returns JSON (no HTML head)
+        url = "https://httpbin.org/json"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, etag, last_modified = await validator._fetch_head(url)
+
+            # Should not crash, may return partial content or None
+            # The important thing is it doesn't error
+
+
+class TestRealDomainsValidationCombinations:
+    """Test various combinations of validation data."""
+
+    @pytest.mark.asyncio
+    async def test_etag_only(self):
+        """Validate with only ETag."""
+        url = "https://docs.python.org/3/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            _, etag, _ = await validator._fetch_head(url)
+
+            result = await validator.validate(url=url, stored_etag=etag)
+            assert result.status == CacheValidationResult.FRESH
+
+    @pytest.mark.asyncio
+    async def test_last_modified_only(self):
+        """Validate with only Last-Modified."""
+        url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            _, _, last_modified = await validator._fetch_head(url)
+
+            if last_modified:
+                result = await validator.validate(url=url, stored_last_modified=last_modified)
+                assert result.status == CacheValidationResult.FRESH
+
+    @pytest.mark.asyncio
+    async def test_fingerprint_only(self):
+        """Validate with only fingerprint."""
+        url = "https://example.com/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, _, _ = await validator._fetch_head(url)
+            fingerprint = compute_head_fingerprint(head_html)
+
+            if fingerprint:
+                result = await validator.validate(url=url, stored_head_fingerprint=fingerprint)
+                assert result.status == CacheValidationResult.FRESH
+
+    @pytest.mark.asyncio
+    async def test_all_validation_data(self):
+        """Validate with all available data."""
+        url = "https://docs.python.org/3/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, etag, last_modified = await validator._fetch_head(url)
+            fingerprint = compute_head_fingerprint(head_html)
+
+            result = await validator.validate(
+                url=url,
+                stored_etag=etag,
+                stored_last_modified=last_modified,
+                stored_head_fingerprint=fingerprint,
+            )
+
+            assert result.status == CacheValidationResult.FRESH
+
+    @pytest.mark.asyncio
+    async def test_stale_etag_fresh_fingerprint(self):
+        """When ETag is stale but fingerprint matches, should be FRESH."""
+        url = "https://docs.python.org/3/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, _, _ = await validator._fetch_head(url)
+            fingerprint = compute_head_fingerprint(head_html)
+
+            # Use fake ETag but real fingerprint
+            result = await validator.validate(
+                url=url,
+                stored_etag='"fake-stale-etag"',
+                stored_head_fingerprint=fingerprint,
+            )
+
+            # Fingerprint should save us
+            assert result.status == CacheValidationResult.FRESH
+            assert "fingerprint" in result.reason.lower()
--- a/tests/deep_crawling/init.py
+++ b/tests/deep_crawling/init.py
--- a/tests/deep_crawling/test_deep_crawl_resume.py
+++ b/tests/deep_crawling/test_deep_crawl_resume.py
@@ -0,0 +1,773 @@
+"""
+Test Suite: Deep Crawl Resume/Crash Recovery Tests
+
+Tests that verify:
+1. State export produces valid JSON-serializable data
+2. Resume from checkpoint continues without duplicates
+3. Simulated crash at various points recovers correctly
+4. State callback fires at expected intervals
+5. No damage to existing system behavior (regression tests)
+"""
+
+import pytest
+import asyncio
+import json
+from typing import Dict, Any, List
+from unittest.mock import AsyncMock, MagicMock
+
+from crawl4ai.deep_crawling import (
+    BFSDeepCrawlStrategy,
+    DFSDeepCrawlStrategy,
+    BestFirstCrawlingStrategy,
+    FilterChain,
+    URLPatternFilter,
+    DomainFilter,
+)
+from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
+
+
+# ============================================================================
+# Helper Functions for Mock Crawler
+# ============================================================================
+
+def create_mock_config(stream=False):
+    """Create a mock CrawlerRunConfig."""
+    config = MagicMock()
+    config.clone = MagicMock(return_value=config)
+    config.stream = stream
+    return config
+
+
+def create_mock_crawler_with_links(num_links: int = 3, include_keyword: bool = False):
+    """Create mock crawler that returns results with links."""
+    call_count = 0
+
+    async def mock_arun_many(urls, config):
+        nonlocal call_count
+        results = []
+        for url in urls:
+            call_count += 1
+            result = MagicMock()
+            result.url = url
+            result.success = True
+            result.metadata = {}
+
+            # Generate child links
+            links = []
+            for i in range(num_links):
+                link_url = f"{url}/child{call_count}_{i}"
+                if include_keyword:
+                    link_url = f"{url}/important-child{call_count}_{i}"
+                links.append({"href": link_url})
+
+            result.links = {"internal": links, "external": []}
+            results.append(result)
+
+        # For streaming mode, return async generator
+        if config.stream:
+            async def gen():
+                for r in results:
+                    yield r
+            return gen()
+        return results
+
+    crawler = MagicMock()
+    crawler.arun_many = mock_arun_many
+    return crawler
+
+
+def create_mock_crawler_tracking(crawl_order: List[str], return_no_links: bool = False):
+    """Create mock crawler that tracks crawl order."""
+
+    async def mock_arun_many(urls, config):
+        results = []
+        for url in urls:
+            crawl_order.append(url)
+            result = MagicMock()
+            result.url = url
+            result.success = True
+            result.metadata = {}
+            result.links = {"internal": [], "external": []} if return_no_links else {"internal": [{"href": f"{url}/child"}], "external": []}
+            results.append(result)
+
+        # For streaming mode, return async generator
+        if config.stream:
+            async def gen():
+                for r in results:
+                    yield r
+            return gen()
+        return results
+
+    crawler = MagicMock()
+    crawler.arun_many = mock_arun_many
+    return crawler
+
+
+def create_simple_mock_crawler():
+    """Basic mock crawler returning 1 result with 2 child links."""
+    call_count = 0
+
+    async def mock_arun_many(urls, config):
+        nonlocal call_count
+        results = []
+        for url in urls:
+            call_count += 1
+            result = MagicMock()
+            result.url = url
+            result.success = True
+            result.metadata = {}
+            result.links = {
+                "internal": [
+                    {"href": f"{url}/child1"},
+                    {"href": f"{url}/child2"},
+                ],
+                "external": []
+            }
+            results.append(result)
+
+        if config.stream:
+            async def gen():
+                for r in results:
+                    yield r
+            return gen()
+        return results
+
+    crawler = MagicMock()
+    crawler.arun_many = mock_arun_many
+    return crawler
+
+
+def create_mock_crawler_unlimited_links():
+    """Mock crawler that always returns links (for testing limits)."""
+    async def mock_arun_many(urls, config):
+        results = []
+        for url in urls:
+            result = MagicMock()
+            result.url = url
+            result.success = True
+            result.metadata = {}
+            result.links = {
+                "internal": [{"href": f"{url}/link{i}"} for i in range(10)],
+                "external": []
+            }
+            results.append(result)
+
+        if config.stream:
+            async def gen():
+                for r in results:
+                    yield r
+            return gen()
+        return results
+
+    crawler = MagicMock()
+    crawler.arun_many = mock_arun_many
+    return crawler
+
+
+# ============================================================================
+# TEST SUITE 1: Crash Recovery Tests
+# ============================================================================
+
+class TestBFSResume:
+    """BFS strategy resume tests."""
+
+    @pytest.mark.asyncio
+    async def test_state_export_json_serializable(self):
+        """Verify exported state can be JSON serialized."""
+        captured_states: List[Dict] = []
+
+        async def capture_state(state: Dict[str, Any]):
+            # Verify JSON serializable
+            json_str = json.dumps(state)
+            parsed = json.loads(json_str)
+            captured_states.append(parsed)
+
+        strategy = BFSDeepCrawlStrategy(
+            max_depth=2,
+            max_pages=10,
+            on_state_change=capture_state,
+        )
+
+        # Create mock crawler that returns predictable results
+        mock_crawler = create_mock_crawler_with_links(num_links=3)
+        mock_config = create_mock_config()
+
+        results = await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
+
+        # Verify states were captured
+        assert len(captured_states) > 0
+
+        # Verify state structure
+        for state in captured_states:
+            assert state["strategy_type"] == "bfs"
+            assert "visited" in state
+            assert "pending" in state
+            assert "depths" in state
+            assert "pages_crawled" in state
+            assert isinstance(state["visited"], list)
+            assert isinstance(state["pending"], list)
+            assert isinstance(state["depths"], dict)
+            assert isinstance(state["pages_crawled"], int)
+
+    @pytest.mark.asyncio
+    async def test_resume_continues_from_checkpoint(self):
+        """Verify resume starts from saved state, not beginning."""
+        # Simulate state from previous crawl (visited 5 URLs, 3 pending)
+        saved_state = {
+            "strategy_type": "bfs",
+            "visited": [
+                "https://example.com",
+                "https://example.com/page1",
+                "https://example.com/page2",
+                "https://example.com/page3",
+                "https://example.com/page4",
+            ],
+            "pending": [
+                {"url": "https://example.com/page5", "parent_url": "https://example.com/page2"},
+                {"url": "https://example.com/page6", "parent_url": "https://example.com/page3"},
+                {"url": "https://example.com/page7", "parent_url": "https://example.com/page3"},
+            ],
+            "depths": {
+                "https://example.com": 0,
+                "https://example.com/page1": 1,
+                "https://example.com/page2": 1,
+                "https://example.com/page3": 1,
+                "https://example.com/page4": 1,
+                "https://example.com/page5": 2,
+                "https://example.com/page6": 2,
+                "https://example.com/page7": 2,
+            },
+            "pages_crawled": 5,
+        }
+
+        crawled_urls: List[str] = []
+
+        strategy = BFSDeepCrawlStrategy(
+            max_depth=2,
+            max_pages=20,
+            resume_state=saved_state,
+        )
+
+        # Verify internal state was restored
+        assert strategy._resume_state == saved_state
+
+        mock_crawler = create_mock_crawler_tracking(crawled_urls, return_no_links=True)
+        mock_config = create_mock_config()
+
+        await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
+
+        # Should NOT re-crawl already visited URLs
+        for visited_url in saved_state["visited"]:
+            assert visited_url not in crawled_urls, f"Re-crawled already visited: {visited_url}"
+
+        # Should crawl pending URLs
+        for pending in saved_state["pending"]:
+            assert pending["url"] in crawled_urls, f"Did not crawl pending: {pending['url']}"
+
+    @pytest.mark.asyncio
+    async def test_simulated_crash_mid_crawl(self):
+        """Simulate crash at URL N, verify resume continues from pending URLs."""
+        crash_after = 3
+        states_before_crash: List[Dict] = []
+
+        async def capture_until_crash(state: Dict[str, Any]):
+            states_before_crash.append(state)
+            if state["pages_crawled"] >= crash_after:
+                raise Exception("Simulated crash!")
+
+        strategy1 = BFSDeepCrawlStrategy(
+            max_depth=2,
+            max_pages=10,
+            on_state_change=capture_until_crash,
+        )
+
+        mock_crawler = create_mock_crawler_with_links(num_links=5)
+        mock_config = create_mock_config()
+
+        # First crawl - crashes
+        with pytest.raises(Exception, match="Simulated crash"):
+            await strategy1._arun_batch("https://example.com", mock_crawler, mock_config)
+
+        # Get last state before crash
+        last_state = states_before_crash[-1]
+        assert last_state["pages_crawled"] >= crash_after
+
+        # Calculate which URLs were already crawled vs pending
+        pending_urls = {item["url"] for item in last_state["pending"]}
+        visited_urls = set(last_state["visited"])
+        already_crawled_urls = visited_urls - pending_urls
+
+        # Resume from checkpoint
+        crawled_in_resume: List[str] = []
+
+        strategy2 = BFSDeepCrawlStrategy(
+            max_depth=2,
+            max_pages=10,
+            resume_state=last_state,
+        )
+
+        mock_crawler2 = create_mock_crawler_tracking(crawled_in_resume, return_no_links=True)
+
+        await strategy2._arun_batch("https://example.com", mock_crawler2, mock_config)
+
+        # Verify already-crawled URLs are not re-crawled
+        for crawled_url in already_crawled_urls:
+            assert crawled_url not in crawled_in_resume, f"Re-crawled already visited: {crawled_url}"
+
+        # Verify pending URLs are crawled
+        for pending_url in pending_urls:
+            assert pending_url in crawled_in_resume, f"Did not crawl pending: {pending_url}"
+
+    @pytest.mark.asyncio
+    async def test_callback_fires_per_url(self):
+        """Verify callback fires after each URL for maximum granularity."""
+        callback_count = 0
+        pages_crawled_sequence: List[int] = []
+
+        async def count_callbacks(state: Dict[str, Any]):
+            nonlocal callback_count
+            callback_count += 1
+            pages_crawled_sequence.append(state["pages_crawled"])
+
+        strategy = BFSDeepCrawlStrategy(
+            max_depth=1,
+            max_pages=5,
+            on_state_change=count_callbacks,
+        )
+
+        mock_crawler = create_mock_crawler_with_links(num_links=2)
+        mock_config = create_mock_config()
+
+        await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
+
+        # Callback should fire once per successful URL
+        assert callback_count == strategy._pages_crawled, \
+            f"Callback fired {callback_count} times, expected {strategy._pages_crawled} (per URL)"
+
+        # pages_crawled should increment by 1 each callback
+        for i, count in enumerate(pages_crawled_sequence):
+            assert count == i + 1, f"Expected pages_crawled={i+1} at callback {i}, got {count}"
+
+    @pytest.mark.asyncio
+    async def test_export_state_returns_last_captured(self):
+        """Verify export_state() returns last captured state."""
+        last_state = None
+
+        async def capture(state):
+            nonlocal last_state
+            last_state = state
+
+        strategy = BFSDeepCrawlStrategy(max_depth=2, max_pages=5, on_state_change=capture)
+
+        mock_crawler = create_mock_crawler_with_links(num_links=2)
+        mock_config = create_mock_config()
+
+        await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
+
+        exported = strategy.export_state()
+        assert exported == last_state
+
+
+class TestDFSResume:
+    """DFS strategy resume tests."""
+
+    @pytest.mark.asyncio
+    async def test_state_export_includes_stack_and_dfs_seen(self):
+        """Verify DFS state includes stack structure and _dfs_seen."""
+        captured_states: List[Dict] = []
+
+        async def capture_state(state: Dict[str, Any]):
+            captured_states.append(state)
+
+        strategy = DFSDeepCrawlStrategy(
+            max_depth=3,
+            max_pages=10,
+            on_state_change=capture_state,
+        )
+
+        mock_crawler = create_mock_crawler_with_links(num_links=2)
+        mock_config = create_mock_config()
+
+        await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
+
+        assert len(captured_states) > 0
+
+        for state in captured_states:
+            assert state["strategy_type"] == "dfs"
+            assert "stack" in state
+            assert "dfs_seen" in state
+            # Stack items should have depth
+            for item in state["stack"]:
+                assert "url" in item
+                assert "parent_url" in item
+                assert "depth" in item
+
+    @pytest.mark.asyncio
+    async def test_resume_restores_stack_order(self):
+        """Verify DFS stack order is preserved on resume."""
+        saved_state = {
+            "strategy_type": "dfs",
+            "visited": ["https://example.com"],
+            "stack": [
+                {"url": "https://example.com/deep3", "parent_url": "https://example.com/deep2", "depth": 3},
+                {"url": "https://example.com/deep2", "parent_url": "https://example.com/deep1", "depth": 2},
+                {"url": "https://example.com/page1", "parent_url": "https://example.com", "depth": 1},
+            ],
+            "depths": {"https://example.com": 0},
+            "pages_crawled": 1,
+            "dfs_seen": ["https://example.com", "https://example.com/deep3", "https://example.com/deep2", "https://example.com/page1"],
+        }
+
+        crawl_order: List[str] = []
+
+        strategy = DFSDeepCrawlStrategy(
+            max_depth=3,
+            max_pages=10,
+            resume_state=saved_state,
+        )
+
+        mock_crawler = create_mock_crawler_tracking(crawl_order, return_no_links=True)
+        mock_config = create_mock_config()
+
+        await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
+
+        # DFS pops from end of stack, so order should be: page1, deep2, deep3
+        assert crawl_order[0] == "https://example.com/page1"
+        assert crawl_order[1] == "https://example.com/deep2"
+        assert crawl_order[2] == "https://example.com/deep3"
+
+
+class TestBestFirstResume:
+    """Best-First strategy resume tests."""
+
+    @pytest.mark.asyncio
+    async def test_state_export_includes_scored_queue(self):
+        """Verify Best-First state includes queue with scores."""
+        captured_states: List[Dict] = []
+
+        async def capture_state(state: Dict[str, Any]):
+            captured_states.append(state)
+
+        scorer = KeywordRelevanceScorer(keywords=["important"], weight=1.0)
+
+        strategy = BestFirstCrawlingStrategy(
+            max_depth=2,
+            max_pages=10,
+            url_scorer=scorer,
+            on_state_change=capture_state,
+        )
+
+        mock_crawler = create_mock_crawler_with_links(num_links=3, include_keyword=True)
+        mock_config = create_mock_config(stream=True)
+
+        async for _ in strategy._arun_stream("https://example.com", mock_crawler, mock_config):
+            pass
+
+        assert len(captured_states) > 0
+
+        for state in captured_states:
+            assert state["strategy_type"] == "best_first"
+            assert "queue_items" in state
+            for item in state["queue_items"]:
+                assert "score" in item
+                assert "depth" in item
+                assert "url" in item
+                assert "parent_url" in item
+
+    @pytest.mark.asyncio
+    async def test_resume_maintains_priority_order(self):
+        """Verify priority queue order is maintained on resume."""
+        saved_state = {
+            "strategy_type": "best_first",
+            "visited": ["https://example.com"],
+            "queue_items": [
+                {"score": -0.9, "depth": 1, "url": "https://example.com/high-priority", "parent_url": "https://example.com"},
+                {"score": -0.5, "depth": 1, "url": "https://example.com/medium-priority", "parent_url": "https://example.com"},
+                {"score": -0.1, "depth": 1, "url": "https://example.com/low-priority", "parent_url": "https://example.com"},
+            ],
+            "depths": {"https://example.com": 0},
+            "pages_crawled": 1,
+        }
+
+        crawl_order: List[str] = []
+
+        strategy = BestFirstCrawlingStrategy(
+            max_depth=2,
+            max_pages=10,
+            resume_state=saved_state,
+        )
+
+        mock_crawler = create_mock_crawler_tracking(crawl_order, return_no_links=True)
+        mock_config = create_mock_config(stream=True)
+
+        async for _ in strategy._arun_stream("https://example.com", mock_crawler, mock_config):
+            pass
+
+        # Higher negative score = higher priority (min-heap)
+        # So -0.9 should be crawled first
+        assert crawl_order[0] == "https://example.com/high-priority"
+
+
+class TestCrossStrategyResume:
+    """Tests that apply to all strategies."""
+
+    @pytest.mark.asyncio
+    @pytest.mark.parametrize("strategy_class,strategy_type", [
+        (BFSDeepCrawlStrategy, "bfs"),
+        (DFSDeepCrawlStrategy, "dfs"),
+        (BestFirstCrawlingStrategy, "best_first"),
+    ])
+    async def test_no_callback_means_no_overhead(self, strategy_class, strategy_type):
+        """Verify no state tracking when callback is None."""
+        strategy = strategy_class(max_depth=2, max_pages=5)
+
+        # _queue_shadow should be None for Best-First when no callback
+        if strategy_class == BestFirstCrawlingStrategy:
+            assert strategy._queue_shadow is None
+
+        # _last_state should be None initially
+        assert strategy._last_state is None
+
+    @pytest.mark.asyncio
+    @pytest.mark.parametrize("strategy_class", [
+        BFSDeepCrawlStrategy,
+        DFSDeepCrawlStrategy,
+        BestFirstCrawlingStrategy,
+    ])
+    async def test_export_state_returns_last_captured(self, strategy_class):
+        """Verify export_state() returns last captured state."""
+        last_state = None
+
+        async def capture(state):
+            nonlocal last_state
+            last_state = state
+
+        strategy = strategy_class(max_depth=2, max_pages=5, on_state_change=capture)
+
+        mock_crawler = create_mock_crawler_with_links(num_links=2)
+
+        if strategy_class == BestFirstCrawlingStrategy:
+            mock_config = create_mock_config(stream=True)
+            async for _ in strategy._arun_stream("https://example.com", mock_crawler, mock_config):
+                pass
+        else:
+            mock_config = create_mock_config()
+            await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
+
+        exported = strategy.export_state()
+        assert exported == last_state
+
+
+# ============================================================================
+# TEST SUITE 2: Regression Tests (No Damage to Current System)
+# ============================================================================
+
+class TestBFSRegressions:
+    """Ensure BFS works identically when new params not used."""
+
+    @pytest.mark.asyncio
+    async def test_default_params_unchanged(self):
+        """Constructor with only original params works."""
+        strategy = BFSDeepCrawlStrategy(
+            max_depth=2,
+            include_external=False,
+            max_pages=10,
+        )
+
+        assert strategy.max_depth == 2
+        assert strategy.include_external == False
+        assert strategy.max_pages == 10
+        assert strategy._resume_state is None
+        assert strategy._on_state_change is None
+
+    @pytest.mark.asyncio
+    async def test_filter_chain_still_works(self):
+        """FilterChain integration unchanged."""
+        filter_chain = FilterChain([
+            URLPatternFilter(patterns=["*/blog/*"]),
+            DomainFilter(allowed_domains=["example.com"]),
+        ])
+
+        strategy = BFSDeepCrawlStrategy(
+            max_depth=2,
+            filter_chain=filter_chain,
+        )
+
+        # Test filter still applies
+        assert await strategy.can_process_url("https://example.com/blog/post1", 1) == True
+        assert await strategy.can_process_url("https://other.com/blog/post1", 1) == False
+
+    @pytest.mark.asyncio
+    async def test_url_scorer_still_works(self):
+        """URL scoring integration unchanged."""
+        scorer = KeywordRelevanceScorer(keywords=["python", "tutorial"], weight=1.0)
+
+        strategy = BFSDeepCrawlStrategy(
+            max_depth=2,
+            url_scorer=scorer,
+            score_threshold=0.5,
+        )
+
+        assert strategy.url_scorer is not None
+        assert strategy.score_threshold == 0.5
+
+        # Scorer should work
+        score = scorer.score("https://example.com/python-tutorial")
+        assert score > 0
+
+    @pytest.mark.asyncio
+    async def test_batch_mode_returns_list(self):
+        """Batch mode still returns List[CrawlResult]."""
+        strategy = BFSDeepCrawlStrategy(max_depth=1, max_pages=5)
+
+        mock_crawler = create_simple_mock_crawler()
+        mock_config = create_mock_config(stream=False)
+
+        results = await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
+
+        assert isinstance(results, list)
+        assert len(results) > 0
+
+    @pytest.mark.asyncio
+    async def test_max_pages_limit_respected(self):
+        """max_pages limit still enforced."""
+        strategy = BFSDeepCrawlStrategy(max_depth=10, max_pages=3)
+
+        mock_crawler = create_mock_crawler_unlimited_links()
+        mock_config = create_mock_config()
+
+        results = await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
+
+        # Should stop at max_pages
+        assert strategy._pages_crawled <= 3
+
+    @pytest.mark.asyncio
+    async def test_max_depth_limit_respected(self):
+        """max_depth limit still enforced."""
+        strategy = BFSDeepCrawlStrategy(max_depth=2, max_pages=100)
+
+        mock_crawler = create_mock_crawler_unlimited_links()
+        mock_config = create_mock_config()
+
+        results = await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
+
+        # All results should have depth <= max_depth
+        for result in results:
+            assert result.metadata.get("depth", 0) <= 2
+
+    @pytest.mark.asyncio
+    async def test_metadata_depth_still_set(self):
+        """Result metadata still includes depth."""
+        strategy = BFSDeepCrawlStrategy(max_depth=2, max_pages=5)
+
+        mock_crawler = create_simple_mock_crawler()
+        mock_config = create_mock_config()
+
+        results = await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
+
+        for result in results:
+            assert "depth" in result.metadata
+            assert isinstance(result.metadata["depth"], int)
+
+    @pytest.mark.asyncio
+    async def test_metadata_parent_url_still_set(self):
+        """Result metadata still includes parent_url."""
+        strategy = BFSDeepCrawlStrategy(max_depth=2, max_pages=5)
+
+        mock_crawler = create_simple_mock_crawler()
+        mock_config = create_mock_config()
+
+        results = await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
+
+        # First result (start URL) should have parent_url = None
+        assert results[0].metadata.get("parent_url") is None
+
+        # Child results should have parent_url set
+        for result in results[1:]:
+            assert "parent_url" in result.metadata
+
+
+class TestDFSRegressions:
+    """Ensure DFS works identically when new params not used."""
+
+    @pytest.mark.asyncio
+    async def test_inherits_bfs_params(self):
+        """DFS still inherits all BFS parameters."""
+        strategy = DFSDeepCrawlStrategy(
+            max_depth=3,
+            include_external=True,
+            max_pages=20,
+            score_threshold=0.5,
+        )
+
+        assert strategy.max_depth == 3
+        assert strategy.include_external == True
+        assert strategy.max_pages == 20
+        assert strategy.score_threshold == 0.5
+
+    @pytest.mark.asyncio
+    async def test_dfs_seen_initialized(self):
+        """DFS _dfs_seen set still initialized."""
+        strategy = DFSDeepCrawlStrategy(max_depth=2)
+
+        assert hasattr(strategy, '_dfs_seen')
+        assert isinstance(strategy._dfs_seen, set)
+
+
+class TestBestFirstRegressions:
+    """Ensure Best-First works identically when new params not used."""
+
+    @pytest.mark.asyncio
+    async def test_default_params_unchanged(self):
+        """Constructor with only original params works."""
+        strategy = BestFirstCrawlingStrategy(
+            max_depth=2,
+            include_external=False,
+            max_pages=10,
+        )
+
+        assert strategy.max_depth == 2
+        assert strategy.include_external == False
+        assert strategy.max_pages == 10
+        assert strategy._resume_state is None
+        assert strategy._on_state_change is None
+        assert strategy._queue_shadow is None  # Not initialized without callback
+
+    @pytest.mark.asyncio
+    async def test_scorer_integration(self):
+        """URL scorer still affects crawl priority."""
+        scorer = KeywordRelevanceScorer(keywords=["important"], weight=1.0)
+
+        strategy = BestFirstCrawlingStrategy(
+            max_depth=2,
+            max_pages=10,
+            url_scorer=scorer,
+        )
+
+        assert strategy.url_scorer is scorer
+
+
+class TestAPICompatibility:
+    """Ensure API/serialization compatibility."""
+
+    def test_strategy_signature_backward_compatible(self):
+        """Old code calling with positional/keyword args still works."""
+        # Positional args (old style)
+        s1 = BFSDeepCrawlStrategy(2)
+        assert s1.max_depth == 2
+
+        # Keyword args (old style)
+        s2 = BFSDeepCrawlStrategy(max_depth=3, max_pages=10)
+        assert s2.max_depth == 3
+
+        # Mixed (old style)
+        s3 = BFSDeepCrawlStrategy(2, FilterChain(), None, False, float('-inf'), 100)
+        assert s3.max_depth == 2
+        assert s3.max_pages == 100
+
+    def test_no_required_new_params(self):
+        """New params are optional, not required."""
+        # Should not raise
+        BFSDeepCrawlStrategy(max_depth=2)
+        DFSDeepCrawlStrategy(max_depth=2)
+        BestFirstCrawlingStrategy(max_depth=2)
--- a/tests/deep_crawling/test_deep_crawl_resume_integration.py
+++ b/tests/deep_crawling/test_deep_crawl_resume_integration.py
@@ -0,0 +1,162 @@
+"""
+Integration Test: Deep Crawl Resume with Real URLs
+
+Tests the crash recovery feature using books.toscrape.com - a site
+designed for scraping practice with a clear hierarchy:
+- Home page → Category pages → Book detail pages
+"""
+
+import pytest
+import asyncio
+import json
+from typing import Dict, Any, List
+
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+
+
+class TestBFSResumeIntegration:
+    """Integration tests for BFS resume with real crawling."""
+
+    @pytest.mark.asyncio
+    async def test_real_crawl_state_capture_and_resume(self):
+        """
+        Test crash recovery with real URLs from books.toscrape.com.
+
+        Flow:
+        1. Start crawl with state callback
+        2. Stop after N pages (simulated crash)
+        3. Resume from saved state
+        4. Verify no duplicate crawls
+        """
+        # Phase 1: Initial crawl that "crashes" after 3 pages
+        crash_after = 3
+        captured_states: List[Dict[str, Any]] = []
+        crawled_urls_phase1: List[str] = []
+
+        async def capture_state_until_crash(state: Dict[str, Any]):
+            captured_states.append(state)
+            crawled_urls_phase1.clear()
+            crawled_urls_phase1.extend(state["visited"])
+
+            if state["pages_crawled"] >= crash_after:
+                raise Exception("Simulated crash!")
+
+        strategy1 = BFSDeepCrawlStrategy(
+            max_depth=2,
+            max_pages=10,
+            on_state_change=capture_state_until_crash,
+        )
+
+        config = CrawlerRunConfig(
+            deep_crawl_strategy=strategy1,
+            stream=False,
+            verbose=False,
+        )
+
+        async with AsyncWebCrawler(verbose=False) as crawler:
+            # First crawl - will crash after 3 pages
+            with pytest.raises(Exception, match="Simulated crash"):
+                await crawler.arun("https://books.toscrape.com", config=config)
+
+        # Verify we captured state before crash
+        assert len(captured_states) > 0, "No states captured before crash"
+        last_state = captured_states[-1]
+
+        print(f"\n=== Phase 1: Crashed after {last_state['pages_crawled']} pages ===")
+        print(f"Visited URLs: {len(last_state['visited'])}")
+        print(f"Pending URLs: {len(last_state['pending'])}")
+
+        # Verify state structure
+        assert last_state["strategy_type"] == "bfs"
+        assert last_state["pages_crawled"] >= crash_after
+        assert len(last_state["visited"]) > 0
+        assert "pending" in last_state
+        assert "depths" in last_state
+
+        # Verify state is JSON serializable (important for Redis/DB storage)
+        json_str = json.dumps(last_state)
+        restored_state = json.loads(json_str)
+        assert restored_state == last_state, "State not JSON round-trip safe"
+
+        # Phase 2: Resume from checkpoint
+        crawled_urls_phase2: List[str] = []
+
+        async def track_resumed_crawl(state: Dict[str, Any]):
+            # Track what's being crawled in phase 2
+            new_visited = set(state["visited"]) - set(last_state["visited"])
+            for url in new_visited:
+                if url not in crawled_urls_phase2:
+                    crawled_urls_phase2.append(url)
+
+        strategy2 = BFSDeepCrawlStrategy(
+            max_depth=2,
+            max_pages=10,
+            resume_state=restored_state,
+            on_state_change=track_resumed_crawl,
+        )
+
+        config2 = CrawlerRunConfig(
+            deep_crawl_strategy=strategy2,
+            stream=False,
+            verbose=False,
+        )
+
+        async with AsyncWebCrawler(verbose=False) as crawler:
+            results = await crawler.arun("https://books.toscrape.com", config=config2)
+
+        print(f"\n=== Phase 2: Resumed crawl ===")
+        print(f"New URLs crawled: {len(crawled_urls_phase2)}")
+        print(f"Final pages_crawled: {strategy2._pages_crawled}")
+
+        # Verify no duplicates - URLs from phase 1 should not be re-crawled
+        already_crawled = set(last_state["visited"]) - {item["url"] for item in last_state["pending"]}
+        duplicates = set(crawled_urls_phase2) & already_crawled
+
+        assert len(duplicates) == 0, f"Duplicate crawls detected: {duplicates}"
+
+        # Verify we made progress (crawled some of the pending URLs)
+        pending_urls = {item["url"] for item in last_state["pending"]}
+        crawled_pending = set(crawled_urls_phase2) & pending_urls
+
+        print(f"Pending URLs crawled in phase 2: {len(crawled_pending)}")
+
+        # Final state should show more pages crawled than before crash
+        final_state = strategy2.export_state()
+        if final_state:
+            assert final_state["pages_crawled"] >= last_state["pages_crawled"], \
+                "Resume did not make progress"
+
+        print("\n=== Integration test PASSED ===")
+
+    @pytest.mark.asyncio
+    async def test_state_export_method(self):
+        """Test that export_state() returns valid state during crawl."""
+        states_from_callback: List[Dict] = []
+
+        async def capture(state):
+            states_from_callback.append(state)
+
+        strategy = BFSDeepCrawlStrategy(
+            max_depth=1,
+            max_pages=3,
+            on_state_change=capture,
+        )
+
+        config = CrawlerRunConfig(
+            deep_crawl_strategy=strategy,
+            stream=False,
+            verbose=False,
+        )
+
+        async with AsyncWebCrawler(verbose=False) as crawler:
+            await crawler.arun("https://books.toscrape.com", config=config)
+
+        # export_state should return the last captured state
+        exported = strategy.export_state()
+
+        assert exported is not None, "export_state() returned None"
+        assert exported == states_from_callback[-1], "export_state() doesn't match last callback"
+
+        print(f"\n=== export_state() test PASSED ===")
+        print(f"Final state: {exported['pages_crawled']} pages, {len(exported['visited'])} visited")
--- a/tests/docker/test_hooks_comprehensive.py
+++ b/tests/docker/test_hooks_comprehensive.py
@@ -7,9 +7,46 @@ adapted for the Docker API with real URLs
 import requests
 import json
 import time
-from typing import Dict, Any
+from typing import Dict, Optional

-API_BASE_URL = "http://localhost:11234"
+API_BASE_URL = "http://localhost:11235"
+
+# Global token storage
+_auth_token: Optional[str] = None
+
+
+def get_auth_token(email: str = "test@gmail.com") -> str:
+    """
+    Get a JWT token from the /token endpoint.
+    The email domain must have valid MX records.
+    """
+    global _auth_token
+
+    if _auth_token:
+        return _auth_token
+
+    print(f"🔐 Requesting JWT token for {email}...")
+    response = requests.post(
+        f"{API_BASE_URL}/token",
+        json={"email": email}
+    )
+
+    if response.status_code == 200:
+        data = response.json()
+        _auth_token = data["access_token"]
+        print(f"✅ Token obtained successfully")
+        return _auth_token
+    else:
+        raise Exception(f"Failed to get token: {response.status_code} - {response.text}")
+
+
+def get_auth_headers() -> Dict[str, str]:
+    """Get headers with JWT Bearer token."""
+    token = get_auth_token()
+    return {
+        "Authorization": f"Bearer {token}",
+        "Content-Type": "application/json"
+    }


 def test_all_hooks_demo():
@@ -165,7 +202,7 @@ async def hook(page, context, html, **kwargs):
    print("\nSending request with all 8 hooks...")
    start_time = time.time()

-    response = requests.post(f"{API_BASE_URL}/crawl", json=payload)
+    response = requests.post(f"{API_BASE_URL}/crawl", json=payload, headers=get_auth_headers())
    
    elapsed_time = time.time() - start_time
    print(f"Request completed in {elapsed_time:.2f} seconds")
@@ -278,7 +315,7 @@ async def hook(page, context, url, **kwargs):
    }
    
    print("\nTesting authentication with httpbin endpoints...")
-    response = requests.post(f"{API_BASE_URL}/crawl", json=payload)
+    response = requests.post(f"{API_BASE_URL}/crawl", json=payload, headers=get_auth_headers())
    
    if response.status_code == 200:
        data = response.json()
@@ -372,7 +409,7 @@ async def hook(page, context, **kwargs):
    print("\nTesting performance optimization hooks...")
    start_time = time.time()

-    response = requests.post(f"{API_BASE_URL}/crawl", json=payload)
+    response = requests.post(f"{API_BASE_URL}/crawl", json=payload, headers=get_auth_headers())
    
    elapsed_time = time.time() - start_time
    print(f"Request completed in {elapsed_time:.2f} seconds")
@@ -462,7 +499,7 @@ async def hook(page, context, **kwargs):
    }
    
    print("\nTesting content extraction hooks...")
-    response = requests.post(f"{API_BASE_URL}/crawl", json=payload)
+    response = requests.post(f"{API_BASE_URL}/crawl", json=payload, headers=get_auth_headers())
    
    if response.status_code == 200:
        data = response.json()
@@ -486,6 +523,15 @@ def main():
    print("Based on docs/examples/hooks_example.py")
    print("=" * 70)

+    # Get JWT token first (required when jwt_enabled=true)
+    try:
+        get_auth_token()
+        print("=" * 70)
+    except Exception as e:
+        print(f"❌ Failed to authenticate: {e}")
+        print("Make sure the server is running and jwt_enabled is configured correctly.")
+        return
+
    tests = [
        ("All Hooks Demo", test_all_hooks_demo),
        ("Authentication Flow", test_authentication_flow),
--- a/tests/proxy/test_sticky_sessions.py
+++ b/tests/proxy/test_sticky_sessions.py
@@ -0,0 +1,569 @@
+"""
+Comprehensive test suite for Sticky Proxy Sessions functionality.
+
+Tests cover:
+1. Basic sticky session - same proxy for same session_id
+2. Different sessions get different proxies
+3. Session release
+4. TTL expiration
+5. Thread safety / concurrent access
+6. Integration tests with AsyncWebCrawler
+"""
+
+import asyncio
+import os
+import time
+import pytest
+from unittest.mock import patch
+
+from crawl4ai import AsyncWebCrawler, BrowserConfig
+from crawl4ai.async_configs import CrawlerRunConfig, ProxyConfig
+from crawl4ai.proxy_strategy import RoundRobinProxyStrategy
+from crawl4ai.cache_context import CacheMode
+
+
+class TestRoundRobinProxyStrategySession:
+    """Test suite for RoundRobinProxyStrategy session methods."""
+
+    def setup_method(self):
+        """Setup for each test method."""
+        self.proxies = [
+            ProxyConfig(server=f"http://proxy{i}.test:8080")
+            for i in range(5)
+        ]
+
+    # ==================== BASIC STICKY SESSION TESTS ====================
+
+    @pytest.mark.asyncio
+    async def test_sticky_session_same_proxy(self):
+        """Verify same proxy is returned for same session_id."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        # First call - acquires proxy
+        proxy1 = await strategy.get_proxy_for_session("session-1")
+
+        # Second call - should return same proxy
+        proxy2 = await strategy.get_proxy_for_session("session-1")
+
+        # Third call - should return same proxy
+        proxy3 = await strategy.get_proxy_for_session("session-1")
+
+        assert proxy1 is not None
+        assert proxy1.server == proxy2.server == proxy3.server
+
+    @pytest.mark.asyncio
+    async def test_different_sessions_different_proxies(self):
+        """Verify different session_ids can get different proxies."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        proxy_a = await strategy.get_proxy_for_session("session-a")
+        proxy_b = await strategy.get_proxy_for_session("session-b")
+        proxy_c = await strategy.get_proxy_for_session("session-c")
+
+        # All should be different (round-robin)
+        servers = {proxy_a.server, proxy_b.server, proxy_c.server}
+        assert len(servers) == 3
+
+    @pytest.mark.asyncio
+    async def test_sticky_session_with_regular_rotation(self):
+        """Verify sticky sessions don't interfere with regular rotation."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        # Acquire a sticky session
+        session_proxy = await strategy.get_proxy_for_session("sticky-session")
+
+        # Regular rotation should continue independently
+        regular_proxy1 = await strategy.get_next_proxy()
+        regular_proxy2 = await strategy.get_next_proxy()
+
+        # Sticky session should still return same proxy
+        session_proxy_again = await strategy.get_proxy_for_session("sticky-session")
+
+        assert session_proxy.server == session_proxy_again.server
+        # Regular proxies should rotate
+        assert regular_proxy1.server != regular_proxy2.server
+
+    # ==================== SESSION RELEASE TESTS ====================
+
+    @pytest.mark.asyncio
+    async def test_session_release(self):
+        """Verify session can be released and reacquired."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        # Acquire session
+        proxy1 = await strategy.get_proxy_for_session("session-1")
+        assert strategy.get_session_proxy("session-1") is not None
+
+        # Release session
+        await strategy.release_session("session-1")
+        assert strategy.get_session_proxy("session-1") is None
+
+        # Reacquire - should get a new proxy (next in round-robin)
+        proxy2 = await strategy.get_proxy_for_session("session-1")
+        assert proxy2 is not None
+        # After release, next call gets the next proxy in rotation
+        # (not necessarily the same as before)
+
+    @pytest.mark.asyncio
+    async def test_release_nonexistent_session(self):
+        """Verify releasing non-existent session doesn't raise error."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        # Should not raise
+        await strategy.release_session("nonexistent-session")
+
+    @pytest.mark.asyncio
+    async def test_release_twice(self):
+        """Verify releasing session twice doesn't raise error."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        await strategy.get_proxy_for_session("session-1")
+        await strategy.release_session("session-1")
+        await strategy.release_session("session-1")  # Should not raise
+
+    # ==================== GET SESSION PROXY TESTS ====================
+
+    @pytest.mark.asyncio
+    async def test_get_session_proxy_existing(self):
+        """Verify get_session_proxy returns proxy for existing session."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        acquired = await strategy.get_proxy_for_session("session-1")
+        retrieved = strategy.get_session_proxy("session-1")
+
+        assert retrieved is not None
+        assert acquired.server == retrieved.server
+
+    def test_get_session_proxy_nonexistent(self):
+        """Verify get_session_proxy returns None for non-existent session."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        result = strategy.get_session_proxy("nonexistent-session")
+        assert result is None
+
+    # ==================== TTL EXPIRATION TESTS ====================
+
+    @pytest.mark.asyncio
+    async def test_session_ttl_not_expired(self):
+        """Verify session returns same proxy when TTL not expired."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        # Acquire with 10 second TTL
+        proxy1 = await strategy.get_proxy_for_session("session-1", ttl=10)
+
+        # Immediately request again - should return same proxy
+        proxy2 = await strategy.get_proxy_for_session("session-1", ttl=10)
+
+        assert proxy1.server == proxy2.server
+
+    @pytest.mark.asyncio
+    async def test_session_ttl_expired(self):
+        """Verify new proxy acquired after TTL expires."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        # Acquire with 1 second TTL
+        proxy1 = await strategy.get_proxy_for_session("session-1", ttl=1)
+
+        # Wait for TTL to expire
+        await asyncio.sleep(1.1)
+
+        # Request again - should get new proxy due to expiration
+        proxy2 = await strategy.get_proxy_for_session("session-1", ttl=1)
+
+        # May or may not be same server depending on round-robin state,
+        # but session should have been recreated
+        assert proxy2 is not None
+
+    @pytest.mark.asyncio
+    async def test_get_session_proxy_ttl_expired(self):
+        """Verify get_session_proxy returns None after TTL expires."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        await strategy.get_proxy_for_session("session-1", ttl=1)
+
+        # Wait for expiration
+        await asyncio.sleep(1.1)
+
+        # Should return None for expired session
+        result = strategy.get_session_proxy("session-1")
+        assert result is None
+
+    @pytest.mark.asyncio
+    async def test_cleanup_expired_sessions(self):
+        """Verify cleanup_expired_sessions removes expired sessions."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        # Create sessions with short TTL
+        await strategy.get_proxy_for_session("short-ttl-1", ttl=1)
+        await strategy.get_proxy_for_session("short-ttl-2", ttl=1)
+        # Create session without TTL (should not be cleaned up)
+        await strategy.get_proxy_for_session("no-ttl")
+
+        # Wait for TTL to expire
+        await asyncio.sleep(1.1)
+
+        # Cleanup
+        removed = await strategy.cleanup_expired_sessions()
+
+        assert removed == 2
+        assert strategy.get_session_proxy("short-ttl-1") is None
+        assert strategy.get_session_proxy("short-ttl-2") is None
+        assert strategy.get_session_proxy("no-ttl") is not None
+
+    # ==================== GET ACTIVE SESSIONS TESTS ====================
+
+    @pytest.mark.asyncio
+    async def test_get_active_sessions(self):
+        """Verify get_active_sessions returns all active sessions."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        await strategy.get_proxy_for_session("session-a")
+        await strategy.get_proxy_for_session("session-b")
+        await strategy.get_proxy_for_session("session-c")
+
+        active = strategy.get_active_sessions()
+
+        assert len(active) == 3
+        assert "session-a" in active
+        assert "session-b" in active
+        assert "session-c" in active
+
+    @pytest.mark.asyncio
+    async def test_get_active_sessions_excludes_expired(self):
+        """Verify get_active_sessions excludes expired sessions."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        await strategy.get_proxy_for_session("short-ttl", ttl=1)
+        await strategy.get_proxy_for_session("no-ttl")
+
+        # Before expiration
+        active = strategy.get_active_sessions()
+        assert len(active) == 2
+
+        # Wait for TTL to expire
+        await asyncio.sleep(1.1)
+
+        # After expiration
+        active = strategy.get_active_sessions()
+        assert len(active) == 1
+        assert "no-ttl" in active
+        assert "short-ttl" not in active
+
+    # ==================== THREAD SAFETY TESTS ====================
+
+    @pytest.mark.asyncio
+    async def test_concurrent_session_access(self):
+        """Verify thread-safe access to sessions."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        async def acquire_session(session_id: str):
+            proxy = await strategy.get_proxy_for_session(session_id)
+            await asyncio.sleep(0.01)  # Simulate work
+            return proxy.server
+
+        # Acquire same session from multiple coroutines
+        results = await asyncio.gather(*[
+            acquire_session("shared-session") for _ in range(10)
+        ])
+
+        # All should get same proxy
+        assert len(set(results)) == 1
+
+    @pytest.mark.asyncio
+    async def test_concurrent_different_sessions(self):
+        """Verify concurrent acquisition of different sessions works correctly."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        async def acquire_session(session_id: str):
+            proxy = await strategy.get_proxy_for_session(session_id)
+            await asyncio.sleep(0.01)
+            return (session_id, proxy.server)
+
+        # Acquire different sessions concurrently
+        results = await asyncio.gather(*[
+            acquire_session(f"session-{i}") for i in range(5)
+        ])
+
+        # Each session should have a consistent proxy
+        session_proxies = dict(results)
+        assert len(session_proxies) == 5
+
+        # Verify each session still returns same proxy
+        for session_id, expected_server in session_proxies.items():
+            actual = await strategy.get_proxy_for_session(session_id)
+            assert actual.server == expected_server
+
+    @pytest.mark.asyncio
+    async def test_concurrent_session_acquire_and_release(self):
+        """Verify concurrent acquire and release operations work correctly."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        async def acquire_and_release(session_id: str):
+            proxy = await strategy.get_proxy_for_session(session_id)
+            await asyncio.sleep(0.01)
+            await strategy.release_session(session_id)
+            return proxy.server
+
+        # Run multiple acquire/release cycles concurrently
+        await asyncio.gather(*[
+            acquire_and_release(f"session-{i}") for i in range(10)
+        ])
+
+        # All sessions should be released
+        active = strategy.get_active_sessions()
+        assert len(active) == 0
+
+    # ==================== EMPTY PROXY POOL TESTS ====================
+
+    @pytest.mark.asyncio
+    async def test_empty_proxy_pool_session(self):
+        """Verify behavior with empty proxy pool."""
+        strategy = RoundRobinProxyStrategy()  # No proxies
+
+        result = await strategy.get_proxy_for_session("session-1")
+        assert result is None
+
+    @pytest.mark.asyncio
+    async def test_add_proxies_after_session(self):
+        """Verify adding proxies after session creation works."""
+        strategy = RoundRobinProxyStrategy()
+
+        # No proxies initially
+        result1 = await strategy.get_proxy_for_session("session-1")
+        assert result1 is None
+
+        # Add proxies
+        strategy.add_proxies(self.proxies)
+
+        # Now should work
+        result2 = await strategy.get_proxy_for_session("session-2")
+        assert result2 is not None
+
+
+class TestCrawlerRunConfigSession:
+    """Test CrawlerRunConfig with sticky session parameters."""
+
+    def test_config_has_session_fields(self):
+        """Verify CrawlerRunConfig has sticky session fields."""
+        config = CrawlerRunConfig(
+            proxy_session_id="test-session",
+            proxy_session_ttl=300,
+            proxy_session_auto_release=True
+        )
+
+        assert config.proxy_session_id == "test-session"
+        assert config.proxy_session_ttl == 300
+        assert config.proxy_session_auto_release is True
+
+    def test_config_session_defaults(self):
+        """Verify default values for session fields."""
+        config = CrawlerRunConfig()
+
+        assert config.proxy_session_id is None
+        assert config.proxy_session_ttl is None
+        assert config.proxy_session_auto_release is False
+
+
+class TestCrawlerStickySessionIntegration:
+    """Integration tests for AsyncWebCrawler with sticky sessions."""
+
+    def setup_method(self):
+        """Setup for each test method."""
+        self.proxies = [
+            ProxyConfig(server=f"http://proxy{i}.test:8080")
+            for i in range(3)
+        ]
+        self.test_url = "https://httpbin.org/ip"
+
+    @pytest.mark.asyncio
+    async def test_crawler_sticky_session_without_proxy(self):
+        """Test that crawler works when proxy_session_id set but no strategy."""
+        browser_config = BrowserConfig(headless=True)
+
+        config = CrawlerRunConfig(
+            cache_mode=CacheMode.BYPASS,
+            proxy_session_id="test-session",
+            page_timeout=15000
+        )
+
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            result = await crawler.arun(url=self.test_url, config=config)
+            # Should work without errors (no proxy strategy means no proxy)
+            assert result is not None
+
+    @pytest.mark.asyncio
+    async def test_crawler_sticky_session_basic(self):
+        """Test basic sticky session with crawler."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        config = CrawlerRunConfig(
+            cache_mode=CacheMode.BYPASS,
+            proxy_rotation_strategy=strategy,
+            proxy_session_id="integration-test",
+            page_timeout=10000
+        )
+
+        browser_config = BrowserConfig(headless=True)
+
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            # First request
+            try:
+                result1 = await crawler.arun(url=self.test_url, config=config)
+            except Exception:
+                pass  # Proxy connection may fail, but session should be tracked
+
+            # Verify session was created
+            session_proxy = strategy.get_session_proxy("integration-test")
+            assert session_proxy is not None
+
+            # Cleanup
+            await strategy.release_session("integration-test")
+
+    @pytest.mark.asyncio
+    async def test_crawler_rotating_vs_sticky(self):
+        """Compare rotating behavior vs sticky session behavior."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        # Config WITHOUT sticky session - should rotate
+        rotating_config = CrawlerRunConfig(
+            cache_mode=CacheMode.BYPASS,
+            proxy_rotation_strategy=strategy,
+            page_timeout=5000
+        )
+
+        # Config WITH sticky session - should use same proxy
+        sticky_config = CrawlerRunConfig(
+            cache_mode=CacheMode.BYPASS,
+            proxy_rotation_strategy=strategy,
+            proxy_session_id="sticky-test",
+            page_timeout=5000
+        )
+
+        browser_config = BrowserConfig(headless=True)
+
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            # Track proxy configs used
+            rotating_proxies = []
+            sticky_proxies = []
+
+            # Try rotating requests (may fail due to test proxies, but config should be set)
+            for _ in range(3):
+                try:
+                    await crawler.arun(url=self.test_url, config=rotating_config)
+                except Exception:
+                    pass
+                rotating_proxies.append(rotating_config.proxy_config.server if rotating_config.proxy_config else None)
+
+            # Try sticky requests
+            for _ in range(3):
+                try:
+                    await crawler.arun(url=self.test_url, config=sticky_config)
+                except Exception:
+                    pass
+                sticky_proxies.append(sticky_config.proxy_config.server if sticky_config.proxy_config else None)
+
+            # Rotating should have different proxies (or cycle through them)
+            # Sticky should have same proxy for all requests
+            if all(sticky_proxies):
+                assert len(set(sticky_proxies)) == 1, "Sticky session should use same proxy"
+
+            await strategy.release_session("sticky-test")
+
+
+class TestStickySessionRealWorld:
+    """Real-world scenario tests for sticky sessions.
+
+    Note: These tests require actual proxy servers to verify IP consistency.
+    They are marked to be skipped if no proxy is configured.
+    """
+
+    @pytest.mark.asyncio
+    @pytest.mark.skipif(
+        not os.environ.get('TEST_PROXY_1'),
+        reason="Requires TEST_PROXY_1 environment variable"
+    )
+    async def test_verify_ip_consistency(self):
+        """Verify that sticky session actually uses same IP.
+
+        This test requires real proxies set in environment variables:
+        TEST_PROXY_1=ip:port:user:pass
+        TEST_PROXY_2=ip:port:user:pass
+        """
+        import re
+
+        # Load proxies from environment
+        proxy_strs = [
+            os.environ.get('TEST_PROXY_1', ''),
+            os.environ.get('TEST_PROXY_2', '')
+        ]
+        proxies = [ProxyConfig.from_string(p) for p in proxy_strs if p]
+
+        if len(proxies) < 2:
+            pytest.skip("Need at least 2 proxies for this test")
+
+        strategy = RoundRobinProxyStrategy(proxies)
+
+        # Config WITH sticky session
+        config = CrawlerRunConfig(
+            cache_mode=CacheMode.BYPASS,
+            proxy_rotation_strategy=strategy,
+            proxy_session_id="ip-verify-session",
+            page_timeout=30000
+        )
+
+        browser_config = BrowserConfig(headless=True)
+
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            ips = []
+
+            for i in range(3):
+                result = await crawler.arun(
+                    url="https://httpbin.org/ip",
+                    config=config
+                )
+
+                if result and result.success and result.html:
+                    # Extract IP from response
+                    ip_match = re.search(r'"origin":\s*"([^"]+)"', result.html)
+                    if ip_match:
+                        ips.append(ip_match.group(1))
+
+            await strategy.release_session("ip-verify-session")
+
+            # All IPs should be same for sticky session
+            if len(ips) >= 2:
+                assert len(set(ips)) == 1, f"Expected same IP, got: {ips}"
+
+
+# ==================== STANDALONE TEST FUNCTIONS ====================
+
+@pytest.mark.asyncio
+async def test_sticky_session_simple():
+    """Simple test for sticky session functionality."""
+    proxies = [
+        ProxyConfig(server=f"http://proxy{i}.test:8080")
+        for i in range(3)
+    ]
+    strategy = RoundRobinProxyStrategy(proxies)
+
+    # Same session should return same proxy
+    p1 = await strategy.get_proxy_for_session("test")
+    p2 = await strategy.get_proxy_for_session("test")
+    p3 = await strategy.get_proxy_for_session("test")
+
+    assert p1.server == p2.server == p3.server
+    print(f"Sticky session works! All requests use: {p1.server}")
+
+    # Cleanup
+    await strategy.release_session("test")
+
+
+if __name__ == "__main__":
+    print("Running Sticky Session tests...")
+    print("=" * 50)
+
+    asyncio.run(test_sticky_session_simple())
+
+    print("\n" + "=" * 50)
+    print("To run the full pytest suite, use: pytest " + __file__)
+    print("=" * 50)
--- a/tests/test_prefetch_integration.py
+++ b/tests/test_prefetch_integration.py
@@ -0,0 +1,236 @@
+"""Integration tests for prefetch mode with the crawler."""
+
+import pytest
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BrowserConfig
+
+# Use crawl4ai docs as test domain
+TEST_DOMAIN = "https://docs.crawl4ai.com"
+
+
+class TestPrefetchModeIntegration:
+    """Integration tests for prefetch mode."""
+
+    @pytest.mark.asyncio
+    async def test_prefetch_returns_html_and_links(self):
+        """Test that prefetch mode returns HTML and links only."""
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(prefetch=True)
+            result = await crawler.arun(TEST_DOMAIN, config=config)
+
+            # Should have HTML
+            assert result.html is not None
+            assert len(result.html) > 0
+            assert "<html" in result.html.lower() or "<!doctype" in result.html.lower()
+
+            # Should have links
+            assert result.links is not None
+            assert "internal" in result.links
+            assert "external" in result.links
+
+            # Should NOT have processed content
+            assert result.markdown is None or (
+                hasattr(result.markdown, 'raw_markdown') and
+                result.markdown.raw_markdown is None
+            )
+            assert result.cleaned_html is None
+            assert result.extracted_content is None
+
+    @pytest.mark.asyncio
+    async def test_prefetch_preserves_metadata(self):
+        """Test that prefetch mode preserves essential metadata."""
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(prefetch=True)
+            result = await crawler.arun(TEST_DOMAIN, config=config)
+
+            # Should have success flag
+            assert result.success is True
+
+            # Should have URL
+            assert result.url is not None
+
+            # Status code should be present
+            assert result.status_code is not None or result.status_code == 200
+
+    @pytest.mark.asyncio
+    async def test_prefetch_with_deep_crawl(self):
+        """Test prefetch mode with deep crawl strategy."""
+        from crawl4ai import BFSDeepCrawlStrategy
+
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(
+                prefetch=True,
+                deep_crawl_strategy=BFSDeepCrawlStrategy(
+                    max_depth=1,
+                    max_pages=3
+                )
+            )
+
+            result_container = await crawler.arun(TEST_DOMAIN, config=config)
+
+            # Handle both list and iterator results
+            if hasattr(result_container, '__aiter__'):
+                results = [r async for r in result_container]
+            else:
+                results = list(result_container) if hasattr(result_container, '__iter__') else [result_container]
+
+            # Each result should have HTML and links
+            for result in results:
+                assert result.html is not None
+                assert result.links is not None
+
+            # Should have crawled at least one page
+            assert len(results) >= 1
+
+    @pytest.mark.asyncio
+    async def test_prefetch_then_process_with_raw(self):
+        """Test the full two-phase workflow: prefetch then process."""
+        async with AsyncWebCrawler() as crawler:
+            # Phase 1: Prefetch
+            prefetch_config = CrawlerRunConfig(prefetch=True)
+            prefetch_result = await crawler.arun(TEST_DOMAIN, config=prefetch_config)
+
+            stored_html = prefetch_result.html
+
+            assert stored_html is not None
+            assert len(stored_html) > 0
+
+            # Phase 2: Process with raw: URL
+            process_config = CrawlerRunConfig(
+                # No prefetch - full processing
+                base_url=TEST_DOMAIN  # Provide base URL for link resolution
+            )
+            processed_result = await crawler.arun(
+                f"raw:{stored_html}",
+                config=process_config
+            )
+
+            # Should now have full processing
+            assert processed_result.html is not None
+            assert processed_result.success is True
+            # Note: cleaned_html and markdown depend on the content
+
+    @pytest.mark.asyncio
+    async def test_prefetch_links_structure(self):
+        """Test that links have the expected structure."""
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(prefetch=True)
+            result = await crawler.arun(TEST_DOMAIN, config=config)
+
+            assert result.links is not None
+
+            # Check internal links structure
+            if result.links["internal"]:
+                link = result.links["internal"][0]
+                assert "href" in link
+                assert "text" in link
+                assert link["href"].startswith("http")
+
+            # Check external links structure (if any)
+            if result.links["external"]:
+                link = result.links["external"][0]
+                assert "href" in link
+                assert "text" in link
+                assert link["href"].startswith("http")
+
+    @pytest.mark.asyncio
+    async def test_prefetch_config_clone(self):
+        """Test that config.clone() preserves prefetch setting."""
+        config = CrawlerRunConfig(prefetch=True)
+        cloned = config.clone()
+
+        assert cloned.prefetch == True
+
+        # Clone with override
+        cloned_false = config.clone(prefetch=False)
+        assert cloned_false.prefetch == False
+
+    @pytest.mark.asyncio
+    async def test_prefetch_to_dict(self):
+        """Test that to_dict() includes prefetch."""
+        config = CrawlerRunConfig(prefetch=True)
+        config_dict = config.to_dict()
+
+        assert "prefetch" in config_dict
+        assert config_dict["prefetch"] == True
+
+    @pytest.mark.asyncio
+    async def test_prefetch_default_false(self):
+        """Test that prefetch defaults to False."""
+        config = CrawlerRunConfig()
+        assert config.prefetch == False
+
+    @pytest.mark.asyncio
+    async def test_prefetch_explicit_false(self):
+        """Test explicit prefetch=False works like default."""
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(prefetch=False)
+            result = await crawler.arun(TEST_DOMAIN, config=config)
+
+            # Should have full processing
+            assert result.html is not None
+            # cleaned_html should be populated in normal mode
+            assert result.cleaned_html is not None
+
+
+class TestPrefetchPerformance:
+    """Performance-related tests for prefetch mode."""
+
+    @pytest.mark.asyncio
+    async def test_prefetch_returns_quickly(self):
+        """Test that prefetch mode returns results quickly."""
+        import time
+
+        async with AsyncWebCrawler() as crawler:
+            # Prefetch mode
+            start = time.time()
+            prefetch_config = CrawlerRunConfig(prefetch=True)
+            await crawler.arun(TEST_DOMAIN, config=prefetch_config)
+            prefetch_time = time.time() - start
+
+            # Full mode
+            start = time.time()
+            full_config = CrawlerRunConfig()
+            await crawler.arun(TEST_DOMAIN, config=full_config)
+            full_time = time.time() - start
+
+            # Log times for debugging
+            print(f"\nPrefetch: {prefetch_time:.3f}s, Full: {full_time:.3f}s")
+
+            # Prefetch should not be significantly slower
+            # (may be same or slightly faster depending on content)
+            # This is a soft check - mostly for logging
+
+
+class TestPrefetchWithRawHTML:
+    """Test prefetch mode with raw HTML input."""
+
+    @pytest.mark.asyncio
+    async def test_prefetch_with_raw_html(self):
+        """Test prefetch mode works with raw: URL scheme."""
+        sample_html = """
+        <html>
+            <head><title>Test Page</title></head>
+            <body>
+                <h1>Hello World</h1>
+                <a href="/link1">Link 1</a>
+                <a href="/link2">Link 2</a>
+                <a href="https://external.com/page">External</a>
+            </body>
+        </html>
+        """
+
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(
+                prefetch=True,
+                base_url="https://example.com"
+            )
+            result = await crawler.arun(f"raw:{sample_html}", config=config)
+
+            assert result.success is True
+            assert result.html is not None
+            assert result.links is not None
+
+            # Should have extracted links
+            assert len(result.links["internal"]) >= 2
+            assert len(result.links["external"]) >= 1
--- a/tests/test_prefetch_mode.py
+++ b/tests/test_prefetch_mode.py
@@ -0,0 +1,275 @@
+"""Unit tests for the quick_extract_links function used in prefetch mode."""
+
+import pytest
+from crawl4ai.utils import quick_extract_links
+
+
+class TestQuickExtractLinks:
+    """Unit tests for the quick_extract_links function."""
+
+    def test_basic_internal_links(self):
+        """Test extraction of internal links."""
+        html = '''
+        <html>
+            <body>
+                <a href="/page1">Page 1</a>
+                <a href="/page2">Page 2</a>
+                <a href="https://example.com/page3">Page 3</a>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        assert len(result["internal"]) == 3
+        assert result["internal"][0]["href"] == "https://example.com/page1"
+        assert result["internal"][0]["text"] == "Page 1"
+
+    def test_external_links(self):
+        """Test extraction and classification of external links."""
+        html = '''
+        <html>
+            <body>
+                <a href="https://other.com/page">External</a>
+                <a href="/internal">Internal</a>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        assert len(result["internal"]) == 1
+        assert len(result["external"]) == 1
+        assert result["external"][0]["href"] == "https://other.com/page"
+
+    def test_ignores_javascript_and_mailto(self):
+        """Test that javascript: and mailto: links are ignored."""
+        html = '''
+        <html>
+            <body>
+                <a href="javascript:void(0)">Click</a>
+                <a href="mailto:test@example.com">Email</a>
+                <a href="tel:+1234567890">Call</a>
+                <a href="/valid">Valid</a>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        assert len(result["internal"]) == 1
+        assert result["internal"][0]["href"] == "https://example.com/valid"
+
+    def test_ignores_anchor_only_links(self):
+        """Test that anchor-only links (#section) are ignored."""
+        html = '''
+        <html>
+            <body>
+                <a href="#section1">Section 1</a>
+                <a href="#section2">Section 2</a>
+                <a href="/page#section">Page with anchor</a>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        # Only the page link should be included, anchor-only links are skipped
+        assert len(result["internal"]) == 1
+        assert "/page" in result["internal"][0]["href"]
+
+    def test_deduplication(self):
+        """Test that duplicate URLs are deduplicated."""
+        html = '''
+        <html>
+            <body>
+                <a href="/page">Link 1</a>
+                <a href="/page">Link 2</a>
+                <a href="/page">Link 3</a>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        assert len(result["internal"]) == 1
+
+    def test_handles_malformed_html(self):
+        """Test graceful handling of malformed HTML."""
+        html = "not valid html at all <><><"
+        result = quick_extract_links(html, "https://example.com")
+
+        # Should not raise, should return empty
+        assert result["internal"] == []
+        assert result["external"] == []
+
+    def test_empty_html(self):
+        """Test handling of empty HTML."""
+        result = quick_extract_links("", "https://example.com")
+        assert result == {"internal": [], "external": []}
+
+    def test_relative_url_resolution(self):
+        """Test that relative URLs are resolved correctly."""
+        html = '''
+        <html>
+            <body>
+                <a href="page1.html">Relative</a>
+                <a href="./page2.html">Dot Relative</a>
+                <a href="../page3.html">Parent Relative</a>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com/docs/")
+
+        assert len(result["internal"]) >= 1
+        # All should be internal and properly resolved
+        for link in result["internal"]:
+            assert link["href"].startswith("https://example.com")
+
+    def test_text_truncation(self):
+        """Test that long link text is truncated to 200 chars."""
+        long_text = "A" * 300
+        html = f'''
+        <html>
+            <body>
+                <a href="/page">{long_text}</a>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        assert len(result["internal"]) == 1
+        assert len(result["internal"][0]["text"]) == 200
+
+    def test_empty_href_ignored(self):
+        """Test that empty href attributes are ignored."""
+        html = '''
+        <html>
+            <body>
+                <a href="">Empty</a>
+                <a href="   ">Whitespace</a>
+                <a href="/valid">Valid</a>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        assert len(result["internal"]) == 1
+        assert result["internal"][0]["href"] == "https://example.com/valid"
+
+    def test_mixed_internal_external(self):
+        """Test correct classification of mixed internal and external links."""
+        html = '''
+        <html>
+            <body>
+                <a href="/internal1">Internal 1</a>
+                <a href="https://example.com/internal2">Internal 2</a>
+                <a href="https://google.com">Google</a>
+                <a href="https://github.com/repo">GitHub</a>
+                <a href="/internal3">Internal 3</a>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        assert len(result["internal"]) == 3
+        assert len(result["external"]) == 2
+
+    def test_subdomain_handling(self):
+        """Test that subdomains are handled correctly."""
+        html = '''
+        <html>
+            <body>
+                <a href="https://docs.example.com/page">Docs subdomain</a>
+                <a href="https://api.example.com/v1">API subdomain</a>
+                <a href="https://example.com/main">Main domain</a>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        # All should be internal (same base domain)
+        total_links = len(result["internal"]) + len(result["external"])
+        assert total_links == 3
+
+
+class TestQuickExtractLinksEdgeCases:
+    """Edge case tests for quick_extract_links."""
+
+    def test_no_links_in_page(self):
+        """Test page with no links."""
+        html = '''
+        <html>
+            <body>
+                <h1>No Links Here</h1>
+                <p>Just some text content.</p>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        assert result["internal"] == []
+        assert result["external"] == []
+
+    def test_links_in_nested_elements(self):
+        """Test links nested in various elements."""
+        html = '''
+        <html>
+            <body>
+                <nav>
+                    <ul>
+                        <li><a href="/home">Home</a></li>
+                        <li><a href="/about">About</a></li>
+                    </ul>
+                </nav>
+                <div class="content">
+                    <p>Check out <a href="/products">our products</a>.</p>
+                </div>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        assert len(result["internal"]) == 3
+
+    def test_link_with_nested_elements(self):
+        """Test links containing nested elements."""
+        html = '''
+        <html>
+            <body>
+                <a href="/page"><span>Nested</span> <strong>Text</strong></a>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        assert len(result["internal"]) == 1
+        assert "Nested" in result["internal"][0]["text"]
+        assert "Text" in result["internal"][0]["text"]
+
+    def test_protocol_relative_urls(self):
+        """Test handling of protocol-relative URLs (//example.com)."""
+        html = '''
+        <html>
+            <body>
+                <a href="//cdn.example.com/asset">CDN Link</a>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        # Should be resolved with https:
+        total = len(result["internal"]) + len(result["external"])
+        assert total >= 1
+
+    def test_whitespace_in_href(self):
+        """Test handling of whitespace around href values."""
+        html = '''
+        <html>
+            <body>
+                <a href="  /page1  ">Padded</a>
+                <a href="
+                    /page2
+                ">Multiline</a>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        # Both should be extracted and normalized
+        assert len(result["internal"]) >= 1
--- a/tests/test_prefetch_regression.py
+++ b/tests/test_prefetch_regression.py
@@ -0,0 +1,232 @@
+"""Regression tests to ensure prefetch mode doesn't break existing functionality."""
+
+import pytest
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+TEST_URL = "https://docs.crawl4ai.com"
+
+
+class TestNoRegressions:
+    """Ensure prefetch mode doesn't break existing functionality."""
+
+    @pytest.mark.asyncio
+    async def test_default_mode_unchanged(self):
+        """Test that default mode (prefetch=False) works exactly as before."""
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig()  # Default config
+            result = await crawler.arun(TEST_URL, config=config)
+
+            # All standard fields should be populated
+            assert result.html is not None
+            assert result.cleaned_html is not None
+            assert result.links is not None
+            assert result.success is True
+
+    @pytest.mark.asyncio
+    async def test_explicit_prefetch_false(self):
+        """Test explicit prefetch=False works like default."""
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(prefetch=False)
+            result = await crawler.arun(TEST_URL, config=config)
+
+            assert result.cleaned_html is not None
+
+    @pytest.mark.asyncio
+    async def test_config_clone_preserves_prefetch(self):
+        """Test that config.clone() preserves prefetch setting."""
+        config = CrawlerRunConfig(prefetch=True)
+        cloned = config.clone()
+
+        assert cloned.prefetch == True
+
+        # Clone with override
+        cloned_false = config.clone(prefetch=False)
+        assert cloned_false.prefetch == False
+
+    @pytest.mark.asyncio
+    async def test_config_to_dict_includes_prefetch(self):
+        """Test that to_dict() includes prefetch."""
+        config_true = CrawlerRunConfig(prefetch=True)
+        config_false = CrawlerRunConfig(prefetch=False)
+
+        assert config_true.to_dict()["prefetch"] == True
+        assert config_false.to_dict()["prefetch"] == False
+
+    @pytest.mark.asyncio
+    async def test_existing_extraction_still_works(self):
+        """Test that extraction strategies still work in normal mode."""
+        from crawl4ai import JsonCssExtractionStrategy
+
+        schema = {
+            "name": "Links",
+            "baseSelector": "a",
+            "fields": [
+                {"name": "href", "selector": "", "type": "attribute", "attribute": "href"},
+                {"name": "text", "selector": "", "type": "text"}
+            ]
+        }
+
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(
+                extraction_strategy=JsonCssExtractionStrategy(schema=schema)
+            )
+            result = await crawler.arun(TEST_URL, config=config)
+
+            assert result.extracted_content is not None
+
+    @pytest.mark.asyncio
+    async def test_existing_deep_crawl_still_works(self):
+        """Test that deep crawl without prefetch still does full processing."""
+        from crawl4ai import BFSDeepCrawlStrategy
+
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(
+                deep_crawl_strategy=BFSDeepCrawlStrategy(
+                    max_depth=1,
+                    max_pages=2
+                )
+                # No prefetch - should do full processing
+            )
+
+            result_container = await crawler.arun(TEST_URL, config=config)
+
+            # Handle both list and iterator results
+            if hasattr(result_container, '__aiter__'):
+                results = [r async for r in result_container]
+            else:
+                results = list(result_container) if hasattr(result_container, '__iter__') else [result_container]
+
+            # Each result should have full processing
+            for result in results:
+                assert result.cleaned_html is not None
+
+            assert len(results) >= 1
+
+    @pytest.mark.asyncio
+    async def test_raw_url_scheme_still_works(self):
+        """Test that raw: URL scheme works for processing stored HTML."""
+        sample_html = """
+        <html>
+            <head><title>Test Page</title></head>
+            <body>
+                <h1>Hello World</h1>
+                <p>This is a test paragraph.</p>
+                <a href="/link1">Link 1</a>
+            </body>
+        </html>
+        """
+
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig()
+            result = await crawler.arun(f"raw:{sample_html}", config=config)
+
+            assert result.success is True
+            assert result.html is not None
+            assert "Hello World" in result.html
+            assert result.cleaned_html is not None
+
+    @pytest.mark.asyncio
+    async def test_screenshot_still_works(self):
+        """Test that screenshot option still works in normal mode."""
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(screenshot=True)
+            result = await crawler.arun(TEST_URL, config=config)
+
+            assert result.success is True
+            # Screenshot data should be present
+            assert result.screenshot is not None or result.screenshot_data is not None
+
+    @pytest.mark.asyncio
+    async def test_js_execution_still_works(self):
+        """Test that JavaScript execution still works in normal mode."""
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(
+                js_code="document.querySelector('h1')?.textContent"
+            )
+            result = await crawler.arun(TEST_URL, config=config)
+
+            assert result.success is True
+            assert result.html is not None
+
+
+class TestPrefetchDoesNotAffectOtherModes:
+    """Test that prefetch doesn't interfere with other configurations."""
+
+    @pytest.mark.asyncio
+    async def test_prefetch_with_other_options_ignored(self):
+        """Test that other options are properly ignored in prefetch mode."""
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(
+                prefetch=True,
+                # These should be ignored in prefetch mode
+                screenshot=True,
+                pdf=True,
+                only_text=True,
+                word_count_threshold=100
+            )
+            result = await crawler.arun(TEST_URL, config=config)
+
+            # Should still return HTML and links
+            assert result.html is not None
+            assert result.links is not None
+
+            # But should NOT have processed content
+            assert result.cleaned_html is None
+            assert result.extracted_content is None
+
+    @pytest.mark.asyncio
+    async def test_stream_mode_still_works(self):
+        """Test that stream mode still works normally."""
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(stream=True)
+            result = await crawler.arun(TEST_URL, config=config)
+
+            assert result.success is True
+            assert result.html is not None
+
+    @pytest.mark.asyncio
+    async def test_cache_mode_still_works(self):
+        """Test that cache mode still works normally."""
+        from crawl4ai import CacheMode
+
+        async with AsyncWebCrawler() as crawler:
+            # First request - bypass cache
+            config1 = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
+            result1 = await crawler.arun(TEST_URL, config=config1)
+            assert result1.success is True
+
+            # Second request - should work
+            config2 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED)
+            result2 = await crawler.arun(TEST_URL, config=config2)
+            assert result2.success is True
+
+
+class TestBackwardsCompatibility:
+    """Test backwards compatibility with existing code patterns."""
+
+    @pytest.mark.asyncio
+    async def test_config_without_prefetch_works(self):
+        """Test that configs created without prefetch parameter work."""
+        # Simulating old code that doesn't know about prefetch
+        config = CrawlerRunConfig(
+            word_count_threshold=50,
+            css_selector="body"
+        )
+
+        # Should default to prefetch=False
+        assert config.prefetch == False
+
+        async with AsyncWebCrawler() as crawler:
+            result = await crawler.arun(TEST_URL, config=config)
+            assert result.success is True
+            assert result.cleaned_html is not None
+
+    @pytest.mark.asyncio
+    async def test_from_kwargs_without_prefetch(self):
+        """Test CrawlerRunConfig.from_kwargs works without prefetch."""
+        config = CrawlerRunConfig.from_kwargs({
+            "word_count_threshold": 50,
+            "verbose": False
+        })
+
+        assert config.prefetch == False
--- a/tests/test_raw_html_browser.py
+++ b/tests/test_raw_html_browser.py
@@ -0,0 +1,172 @@
+"""
+Tests for raw:/file:// URL browser pipeline support.
+
+Tests the new feature that allows js_code, wait_for, and other browser operations
+to work with raw: and file:// URLs by routing them through _crawl_web() with
+set_content() instead of goto().
+"""
+
+import pytest
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+
+@pytest.mark.asyncio
+async def test_raw_html_fast_path():
+    """Test that raw: without browser params returns HTML directly (fast path)."""
+    html = "<html><body><div id='test'>Original Content</div></body></html>"
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig()  # No browser params
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert "Original Content" in result.html
+    # Fast path should not modify the HTML
+    assert result.html == html
+
+
+@pytest.mark.asyncio
+async def test_js_code_on_raw_html():
+    """Test that js_code executes on raw: HTML and modifies the DOM."""
+    html = "<html><body><div id='test'>Original</div></body></html>"
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            js_code="document.getElementById('test').innerText = 'Modified by JS'"
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert "Modified by JS" in result.html
+    assert "Original" not in result.html or "Modified by JS" in result.html
+
+
+@pytest.mark.asyncio
+async def test_js_code_adds_element_to_raw_html():
+    """Test that js_code can add new elements to raw: HTML."""
+    html = "<html><body><div id='container'></div></body></html>"
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            js_code='document.getElementById("container").innerHTML = "<span id=\'injected\'>Custom Content</span>"'
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert "injected" in result.html
+    assert "Custom Content" in result.html
+
+
+@pytest.mark.asyncio
+async def test_screenshot_on_raw_html():
+    """Test that screenshots work on raw: HTML."""
+    html = "<html><body><h1 style='color:red;font-size:48px;'>Screenshot Test</h1></body></html>"
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(screenshot=True)
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert result.screenshot is not None
+    assert len(result.screenshot) > 100  # Should have substantial screenshot data
+
+
+@pytest.mark.asyncio
+async def test_process_in_browser_flag():
+    """Test that process_in_browser=True forces browser path even without other params."""
+    html = "<html><body><div>Test</div></body></html>"
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(process_in_browser=True)
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    # Browser path normalizes HTML, so it may be slightly different
+    assert "Test" in result.html
+
+
+@pytest.mark.asyncio
+async def test_raw_prefix_variations():
+    """Test both raw: and raw:// prefix formats."""
+    html = "<html><body>Content</body></html>"
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            js_code='document.body.innerHTML += "<div id=\'added\'>Added</div>"'
+        )
+
+        # Test raw: prefix
+        result1 = await crawler.arun(f"raw:{html}", config=config)
+        assert result1.success
+        assert "Added" in result1.html
+
+        # Test raw:// prefix
+        result2 = await crawler.arun(f"raw://{html}", config=config)
+        assert result2.success
+        assert "Added" in result2.html
+
+
+@pytest.mark.asyncio
+async def test_wait_for_on_raw_html():
+    """Test that wait_for works with raw: HTML after js_code modifies DOM."""
+    html = "<html><body><div id='container'></div></body></html>"
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            js_code='''
+                setTimeout(() => {
+                    document.getElementById('container').innerHTML = '<div id="delayed">Delayed Content</div>';
+                }, 100);
+            ''',
+            wait_for="#delayed",
+            wait_for_timeout=5000
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert "Delayed Content" in result.html
+
+
+@pytest.mark.asyncio
+async def test_multiple_js_code_scripts():
+    """Test that multiple js_code scripts execute in order."""
+    html = "<html><body><div id='counter'>0</div></body></html>"
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            js_code=[
+                "document.getElementById('counter').innerText = '1'",
+                "document.getElementById('counter').innerText = parseInt(document.getElementById('counter').innerText) + 1",
+                "document.getElementById('counter').innerText = parseInt(document.getElementById('counter').innerText) + 1",
+            ]
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert ">3<" in result.html  # Counter should be 3 after all scripts run
+
+
+if __name__ == "__main__":
+    # Run a quick manual test
+    async def quick_test():
+        html = "<html><body><div id='test'>Original</div></body></html>"
+
+        async with AsyncWebCrawler(verbose=True) as crawler:
+            # Test 1: Fast path
+            print("\n=== Test 1: Fast path (no browser params) ===")
+            result1 = await crawler.arun(f"raw:{html}")
+            print(f"Success: {result1.success}")
+            print(f"HTML contains 'Original': {'Original' in result1.html}")
+
+            # Test 2: js_code modifies DOM
+            print("\n=== Test 2: js_code modifies DOM ===")
+            config = CrawlerRunConfig(
+                js_code="document.getElementById('test').innerText = 'Modified by JS'"
+            )
+            result2 = await crawler.arun(f"raw:{html}", config=config)
+            print(f"Success: {result2.success}")
+            print(f"HTML contains 'Modified by JS': {'Modified by JS' in result2.html}")
+            print(f"HTML snippet: {result2.html[:500]}...")
+
+    asyncio.run(quick_test())
--- a/tests/test_raw_html_edge_cases.py
+++ b/tests/test_raw_html_edge_cases.py
@@ -0,0 +1,563 @@
+"""
+BRUTAL edge case tests for raw:/file:// URL browser pipeline.
+
+These tests try to break the system with tricky inputs, edge cases,
+and compatibility checks to ensure we didn't break existing functionality.
+"""
+
+import pytest
+import asyncio
+import tempfile
+import os
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+
+# ============================================================================
+# EDGE CASE: Hash characters in HTML (previously broke urlparse - Issue #283)
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_raw_html_with_hash_in_css():
+    """Test that # in CSS colors doesn't break HTML parsing (regression for #283)."""
+    html = """
+    <html>
+    <head>
+        <style>
+            body { background-color: #ff5733; color: #333333; }
+            .highlight { border: 1px solid #000; }
+        </style>
+    </head>
+    <body>
+        <div class="highlight" style="color: #ffffff;">Content with hash colors</div>
+    </body>
+    </html>
+    """
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(js_code="document.body.innerHTML += '<div id=\"added\">Added</div>'")
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert "#ff5733" in result.html or "ff5733" in result.html  # Color should be preserved
+    assert "Added" in result.html  # JS executed
+    assert "Content with hash colors" in result.html  # Original content preserved
+
+
+@pytest.mark.asyncio
+async def test_raw_html_with_fragment_links():
+    """Test HTML with # fragment links doesn't break."""
+    html = """
+    <html><body>
+        <a href="#section1">Go to section 1</a>
+        <a href="#section2">Go to section 2</a>
+        <div id="section1">Section 1</div>
+        <div id="section2">Section 2</div>
+    </body></html>
+    """
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(js_code="document.getElementById('section1').innerText = 'Modified Section 1'")
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert "Modified Section 1" in result.html
+    assert "#section2" in result.html  # Fragment link preserved
+
+
+# ============================================================================
+# EDGE CASE: Special characters and unicode
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_raw_html_with_unicode():
+    """Test raw HTML with various unicode characters."""
+    html = """
+    <html><body>
+        <div id="unicode">日本語 中文 한국어 العربية 🎉 💻 🚀</div>
+        <div id="special">&amp; &lt; &gt; &quot; &apos;</div>
+    </body></html>
+    """
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(js_code="document.getElementById('unicode').innerText += ' ✅ Modified'")
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert "✅ Modified" in result.html or "Modified" in result.html
+    # Check unicode is preserved
+    assert "日本語" in result.html or "&#" in result.html  # Either preserved or encoded
+
+
+@pytest.mark.asyncio
+async def test_raw_html_with_script_tags():
+    """Test raw HTML with existing script tags doesn't interfere with js_code."""
+    html = """
+    <html><body>
+        <div id="counter">0</div>
+        <script>
+            // This script runs on page load
+            document.getElementById('counter').innerText = '10';
+        </script>
+    </body></html>
+    """
+
+    async with AsyncWebCrawler() as crawler:
+        # Our js_code runs AFTER the page scripts
+        config = CrawlerRunConfig(
+            js_code="document.getElementById('counter').innerText = parseInt(document.getElementById('counter').innerText) + 5"
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    # The embedded script sets it to 10, then our js_code adds 5
+    assert ">15<" in result.html or "15" in result.html
+
+
+# ============================================================================
+# EDGE CASE: Empty and malformed HTML
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_raw_html_empty():
+    """Test empty raw HTML."""
+    html = ""
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(js_code="document.body.innerHTML = '<div>Added to empty</div>'")
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert "Added to empty" in result.html
+
+
+@pytest.mark.asyncio
+async def test_raw_html_minimal():
+    """Test minimal HTML (just text, no tags)."""
+    html = "Just plain text, no HTML tags"
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(js_code="document.body.innerHTML += '<div id=\"injected\">Injected</div>'")
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    # Browser should wrap it in proper HTML
+    assert "Injected" in result.html
+
+
+@pytest.mark.asyncio
+async def test_raw_html_malformed():
+    """Test malformed HTML with unclosed tags."""
+    html = "<html><body><div><span>Unclosed tags<div>More content"
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(js_code="document.body.innerHTML += '<div id=\"valid\">Valid Added</div>'")
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert "Valid Added" in result.html
+    # Browser should have fixed the malformed HTML
+
+
+# ============================================================================
+# EDGE CASE: Very large HTML
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_raw_html_large():
+    """Test large raw HTML (100KB+)."""
+    # Generate 100KB of HTML
+    items = "".join([f'<div class="item" id="item-{i}">Item {i} content here with some text</div>\n' for i in range(2000)])
+    html = f"<html><body>{items}</body></html>"
+
+    assert len(html) > 100000  # Verify it's actually large
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            js_code="document.getElementById('item-999').innerText = 'MODIFIED ITEM 999'"
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert "MODIFIED ITEM 999" in result.html
+    assert "item-1999" in result.html  # Last item should still exist
+
+
+# ============================================================================
+# EDGE CASE: JavaScript errors and timeouts
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_raw_html_js_error_doesnt_crash():
+    """Test that JavaScript errors in js_code don't crash the crawl."""
+    html = "<html><body><div id='test'>Original</div></body></html>"
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            js_code=[
+                "nonExistentFunction();",  # This will throw an error
+                "document.getElementById('test').innerText = 'Still works'"  # This should still run
+            ]
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    # Crawl should succeed even with JS errors
+    assert result.success
+
+
+@pytest.mark.asyncio
+async def test_raw_html_wait_for_timeout():
+    """Test wait_for with element that never appears times out gracefully."""
+    html = "<html><body><div id='test'>Original</div></body></html>"
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            wait_for="#never-exists",
+            wait_for_timeout=1000  # 1 second timeout
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    # Should timeout but still return the HTML we have
+    # The behavior might be success=False or success=True with partial content
+    # Either way, it shouldn't hang or crash
+    assert result is not None
+
+
+# ============================================================================
+# COMPATIBILITY: Normal HTTP URLs still work
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_http_urls_still_work():
+    """Ensure we didn't break normal HTTP URL crawling."""
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun("https://example.com")
+
+    assert result.success
+    assert "Example Domain" in result.html
+
+
+@pytest.mark.asyncio
+async def test_http_with_js_code_still_works():
+    """Ensure HTTP URLs with js_code still work."""
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            js_code="document.body.innerHTML += '<div id=\"injected\">Injected via JS</div>'"
+        )
+        result = await crawler.arun("https://example.com", config=config)
+
+    assert result.success
+    assert "Injected via JS" in result.html
+
+
+# ============================================================================
+# COMPATIBILITY: File URLs
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_file_url_with_js_code():
+    """Test file:// URLs with js_code execution."""
+    # Create a temp file
+    with tempfile.NamedTemporaryFile(mode='w', suffix='.html', delete=False) as f:
+        f.write("<html><body><div id='file-content'>File Content</div></body></html>")
+        temp_path = f.name
+
+    try:
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(
+                js_code="document.getElementById('file-content').innerText = 'Modified File Content'"
+            )
+            result = await crawler.arun(f"file://{temp_path}", config=config)
+
+        assert result.success
+        assert "Modified File Content" in result.html
+    finally:
+        os.unlink(temp_path)
+
+
+@pytest.mark.asyncio
+async def test_file_url_fast_path():
+    """Test file:// fast path (no browser params)."""
+    with tempfile.NamedTemporaryFile(mode='w', suffix='.html', delete=False) as f:
+        f.write("<html><body>Fast path file content</body></html>")
+        temp_path = f.name
+
+    try:
+        async with AsyncWebCrawler() as crawler:
+            result = await crawler.arun(f"file://{temp_path}")
+
+        assert result.success
+        assert "Fast path file content" in result.html
+    finally:
+        os.unlink(temp_path)
+
+
+# ============================================================================
+# COMPATIBILITY: Extraction strategies with raw HTML
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_raw_html_with_css_extraction():
+    """Test CSS extraction on raw HTML after js_code modifies it."""
+    from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+
+    html = """
+    <html><body>
+        <div class="products">
+            <div class="product"><span class="name">Original Product</span></div>
+        </div>
+    </body></html>
+    """
+
+    schema = {
+        "name": "Products",
+        "baseSelector": ".product",
+        "fields": [
+            {"name": "name", "selector": ".name", "type": "text"}
+        ]
+    }
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            js_code="""
+                document.querySelector('.products').innerHTML +=
+                '<div class="product"><span class="name">JS Added Product</span></div>';
+            """,
+            extraction_strategy=JsonCssExtractionStrategy(schema)
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    # Check that extraction found both products
+    import json
+    extracted = json.loads(result.extracted_content)
+    names = [p.get('name', '') for p in extracted]
+    assert any("JS Added Product" in name for name in names)
+
+
+# ============================================================================
+# EDGE CASE: Concurrent raw: requests
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_concurrent_raw_requests():
+    """Test multiple concurrent raw: requests don't interfere."""
+    htmls = [
+        f"<html><body><div id='test'>Request {i}</div></body></html>"
+        for i in range(5)
+    ]
+
+    async with AsyncWebCrawler() as crawler:
+        configs = [
+            CrawlerRunConfig(
+                js_code=f"document.getElementById('test').innerText += ' Modified {i}'"
+            )
+            for i in range(5)
+        ]
+
+        # Run concurrently
+        tasks = [
+            crawler.arun(f"raw:{html}", config=config)
+            for html, config in zip(htmls, configs)
+        ]
+        results = await asyncio.gather(*tasks)
+
+    for i, result in enumerate(results):
+        assert result.success
+        assert f"Request {i}" in result.html
+        assert f"Modified {i}" in result.html
+
+
+# ============================================================================
+# EDGE CASE: raw: with base_url for link resolution
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_raw_html_with_base_url():
+    """Test that base_url is used for link resolution in markdown."""
+    html = """
+    <html><body>
+        <a href="/page1">Page 1</a>
+        <a href="/page2">Page 2</a>
+        <img src="/images/logo.png" alt="Logo">
+    </body></html>
+    """
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            base_url="https://example.com",
+            process_in_browser=True  # Force browser to test base_url handling
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    # Check markdown has absolute URLs
+    if result.markdown:
+        # Links should be absolute
+        md = result.markdown.raw_markdown if hasattr(result.markdown, 'raw_markdown') else str(result.markdown)
+        assert "example.com" in md or "/page1" in md
+
+
+# ============================================================================
+# EDGE CASE: raw: with screenshot of complex page
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_raw_html_screenshot_complex_page():
+    """Test screenshot of complex raw HTML with CSS and JS modifications."""
+    html = """
+    <html>
+    <head>
+        <style>
+            body { font-family: Arial; background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 40px; }
+            .card { background: white; padding: 20px; border-radius: 10px; box-shadow: 0 4px 6px rgba(0,0,0,0.1); }
+            h1 { color: #333; }
+        </style>
+    </head>
+    <body>
+        <div class="card">
+            <h1 id="title">Original Title</h1>
+            <p>This is a test card with styling.</p>
+        </div>
+    </body>
+    </html>
+    """
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            js_code="document.getElementById('title').innerText = 'Modified Title'",
+            screenshot=True
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert result.screenshot is not None
+    assert len(result.screenshot) > 1000  # Should be substantial
+    assert "Modified Title" in result.html
+
+
+# ============================================================================
+# EDGE CASE: JavaScript that tries to navigate away
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_raw_html_js_navigation_blocked():
+    """Test that JS trying to navigate doesn't break the crawl."""
+    html = """
+    <html><body>
+        <div id="content">Original Content</div>
+        <script>
+            // Try to navigate away (should be blocked or handled)
+            // window.location.href = 'https://example.com';
+        </script>
+    </body></html>
+    """
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            # Try to navigate via js_code
+            js_code=[
+                "document.getElementById('content').innerText = 'Before navigation attempt'",
+                # Actual navigation attempt commented - would cause issues
+                # "window.location.href = 'https://example.com'",
+            ]
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert "Before navigation attempt" in result.html
+
+
+# ============================================================================
+# EDGE CASE: Raw HTML with iframes
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_raw_html_with_iframes():
+    """Test raw HTML containing iframes."""
+    html = """
+    <html><body>
+        <div id="main">Main content</div>
+        <iframe id="frame1" srcdoc="<html><body><div id='iframe-content'>Iframe Content</div></body></html>"></iframe>
+    </body></html>
+    """
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            js_code="document.getElementById('main').innerText = 'Modified main'",
+            process_iframes=True
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert "Modified main" in result.html
+
+
+# ============================================================================
+# TRICKY: Protocol inside raw content
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_raw_html_with_urls_inside():
+    """Test raw: with http:// URLs inside the content."""
+    html = """
+    <html><body>
+        <a href="http://example.com">Example</a>
+        <a href="https://google.com">Google</a>
+        <img src="https://placekitten.com/200/300" alt="Cat">
+        <div id="test">Test content with URL: https://test.com</div>
+    </body></html>
+    """
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            js_code="document.getElementById('test').innerText += ' - Modified'"
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert "Modified" in result.html
+    assert "http://example.com" in result.html or "example.com" in result.html
+
+
+# ============================================================================
+# TRICKY: Double raw: prefix
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_double_raw_prefix():
+    """Test what happens with double raw: prefix (edge case)."""
+    html = "<html><body>Content</body></html>"
+
+    async with AsyncWebCrawler() as crawler:
+        # raw:raw:<html>... - the second raw: becomes part of content
+        result = await crawler.arun(f"raw:raw:{html}")
+
+    # Should either handle gracefully or return "raw:<html>..." as content
+    assert result is not None
+
+
+if __name__ == "__main__":
+    import sys
+
+    async def run_tests():
+        # Run a few key tests manually
+        tests = [
+            ("Hash in CSS", test_raw_html_with_hash_in_css),
+            ("Unicode", test_raw_html_with_unicode),
+            ("Large HTML", test_raw_html_large),
+            ("HTTP still works", test_http_urls_still_work),
+            ("Concurrent requests", test_concurrent_raw_requests),
+            ("Complex screenshot", test_raw_html_screenshot_complex_page),
+        ]
+
+        for name, test_fn in tests:
+            print(f"\n=== Running: {name} ===")
+            try:
+                await test_fn()
+                print(f"✅ {name} PASSED")
+            except Exception as e:
+                print(f"❌ {name} FAILED: {e}")
+                import traceback
+                traceback.print_exc()
+
+    asyncio.run(run_tests())
--- a/uv.lock
+++ b/uv.lock