Release v0.8.0: Crash Recovery, Prefetch Mode & Security Fixes (#1712)

* Fix: Use correct URL variable for raw HTML extraction (#1116) - Prevents full HTML content from being passed as URL to extraction strategies - Added unit tests to verify raw HTML and regular URL processing Fix: Wrong URL variable used for extraction of raw html * Fix #1181: Preserve whitespace in code blocks during HTML scraping The remove_empty_elements_fast() method was removing whitespace-only span elements inside <pre> and <code> tags, causing import statements like "import torch" to become "importtorch". Now skips elements inside code blocks where whitespace is significant. * Refactor Pydantic model configuration to use ConfigDict for arbitrary types * Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621 * Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638 * fix: ensure BrowserConfig.to_dict serializes proxy_config * feat: make LLM backoff configurable end-to-end - extend LLMConfig with backoff delay/attempt/factor fields and thread them through LLMExtractionStrategy, LLMContentFilter, table extraction, and Docker API handlers - expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff and document them in the md_v2 guides * reproduced AttributeError from #1642 * pass timeout parameter to docker client request * added missing deep crawling objects to init * generalized query in ContentRelevanceFilter to be a str or list * import modules from enhanceable deserialization * parameterized tests * Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268 * refactor: replace PyPDF2 with pypdf across the codebase. ref #1412 * Add browser_context_id and target_id parameters to BrowserConfig Enable Crawl4AI to connect to pre-created CDP browser contexts, which is essential for cloud browser services that pre-create isolated contexts. Changes: - Add browser_context_id and target_id parameters to BrowserConfig - Update from_kwargs() and to_dict() methods - Modify BrowserManager.start() to use existing context when provided - Add _get_page_by_target_id() helper method - Update get_page() to handle pre-existing targets - Add test for browser_context_id functionality This enables cloud services to: 1. Create isolated CDP contexts before Crawl4AI connects 2. Pass context/target IDs to BrowserConfig 3. Have Crawl4AI reuse existing contexts instead of creating new ones * Add cdp_cleanup_on_close flag to prevent memory leaks in cloud/server scenarios * Fix: add cdp_cleanup_on_close to from_kwargs * Fix: find context by target_id for concurrent CDP connections * Fix: use target_id to find correct page in get_page * Fix: use CDP to find context by browserContextId for concurrent sessions * Revert context matching attempts - Playwright cannot see CDP-created contexts * Add create_isolated_context flag for concurrent CDP crawls When True, forces creation of a new browser context instead of reusing the default context. Essential for concurrent crawls on the same browser to prevent navigation conflicts. * Add context caching to create_isolated_context branch Uses contexts_by_config cache (same as non-CDP mode) to reuse contexts for multiple URLs with same config. Still creates new page per crawl for navigation isolation. Benefits batch/deep crawls. * Add init_scripts support to BrowserConfig for pre-page-load JS injection This adds the ability to inject JavaScript that runs before any page loads, useful for stealth evasions (canvas/audio fingerprinting, userAgentData). - Add init_scripts parameter to BrowserConfig (list of JS strings) - Apply init_scripts in setup_context() via context.add_init_script() - Update from_kwargs() and to_dict() for serialization * Fix CDP connection handling: support WS URLs and proper cleanup Changes to browser_manager.py: 1. _verify_cdp_ready(): Support multiple URL formats - WebSocket URLs (ws://, wss://): Skip HTTP verification, Playwright handles directly - HTTP URLs with query params: Properly parse with urlparse to preserve query string - Fixes issue where naive f"{cdp_url}/json/version" broke WS URLs and query params 2. close(): Proper cleanup when cdp_cleanup_on_close=True - Close all sessions (pages) - Close all contexts - Call browser.close() to disconnect (doesn't terminate browser, just releases connection) - Wait 1 second for CDP connection to fully release - Stop Playwright instance to prevent memory leaks This enables: - Connecting to specific browsers via WS URL - Reusing the same browser with multiple sequential connections - No user wait needed between connections (internal 1s delay handles it) Added tests/browser/test_cdp_cleanup_reuse.py with comprehensive tests. * Update gitignore * Some debugging for caching * Add _generate_screenshot_from_html for raw: and file:// URLs Implements the missing method that was being called but never defined. Now raw: and file:// URLs can generate screenshots by: 1. Loading HTML into a browser page via page.set_content() 2. Taking screenshot using existing take_screenshot() method 3. Cleaning up the page afterward This enables cached HTML to be rendered with screenshots in crawl4ai-cloud. * Add PDF and MHTML support for raw: and file:// URLs - Replace _generate_screenshot_from_html with _generate_media_from_html - New method handles screenshot, PDF, and MHTML in one browser session - Update raw: and file:// URL handlers to use new method - Enables cached HTML to generate all media types * Add crash recovery for deep crawl strategies Add optional resume_state and on_state_change parameters to all deep crawl strategies (BFS, DFS, Best-First) for cloud deployment crash recovery. Features: - resume_state: Pass saved state to resume from checkpoint - on_state_change: Async callback fired after each URL for real-time state persistence to external storage (Redis, DB, etc.) - export_state(): Get last captured state manually - Zero overhead when features are disabled (None defaults) State includes visited URLs, pending queue/stack, depths, and pages_crawled count. All state is JSON-serializable. * Fix: HTTP strategy raw: URL parsing truncates at # character The AsyncHTTPCrawlerStrategy.crawl() method used urlparse() to extract content from raw: URLs. This caused HTML with CSS color codes like #eee to be truncated because # is treated as a URL fragment delimiter. Before: raw:body{background:#eee} -> parsed.path = 'body{background:' After: raw:body{background:#eee} -> raw_content = 'body{background:#eee' Fix: Strip the raw: or raw:// prefix directly instead of using urlparse, matching how the browser strategy handles it. * Add base_url parameter to CrawlerRunConfig for raw HTML processing When processing raw: HTML (e.g., from cache), the URL parameter is meaningless for markdown link resolution. This adds a base_url parameter that can be set explicitly to provide proper URL resolution context. Changes: - Add base_url parameter to CrawlerRunConfig.__init__ - Add base_url to CrawlerRunConfig.from_kwargs - Update aprocess_html to use base_url for markdown generation Usage: config = CrawlerRunConfig(base_url='https://example.com') result = await crawler.arun(url='raw:{html}', config=config) * Add prefetch mode for two-phase deep crawling - Add `prefetch` parameter to CrawlerRunConfig - Add `quick_extract_links()` function for fast link extraction - Add short-circuit in aprocess_html() for prefetch mode - Add 42 tests (unit, integration, regression) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Updates on proxy rotation and proxy configuration * Add proxy support to HTTP crawler strategy * Add browser pipeline support for raw:/file:// URLs - Add process_in_browser parameter to CrawlerRunConfig - Route raw:/file:// URLs through _crawl_web() when browser operations needed - Use page.set_content() instead of goto() for local content - Fix cookie handling for non-HTTP URLs in browser_manager - Auto-detect browser requirements: js_code, wait_for, screenshot, etc. - Maintain fast path for raw:/file:// without browser params Fixes #310 * Add smart TTL cache for sitemap URL seeder - Add cache_ttl_hours and validate_sitemap_lastmod params to SeedingConfig - New JSON cache format with metadata (version, created_at, lastmod, url_count) - Cache validation by TTL expiry and sitemap lastmod comparison - Auto-migration from old .jsonl to new .json format - Fixes bug where incomplete cache was used indefinitely * Update URL seeder docs with smart TTL cache parameters - Add cache_ttl_hours and validate_sitemap_lastmod to parameter table - Document smart TTL cache validation with examples - Add cache-related troubleshooting entries - Update key features summary * Add MEMORY.md to gitignore * Docs: Add multi-sample schema generation section Add documentation explaining how to pass multiple HTML samples to generate_schema() for stable selectors that work across pages with varying DOM structures. Includes: - Problem explanation (fragile nth-child selectors) - Solution with code example - Key points for multi-sample queries - Comparison table of fragile vs stable selectors * Fix critical RCE and LFI vulnerabilities in Docker API deployment Security fixes for vulnerabilities reported by ProjectDiscovery: 1. Remote Code Execution via Hooks (CVE pending) - Remove __import__ from allowed_builtins in hook_manager.py - Prevents arbitrary module imports (os, subprocess, etc.) - Hooks now disabled by default via CRAWL4AI_HOOKS_ENABLED env var 2. Local File Inclusion via file:// URLs (CVE pending) - Add URL scheme validation to /execute_js, /screenshot, /pdf, /html - Block file://, javascript:, data: and other dangerous schemes - Only allow http://, https://, and raw: (where appropriate) 3. Security hardening - Add CRAWL4AI_HOOKS_ENABLED=false as default (opt-in for hooks) - Add security warning comments in config.yml - Add validate_url_scheme() helper for consistent validation Testing: - Add unit tests (test_security_fixes.py) - 16 tests - Add integration tests (run_security_tests.py) for live server Affected endpoints: - POST /crawl (hooks disabled by default) - POST /crawl/stream (hooks disabled by default) - POST /execute_js (URL validation added) - POST /screenshot (URL validation added) - POST /pdf (URL validation added) - POST /html (URL validation added) Breaking changes: - Hooks require CRAWL4AI_HOOKS_ENABLED=true to function - file:// URLs no longer work on API endpoints (use library directly) * Enhance authentication flow by implementing JWT token retrieval and adding authorization headers to API requests * Add release notes for v0.7.9, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates * Add release notes for v0.8.0, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates Documentation for v0.8.0 release: - SECURITY.md: Security policy and vulnerability reporting guidelines - RELEASE_NOTES_v0.8.0.md: Comprehensive release notes - migration/v0.8.0-upgrade-guide.md: Step-by-step migration guide - security/GHSA-DRAFT-RCE-LFI.md: GitHub security advisory drafts - CHANGELOG.md: Updated with v0.8.0 changes Breaking changes documented: - Docker API hooks disabled by default (CRAWL4AI_HOOKS_ENABLED) - file:// URLs blocked on Docker API endpoints Security fixes credited to Neo by ProjectDiscovery * Add examples for deep crawl crash recovery and prefetch mode in documentation * Release v0.8.0: The v0.8.0 Update - Updated version to 0.8.0 - Added comprehensive demo and release notes - Updated all documentation * Update security researcher acknowledgment with a hyperlink for Neo by ProjectDiscovery * Add async agenerate_schema method for schema generation - Extract prompt building to shared _build_schema_prompt() method - Add agenerate_schema() async version using aperform_completion_with_backoff - Refactor generate_schema() to use shared prompt builder - Fixes Gemini/Vertex AI compatibility in async contexts (FastAPI) * Fix: Enable litellm.drop_params for O-series/GPT-5 model compatibility O-series (o1, o3) and GPT-5 models only support temperature=1. Setting litellm.drop_params=True auto-drops unsupported parameters instead of throwing UnsupportedParamsError. Fixes temperature=0.01 error for these models in LLM extraction. --------- Co-authored-by: rbushria <rbushri@gmail.com> Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com> Co-authored-by: Soham Kukreti <kukretisoham@gmail.com> Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com> Co-authored-by: unclecode <unclecode@kidocode.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-17 14:19:15 +01:00
parent c85f56b085
commit f6f7f1b551
58 changed files with 11942 additions and 2411 deletions
--- a/tests/browser/test_browser_context_id.py
+++ b/tests/browser/test_browser_context_id.py
@@ -0,0 +1,489 @@
+"""Test for browser_context_id and target_id parameters.
+
+These tests verify that Crawl4AI can connect to and use pre-created
+browser contexts, which is essential for cloud browser services that
+pre-create isolated contexts for each user.
+
+The flow being tested:
+1. Start a browser with CDP
+2. Create a context via raw CDP commands (simulating cloud service)
+3. Create a page/target in that context
+4. Have Crawl4AI connect using browser_context_id and target_id
+5. Verify Crawl4AI uses the existing context/page instead of creating new ones
+"""
+
+import asyncio
+import json
+import os
+import sys
+import websockets
+
+# Add the project root to Python path if running directly
+if __name__ == "__main__":
+    sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '../..')))
+
+from crawl4ai.browser_manager import BrowserManager, ManagedBrowser
+from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
+from crawl4ai.async_logger import AsyncLogger
+
+# Create a logger for clear terminal output
+logger = AsyncLogger(verbose=True, log_file=None)
+
+
+class CDPContextCreator:
+    """
+    Helper class to create browser contexts via raw CDP commands.
+    This simulates what a cloud browser service would do.
+    """
+
+    def __init__(self, cdp_url: str):
+        self.cdp_url = cdp_url
+        self._message_id = 0
+        self._ws = None
+        self._pending_responses = {}
+        self._receiver_task = None
+
+    async def connect(self):
+        """Establish WebSocket connection to browser."""
+        # Convert HTTP URL to WebSocket URL if needed
+        ws_url = self.cdp_url.replace("http://", "ws://").replace("https://", "wss://")
+        if not ws_url.endswith("/devtools/browser"):
+            # Get the browser websocket URL from /json/version
+            import aiohttp
+            async with aiohttp.ClientSession() as session:
+                async with session.get(f"{self.cdp_url}/json/version") as response:
+                    data = await response.json()
+                    ws_url = data.get("webSocketDebuggerUrl", ws_url)
+
+        self._ws = await websockets.connect(ws_url, max_size=None, ping_interval=None)
+        self._receiver_task = asyncio.create_task(self._receive_messages())
+        logger.info(f"Connected to CDP at {ws_url}", tag="CDP")
+
+    async def disconnect(self):
+        """Close WebSocket connection."""
+        if self._receiver_task:
+            self._receiver_task.cancel()
+            try:
+                await self._receiver_task
+            except asyncio.CancelledError:
+                pass
+        if self._ws:
+            await self._ws.close()
+            self._ws = None
+
+    async def _receive_messages(self):
+        """Background task to receive CDP messages."""
+        try:
+            async for message in self._ws:
+                data = json.loads(message)
+                msg_id = data.get('id')
+                if msg_id is not None and msg_id in self._pending_responses:
+                    self._pending_responses[msg_id].set_result(data)
+        except asyncio.CancelledError:
+            pass
+        except Exception as e:
+            logger.error(f"CDP receiver error: {e}", tag="CDP")
+
+    async def _send_command(self, method: str, params: dict = None) -> dict:
+        """Send CDP command and wait for response."""
+        self._message_id += 1
+        msg_id = self._message_id
+
+        message = {
+            "id": msg_id,
+            "method": method,
+            "params": params or {}
+        }
+
+        future = asyncio.get_event_loop().create_future()
+        self._pending_responses[msg_id] = future
+
+        try:
+            await self._ws.send(json.dumps(message))
+            response = await asyncio.wait_for(future, timeout=30.0)
+
+            if 'error' in response:
+                raise Exception(f"CDP error: {response['error']}")
+
+            return response.get('result', {})
+        finally:
+            self._pending_responses.pop(msg_id, None)
+
+    async def create_context(self) -> dict:
+        """
+        Create an isolated browser context with a blank page.
+
+        Returns:
+            dict with browser_context_id, target_id, and cdp_session_id
+        """
+        await self.connect()
+
+        # 1. Create isolated browser context
+        result = await self._send_command("Target.createBrowserContext", {
+            "disposeOnDetach": False  # Keep context alive
+        })
+        browser_context_id = result["browserContextId"]
+        logger.info(f"Created browser context: {browser_context_id}", tag="CDP")
+
+        # 2. Create a new page (target) in the context
+        result = await self._send_command("Target.createTarget", {
+            "url": "about:blank",
+            "browserContextId": browser_context_id
+        })
+        target_id = result["targetId"]
+        logger.info(f"Created target: {target_id}", tag="CDP")
+
+        # 3. Attach to the target to get a session ID
+        result = await self._send_command("Target.attachToTarget", {
+            "targetId": target_id,
+            "flatten": True
+        })
+        cdp_session_id = result["sessionId"]
+        logger.info(f"Attached to target, sessionId: {cdp_session_id}", tag="CDP")
+
+        return {
+            "browser_context_id": browser_context_id,
+            "target_id": target_id,
+            "cdp_session_id": cdp_session_id
+        }
+
+    async def get_targets(self) -> list:
+        """Get list of all targets in the browser."""
+        result = await self._send_command("Target.getTargets")
+        return result.get("targetInfos", [])
+
+    async def dispose_context(self, browser_context_id: str):
+        """Dispose of a browser context."""
+        try:
+            await self._send_command("Target.disposeBrowserContext", {
+                "browserContextId": browser_context_id
+            })
+            logger.info(f"Disposed browser context: {browser_context_id}", tag="CDP")
+        except Exception as e:
+            logger.warning(f"Error disposing context: {e}", tag="CDP")
+
+
+async def test_browser_context_id_basic():
+    """
+    Test that BrowserConfig accepts browser_context_id and target_id parameters.
+    """
+    logger.info("Testing BrowserConfig browser_context_id parameter", tag="TEST")
+
+    try:
+        # Test that BrowserConfig accepts the new parameters
+        config = BrowserConfig(
+            cdp_url="http://localhost:9222",
+            browser_context_id="test-context-id",
+            target_id="test-target-id",
+            headless=True
+        )
+
+        # Verify parameters are set correctly
+        assert config.browser_context_id == "test-context-id", "browser_context_id not set"
+        assert config.target_id == "test-target-id", "target_id not set"
+
+        # Test from_kwargs
+        config2 = BrowserConfig.from_kwargs({
+            "cdp_url": "http://localhost:9222",
+            "browser_context_id": "test-context-id-2",
+            "target_id": "test-target-id-2"
+        })
+
+        assert config2.browser_context_id == "test-context-id-2", "browser_context_id not set via from_kwargs"
+        assert config2.target_id == "test-target-id-2", "target_id not set via from_kwargs"
+
+        # Test to_dict
+        config_dict = config.to_dict()
+        assert config_dict.get("browser_context_id") == "test-context-id", "browser_context_id not in to_dict"
+        assert config_dict.get("target_id") == "test-target-id", "target_id not in to_dict"
+
+        logger.success("BrowserConfig browser_context_id test passed", tag="TEST")
+        return True
+
+    except Exception as e:
+        logger.error(f"Test failed: {str(e)}", tag="TEST")
+        return False
+
+
+async def test_pre_created_context_usage():
+    """
+    Test that Crawl4AI uses a pre-created browser context instead of creating a new one.
+
+    This simulates the cloud browser service flow:
+    1. Start browser with CDP
+    2. Create context via raw CDP (simulating cloud service)
+    3. Have Crawl4AI connect with browser_context_id
+    4. Verify it uses existing context
+    """
+    logger.info("Testing pre-created context usage", tag="TEST")
+
+    # Start a managed browser first
+    browser_config_initial = BrowserConfig(
+        use_managed_browser=True,
+        headless=True,
+        debugging_port=9226,  # Use unique port
+        verbose=True
+    )
+
+    managed_browser = ManagedBrowser(browser_config=browser_config_initial, logger=logger)
+    cdp_creator = None
+    manager = None
+    context_info = None
+
+    try:
+        # Start the browser
+        cdp_url = await managed_browser.start()
+        logger.info(f"Browser started at {cdp_url}", tag="TEST")
+
+        # Create a context via raw CDP (simulating cloud service)
+        cdp_creator = CDPContextCreator(cdp_url)
+        context_info = await cdp_creator.create_context()
+
+        logger.info(f"Pre-created context: {context_info['browser_context_id']}", tag="TEST")
+        logger.info(f"Pre-created target: {context_info['target_id']}", tag="TEST")
+
+        # Get initial target count
+        targets_before = await cdp_creator.get_targets()
+        initial_target_count = len(targets_before)
+        logger.info(f"Initial target count: {initial_target_count}", tag="TEST")
+
+        # Now create BrowserManager with browser_context_id and target_id
+        browser_config = BrowserConfig(
+            cdp_url=cdp_url,
+            browser_context_id=context_info['browser_context_id'],
+            target_id=context_info['target_id'],
+            headless=True,
+            verbose=True
+        )
+
+        manager = BrowserManager(browser_config=browser_config, logger=logger)
+        await manager.start()
+
+        logger.info("BrowserManager started with pre-created context", tag="TEST")
+
+        # Get a page
+        crawler_config = CrawlerRunConfig()
+        page, context = await manager.get_page(crawler_config)
+
+        # Navigate to a test page
+        await page.goto("https://example.com", wait_until="domcontentloaded")
+        title = await page.title()
+
+        logger.info(f"Page title: {title}", tag="TEST")
+
+        # Get target count after
+        targets_after = await cdp_creator.get_targets()
+        final_target_count = len(targets_after)
+        logger.info(f"Final target count: {final_target_count}", tag="TEST")
+
+        # Verify: target count should not have increased significantly
+        # (allow for 1 extra target for internal use, but not many more)
+        target_diff = final_target_count - initial_target_count
+        logger.info(f"Target count difference: {target_diff}", tag="TEST")
+
+        # Success criteria:
+        # 1. Page navigation worked
+        # 2. Target count didn't explode (reused existing context)
+        success = title == "Example Domain" and target_diff <= 1
+
+        if success:
+            logger.success("Pre-created context usage test passed", tag="TEST")
+        else:
+            logger.error(f"Test failed - Title: {title}, Target diff: {target_diff}", tag="TEST")
+
+        return success
+
+    except Exception as e:
+        logger.error(f"Test failed: {str(e)}", tag="TEST")
+        import traceback
+        traceback.print_exc()
+        return False
+
+    finally:
+        # Cleanup
+        if manager:
+            try:
+                await manager.close()
+            except:
+                pass
+
+        if cdp_creator and context_info:
+            try:
+                await cdp_creator.dispose_context(context_info['browser_context_id'])
+                await cdp_creator.disconnect()
+            except:
+                pass
+
+        if managed_browser:
+            try:
+                await managed_browser.cleanup()
+            except:
+                pass
+
+
+async def test_context_isolation():
+    """
+    Test that using browser_context_id actually provides isolation.
+    Create two contexts and verify they don't share state.
+    """
+    logger.info("Testing context isolation with browser_context_id", tag="TEST")
+
+    browser_config_initial = BrowserConfig(
+        use_managed_browser=True,
+        headless=True,
+        debugging_port=9227,
+        verbose=True
+    )
+
+    managed_browser = ManagedBrowser(browser_config=browser_config_initial, logger=logger)
+    cdp_creator = None
+    manager1 = None
+    manager2 = None
+    context_info_1 = None
+    context_info_2 = None
+
+    try:
+        # Start the browser
+        cdp_url = await managed_browser.start()
+        logger.info(f"Browser started at {cdp_url}", tag="TEST")
+
+        # Create two separate contexts
+        cdp_creator = CDPContextCreator(cdp_url)
+        context_info_1 = await cdp_creator.create_context()
+        logger.info(f"Context 1: {context_info_1['browser_context_id']}", tag="TEST")
+
+        # Need to reconnect for second context (or use same connection)
+        await cdp_creator.disconnect()
+        cdp_creator2 = CDPContextCreator(cdp_url)
+        context_info_2 = await cdp_creator2.create_context()
+        logger.info(f"Context 2: {context_info_2['browser_context_id']}", tag="TEST")
+
+        # Verify contexts are different
+        assert context_info_1['browser_context_id'] != context_info_2['browser_context_id'], \
+            "Contexts should have different IDs"
+
+        # Connect with first context
+        browser_config_1 = BrowserConfig(
+            cdp_url=cdp_url,
+            browser_context_id=context_info_1['browser_context_id'],
+            target_id=context_info_1['target_id'],
+            headless=True
+        )
+
+        manager1 = BrowserManager(browser_config=browser_config_1, logger=logger)
+        await manager1.start()
+
+        # Set a cookie in context 1
+        page1, ctx1 = await manager1.get_page(CrawlerRunConfig())
+        await page1.goto("https://example.com", wait_until="domcontentloaded")
+        await ctx1.add_cookies([{
+            "name": "test_isolation",
+            "value": "context_1_value",
+            "domain": "example.com",
+            "path": "/"
+        }])
+
+        cookies1 = await ctx1.cookies(["https://example.com"])
+        cookie1_value = next((c["value"] for c in cookies1 if c["name"] == "test_isolation"), None)
+        logger.info(f"Cookie in context 1: {cookie1_value}", tag="TEST")
+
+        # Connect with second context
+        browser_config_2 = BrowserConfig(
+            cdp_url=cdp_url,
+            browser_context_id=context_info_2['browser_context_id'],
+            target_id=context_info_2['target_id'],
+            headless=True
+        )
+
+        manager2 = BrowserManager(browser_config=browser_config_2, logger=logger)
+        await manager2.start()
+
+        # Check cookies in context 2 - should not have the cookie from context 1
+        page2, ctx2 = await manager2.get_page(CrawlerRunConfig())
+        await page2.goto("https://example.com", wait_until="domcontentloaded")
+
+        cookies2 = await ctx2.cookies(["https://example.com"])
+        cookie2_value = next((c["value"] for c in cookies2 if c["name"] == "test_isolation"), None)
+        logger.info(f"Cookie in context 2: {cookie2_value}", tag="TEST")
+
+        # Verify isolation
+        isolation_works = cookie1_value == "context_1_value" and cookie2_value is None
+
+        if isolation_works:
+            logger.success("Context isolation test passed", tag="TEST")
+        else:
+            logger.error(f"Isolation failed - Cookie1: {cookie1_value}, Cookie2: {cookie2_value}", tag="TEST")
+
+        return isolation_works
+
+    except Exception as e:
+        logger.error(f"Test failed: {str(e)}", tag="TEST")
+        import traceback
+        traceback.print_exc()
+        return False
+
+    finally:
+        # Cleanup
+        for mgr in [manager1, manager2]:
+            if mgr:
+                try:
+                    await mgr.close()
+                except:
+                    pass
+
+        for ctx_info, creator in [(context_info_1, cdp_creator), (context_info_2, cdp_creator2 if 'cdp_creator2' in dir() else None)]:
+            if ctx_info and creator:
+                try:
+                    await creator.dispose_context(ctx_info['browser_context_id'])
+                    await creator.disconnect()
+                except:
+                    pass
+
+        if managed_browser:
+            try:
+                await managed_browser.cleanup()
+            except:
+                pass
+
+
+async def run_tests():
+    """Run all browser_context_id tests."""
+    results = []
+
+    logger.info("Running browser_context_id tests", tag="SUITE")
+
+    # Basic parameter test
+    results.append(("browser_context_id_basic", await test_browser_context_id_basic()))
+
+    # Pre-created context usage test
+    results.append(("pre_created_context_usage", await test_pre_created_context_usage()))
+
+    # Note: Context isolation test is commented out because isolation is enforced
+    # at the CDP level by the cloud browser service, not at the Playwright level.
+    # When multiple BrowserManagers connect to the same browser, Playwright sees
+    # all contexts. In production, each worker gets exactly one pre-created context.
+    # results.append(("context_isolation", await test_context_isolation()))
+
+    # Print summary
+    total = len(results)
+    passed = sum(1 for _, r in results if r)
+
+    logger.info("=" * 50, tag="SUMMARY")
+    logger.info(f"Test Results: {passed}/{total} passed", tag="SUMMARY")
+    logger.info("=" * 50, tag="SUMMARY")
+
+    for name, result in results:
+        status = "PASSED" if result else "FAILED"
+        logger.info(f"  {name}: {status}", tag="SUMMARY")
+
+    if passed == total:
+        logger.success("All tests passed!", tag="SUMMARY")
+        return True
+    else:
+        logger.error(f"{total - passed} tests failed", tag="SUMMARY")
+        return False
+
+
+if __name__ == "__main__":
+    success = asyncio.run(run_tests())
+    sys.exit(0 if success else 1)
--- a/tests/browser/test_cdp_cleanup_reuse.py
+++ b/tests/browser/test_cdp_cleanup_reuse.py
@@ -0,0 +1,281 @@
+#!/usr/bin/env python3
+"""
+Tests for CDP connection cleanup and browser reuse.
+
+These tests verify that:
+1. WebSocket URLs are properly handled (skip HTTP verification)
+2. cdp_cleanup_on_close properly disconnects without terminating the browser
+3. The same browser can be reused by multiple sequential connections
+
+Requirements:
+- A CDP-compatible browser pool service running (e.g., chromepoold)
+- Service should be accessible at CDP_SERVICE_URL (default: http://localhost:11235)
+
+Usage:
+    pytest tests/browser/test_cdp_cleanup_reuse.py -v
+
+Or run directly:
+    python tests/browser/test_cdp_cleanup_reuse.py
+"""
+
+import asyncio
+import os
+import pytest
+import requests
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
+
+# Configuration
+CDP_SERVICE_URL = os.getenv("CDP_SERVICE_URL", "http://localhost:11235")
+
+
+def is_cdp_service_available():
+    """Check if CDP service is running."""
+    try:
+        resp = requests.get(f"{CDP_SERVICE_URL}/health", timeout=2)
+        return resp.status_code == 200
+    except:
+        return False
+
+
+def create_browser():
+    """Create a browser via CDP service API."""
+    resp = requests.post(
+        f"{CDP_SERVICE_URL}/v1/browsers",
+        json={"headless": True},
+        timeout=10
+    )
+    resp.raise_for_status()
+    return resp.json()
+
+
+def get_browser_info(browser_id):
+    """Get browser info from CDP service."""
+    resp = requests.get(f"{CDP_SERVICE_URL}/v1/browsers", timeout=5)
+    for browser in resp.json():
+        if browser["id"] == browser_id:
+            return browser
+    return None
+
+
+def delete_browser(browser_id):
+    """Delete a browser via CDP service API."""
+    try:
+        requests.delete(f"{CDP_SERVICE_URL}/v1/browsers/{browser_id}", timeout=5)
+    except:
+        pass
+
+
+# Skip all tests if CDP service is not available
+pytestmark = pytest.mark.skipif(
+    not is_cdp_service_available(),
+    reason=f"CDP service not available at {CDP_SERVICE_URL}"
+)
+
+
+class TestCDPWebSocketURL:
+    """Tests for WebSocket URL handling."""
+
+    @pytest.mark.asyncio
+    async def test_websocket_url_skips_http_verification(self):
+        """WebSocket URLs should skip HTTP /json/version verification."""
+        browser = create_browser()
+        try:
+            ws_url = browser["ws_url"]
+            assert ws_url.startswith("ws://") or ws_url.startswith("wss://")
+
+            async with AsyncWebCrawler(
+                config=BrowserConfig(
+                    browser_mode="cdp",
+                    cdp_url=ws_url,
+                    headless=True,
+                    cdp_cleanup_on_close=True,
+                )
+            ) as crawler:
+                result = await crawler.arun(
+                    url="https://example.com",
+                    config=CrawlerRunConfig(verbose=False),
+                )
+                assert result.success
+                assert "Example Domain" in result.metadata.get("title", "")
+        finally:
+            delete_browser(browser["browser_id"])
+
+
+class TestCDPCleanupOnClose:
+    """Tests for cdp_cleanup_on_close behavior."""
+
+    @pytest.mark.asyncio
+    async def test_browser_survives_after_cleanup_close(self):
+        """Browser should remain alive after close with cdp_cleanup_on_close=True."""
+        browser = create_browser()
+        browser_id = browser["browser_id"]
+        ws_url = browser["ws_url"]
+
+        try:
+            # Verify browser exists
+            info_before = get_browser_info(browser_id)
+            assert info_before is not None
+            pid_before = info_before["pid"]
+
+            # Connect, crawl, and close with cleanup
+            async with AsyncWebCrawler(
+                config=BrowserConfig(
+                    browser_mode="cdp",
+                    cdp_url=ws_url,
+                    headless=True,
+                    cdp_cleanup_on_close=True,
+                )
+            ) as crawler:
+                result = await crawler.arun(
+                    url="https://example.com",
+                    config=CrawlerRunConfig(verbose=False),
+                )
+                assert result.success
+
+            # Browser should still exist with same PID
+            info_after = get_browser_info(browser_id)
+            assert info_after is not None, "Browser was terminated but should only disconnect"
+            assert info_after["pid"] == pid_before, "Browser PID changed unexpectedly"
+        finally:
+            delete_browser(browser_id)
+
+
+class TestCDPBrowserReuse:
+    """Tests for reusing the same browser with multiple connections."""
+
+    @pytest.mark.asyncio
+    async def test_sequential_connections_same_browser(self):
+        """Multiple sequential connections to the same browser should work."""
+        browser = create_browser()
+        browser_id = browser["browser_id"]
+        ws_url = browser["ws_url"]
+
+        try:
+            urls = [
+                "https://example.com",
+                "https://httpbin.org/ip",
+                "https://httpbin.org/headers",
+            ]
+
+            for i, url in enumerate(urls, 1):
+                # Each connection uses cdp_cleanup_on_close=True
+                async with AsyncWebCrawler(
+                    config=BrowserConfig(
+                        browser_mode="cdp",
+                        cdp_url=ws_url,
+                        headless=True,
+                        cdp_cleanup_on_close=True,
+                    )
+                ) as crawler:
+                    result = await crawler.arun(
+                        url=url,
+                        config=CrawlerRunConfig(verbose=False),
+                    )
+                    assert result.success, f"Connection {i} failed for {url}"
+
+                # Verify browser is still healthy
+                info = get_browser_info(browser_id)
+                assert info is not None, f"Browser died after connection {i}"
+
+        finally:
+            delete_browser(browser_id)
+
+    @pytest.mark.asyncio
+    async def test_no_user_wait_needed_between_connections(self):
+        """With cdp_cleanup_on_close=True, no user wait should be needed."""
+        browser = create_browser()
+        browser_id = browser["browser_id"]
+        ws_url = browser["ws_url"]
+
+        try:
+            # Rapid-fire connections with NO sleep between them
+            for i in range(3):
+                async with AsyncWebCrawler(
+                    config=BrowserConfig(
+                        browser_mode="cdp",
+                        cdp_url=ws_url,
+                        headless=True,
+                        cdp_cleanup_on_close=True,
+                    )
+                ) as crawler:
+                    result = await crawler.arun(
+                        url="https://example.com",
+                        config=CrawlerRunConfig(verbose=False),
+                    )
+                    assert result.success, f"Rapid connection {i+1} failed"
+                # NO asyncio.sleep() here - internal delay should be sufficient
+        finally:
+            delete_browser(browser_id)
+
+
+class TestCDPBackwardCompatibility:
+    """Tests for backward compatibility with existing CDP usage."""
+
+    @pytest.mark.asyncio
+    async def test_http_url_with_browser_id_works(self):
+        """HTTP URL with browser_id query param should work (backward compatibility)."""
+        browser = create_browser()
+        browser_id = browser["browser_id"]
+        try:
+            # Use HTTP URL with browser_id query parameter
+            http_url = f"{CDP_SERVICE_URL}?browser_id={browser_id}"
+
+            async with AsyncWebCrawler(
+                config=BrowserConfig(
+                    browser_mode="cdp",
+                    cdp_url=http_url,
+                    headless=True,
+                    cdp_cleanup_on_close=True,
+                )
+            ) as crawler:
+                result = await crawler.arun(
+                    url="https://example.com",
+                    config=CrawlerRunConfig(verbose=False),
+                )
+                assert result.success
+        finally:
+            delete_browser(browser_id)
+
+
+# Allow running directly
+if __name__ == "__main__":
+    if not is_cdp_service_available():
+        print(f"CDP service not available at {CDP_SERVICE_URL}")
+        print("Please start a CDP-compatible browser pool service first.")
+        exit(1)
+
+    async def run_tests():
+        print("=" * 60)
+        print("CDP Cleanup and Browser Reuse Tests")
+        print("=" * 60)
+
+        tests = [
+            ("WebSocket URL handling", TestCDPWebSocketURL().test_websocket_url_skips_http_verification),
+            ("Browser survives after cleanup", TestCDPCleanupOnClose().test_browser_survives_after_cleanup_close),
+            ("Sequential connections", TestCDPBrowserReuse().test_sequential_connections_same_browser),
+            ("No user wait needed", TestCDPBrowserReuse().test_no_user_wait_needed_between_connections),
+            ("HTTP URL with browser_id", TestCDPBackwardCompatibility().test_http_url_with_browser_id_works),
+        ]
+
+        results = []
+        for name, test_func in tests:
+            print(f"\n--- {name} ---")
+            try:
+                await test_func()
+                print(f"PASS")
+                results.append((name, True))
+            except Exception as e:
+                print(f"FAIL: {e}")
+                results.append((name, False))
+
+        print("\n" + "=" * 60)
+        print("SUMMARY")
+        print("=" * 60)
+        for name, passed in results:
+            print(f"  {name}: {'PASS' if passed else 'FAIL'}")
+
+        all_passed = all(r[1] for r in results)
+        print(f"\nOverall: {'ALL TESTS PASSED' if all_passed else 'SOME TESTS FAILED'}")
+        return 0 if all_passed else 1
+
+    exit(asyncio.run(run_tests()))
--- a/tests/cache_validation/init.py
+++ b/tests/cache_validation/init.py
@@ -0,0 +1 @@
+# Cache validation test suite
--- a/tests/cache_validation/conftest.py
+++ b/tests/cache_validation/conftest.py
@@ -0,0 +1,40 @@
+"""Pytest fixtures for cache validation tests."""
+
+import pytest
+
+
+def pytest_configure(config):
+    """Register custom markers."""
+    config.addinivalue_line(
+        "markers", "integration: marks tests as integration tests (may require network)"
+    )
+
+
+@pytest.fixture
+def sample_head_html():
+    """Sample HTML head section for testing."""
+    return '''
+    <head>
+        <meta charset="utf-8">
+        <title>Test Page Title</title>
+        <meta name="description" content="This is a test page description">
+        <meta property="og:title" content="OG Test Title">
+        <meta property="og:description" content="OG Description">
+        <meta property="og:image" content="https://example.com/image.jpg">
+        <meta property="article:modified_time" content="2024-12-01T00:00:00Z">
+        <link rel="stylesheet" href="style.css">
+        <script src="app.js"></script>
+    </head>
+    '''
+
+
+@pytest.fixture
+def minimal_head_html():
+    """Minimal head with just a title."""
+    return '<head><title>Minimal</title></head>'
+
+
+@pytest.fixture
+def empty_head_html():
+    """Empty head section."""
+    return '<head></head>'
--- a/tests/cache_validation/test_end_to_end.py
+++ b/tests/cache_validation/test_end_to_end.py
@@ -0,0 +1,449 @@
+"""
+End-to-end tests for Smart Cache validation.
+
+Tests the full flow:
+1. Fresh crawl (browser launch) - SLOW
+2. Cached crawl without validation (check_cache_freshness=False) - FAST
+3. Cached crawl with validation (check_cache_freshness=True) - FAST (304/fingerprint)
+
+Verifies all layers:
+- Database storage of etag, last_modified, head_fingerprint, cached_at
+- Cache validation logic
+- HTTP conditional requests (304 Not Modified)
+- Performance improvements
+"""
+
+import pytest
+import time
+import asyncio
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
+from crawl4ai.async_database import async_db_manager
+
+
+class TestEndToEndCacheValidation:
+    """End-to-end tests for the complete cache validation flow."""
+
+    @pytest.mark.asyncio
+    async def test_full_cache_flow_docs_python(self):
+        """
+        Test complete cache flow with docs.python.org:
+        1. Fresh crawl (slow - browser) - using BYPASS to force fresh
+        2. Cache hit without validation (fast)
+        3. Cache hit with validation (fast - 304)
+        """
+        url = "https://docs.python.org/3/"
+
+        browser_config = BrowserConfig(headless=True, verbose=False)
+
+        # ========== CRAWL 1: Fresh crawl (force with WRITE_ONLY to skip cache read) ==========
+        config1 = CrawlerRunConfig(
+            cache_mode=CacheMode.WRITE_ONLY,  # Skip reading, write new data
+            check_cache_freshness=False,
+        )
+
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            start1 = time.perf_counter()
+            result1 = await crawler.arun(url, config=config1)
+            time1 = time.perf_counter() - start1
+
+        assert result1.success, f"First crawl failed: {result1.error_message}"
+        # WRITE_ONLY means we did a fresh crawl and wrote to cache
+        assert result1.cache_status == "miss", f"Expected 'miss', got '{result1.cache_status}'"
+
+        print(f"\n[CRAWL 1] Fresh crawl: {time1:.2f}s (cache_status: {result1.cache_status})")
+
+        # Verify data is stored in database
+        metadata = await async_db_manager.aget_cache_metadata(url)
+        assert metadata is not None, "Metadata should be stored in database"
+        assert metadata.get("etag") or metadata.get("last_modified"), "Should have ETag or Last-Modified"
+        print(f"  - Stored ETag: {metadata.get('etag', 'N/A')[:30]}...")
+        print(f"  - Stored Last-Modified: {metadata.get('last_modified', 'N/A')}")
+        print(f"  - Stored head_fingerprint: {metadata.get('head_fingerprint', 'N/A')}")
+        print(f"  - Stored cached_at: {metadata.get('cached_at', 'N/A')}")
+
+        # ========== CRAWL 2: Cache hit WITHOUT validation ==========
+        config2 = CrawlerRunConfig(
+            cache_mode=CacheMode.ENABLED,
+            check_cache_freshness=False,  # Skip validation - pure cache hit
+        )
+
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            start2 = time.perf_counter()
+            result2 = await crawler.arun(url, config=config2)
+            time2 = time.perf_counter() - start2
+
+        assert result2.success, f"Second crawl failed: {result2.error_message}"
+        assert result2.cache_status == "hit", f"Expected 'hit', got '{result2.cache_status}'"
+
+        print(f"\n[CRAWL 2] Cache hit (no validation): {time2:.2f}s (cache_status: {result2.cache_status})")
+        print(f"  - Speedup: {time1/time2:.1f}x faster than fresh crawl")
+
+        # Should be MUCH faster - no browser, no HTTP request
+        assert time2 < time1 / 2, f"Cache hit should be at least 2x faster (was {time1/time2:.1f}x)"
+
+        # ========== CRAWL 3: Cache hit WITH validation (304) ==========
+        config3 = CrawlerRunConfig(
+            cache_mode=CacheMode.ENABLED,
+            check_cache_freshness=True,  # Validate cache freshness
+        )
+
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            start3 = time.perf_counter()
+            result3 = await crawler.arun(url, config=config3)
+            time3 = time.perf_counter() - start3
+
+        assert result3.success, f"Third crawl failed: {result3.error_message}"
+        # Should be "hit_validated" (304) or "hit_fallback" (error during validation)
+        assert result3.cache_status in ["hit_validated", "hit_fallback"], \
+            f"Expected validated cache hit, got '{result3.cache_status}'"
+
+        print(f"\n[CRAWL 3] Cache hit (with validation): {time3:.2f}s (cache_status: {result3.cache_status})")
+        print(f"  - Speedup: {time1/time3:.1f}x faster than fresh crawl")
+
+        # Should still be fast - just a HEAD request, no browser
+        assert time3 < time1 / 2, f"Validated cache hit should be faster than fresh crawl"
+
+        # ========== SUMMARY ==========
+        print(f"\n{'='*60}")
+        print(f"PERFORMANCE SUMMARY for {url}")
+        print(f"{'='*60}")
+        print(f"  Fresh crawl (browser):        {time1:.2f}s")
+        print(f"  Cache hit (no validation):    {time2:.2f}s ({time1/time2:.1f}x faster)")
+        print(f"  Cache hit (with validation):  {time3:.2f}s ({time1/time3:.1f}x faster)")
+        print(f"{'='*60}")
+
+    @pytest.mark.asyncio
+    async def test_full_cache_flow_crawl4ai_docs(self):
+        """Test with docs.crawl4ai.com."""
+        url = "https://docs.crawl4ai.com/"
+
+        browser_config = BrowserConfig(headless=True, verbose=False)
+
+        # Fresh crawl - use WRITE_ONLY to ensure we get fresh data
+        config1 = CrawlerRunConfig(cache_mode=CacheMode.WRITE_ONLY, check_cache_freshness=False)
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            start1 = time.perf_counter()
+            result1 = await crawler.arun(url, config=config1)
+            time1 = time.perf_counter() - start1
+
+        assert result1.success
+        assert result1.cache_status == "miss"
+        print(f"\n[docs.crawl4ai.com] Fresh: {time1:.2f}s")
+
+        # Cache hit with validation
+        config2 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=True)
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            start2 = time.perf_counter()
+            result2 = await crawler.arun(url, config=config2)
+            time2 = time.perf_counter() - start2
+
+        assert result2.success
+        assert result2.cache_status in ["hit_validated", "hit_fallback"]
+        print(f"[docs.crawl4ai.com] Validated: {time2:.2f}s ({time1/time2:.1f}x faster)")
+
+    @pytest.mark.asyncio
+    async def test_verify_database_storage(self):
+        """Verify all validation metadata is properly stored in database."""
+        url = "https://docs.python.org/3/library/asyncio.html"
+
+        browser_config = BrowserConfig(headless=True, verbose=False)
+        config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
+
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            result = await crawler.arun(url, config=config)
+
+        assert result.success
+
+        # Verify all fields in database
+        metadata = await async_db_manager.aget_cache_metadata(url)
+
+        assert metadata is not None, "Metadata must be stored"
+        assert "url" in metadata
+        assert "etag" in metadata
+        assert "last_modified" in metadata
+        assert "head_fingerprint" in metadata
+        assert "cached_at" in metadata
+        assert "response_headers" in metadata
+
+        print(f"\nDatabase storage verification for {url}:")
+        print(f"  - etag: {metadata['etag'][:40] if metadata['etag'] else 'None'}...")
+        print(f"  - last_modified: {metadata['last_modified']}")
+        print(f"  - head_fingerprint: {metadata['head_fingerprint']}")
+        print(f"  - cached_at: {metadata['cached_at']}")
+        print(f"  - response_headers keys: {list(metadata['response_headers'].keys())[:5]}...")
+
+        # At least one validation field should be populated
+        has_validation_data = (
+            metadata["etag"] or
+            metadata["last_modified"] or
+            metadata["head_fingerprint"]
+        )
+        assert has_validation_data, "Should have at least one validation field"
+
+    @pytest.mark.asyncio
+    async def test_head_fingerprint_stored_and_used(self):
+        """Verify head fingerprint is computed, stored, and used for validation."""
+        url = "https://example.com/"
+
+        browser_config = BrowserConfig(headless=True, verbose=False)
+
+        # Fresh crawl
+        config1 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            result1 = await crawler.arun(url, config=config1)
+
+        assert result1.success
+        assert result1.head_fingerprint, "head_fingerprint should be set on CrawlResult"
+
+        # Verify in database
+        metadata = await async_db_manager.aget_cache_metadata(url)
+        assert metadata["head_fingerprint"], "head_fingerprint should be stored in database"
+        assert metadata["head_fingerprint"] == result1.head_fingerprint
+
+        print(f"\nHead fingerprint for {url}:")
+        print(f"  - CrawlResult.head_fingerprint: {result1.head_fingerprint}")
+        print(f"  - Database head_fingerprint: {metadata['head_fingerprint']}")
+
+        # Validate using fingerprint
+        config2 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=True)
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            result2 = await crawler.arun(url, config=config2)
+
+        assert result2.success
+        assert result2.cache_status in ["hit_validated", "hit_fallback"]
+        print(f"  - Validation result: {result2.cache_status}")
+
+
+class TestCacheValidationPerformance:
+    """Performance benchmarks for cache validation."""
+
+    @pytest.mark.asyncio
+    async def test_multiple_urls_performance(self):
+        """Test cache performance across multiple URLs."""
+        urls = [
+            "https://docs.python.org/3/",
+            "https://docs.python.org/3/library/asyncio.html",
+            "https://en.wikipedia.org/wiki/Python_(programming_language)",
+        ]
+
+        browser_config = BrowserConfig(headless=True, verbose=False)
+        fresh_times = []
+        cached_times = []
+
+        print(f"\n{'='*70}")
+        print("MULTI-URL PERFORMANCE TEST")
+        print(f"{'='*70}")
+
+        # Fresh crawls - use WRITE_ONLY to force fresh crawl
+        for url in urls:
+            config = CrawlerRunConfig(cache_mode=CacheMode.WRITE_ONLY, check_cache_freshness=False)
+            async with AsyncWebCrawler(config=browser_config) as crawler:
+                start = time.perf_counter()
+                result = await crawler.arun(url, config=config)
+                elapsed = time.perf_counter() - start
+                fresh_times.append(elapsed)
+                print(f"Fresh:  {url[:50]:50} {elapsed:.2f}s ({result.cache_status})")
+
+        # Cached crawls with validation
+        for url in urls:
+            config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=True)
+            async with AsyncWebCrawler(config=browser_config) as crawler:
+                start = time.perf_counter()
+                result = await crawler.arun(url, config=config)
+                elapsed = time.perf_counter() - start
+                cached_times.append(elapsed)
+                print(f"Cached: {url[:50]:50} {elapsed:.2f}s ({result.cache_status})")
+
+        avg_fresh = sum(fresh_times) / len(fresh_times)
+        avg_cached = sum(cached_times) / len(cached_times)
+        total_fresh = sum(fresh_times)
+        total_cached = sum(cached_times)
+
+        print(f"\n{'='*70}")
+        print(f"RESULTS:")
+        print(f"  Total fresh crawl time:  {total_fresh:.2f}s")
+        print(f"  Total cached time:       {total_cached:.2f}s")
+        print(f"  Average speedup:         {avg_fresh/avg_cached:.1f}x")
+        print(f"  Time saved:              {total_fresh - total_cached:.2f}s")
+        print(f"{'='*70}")
+
+        # Cached should be significantly faster
+        assert avg_cached < avg_fresh / 2, "Cached crawls should be at least 2x faster"
+
+    @pytest.mark.asyncio
+    async def test_repeated_access_same_url(self):
+        """Test repeated access to the same URL shows consistent cache hits."""
+        url = "https://docs.python.org/3/"
+        num_accesses = 5
+
+        browser_config = BrowserConfig(headless=True, verbose=False)
+
+        print(f"\n{'='*60}")
+        print(f"REPEATED ACCESS TEST: {url}")
+        print(f"{'='*60}")
+
+        # First access - fresh crawl
+        config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            start = time.perf_counter()
+            result = await crawler.arun(url, config=config)
+            fresh_time = time.perf_counter() - start
+        print(f"Access 1 (fresh):     {fresh_time:.2f}s - {result.cache_status}")
+
+        # Repeated accesses - should all be cache hits
+        cached_times = []
+        for i in range(2, num_accesses + 1):
+            config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=True)
+            async with AsyncWebCrawler(config=browser_config) as crawler:
+                start = time.perf_counter()
+                result = await crawler.arun(url, config=config)
+                elapsed = time.perf_counter() - start
+                cached_times.append(elapsed)
+            print(f"Access {i} (cached):    {elapsed:.2f}s - {result.cache_status}")
+            assert result.cache_status in ["hit", "hit_validated", "hit_fallback"]
+
+        avg_cached = sum(cached_times) / len(cached_times)
+        print(f"\nAverage cached time: {avg_cached:.2f}s")
+        print(f"Speedup over fresh:  {fresh_time/avg_cached:.1f}x")
+
+
+class TestCacheValidationModes:
+    """Test different cache modes and their behavior."""
+
+    @pytest.mark.asyncio
+    async def test_cache_bypass_always_fresh(self):
+        """CacheMode.BYPASS should always do fresh crawl."""
+        # Use a unique URL path to avoid cache from other tests
+        url = "https://example.com/test-bypass"
+
+        browser_config = BrowserConfig(headless=True, verbose=False)
+
+        # First crawl with WRITE_ONLY to populate cache (always fresh)
+        config1 = CrawlerRunConfig(cache_mode=CacheMode.WRITE_ONLY, check_cache_freshness=False)
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            result1 = await crawler.arun(url, config=config1)
+        assert result1.cache_status == "miss"
+
+        # Second crawl with BYPASS - should NOT use cache
+        config2 = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, check_cache_freshness=False)
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            result2 = await crawler.arun(url, config=config2)
+
+        # BYPASS mode means no cache interaction
+        assert result2.cache_status is None or result2.cache_status == "miss"
+        print(f"\nCacheMode.BYPASS result: {result2.cache_status}")
+
+    @pytest.mark.asyncio
+    async def test_validation_disabled_uses_cache_directly(self):
+        """With check_cache_freshness=False, should use cache without HTTP validation."""
+        url = "https://docs.python.org/3/tutorial/"
+
+        browser_config = BrowserConfig(headless=True, verbose=False)
+
+        # Fresh crawl - use WRITE_ONLY to force fresh
+        config1 = CrawlerRunConfig(cache_mode=CacheMode.WRITE_ONLY, check_cache_freshness=False)
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            result1 = await crawler.arun(url, config=config1)
+        assert result1.cache_status == "miss"
+
+        # Cached with validation DISABLED - should be "hit" (not "hit_validated")
+        config2 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            start = time.perf_counter()
+            result2 = await crawler.arun(url, config=config2)
+            elapsed = time.perf_counter() - start
+
+        assert result2.cache_status == "hit", f"Expected 'hit', got '{result2.cache_status}'"
+        print(f"\nValidation disabled: {elapsed:.3f}s (cache_status: {result2.cache_status})")
+
+        # Should be very fast - no HTTP request at all
+        assert elapsed < 1.0, "Cache hit without validation should be < 1 second"
+
+    @pytest.mark.asyncio
+    async def test_validation_enabled_checks_freshness(self):
+        """With check_cache_freshness=True, should validate before using cache."""
+        url = "https://docs.python.org/3/reference/"
+
+        browser_config = BrowserConfig(headless=True, verbose=False)
+
+        # Fresh crawl
+        config1 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            result1 = await crawler.arun(url, config=config1)
+
+        # Cached with validation ENABLED - should be "hit_validated"
+        config2 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=True)
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            start = time.perf_counter()
+            result2 = await crawler.arun(url, config=config2)
+            elapsed = time.perf_counter() - start
+
+        assert result2.cache_status in ["hit_validated", "hit_fallback"]
+        print(f"\nValidation enabled: {elapsed:.3f}s (cache_status: {result2.cache_status})")
+
+
+class TestCacheValidationResponseHeaders:
+    """Test that response headers are properly stored and retrieved."""
+
+    @pytest.mark.asyncio
+    async def test_response_headers_stored(self):
+        """Verify response headers including ETag and Last-Modified are stored."""
+        url = "https://docs.python.org/3/"
+
+        browser_config = BrowserConfig(headless=True, verbose=False)
+        config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
+
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            result = await crawler.arun(url, config=config)
+
+        assert result.success
+        assert result.response_headers is not None
+
+        # Check that cache-relevant headers are captured
+        headers = result.response_headers
+        print(f"\nResponse headers for {url}:")
+
+        # Look for ETag (case-insensitive)
+        etag = headers.get("etag") or headers.get("ETag")
+        print(f"  - ETag: {etag}")
+
+        # Look for Last-Modified
+        last_modified = headers.get("last-modified") or headers.get("Last-Modified")
+        print(f"  - Last-Modified: {last_modified}")
+
+        # Look for Cache-Control
+        cache_control = headers.get("cache-control") or headers.get("Cache-Control")
+        print(f"  - Cache-Control: {cache_control}")
+
+        # At least one should be present for docs.python.org
+        assert etag or last_modified, "Should have ETag or Last-Modified header"
+
+    @pytest.mark.asyncio
+    async def test_headers_used_for_validation(self):
+        """Verify stored headers are used for conditional requests."""
+        url = "https://docs.crawl4ai.com/"
+
+        browser_config = BrowserConfig(headless=True, verbose=False)
+
+        # Fresh crawl to store headers
+        config1 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            result1 = await crawler.arun(url, config=config1)
+
+        # Get stored metadata
+        metadata = await async_db_manager.aget_cache_metadata(url)
+        stored_etag = metadata.get("etag")
+        stored_last_modified = metadata.get("last_modified")
+
+        print(f"\nStored validation data for {url}:")
+        print(f"  - etag: {stored_etag}")
+        print(f"  - last_modified: {stored_last_modified}")
+
+        # Validate - should use stored headers
+        config2 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=True)
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            result2 = await crawler.arun(url, config=config2)
+
+        # Should get validated hit (304 response)
+        assert result2.cache_status in ["hit_validated", "hit_fallback"]
+        print(f"  - Validation result: {result2.cache_status}")
--- a/tests/cache_validation/test_head_fingerprint.py
+++ b/tests/cache_validation/test_head_fingerprint.py
@@ -0,0 +1,97 @@
+"""Unit tests for head fingerprinting."""
+
+import pytest
+from crawl4ai.utils import compute_head_fingerprint
+
+
+class TestHeadFingerprint:
+    """Tests for the compute_head_fingerprint function."""
+
+    def test_same_content_same_fingerprint(self):
+        """Identical <head> content produces same fingerprint."""
+        head = "<head><title>Test Page</title></head>"
+        fp1 = compute_head_fingerprint(head)
+        fp2 = compute_head_fingerprint(head)
+        assert fp1 == fp2
+        assert fp1 != ""
+
+    def test_different_title_different_fingerprint(self):
+        """Different title produces different fingerprint."""
+        head1 = "<head><title>Title A</title></head>"
+        head2 = "<head><title>Title B</title></head>"
+        assert compute_head_fingerprint(head1) != compute_head_fingerprint(head2)
+
+    def test_empty_head_returns_empty_string(self):
+        """Empty or None head should return empty fingerprint."""
+        assert compute_head_fingerprint("") == ""
+        assert compute_head_fingerprint(None) == ""
+
+    def test_head_without_signals_returns_empty(self):
+        """Head without title or key meta tags returns empty."""
+        head = "<head><link rel='stylesheet' href='style.css'></head>"
+        assert compute_head_fingerprint(head) == ""
+
+    def test_extracts_title(self):
+        """Title is extracted and included in fingerprint."""
+        head1 = "<head><title>My Title</title></head>"
+        head2 = "<head><title>My Title</title><link href='x'></head>"
+        # Same title should produce same fingerprint
+        assert compute_head_fingerprint(head1) == compute_head_fingerprint(head2)
+
+    def test_extracts_meta_description(self):
+        """Meta description is extracted."""
+        head1 = '<head><meta name="description" content="Test description"></head>'
+        head2 = '<head><meta name="description" content="Different description"></head>'
+        assert compute_head_fingerprint(head1) != compute_head_fingerprint(head2)
+
+    def test_extracts_og_tags(self):
+        """Open Graph tags are extracted."""
+        head1 = '<head><meta property="og:title" content="OG Title"></head>'
+        head2 = '<head><meta property="og:title" content="Different OG Title"></head>'
+        assert compute_head_fingerprint(head1) != compute_head_fingerprint(head2)
+
+    def test_extracts_og_image(self):
+        """og:image is extracted and affects fingerprint."""
+        head1 = '<head><meta property="og:image" content="https://example.com/img1.jpg"></head>'
+        head2 = '<head><meta property="og:image" content="https://example.com/img2.jpg"></head>'
+        assert compute_head_fingerprint(head1) != compute_head_fingerprint(head2)
+
+    def test_extracts_article_modified_time(self):
+        """article:modified_time is extracted."""
+        head1 = '<head><meta property="article:modified_time" content="2024-01-01T00:00:00Z"></head>'
+        head2 = '<head><meta property="article:modified_time" content="2024-12-01T00:00:00Z"></head>'
+        assert compute_head_fingerprint(head1) != compute_head_fingerprint(head2)
+
+    def test_case_insensitive(self):
+        """Fingerprinting is case-insensitive for tags."""
+        head1 = "<head><TITLE>Test</TITLE></head>"
+        head2 = "<head><title>test</title></head>"
+        # Both should extract title (case insensitive)
+        fp1 = compute_head_fingerprint(head1)
+        fp2 = compute_head_fingerprint(head2)
+        assert fp1 != ""
+        assert fp2 != ""
+
+    def test_handles_attribute_order(self):
+        """Handles different attribute orders in meta tags."""
+        head1 = '<head><meta name="description" content="Test"></head>'
+        head2 = '<head><meta content="Test" name="description"></head>'
+        assert compute_head_fingerprint(head1) == compute_head_fingerprint(head2)
+
+    def test_real_world_head(self):
+        """Test with a realistic head section."""
+        head = '''
+        <head>
+            <meta charset="utf-8">
+            <title>Python Documentation</title>
+            <meta name="description" content="Official Python documentation">
+            <meta property="og:title" content="Python Docs">
+            <meta property="og:description" content="Learn Python">
+            <meta property="og:image" content="https://python.org/logo.png">
+            <link rel="stylesheet" href="styles.css">
+        </head>
+        '''
+        fp = compute_head_fingerprint(head)
+        assert fp != ""
+        # Should be deterministic
+        assert fp == compute_head_fingerprint(head)
--- a/tests/cache_validation/test_real_domains.py
+++ b/tests/cache_validation/test_real_domains.py
@@ -0,0 +1,354 @@
+"""
+Real-world tests for cache validation using actual HTTP requests.
+No mocks - all tests hit real servers.
+"""
+
+import pytest
+from crawl4ai.cache_validator import CacheValidator, CacheValidationResult
+from crawl4ai.utils import compute_head_fingerprint
+
+
+class TestRealDomainsConditionalSupport:
+    """Test domains that support HTTP conditional requests (ETag/Last-Modified)."""
+
+    @pytest.mark.asyncio
+    async def test_docs_python_org_etag(self):
+        """docs.python.org supports ETag - should return 304."""
+        url = "https://docs.python.org/3/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            # First fetch to get ETag
+            head_html, etag, last_modified = await validator._fetch_head(url)
+
+            assert head_html is not None, "Should fetch head content"
+            assert etag is not None, "docs.python.org should return ETag"
+
+            # Validate with the ETag we just got
+            result = await validator.validate(url=url, stored_etag=etag)
+
+            assert result.status == CacheValidationResult.FRESH, f"Expected FRESH, got {result.status}: {result.reason}"
+            assert "304" in result.reason
+
+    @pytest.mark.asyncio
+    async def test_docs_crawl4ai_etag(self):
+        """docs.crawl4ai.com supports ETag - should return 304."""
+        url = "https://docs.crawl4ai.com/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, etag, last_modified = await validator._fetch_head(url)
+
+            assert etag is not None, "docs.crawl4ai.com should return ETag"
+
+            result = await validator.validate(url=url, stored_etag=etag)
+
+            assert result.status == CacheValidationResult.FRESH, f"Expected FRESH, got {result.status}: {result.reason}"
+
+    @pytest.mark.asyncio
+    async def test_wikipedia_last_modified(self):
+        """Wikipedia supports Last-Modified - should return 304."""
+        url = "https://en.wikipedia.org/wiki/Web_crawler"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, etag, last_modified = await validator._fetch_head(url)
+
+            assert last_modified is not None, "Wikipedia should return Last-Modified"
+
+            result = await validator.validate(url=url, stored_last_modified=last_modified)
+
+            assert result.status == CacheValidationResult.FRESH, f"Expected FRESH, got {result.status}: {result.reason}"
+
+    @pytest.mark.asyncio
+    async def test_github_pages(self):
+        """GitHub Pages supports conditional requests."""
+        url = "https://pages.github.com/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, etag, last_modified = await validator._fetch_head(url)
+
+            # GitHub Pages typically has at least one
+            has_conditional = etag is not None or last_modified is not None
+            assert has_conditional, "GitHub Pages should support conditional requests"
+
+            result = await validator.validate(
+                url=url,
+                stored_etag=etag,
+                stored_last_modified=last_modified,
+            )
+
+            assert result.status == CacheValidationResult.FRESH
+
+    @pytest.mark.asyncio
+    async def test_httpbin_etag(self):
+        """httpbin.org/etag endpoint for testing ETag."""
+        url = "https://httpbin.org/etag/test-etag-value"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            result = await validator.validate(url=url, stored_etag='"test-etag-value"')
+
+            # httpbin should return 304 for matching ETag
+            assert result.status == CacheValidationResult.FRESH, f"Expected FRESH, got {result.status}: {result.reason}"
+
+
+class TestRealDomainsNoConditionalSupport:
+    """Test domains that may NOT support HTTP conditional requests."""
+
+    @pytest.mark.asyncio
+    async def test_dynamic_site_fingerprint_fallback(self):
+        """Test fingerprint-based validation for sites without conditional support."""
+        # Use a site that changes frequently but has stable head
+        url = "https://example.com/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            # Get head and compute fingerprint
+            head_html, etag, last_modified = await validator._fetch_head(url)
+
+            assert head_html is not None
+            fingerprint = compute_head_fingerprint(head_html)
+
+            # Validate using fingerprint (not etag/last-modified)
+            result = await validator.validate(
+                url=url,
+                stored_head_fingerprint=fingerprint,
+            )
+
+            # Should be FRESH since fingerprint should match
+            assert result.status == CacheValidationResult.FRESH, f"Expected FRESH, got {result.status}: {result.reason}"
+            assert "fingerprint" in result.reason.lower()
+
+    @pytest.mark.asyncio
+    async def test_news_site_changes_frequently(self):
+        """News sites change frequently - test that we can detect changes."""
+        url = "https://www.bbc.com/news"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, etag, last_modified = await validator._fetch_head(url)
+
+            # BBC News has ETag but it changes with content
+            assert head_html is not None
+
+            # Using a fake old ETag should return STALE (200 with different content)
+            result = await validator.validate(
+                url=url,
+                stored_etag='"fake-old-etag-12345"',
+            )
+
+            # Should be STALE because the ETag doesn't match
+            assert result.status == CacheValidationResult.STALE, f"Expected STALE, got {result.status}: {result.reason}"
+
+
+class TestRealDomainsEdgeCases:
+    """Edge cases with real domains."""
+
+    @pytest.mark.asyncio
+    async def test_nonexistent_domain(self):
+        """Non-existent domain should return ERROR."""
+        url = "https://this-domain-definitely-does-not-exist-xyz123.com/"
+
+        async with CacheValidator(timeout=5.0) as validator:
+            result = await validator.validate(url=url, stored_etag='"test"')
+
+            assert result.status == CacheValidationResult.ERROR
+
+    @pytest.mark.asyncio
+    async def test_timeout_slow_server(self):
+        """Test timeout handling with a slow endpoint."""
+        # httpbin delay endpoint
+        url = "https://httpbin.org/delay/10"
+
+        async with CacheValidator(timeout=2.0) as validator:  # 2 second timeout
+            result = await validator.validate(url=url, stored_etag='"test"')
+
+            # Should timeout and return ERROR
+            assert result.status == CacheValidationResult.ERROR
+            assert "timeout" in result.reason.lower() or "timed out" in result.reason.lower()
+
+    @pytest.mark.asyncio
+    async def test_redirect_handling(self):
+        """Test that redirects are followed."""
+        # httpbin redirect
+        url = "https://httpbin.org/redirect/1"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, etag, last_modified = await validator._fetch_head(url)
+
+            # Should follow redirect and get content
+            # The final page might not have useful head content, but shouldn't error
+            # This tests that redirects are handled
+
+    @pytest.mark.asyncio
+    async def test_https_only(self):
+        """Test HTTPS connection."""
+        url = "https://www.google.com/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, etag, last_modified = await validator._fetch_head(url)
+
+            assert head_html is not None
+            assert "<title" in head_html.lower()
+
+
+class TestRealDomainsHeadFingerprint:
+    """Test head fingerprint extraction with real domains."""
+
+    @pytest.mark.asyncio
+    async def test_python_docs_fingerprint(self):
+        """Python docs has title and meta tags."""
+        url = "https://docs.python.org/3/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, _, _ = await validator._fetch_head(url)
+
+            assert head_html is not None
+            fingerprint = compute_head_fingerprint(head_html)
+
+            assert fingerprint != "", "Should extract fingerprint from Python docs"
+
+            # Fingerprint should be consistent
+            fingerprint2 = compute_head_fingerprint(head_html)
+            assert fingerprint == fingerprint2
+
+    @pytest.mark.asyncio
+    async def test_github_fingerprint(self):
+        """GitHub has og: tags."""
+        url = "https://github.com/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, _, _ = await validator._fetch_head(url)
+
+            assert head_html is not None
+            assert "og:" in head_html.lower() or "title" in head_html.lower()
+
+            fingerprint = compute_head_fingerprint(head_html)
+            assert fingerprint != ""
+
+    @pytest.mark.asyncio
+    async def test_crawl4ai_docs_fingerprint(self):
+        """Crawl4AI docs should have title and description."""
+        url = "https://docs.crawl4ai.com/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, _, _ = await validator._fetch_head(url)
+
+            assert head_html is not None
+            fingerprint = compute_head_fingerprint(head_html)
+
+            assert fingerprint != "", "Should extract fingerprint from Crawl4AI docs"
+
+
+class TestRealDomainsFetchHead:
+    """Test _fetch_head functionality with real domains."""
+
+    @pytest.mark.asyncio
+    async def test_fetch_stops_at_head_close(self):
+        """Verify we stop reading after </head>."""
+        url = "https://docs.python.org/3/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, _, _ = await validator._fetch_head(url)
+
+            assert head_html is not None
+            assert "</head>" in head_html.lower()
+            # Should NOT contain body content
+            assert "<body" not in head_html.lower() or head_html.lower().index("</head>") < head_html.lower().find("<body")
+
+    @pytest.mark.asyncio
+    async def test_extracts_both_headers(self):
+        """Test extraction of both ETag and Last-Modified."""
+        url = "https://docs.python.org/3/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, etag, last_modified = await validator._fetch_head(url)
+
+            # Python docs should have both
+            assert etag is not None, "Should have ETag"
+            assert last_modified is not None, "Should have Last-Modified"
+
+    @pytest.mark.asyncio
+    async def test_handles_missing_head_tag(self):
+        """Handle pages that might not have proper head structure."""
+        # API endpoint that returns JSON (no HTML head)
+        url = "https://httpbin.org/json"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, etag, last_modified = await validator._fetch_head(url)
+
+            # Should not crash, may return partial content or None
+            # The important thing is it doesn't error
+
+
+class TestRealDomainsValidationCombinations:
+    """Test various combinations of validation data."""
+
+    @pytest.mark.asyncio
+    async def test_etag_only(self):
+        """Validate with only ETag."""
+        url = "https://docs.python.org/3/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            _, etag, _ = await validator._fetch_head(url)
+
+            result = await validator.validate(url=url, stored_etag=etag)
+            assert result.status == CacheValidationResult.FRESH
+
+    @pytest.mark.asyncio
+    async def test_last_modified_only(self):
+        """Validate with only Last-Modified."""
+        url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            _, _, last_modified = await validator._fetch_head(url)
+
+            if last_modified:
+                result = await validator.validate(url=url, stored_last_modified=last_modified)
+                assert result.status == CacheValidationResult.FRESH
+
+    @pytest.mark.asyncio
+    async def test_fingerprint_only(self):
+        """Validate with only fingerprint."""
+        url = "https://example.com/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, _, _ = await validator._fetch_head(url)
+            fingerprint = compute_head_fingerprint(head_html)
+
+            if fingerprint:
+                result = await validator.validate(url=url, stored_head_fingerprint=fingerprint)
+                assert result.status == CacheValidationResult.FRESH
+
+    @pytest.mark.asyncio
+    async def test_all_validation_data(self):
+        """Validate with all available data."""
+        url = "https://docs.python.org/3/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, etag, last_modified = await validator._fetch_head(url)
+            fingerprint = compute_head_fingerprint(head_html)
+
+            result = await validator.validate(
+                url=url,
+                stored_etag=etag,
+                stored_last_modified=last_modified,
+                stored_head_fingerprint=fingerprint,
+            )
+
+            assert result.status == CacheValidationResult.FRESH
+
+    @pytest.mark.asyncio
+    async def test_stale_etag_fresh_fingerprint(self):
+        """When ETag is stale but fingerprint matches, should be FRESH."""
+        url = "https://docs.python.org/3/"
+
+        async with CacheValidator(timeout=15.0) as validator:
+            head_html, _, _ = await validator._fetch_head(url)
+            fingerprint = compute_head_fingerprint(head_html)
+
+            # Use fake ETag but real fingerprint
+            result = await validator.validate(
+                url=url,
+                stored_etag='"fake-stale-etag"',
+                stored_head_fingerprint=fingerprint,
+            )
+
+            # Fingerprint should save us
+            assert result.status == CacheValidationResult.FRESH
+            assert "fingerprint" in result.reason.lower()
--- a/tests/deep_crawling/init.py
+++ b/tests/deep_crawling/init.py
--- a/tests/deep_crawling/test_deep_crawl_resume.py
+++ b/tests/deep_crawling/test_deep_crawl_resume.py
@@ -0,0 +1,773 @@
+"""
+Test Suite: Deep Crawl Resume/Crash Recovery Tests
+
+Tests that verify:
+1. State export produces valid JSON-serializable data
+2. Resume from checkpoint continues without duplicates
+3. Simulated crash at various points recovers correctly
+4. State callback fires at expected intervals
+5. No damage to existing system behavior (regression tests)
+"""
+
+import pytest
+import asyncio
+import json
+from typing import Dict, Any, List
+from unittest.mock import AsyncMock, MagicMock
+
+from crawl4ai.deep_crawling import (
+    BFSDeepCrawlStrategy,
+    DFSDeepCrawlStrategy,
+    BestFirstCrawlingStrategy,
+    FilterChain,
+    URLPatternFilter,
+    DomainFilter,
+)
+from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
+
+
+# ============================================================================
+# Helper Functions for Mock Crawler
+# ============================================================================
+
+def create_mock_config(stream=False):
+    """Create a mock CrawlerRunConfig."""
+    config = MagicMock()
+    config.clone = MagicMock(return_value=config)
+    config.stream = stream
+    return config
+
+
+def create_mock_crawler_with_links(num_links: int = 3, include_keyword: bool = False):
+    """Create mock crawler that returns results with links."""
+    call_count = 0
+
+    async def mock_arun_many(urls, config):
+        nonlocal call_count
+        results = []
+        for url in urls:
+            call_count += 1
+            result = MagicMock()
+            result.url = url
+            result.success = True
+            result.metadata = {}
+
+            # Generate child links
+            links = []
+            for i in range(num_links):
+                link_url = f"{url}/child{call_count}_{i}"
+                if include_keyword:
+                    link_url = f"{url}/important-child{call_count}_{i}"
+                links.append({"href": link_url})
+
+            result.links = {"internal": links, "external": []}
+            results.append(result)
+
+        # For streaming mode, return async generator
+        if config.stream:
+            async def gen():
+                for r in results:
+                    yield r
+            return gen()
+        return results
+
+    crawler = MagicMock()
+    crawler.arun_many = mock_arun_many
+    return crawler
+
+
+def create_mock_crawler_tracking(crawl_order: List[str], return_no_links: bool = False):
+    """Create mock crawler that tracks crawl order."""
+
+    async def mock_arun_many(urls, config):
+        results = []
+        for url in urls:
+            crawl_order.append(url)
+            result = MagicMock()
+            result.url = url
+            result.success = True
+            result.metadata = {}
+            result.links = {"internal": [], "external": []} if return_no_links else {"internal": [{"href": f"{url}/child"}], "external": []}
+            results.append(result)
+
+        # For streaming mode, return async generator
+        if config.stream:
+            async def gen():
+                for r in results:
+                    yield r
+            return gen()
+        return results
+
+    crawler = MagicMock()
+    crawler.arun_many = mock_arun_many
+    return crawler
+
+
+def create_simple_mock_crawler():
+    """Basic mock crawler returning 1 result with 2 child links."""
+    call_count = 0
+
+    async def mock_arun_many(urls, config):
+        nonlocal call_count
+        results = []
+        for url in urls:
+            call_count += 1
+            result = MagicMock()
+            result.url = url
+            result.success = True
+            result.metadata = {}
+            result.links = {
+                "internal": [
+                    {"href": f"{url}/child1"},
+                    {"href": f"{url}/child2"},
+                ],
+                "external": []
+            }
+            results.append(result)
+
+        if config.stream:
+            async def gen():
+                for r in results:
+                    yield r
+            return gen()
+        return results
+
+    crawler = MagicMock()
+    crawler.arun_many = mock_arun_many
+    return crawler
+
+
+def create_mock_crawler_unlimited_links():
+    """Mock crawler that always returns links (for testing limits)."""
+    async def mock_arun_many(urls, config):
+        results = []
+        for url in urls:
+            result = MagicMock()
+            result.url = url
+            result.success = True
+            result.metadata = {}
+            result.links = {
+                "internal": [{"href": f"{url}/link{i}"} for i in range(10)],
+                "external": []
+            }
+            results.append(result)
+
+        if config.stream:
+            async def gen():
+                for r in results:
+                    yield r
+            return gen()
+        return results
+
+    crawler = MagicMock()
+    crawler.arun_many = mock_arun_many
+    return crawler
+
+
+# ============================================================================
+# TEST SUITE 1: Crash Recovery Tests
+# ============================================================================
+
+class TestBFSResume:
+    """BFS strategy resume tests."""
+
+    @pytest.mark.asyncio
+    async def test_state_export_json_serializable(self):
+        """Verify exported state can be JSON serialized."""
+        captured_states: List[Dict] = []
+
+        async def capture_state(state: Dict[str, Any]):
+            # Verify JSON serializable
+            json_str = json.dumps(state)
+            parsed = json.loads(json_str)
+            captured_states.append(parsed)
+
+        strategy = BFSDeepCrawlStrategy(
+            max_depth=2,
+            max_pages=10,
+            on_state_change=capture_state,
+        )
+
+        # Create mock crawler that returns predictable results
+        mock_crawler = create_mock_crawler_with_links(num_links=3)
+        mock_config = create_mock_config()
+
+        results = await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
+
+        # Verify states were captured
+        assert len(captured_states) > 0
+
+        # Verify state structure
+        for state in captured_states:
+            assert state["strategy_type"] == "bfs"
+            assert "visited" in state
+            assert "pending" in state
+            assert "depths" in state
+            assert "pages_crawled" in state
+            assert isinstance(state["visited"], list)
+            assert isinstance(state["pending"], list)
+            assert isinstance(state["depths"], dict)
+            assert isinstance(state["pages_crawled"], int)
+
+    @pytest.mark.asyncio
+    async def test_resume_continues_from_checkpoint(self):
+        """Verify resume starts from saved state, not beginning."""
+        # Simulate state from previous crawl (visited 5 URLs, 3 pending)
+        saved_state = {
+            "strategy_type": "bfs",
+            "visited": [
+                "https://example.com",
+                "https://example.com/page1",
+                "https://example.com/page2",
+                "https://example.com/page3",
+                "https://example.com/page4",
+            ],
+            "pending": [
+                {"url": "https://example.com/page5", "parent_url": "https://example.com/page2"},
+                {"url": "https://example.com/page6", "parent_url": "https://example.com/page3"},
+                {"url": "https://example.com/page7", "parent_url": "https://example.com/page3"},
+            ],
+            "depths": {
+                "https://example.com": 0,
+                "https://example.com/page1": 1,
+                "https://example.com/page2": 1,
+                "https://example.com/page3": 1,
+                "https://example.com/page4": 1,
+                "https://example.com/page5": 2,
+                "https://example.com/page6": 2,
+                "https://example.com/page7": 2,
+            },
+            "pages_crawled": 5,
+        }
+
+        crawled_urls: List[str] = []
+
+        strategy = BFSDeepCrawlStrategy(
+            max_depth=2,
+            max_pages=20,
+            resume_state=saved_state,
+        )
+
+        # Verify internal state was restored
+        assert strategy._resume_state == saved_state
+
+        mock_crawler = create_mock_crawler_tracking(crawled_urls, return_no_links=True)
+        mock_config = create_mock_config()
+
+        await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
+
+        # Should NOT re-crawl already visited URLs
+        for visited_url in saved_state["visited"]:
+            assert visited_url not in crawled_urls, f"Re-crawled already visited: {visited_url}"
+
+        # Should crawl pending URLs
+        for pending in saved_state["pending"]:
+            assert pending["url"] in crawled_urls, f"Did not crawl pending: {pending['url']}"
+
+    @pytest.mark.asyncio
+    async def test_simulated_crash_mid_crawl(self):
+        """Simulate crash at URL N, verify resume continues from pending URLs."""
+        crash_after = 3
+        states_before_crash: List[Dict] = []
+
+        async def capture_until_crash(state: Dict[str, Any]):
+            states_before_crash.append(state)
+            if state["pages_crawled"] >= crash_after:
+                raise Exception("Simulated crash!")
+
+        strategy1 = BFSDeepCrawlStrategy(
+            max_depth=2,
+            max_pages=10,
+            on_state_change=capture_until_crash,
+        )
+
+        mock_crawler = create_mock_crawler_with_links(num_links=5)
+        mock_config = create_mock_config()
+
+        # First crawl - crashes
+        with pytest.raises(Exception, match="Simulated crash"):
+            await strategy1._arun_batch("https://example.com", mock_crawler, mock_config)
+
+        # Get last state before crash
+        last_state = states_before_crash[-1]
+        assert last_state["pages_crawled"] >= crash_after
+
+        # Calculate which URLs were already crawled vs pending
+        pending_urls = {item["url"] for item in last_state["pending"]}
+        visited_urls = set(last_state["visited"])
+        already_crawled_urls = visited_urls - pending_urls
+
+        # Resume from checkpoint
+        crawled_in_resume: List[str] = []
+
+        strategy2 = BFSDeepCrawlStrategy(
+            max_depth=2,
+            max_pages=10,
+            resume_state=last_state,
+        )
+
+        mock_crawler2 = create_mock_crawler_tracking(crawled_in_resume, return_no_links=True)
+
+        await strategy2._arun_batch("https://example.com", mock_crawler2, mock_config)
+
+        # Verify already-crawled URLs are not re-crawled
+        for crawled_url in already_crawled_urls:
+            assert crawled_url not in crawled_in_resume, f"Re-crawled already visited: {crawled_url}"
+
+        # Verify pending URLs are crawled
+        for pending_url in pending_urls:
+            assert pending_url in crawled_in_resume, f"Did not crawl pending: {pending_url}"
+
+    @pytest.mark.asyncio
+    async def test_callback_fires_per_url(self):
+        """Verify callback fires after each URL for maximum granularity."""
+        callback_count = 0
+        pages_crawled_sequence: List[int] = []
+
+        async def count_callbacks(state: Dict[str, Any]):
+            nonlocal callback_count
+            callback_count += 1
+            pages_crawled_sequence.append(state["pages_crawled"])
+
+        strategy = BFSDeepCrawlStrategy(
+            max_depth=1,
+            max_pages=5,
+            on_state_change=count_callbacks,
+        )
+
+        mock_crawler = create_mock_crawler_with_links(num_links=2)
+        mock_config = create_mock_config()
+
+        await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
+
+        # Callback should fire once per successful URL
+        assert callback_count == strategy._pages_crawled, \
+            f"Callback fired {callback_count} times, expected {strategy._pages_crawled} (per URL)"
+
+        # pages_crawled should increment by 1 each callback
+        for i, count in enumerate(pages_crawled_sequence):
+            assert count == i + 1, f"Expected pages_crawled={i+1} at callback {i}, got {count}"
+
+    @pytest.mark.asyncio
+    async def test_export_state_returns_last_captured(self):
+        """Verify export_state() returns last captured state."""
+        last_state = None
+
+        async def capture(state):
+            nonlocal last_state
+            last_state = state
+
+        strategy = BFSDeepCrawlStrategy(max_depth=2, max_pages=5, on_state_change=capture)
+
+        mock_crawler = create_mock_crawler_with_links(num_links=2)
+        mock_config = create_mock_config()
+
+        await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
+
+        exported = strategy.export_state()
+        assert exported == last_state
+
+
+class TestDFSResume:
+    """DFS strategy resume tests."""
+
+    @pytest.mark.asyncio
+    async def test_state_export_includes_stack_and_dfs_seen(self):
+        """Verify DFS state includes stack structure and _dfs_seen."""
+        captured_states: List[Dict] = []
+
+        async def capture_state(state: Dict[str, Any]):
+            captured_states.append(state)
+
+        strategy = DFSDeepCrawlStrategy(
+            max_depth=3,
+            max_pages=10,
+            on_state_change=capture_state,
+        )
+
+        mock_crawler = create_mock_crawler_with_links(num_links=2)
+        mock_config = create_mock_config()
+
+        await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
+
+        assert len(captured_states) > 0
+
+        for state in captured_states:
+            assert state["strategy_type"] == "dfs"
+            assert "stack" in state
+            assert "dfs_seen" in state
+            # Stack items should have depth
+            for item in state["stack"]:
+                assert "url" in item
+                assert "parent_url" in item
+                assert "depth" in item
+
+    @pytest.mark.asyncio
+    async def test_resume_restores_stack_order(self):
+        """Verify DFS stack order is preserved on resume."""
+        saved_state = {
+            "strategy_type": "dfs",
+            "visited": ["https://example.com"],
+            "stack": [
+                {"url": "https://example.com/deep3", "parent_url": "https://example.com/deep2", "depth": 3},
+                {"url": "https://example.com/deep2", "parent_url": "https://example.com/deep1", "depth": 2},
+                {"url": "https://example.com/page1", "parent_url": "https://example.com", "depth": 1},
+            ],
+            "depths": {"https://example.com": 0},
+            "pages_crawled": 1,
+            "dfs_seen": ["https://example.com", "https://example.com/deep3", "https://example.com/deep2", "https://example.com/page1"],
+        }
+
+        crawl_order: List[str] = []
+
+        strategy = DFSDeepCrawlStrategy(
+            max_depth=3,
+            max_pages=10,
+            resume_state=saved_state,
+        )
+
+        mock_crawler = create_mock_crawler_tracking(crawl_order, return_no_links=True)
+        mock_config = create_mock_config()
+
+        await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
+
+        # DFS pops from end of stack, so order should be: page1, deep2, deep3
+        assert crawl_order[0] == "https://example.com/page1"
+        assert crawl_order[1] == "https://example.com/deep2"
+        assert crawl_order[2] == "https://example.com/deep3"
+
+
+class TestBestFirstResume:
+    """Best-First strategy resume tests."""
+
+    @pytest.mark.asyncio
+    async def test_state_export_includes_scored_queue(self):
+        """Verify Best-First state includes queue with scores."""
+        captured_states: List[Dict] = []
+
+        async def capture_state(state: Dict[str, Any]):
+            captured_states.append(state)
+
+        scorer = KeywordRelevanceScorer(keywords=["important"], weight=1.0)
+
+        strategy = BestFirstCrawlingStrategy(
+            max_depth=2,
+            max_pages=10,
+            url_scorer=scorer,
+            on_state_change=capture_state,
+        )
+
+        mock_crawler = create_mock_crawler_with_links(num_links=3, include_keyword=True)
+        mock_config = create_mock_config(stream=True)
+
+        async for _ in strategy._arun_stream("https://example.com", mock_crawler, mock_config):
+            pass
+
+        assert len(captured_states) > 0
+
+        for state in captured_states:
+            assert state["strategy_type"] == "best_first"
+            assert "queue_items" in state
+            for item in state["queue_items"]:
+                assert "score" in item
+                assert "depth" in item
+                assert "url" in item
+                assert "parent_url" in item
+
+    @pytest.mark.asyncio
+    async def test_resume_maintains_priority_order(self):
+        """Verify priority queue order is maintained on resume."""
+        saved_state = {
+            "strategy_type": "best_first",
+            "visited": ["https://example.com"],
+            "queue_items": [
+                {"score": -0.9, "depth": 1, "url": "https://example.com/high-priority", "parent_url": "https://example.com"},
+                {"score": -0.5, "depth": 1, "url": "https://example.com/medium-priority", "parent_url": "https://example.com"},
+                {"score": -0.1, "depth": 1, "url": "https://example.com/low-priority", "parent_url": "https://example.com"},
+            ],
+            "depths": {"https://example.com": 0},
+            "pages_crawled": 1,
+        }
+
+        crawl_order: List[str] = []
+
+        strategy = BestFirstCrawlingStrategy(
+            max_depth=2,
+            max_pages=10,
+            resume_state=saved_state,
+        )
+
+        mock_crawler = create_mock_crawler_tracking(crawl_order, return_no_links=True)
+        mock_config = create_mock_config(stream=True)
+
+        async for _ in strategy._arun_stream("https://example.com", mock_crawler, mock_config):
+            pass
+
+        # Higher negative score = higher priority (min-heap)
+        # So -0.9 should be crawled first
+        assert crawl_order[0] == "https://example.com/high-priority"
+
+
+class TestCrossStrategyResume:
+    """Tests that apply to all strategies."""
+
+    @pytest.mark.asyncio
+    @pytest.mark.parametrize("strategy_class,strategy_type", [
+        (BFSDeepCrawlStrategy, "bfs"),
+        (DFSDeepCrawlStrategy, "dfs"),
+        (BestFirstCrawlingStrategy, "best_first"),
+    ])
+    async def test_no_callback_means_no_overhead(self, strategy_class, strategy_type):
+        """Verify no state tracking when callback is None."""
+        strategy = strategy_class(max_depth=2, max_pages=5)
+
+        # _queue_shadow should be None for Best-First when no callback
+        if strategy_class == BestFirstCrawlingStrategy:
+            assert strategy._queue_shadow is None
+
+        # _last_state should be None initially
+        assert strategy._last_state is None
+
+    @pytest.mark.asyncio
+    @pytest.mark.parametrize("strategy_class", [
+        BFSDeepCrawlStrategy,
+        DFSDeepCrawlStrategy,
+        BestFirstCrawlingStrategy,
+    ])
+    async def test_export_state_returns_last_captured(self, strategy_class):
+        """Verify export_state() returns last captured state."""
+        last_state = None
+
+        async def capture(state):
+            nonlocal last_state
+            last_state = state
+
+        strategy = strategy_class(max_depth=2, max_pages=5, on_state_change=capture)
+
+        mock_crawler = create_mock_crawler_with_links(num_links=2)
+
+        if strategy_class == BestFirstCrawlingStrategy:
+            mock_config = create_mock_config(stream=True)
+            async for _ in strategy._arun_stream("https://example.com", mock_crawler, mock_config):
+                pass
+        else:
+            mock_config = create_mock_config()
+            await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
+
+        exported = strategy.export_state()
+        assert exported == last_state
+
+
+# ============================================================================
+# TEST SUITE 2: Regression Tests (No Damage to Current System)
+# ============================================================================
+
+class TestBFSRegressions:
+    """Ensure BFS works identically when new params not used."""
+
+    @pytest.mark.asyncio
+    async def test_default_params_unchanged(self):
+        """Constructor with only original params works."""
+        strategy = BFSDeepCrawlStrategy(
+            max_depth=2,
+            include_external=False,
+            max_pages=10,
+        )
+
+        assert strategy.max_depth == 2
+        assert strategy.include_external == False
+        assert strategy.max_pages == 10
+        assert strategy._resume_state is None
+        assert strategy._on_state_change is None
+
+    @pytest.mark.asyncio
+    async def test_filter_chain_still_works(self):
+        """FilterChain integration unchanged."""
+        filter_chain = FilterChain([
+            URLPatternFilter(patterns=["*/blog/*"]),
+            DomainFilter(allowed_domains=["example.com"]),
+        ])
+
+        strategy = BFSDeepCrawlStrategy(
+            max_depth=2,
+            filter_chain=filter_chain,
+        )
+
+        # Test filter still applies
+        assert await strategy.can_process_url("https://example.com/blog/post1", 1) == True
+        assert await strategy.can_process_url("https://other.com/blog/post1", 1) == False
+
+    @pytest.mark.asyncio
+    async def test_url_scorer_still_works(self):
+        """URL scoring integration unchanged."""
+        scorer = KeywordRelevanceScorer(keywords=["python", "tutorial"], weight=1.0)
+
+        strategy = BFSDeepCrawlStrategy(
+            max_depth=2,
+            url_scorer=scorer,
+            score_threshold=0.5,
+        )
+
+        assert strategy.url_scorer is not None
+        assert strategy.score_threshold == 0.5
+
+        # Scorer should work
+        score = scorer.score("https://example.com/python-tutorial")
+        assert score > 0
+
+    @pytest.mark.asyncio
+    async def test_batch_mode_returns_list(self):
+        """Batch mode still returns List[CrawlResult]."""
+        strategy = BFSDeepCrawlStrategy(max_depth=1, max_pages=5)
+
+        mock_crawler = create_simple_mock_crawler()
+        mock_config = create_mock_config(stream=False)
+
+        results = await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
+
+        assert isinstance(results, list)
+        assert len(results) > 0
+
+    @pytest.mark.asyncio
+    async def test_max_pages_limit_respected(self):
+        """max_pages limit still enforced."""
+        strategy = BFSDeepCrawlStrategy(max_depth=10, max_pages=3)
+
+        mock_crawler = create_mock_crawler_unlimited_links()
+        mock_config = create_mock_config()
+
+        results = await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
+
+        # Should stop at max_pages
+        assert strategy._pages_crawled <= 3
+
+    @pytest.mark.asyncio
+    async def test_max_depth_limit_respected(self):
+        """max_depth limit still enforced."""
+        strategy = BFSDeepCrawlStrategy(max_depth=2, max_pages=100)
+
+        mock_crawler = create_mock_crawler_unlimited_links()
+        mock_config = create_mock_config()
+
+        results = await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
+
+        # All results should have depth <= max_depth
+        for result in results:
+            assert result.metadata.get("depth", 0) <= 2
+
+    @pytest.mark.asyncio
+    async def test_metadata_depth_still_set(self):
+        """Result metadata still includes depth."""
+        strategy = BFSDeepCrawlStrategy(max_depth=2, max_pages=5)
+
+        mock_crawler = create_simple_mock_crawler()
+        mock_config = create_mock_config()
+
+        results = await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
+
+        for result in results:
+            assert "depth" in result.metadata
+            assert isinstance(result.metadata["depth"], int)
+
+    @pytest.mark.asyncio
+    async def test_metadata_parent_url_still_set(self):
+        """Result metadata still includes parent_url."""
+        strategy = BFSDeepCrawlStrategy(max_depth=2, max_pages=5)
+
+        mock_crawler = create_simple_mock_crawler()
+        mock_config = create_mock_config()
+
+        results = await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
+
+        # First result (start URL) should have parent_url = None
+        assert results[0].metadata.get("parent_url") is None
+
+        # Child results should have parent_url set
+        for result in results[1:]:
+            assert "parent_url" in result.metadata
+
+
+class TestDFSRegressions:
+    """Ensure DFS works identically when new params not used."""
+
+    @pytest.mark.asyncio
+    async def test_inherits_bfs_params(self):
+        """DFS still inherits all BFS parameters."""
+        strategy = DFSDeepCrawlStrategy(
+            max_depth=3,
+            include_external=True,
+            max_pages=20,
+            score_threshold=0.5,
+        )
+
+        assert strategy.max_depth == 3
+        assert strategy.include_external == True
+        assert strategy.max_pages == 20
+        assert strategy.score_threshold == 0.5
+
+    @pytest.mark.asyncio
+    async def test_dfs_seen_initialized(self):
+        """DFS _dfs_seen set still initialized."""
+        strategy = DFSDeepCrawlStrategy(max_depth=2)
+
+        assert hasattr(strategy, '_dfs_seen')
+        assert isinstance(strategy._dfs_seen, set)
+
+
+class TestBestFirstRegressions:
+    """Ensure Best-First works identically when new params not used."""
+
+    @pytest.mark.asyncio
+    async def test_default_params_unchanged(self):
+        """Constructor with only original params works."""
+        strategy = BestFirstCrawlingStrategy(
+            max_depth=2,
+            include_external=False,
+            max_pages=10,
+        )
+
+        assert strategy.max_depth == 2
+        assert strategy.include_external == False
+        assert strategy.max_pages == 10
+        assert strategy._resume_state is None
+        assert strategy._on_state_change is None
+        assert strategy._queue_shadow is None  # Not initialized without callback
+
+    @pytest.mark.asyncio
+    async def test_scorer_integration(self):
+        """URL scorer still affects crawl priority."""
+        scorer = KeywordRelevanceScorer(keywords=["important"], weight=1.0)
+
+        strategy = BestFirstCrawlingStrategy(
+            max_depth=2,
+            max_pages=10,
+            url_scorer=scorer,
+        )
+
+        assert strategy.url_scorer is scorer
+
+
+class TestAPICompatibility:
+    """Ensure API/serialization compatibility."""
+
+    def test_strategy_signature_backward_compatible(self):
+        """Old code calling with positional/keyword args still works."""
+        # Positional args (old style)
+        s1 = BFSDeepCrawlStrategy(2)
+        assert s1.max_depth == 2
+
+        # Keyword args (old style)
+        s2 = BFSDeepCrawlStrategy(max_depth=3, max_pages=10)
+        assert s2.max_depth == 3
+
+        # Mixed (old style)
+        s3 = BFSDeepCrawlStrategy(2, FilterChain(), None, False, float('-inf'), 100)
+        assert s3.max_depth == 2
+        assert s3.max_pages == 100
+
+    def test_no_required_new_params(self):
+        """New params are optional, not required."""
+        # Should not raise
+        BFSDeepCrawlStrategy(max_depth=2)
+        DFSDeepCrawlStrategy(max_depth=2)
+        BestFirstCrawlingStrategy(max_depth=2)
--- a/tests/deep_crawling/test_deep_crawl_resume_integration.py
+++ b/tests/deep_crawling/test_deep_crawl_resume_integration.py
@@ -0,0 +1,162 @@
+"""
+Integration Test: Deep Crawl Resume with Real URLs
+
+Tests the crash recovery feature using books.toscrape.com - a site
+designed for scraping practice with a clear hierarchy:
+- Home page → Category pages → Book detail pages
+"""
+
+import pytest
+import asyncio
+import json
+from typing import Dict, Any, List
+
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+
+
+class TestBFSResumeIntegration:
+    """Integration tests for BFS resume with real crawling."""
+
+    @pytest.mark.asyncio
+    async def test_real_crawl_state_capture_and_resume(self):
+        """
+        Test crash recovery with real URLs from books.toscrape.com.
+
+        Flow:
+        1. Start crawl with state callback
+        2. Stop after N pages (simulated crash)
+        3. Resume from saved state
+        4. Verify no duplicate crawls
+        """
+        # Phase 1: Initial crawl that "crashes" after 3 pages
+        crash_after = 3
+        captured_states: List[Dict[str, Any]] = []
+        crawled_urls_phase1: List[str] = []
+
+        async def capture_state_until_crash(state: Dict[str, Any]):
+            captured_states.append(state)
+            crawled_urls_phase1.clear()
+            crawled_urls_phase1.extend(state["visited"])
+
+            if state["pages_crawled"] >= crash_after:
+                raise Exception("Simulated crash!")
+
+        strategy1 = BFSDeepCrawlStrategy(
+            max_depth=2,
+            max_pages=10,
+            on_state_change=capture_state_until_crash,
+        )
+
+        config = CrawlerRunConfig(
+            deep_crawl_strategy=strategy1,
+            stream=False,
+            verbose=False,
+        )
+
+        async with AsyncWebCrawler(verbose=False) as crawler:
+            # First crawl - will crash after 3 pages
+            with pytest.raises(Exception, match="Simulated crash"):
+                await crawler.arun("https://books.toscrape.com", config=config)
+
+        # Verify we captured state before crash
+        assert len(captured_states) > 0, "No states captured before crash"
+        last_state = captured_states[-1]
+
+        print(f"\n=== Phase 1: Crashed after {last_state['pages_crawled']} pages ===")
+        print(f"Visited URLs: {len(last_state['visited'])}")
+        print(f"Pending URLs: {len(last_state['pending'])}")
+
+        # Verify state structure
+        assert last_state["strategy_type"] == "bfs"
+        assert last_state["pages_crawled"] >= crash_after
+        assert len(last_state["visited"]) > 0
+        assert "pending" in last_state
+        assert "depths" in last_state
+
+        # Verify state is JSON serializable (important for Redis/DB storage)
+        json_str = json.dumps(last_state)
+        restored_state = json.loads(json_str)
+        assert restored_state == last_state, "State not JSON round-trip safe"
+
+        # Phase 2: Resume from checkpoint
+        crawled_urls_phase2: List[str] = []
+
+        async def track_resumed_crawl(state: Dict[str, Any]):
+            # Track what's being crawled in phase 2
+            new_visited = set(state["visited"]) - set(last_state["visited"])
+            for url in new_visited:
+                if url not in crawled_urls_phase2:
+                    crawled_urls_phase2.append(url)
+
+        strategy2 = BFSDeepCrawlStrategy(
+            max_depth=2,
+            max_pages=10,
+            resume_state=restored_state,
+            on_state_change=track_resumed_crawl,
+        )
+
+        config2 = CrawlerRunConfig(
+            deep_crawl_strategy=strategy2,
+            stream=False,
+            verbose=False,
+        )
+
+        async with AsyncWebCrawler(verbose=False) as crawler:
+            results = await crawler.arun("https://books.toscrape.com", config=config2)
+
+        print(f"\n=== Phase 2: Resumed crawl ===")
+        print(f"New URLs crawled: {len(crawled_urls_phase2)}")
+        print(f"Final pages_crawled: {strategy2._pages_crawled}")
+
+        # Verify no duplicates - URLs from phase 1 should not be re-crawled
+        already_crawled = set(last_state["visited"]) - {item["url"] for item in last_state["pending"]}
+        duplicates = set(crawled_urls_phase2) & already_crawled
+
+        assert len(duplicates) == 0, f"Duplicate crawls detected: {duplicates}"
+
+        # Verify we made progress (crawled some of the pending URLs)
+        pending_urls = {item["url"] for item in last_state["pending"]}
+        crawled_pending = set(crawled_urls_phase2) & pending_urls
+
+        print(f"Pending URLs crawled in phase 2: {len(crawled_pending)}")
+
+        # Final state should show more pages crawled than before crash
+        final_state = strategy2.export_state()
+        if final_state:
+            assert final_state["pages_crawled"] >= last_state["pages_crawled"], \
+                "Resume did not make progress"
+
+        print("\n=== Integration test PASSED ===")
+
+    @pytest.mark.asyncio
+    async def test_state_export_method(self):
+        """Test that export_state() returns valid state during crawl."""
+        states_from_callback: List[Dict] = []
+
+        async def capture(state):
+            states_from_callback.append(state)
+
+        strategy = BFSDeepCrawlStrategy(
+            max_depth=1,
+            max_pages=3,
+            on_state_change=capture,
+        )
+
+        config = CrawlerRunConfig(
+            deep_crawl_strategy=strategy,
+            stream=False,
+            verbose=False,
+        )
+
+        async with AsyncWebCrawler(verbose=False) as crawler:
+            await crawler.arun("https://books.toscrape.com", config=config)
+
+        # export_state should return the last captured state
+        exported = strategy.export_state()
+
+        assert exported is not None, "export_state() returned None"
+        assert exported == states_from_callback[-1], "export_state() doesn't match last callback"
+
+        print(f"\n=== export_state() test PASSED ===")
+        print(f"Final state: {exported['pages_crawled']} pages, {len(exported['visited'])} visited")
--- a/tests/docker/test_hooks_comprehensive.py
+++ b/tests/docker/test_hooks_comprehensive.py
@@ -7,9 +7,46 @@ adapted for the Docker API with real URLs
 import requests
 import json
 import time
-from typing import Dict, Any
+from typing import Dict, Optional

-API_BASE_URL = "http://localhost:11234"
+API_BASE_URL = "http://localhost:11235"
+
+# Global token storage
+_auth_token: Optional[str] = None
+
+
+def get_auth_token(email: str = "test@gmail.com") -> str:
+    """
+    Get a JWT token from the /token endpoint.
+    The email domain must have valid MX records.
+    """
+    global _auth_token
+
+    if _auth_token:
+        return _auth_token
+
+    print(f"🔐 Requesting JWT token for {email}...")
+    response = requests.post(
+        f"{API_BASE_URL}/token",
+        json={"email": email}
+    )
+
+    if response.status_code == 200:
+        data = response.json()
+        _auth_token = data["access_token"]
+        print(f"✅ Token obtained successfully")
+        return _auth_token
+    else:
+        raise Exception(f"Failed to get token: {response.status_code} - {response.text}")
+
+
+def get_auth_headers() -> Dict[str, str]:
+    """Get headers with JWT Bearer token."""
+    token = get_auth_token()
+    return {
+        "Authorization": f"Bearer {token}",
+        "Content-Type": "application/json"
+    }


 def test_all_hooks_demo():
@@ -164,8 +201,8 @@ async def hook(page, context, html, **kwargs):
    
    print("\nSending request with all 8 hooks...")
    start_time = time.time()
-    
-    response = requests.post(f"{API_BASE_URL}/crawl", json=payload)
+
+    response = requests.post(f"{API_BASE_URL}/crawl", json=payload, headers=get_auth_headers())
    
    elapsed_time = time.time() - start_time
    print(f"Request completed in {elapsed_time:.2f} seconds")
@@ -278,7 +315,7 @@ async def hook(page, context, url, **kwargs):
    }
    
    print("\nTesting authentication with httpbin endpoints...")
-    response = requests.post(f"{API_BASE_URL}/crawl", json=payload)
+    response = requests.post(f"{API_BASE_URL}/crawl", json=payload, headers=get_auth_headers())
    
    if response.status_code == 200:
        data = response.json()
@@ -371,8 +408,8 @@ async def hook(page, context, **kwargs):
    
    print("\nTesting performance optimization hooks...")
    start_time = time.time()
-    
-    response = requests.post(f"{API_BASE_URL}/crawl", json=payload)
+
+    response = requests.post(f"{API_BASE_URL}/crawl", json=payload, headers=get_auth_headers())
    
    elapsed_time = time.time() - start_time
    print(f"Request completed in {elapsed_time:.2f} seconds")
@@ -462,7 +499,7 @@ async def hook(page, context, **kwargs):
    }
    
    print("\nTesting content extraction hooks...")
-    response = requests.post(f"{API_BASE_URL}/crawl", json=payload)
+    response = requests.post(f"{API_BASE_URL}/crawl", json=payload, headers=get_auth_headers())
    
    if response.status_code == 200:
        data = response.json()
@@ -485,7 +522,16 @@ def main():
    print("🔧 Crawl4AI Docker API - Comprehensive Hooks Testing")
    print("Based on docs/examples/hooks_example.py")
    print("=" * 70)
-    
+
+    # Get JWT token first (required when jwt_enabled=true)
+    try:
+        get_auth_token()
+        print("=" * 70)
+    except Exception as e:
+        print(f"❌ Failed to authenticate: {e}")
+        print("Make sure the server is running and jwt_enabled is configured correctly.")
+        return
+
    tests = [
        ("All Hooks Demo", test_all_hooks_demo),
        ("Authentication Flow", test_authentication_flow),
--- a/tests/proxy/test_sticky_sessions.py
+++ b/tests/proxy/test_sticky_sessions.py
@@ -0,0 +1,569 @@
+"""
+Comprehensive test suite for Sticky Proxy Sessions functionality.
+
+Tests cover:
+1. Basic sticky session - same proxy for same session_id
+2. Different sessions get different proxies
+3. Session release
+4. TTL expiration
+5. Thread safety / concurrent access
+6. Integration tests with AsyncWebCrawler
+"""
+
+import asyncio
+import os
+import time
+import pytest
+from unittest.mock import patch
+
+from crawl4ai import AsyncWebCrawler, BrowserConfig
+from crawl4ai.async_configs import CrawlerRunConfig, ProxyConfig
+from crawl4ai.proxy_strategy import RoundRobinProxyStrategy
+from crawl4ai.cache_context import CacheMode
+
+
+class TestRoundRobinProxyStrategySession:
+    """Test suite for RoundRobinProxyStrategy session methods."""
+
+    def setup_method(self):
+        """Setup for each test method."""
+        self.proxies = [
+            ProxyConfig(server=f"http://proxy{i}.test:8080")
+            for i in range(5)
+        ]
+
+    # ==================== BASIC STICKY SESSION TESTS ====================
+
+    @pytest.mark.asyncio
+    async def test_sticky_session_same_proxy(self):
+        """Verify same proxy is returned for same session_id."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        # First call - acquires proxy
+        proxy1 = await strategy.get_proxy_for_session("session-1")
+
+        # Second call - should return same proxy
+        proxy2 = await strategy.get_proxy_for_session("session-1")
+
+        # Third call - should return same proxy
+        proxy3 = await strategy.get_proxy_for_session("session-1")
+
+        assert proxy1 is not None
+        assert proxy1.server == proxy2.server == proxy3.server
+
+    @pytest.mark.asyncio
+    async def test_different_sessions_different_proxies(self):
+        """Verify different session_ids can get different proxies."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        proxy_a = await strategy.get_proxy_for_session("session-a")
+        proxy_b = await strategy.get_proxy_for_session("session-b")
+        proxy_c = await strategy.get_proxy_for_session("session-c")
+
+        # All should be different (round-robin)
+        servers = {proxy_a.server, proxy_b.server, proxy_c.server}
+        assert len(servers) == 3
+
+    @pytest.mark.asyncio
+    async def test_sticky_session_with_regular_rotation(self):
+        """Verify sticky sessions don't interfere with regular rotation."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        # Acquire a sticky session
+        session_proxy = await strategy.get_proxy_for_session("sticky-session")
+
+        # Regular rotation should continue independently
+        regular_proxy1 = await strategy.get_next_proxy()
+        regular_proxy2 = await strategy.get_next_proxy()
+
+        # Sticky session should still return same proxy
+        session_proxy_again = await strategy.get_proxy_for_session("sticky-session")
+
+        assert session_proxy.server == session_proxy_again.server
+        # Regular proxies should rotate
+        assert regular_proxy1.server != regular_proxy2.server
+
+    # ==================== SESSION RELEASE TESTS ====================
+
+    @pytest.mark.asyncio
+    async def test_session_release(self):
+        """Verify session can be released and reacquired."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        # Acquire session
+        proxy1 = await strategy.get_proxy_for_session("session-1")
+        assert strategy.get_session_proxy("session-1") is not None
+
+        # Release session
+        await strategy.release_session("session-1")
+        assert strategy.get_session_proxy("session-1") is None
+
+        # Reacquire - should get a new proxy (next in round-robin)
+        proxy2 = await strategy.get_proxy_for_session("session-1")
+        assert proxy2 is not None
+        # After release, next call gets the next proxy in rotation
+        # (not necessarily the same as before)
+
+    @pytest.mark.asyncio
+    async def test_release_nonexistent_session(self):
+        """Verify releasing non-existent session doesn't raise error."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        # Should not raise
+        await strategy.release_session("nonexistent-session")
+
+    @pytest.mark.asyncio
+    async def test_release_twice(self):
+        """Verify releasing session twice doesn't raise error."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        await strategy.get_proxy_for_session("session-1")
+        await strategy.release_session("session-1")
+        await strategy.release_session("session-1")  # Should not raise
+
+    # ==================== GET SESSION PROXY TESTS ====================
+
+    @pytest.mark.asyncio
+    async def test_get_session_proxy_existing(self):
+        """Verify get_session_proxy returns proxy for existing session."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        acquired = await strategy.get_proxy_for_session("session-1")
+        retrieved = strategy.get_session_proxy("session-1")
+
+        assert retrieved is not None
+        assert acquired.server == retrieved.server
+
+    def test_get_session_proxy_nonexistent(self):
+        """Verify get_session_proxy returns None for non-existent session."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        result = strategy.get_session_proxy("nonexistent-session")
+        assert result is None
+
+    # ==================== TTL EXPIRATION TESTS ====================
+
+    @pytest.mark.asyncio
+    async def test_session_ttl_not_expired(self):
+        """Verify session returns same proxy when TTL not expired."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        # Acquire with 10 second TTL
+        proxy1 = await strategy.get_proxy_for_session("session-1", ttl=10)
+
+        # Immediately request again - should return same proxy
+        proxy2 = await strategy.get_proxy_for_session("session-1", ttl=10)
+
+        assert proxy1.server == proxy2.server
+
+    @pytest.mark.asyncio
+    async def test_session_ttl_expired(self):
+        """Verify new proxy acquired after TTL expires."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        # Acquire with 1 second TTL
+        proxy1 = await strategy.get_proxy_for_session("session-1", ttl=1)
+
+        # Wait for TTL to expire
+        await asyncio.sleep(1.1)
+
+        # Request again - should get new proxy due to expiration
+        proxy2 = await strategy.get_proxy_for_session("session-1", ttl=1)
+
+        # May or may not be same server depending on round-robin state,
+        # but session should have been recreated
+        assert proxy2 is not None
+
+    @pytest.mark.asyncio
+    async def test_get_session_proxy_ttl_expired(self):
+        """Verify get_session_proxy returns None after TTL expires."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        await strategy.get_proxy_for_session("session-1", ttl=1)
+
+        # Wait for expiration
+        await asyncio.sleep(1.1)
+
+        # Should return None for expired session
+        result = strategy.get_session_proxy("session-1")
+        assert result is None
+
+    @pytest.mark.asyncio
+    async def test_cleanup_expired_sessions(self):
+        """Verify cleanup_expired_sessions removes expired sessions."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        # Create sessions with short TTL
+        await strategy.get_proxy_for_session("short-ttl-1", ttl=1)
+        await strategy.get_proxy_for_session("short-ttl-2", ttl=1)
+        # Create session without TTL (should not be cleaned up)
+        await strategy.get_proxy_for_session("no-ttl")
+
+        # Wait for TTL to expire
+        await asyncio.sleep(1.1)
+
+        # Cleanup
+        removed = await strategy.cleanup_expired_sessions()
+
+        assert removed == 2
+        assert strategy.get_session_proxy("short-ttl-1") is None
+        assert strategy.get_session_proxy("short-ttl-2") is None
+        assert strategy.get_session_proxy("no-ttl") is not None
+
+    # ==================== GET ACTIVE SESSIONS TESTS ====================
+
+    @pytest.mark.asyncio
+    async def test_get_active_sessions(self):
+        """Verify get_active_sessions returns all active sessions."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        await strategy.get_proxy_for_session("session-a")
+        await strategy.get_proxy_for_session("session-b")
+        await strategy.get_proxy_for_session("session-c")
+
+        active = strategy.get_active_sessions()
+
+        assert len(active) == 3
+        assert "session-a" in active
+        assert "session-b" in active
+        assert "session-c" in active
+
+    @pytest.mark.asyncio
+    async def test_get_active_sessions_excludes_expired(self):
+        """Verify get_active_sessions excludes expired sessions."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        await strategy.get_proxy_for_session("short-ttl", ttl=1)
+        await strategy.get_proxy_for_session("no-ttl")
+
+        # Before expiration
+        active = strategy.get_active_sessions()
+        assert len(active) == 2
+
+        # Wait for TTL to expire
+        await asyncio.sleep(1.1)
+
+        # After expiration
+        active = strategy.get_active_sessions()
+        assert len(active) == 1
+        assert "no-ttl" in active
+        assert "short-ttl" not in active
+
+    # ==================== THREAD SAFETY TESTS ====================
+
+    @pytest.mark.asyncio
+    async def test_concurrent_session_access(self):
+        """Verify thread-safe access to sessions."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        async def acquire_session(session_id: str):
+            proxy = await strategy.get_proxy_for_session(session_id)
+            await asyncio.sleep(0.01)  # Simulate work
+            return proxy.server
+
+        # Acquire same session from multiple coroutines
+        results = await asyncio.gather(*[
+            acquire_session("shared-session") for _ in range(10)
+        ])
+
+        # All should get same proxy
+        assert len(set(results)) == 1
+
+    @pytest.mark.asyncio
+    async def test_concurrent_different_sessions(self):
+        """Verify concurrent acquisition of different sessions works correctly."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        async def acquire_session(session_id: str):
+            proxy = await strategy.get_proxy_for_session(session_id)
+            await asyncio.sleep(0.01)
+            return (session_id, proxy.server)
+
+        # Acquire different sessions concurrently
+        results = await asyncio.gather(*[
+            acquire_session(f"session-{i}") for i in range(5)
+        ])
+
+        # Each session should have a consistent proxy
+        session_proxies = dict(results)
+        assert len(session_proxies) == 5
+
+        # Verify each session still returns same proxy
+        for session_id, expected_server in session_proxies.items():
+            actual = await strategy.get_proxy_for_session(session_id)
+            assert actual.server == expected_server
+
+    @pytest.mark.asyncio
+    async def test_concurrent_session_acquire_and_release(self):
+        """Verify concurrent acquire and release operations work correctly."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        async def acquire_and_release(session_id: str):
+            proxy = await strategy.get_proxy_for_session(session_id)
+            await asyncio.sleep(0.01)
+            await strategy.release_session(session_id)
+            return proxy.server
+
+        # Run multiple acquire/release cycles concurrently
+        await asyncio.gather(*[
+            acquire_and_release(f"session-{i}") for i in range(10)
+        ])
+
+        # All sessions should be released
+        active = strategy.get_active_sessions()
+        assert len(active) == 0
+
+    # ==================== EMPTY PROXY POOL TESTS ====================
+
+    @pytest.mark.asyncio
+    async def test_empty_proxy_pool_session(self):
+        """Verify behavior with empty proxy pool."""
+        strategy = RoundRobinProxyStrategy()  # No proxies
+
+        result = await strategy.get_proxy_for_session("session-1")
+        assert result is None
+
+    @pytest.mark.asyncio
+    async def test_add_proxies_after_session(self):
+        """Verify adding proxies after session creation works."""
+        strategy = RoundRobinProxyStrategy()
+
+        # No proxies initially
+        result1 = await strategy.get_proxy_for_session("session-1")
+        assert result1 is None
+
+        # Add proxies
+        strategy.add_proxies(self.proxies)
+
+        # Now should work
+        result2 = await strategy.get_proxy_for_session("session-2")
+        assert result2 is not None
+
+
+class TestCrawlerRunConfigSession:
+    """Test CrawlerRunConfig with sticky session parameters."""
+
+    def test_config_has_session_fields(self):
+        """Verify CrawlerRunConfig has sticky session fields."""
+        config = CrawlerRunConfig(
+            proxy_session_id="test-session",
+            proxy_session_ttl=300,
+            proxy_session_auto_release=True
+        )
+
+        assert config.proxy_session_id == "test-session"
+        assert config.proxy_session_ttl == 300
+        assert config.proxy_session_auto_release is True
+
+    def test_config_session_defaults(self):
+        """Verify default values for session fields."""
+        config = CrawlerRunConfig()
+
+        assert config.proxy_session_id is None
+        assert config.proxy_session_ttl is None
+        assert config.proxy_session_auto_release is False
+
+
+class TestCrawlerStickySessionIntegration:
+    """Integration tests for AsyncWebCrawler with sticky sessions."""
+
+    def setup_method(self):
+        """Setup for each test method."""
+        self.proxies = [
+            ProxyConfig(server=f"http://proxy{i}.test:8080")
+            for i in range(3)
+        ]
+        self.test_url = "https://httpbin.org/ip"
+
+    @pytest.mark.asyncio
+    async def test_crawler_sticky_session_without_proxy(self):
+        """Test that crawler works when proxy_session_id set but no strategy."""
+        browser_config = BrowserConfig(headless=True)
+
+        config = CrawlerRunConfig(
+            cache_mode=CacheMode.BYPASS,
+            proxy_session_id="test-session",
+            page_timeout=15000
+        )
+
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            result = await crawler.arun(url=self.test_url, config=config)
+            # Should work without errors (no proxy strategy means no proxy)
+            assert result is not None
+
+    @pytest.mark.asyncio
+    async def test_crawler_sticky_session_basic(self):
+        """Test basic sticky session with crawler."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        config = CrawlerRunConfig(
+            cache_mode=CacheMode.BYPASS,
+            proxy_rotation_strategy=strategy,
+            proxy_session_id="integration-test",
+            page_timeout=10000
+        )
+
+        browser_config = BrowserConfig(headless=True)
+
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            # First request
+            try:
+                result1 = await crawler.arun(url=self.test_url, config=config)
+            except Exception:
+                pass  # Proxy connection may fail, but session should be tracked
+
+            # Verify session was created
+            session_proxy = strategy.get_session_proxy("integration-test")
+            assert session_proxy is not None
+
+            # Cleanup
+            await strategy.release_session("integration-test")
+
+    @pytest.mark.asyncio
+    async def test_crawler_rotating_vs_sticky(self):
+        """Compare rotating behavior vs sticky session behavior."""
+        strategy = RoundRobinProxyStrategy(self.proxies)
+
+        # Config WITHOUT sticky session - should rotate
+        rotating_config = CrawlerRunConfig(
+            cache_mode=CacheMode.BYPASS,
+            proxy_rotation_strategy=strategy,
+            page_timeout=5000
+        )
+
+        # Config WITH sticky session - should use same proxy
+        sticky_config = CrawlerRunConfig(
+            cache_mode=CacheMode.BYPASS,
+            proxy_rotation_strategy=strategy,
+            proxy_session_id="sticky-test",
+            page_timeout=5000
+        )
+
+        browser_config = BrowserConfig(headless=True)
+
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            # Track proxy configs used
+            rotating_proxies = []
+            sticky_proxies = []
+
+            # Try rotating requests (may fail due to test proxies, but config should be set)
+            for _ in range(3):
+                try:
+                    await crawler.arun(url=self.test_url, config=rotating_config)
+                except Exception:
+                    pass
+                rotating_proxies.append(rotating_config.proxy_config.server if rotating_config.proxy_config else None)
+
+            # Try sticky requests
+            for _ in range(3):
+                try:
+                    await crawler.arun(url=self.test_url, config=sticky_config)
+                except Exception:
+                    pass
+                sticky_proxies.append(sticky_config.proxy_config.server if sticky_config.proxy_config else None)
+
+            # Rotating should have different proxies (or cycle through them)
+            # Sticky should have same proxy for all requests
+            if all(sticky_proxies):
+                assert len(set(sticky_proxies)) == 1, "Sticky session should use same proxy"
+
+            await strategy.release_session("sticky-test")
+
+
+class TestStickySessionRealWorld:
+    """Real-world scenario tests for sticky sessions.
+
+    Note: These tests require actual proxy servers to verify IP consistency.
+    They are marked to be skipped if no proxy is configured.
+    """
+
+    @pytest.mark.asyncio
+    @pytest.mark.skipif(
+        not os.environ.get('TEST_PROXY_1'),
+        reason="Requires TEST_PROXY_1 environment variable"
+    )
+    async def test_verify_ip_consistency(self):
+        """Verify that sticky session actually uses same IP.
+
+        This test requires real proxies set in environment variables:
+        TEST_PROXY_1=ip:port:user:pass
+        TEST_PROXY_2=ip:port:user:pass
+        """
+        import re
+
+        # Load proxies from environment
+        proxy_strs = [
+            os.environ.get('TEST_PROXY_1', ''),
+            os.environ.get('TEST_PROXY_2', '')
+        ]
+        proxies = [ProxyConfig.from_string(p) for p in proxy_strs if p]
+
+        if len(proxies) < 2:
+            pytest.skip("Need at least 2 proxies for this test")
+
+        strategy = RoundRobinProxyStrategy(proxies)
+
+        # Config WITH sticky session
+        config = CrawlerRunConfig(
+            cache_mode=CacheMode.BYPASS,
+            proxy_rotation_strategy=strategy,
+            proxy_session_id="ip-verify-session",
+            page_timeout=30000
+        )
+
+        browser_config = BrowserConfig(headless=True)
+
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            ips = []
+
+            for i in range(3):
+                result = await crawler.arun(
+                    url="https://httpbin.org/ip",
+                    config=config
+                )
+
+                if result and result.success and result.html:
+                    # Extract IP from response
+                    ip_match = re.search(r'"origin":\s*"([^"]+)"', result.html)
+                    if ip_match:
+                        ips.append(ip_match.group(1))
+
+            await strategy.release_session("ip-verify-session")
+
+            # All IPs should be same for sticky session
+            if len(ips) >= 2:
+                assert len(set(ips)) == 1, f"Expected same IP, got: {ips}"
+
+
+# ==================== STANDALONE TEST FUNCTIONS ====================
+
+@pytest.mark.asyncio
+async def test_sticky_session_simple():
+    """Simple test for sticky session functionality."""
+    proxies = [
+        ProxyConfig(server=f"http://proxy{i}.test:8080")
+        for i in range(3)
+    ]
+    strategy = RoundRobinProxyStrategy(proxies)
+
+    # Same session should return same proxy
+    p1 = await strategy.get_proxy_for_session("test")
+    p2 = await strategy.get_proxy_for_session("test")
+    p3 = await strategy.get_proxy_for_session("test")
+
+    assert p1.server == p2.server == p3.server
+    print(f"Sticky session works! All requests use: {p1.server}")
+
+    # Cleanup
+    await strategy.release_session("test")
+
+
+if __name__ == "__main__":
+    print("Running Sticky Session tests...")
+    print("=" * 50)
+
+    asyncio.run(test_sticky_session_simple())
+
+    print("\n" + "=" * 50)
+    print("To run the full pytest suite, use: pytest " + __file__)
+    print("=" * 50)
--- a/tests/test_prefetch_integration.py
+++ b/tests/test_prefetch_integration.py
@@ -0,0 +1,236 @@
+"""Integration tests for prefetch mode with the crawler."""
+
+import pytest
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BrowserConfig
+
+# Use crawl4ai docs as test domain
+TEST_DOMAIN = "https://docs.crawl4ai.com"
+
+
+class TestPrefetchModeIntegration:
+    """Integration tests for prefetch mode."""
+
+    @pytest.mark.asyncio
+    async def test_prefetch_returns_html_and_links(self):
+        """Test that prefetch mode returns HTML and links only."""
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(prefetch=True)
+            result = await crawler.arun(TEST_DOMAIN, config=config)
+
+            # Should have HTML
+            assert result.html is not None
+            assert len(result.html) > 0
+            assert "<html" in result.html.lower() or "<!doctype" in result.html.lower()
+
+            # Should have links
+            assert result.links is not None
+            assert "internal" in result.links
+            assert "external" in result.links
+
+            # Should NOT have processed content
+            assert result.markdown is None or (
+                hasattr(result.markdown, 'raw_markdown') and
+                result.markdown.raw_markdown is None
+            )
+            assert result.cleaned_html is None
+            assert result.extracted_content is None
+
+    @pytest.mark.asyncio
+    async def test_prefetch_preserves_metadata(self):
+        """Test that prefetch mode preserves essential metadata."""
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(prefetch=True)
+            result = await crawler.arun(TEST_DOMAIN, config=config)
+
+            # Should have success flag
+            assert result.success is True
+
+            # Should have URL
+            assert result.url is not None
+
+            # Status code should be present
+            assert result.status_code is not None or result.status_code == 200
+
+    @pytest.mark.asyncio
+    async def test_prefetch_with_deep_crawl(self):
+        """Test prefetch mode with deep crawl strategy."""
+        from crawl4ai import BFSDeepCrawlStrategy
+
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(
+                prefetch=True,
+                deep_crawl_strategy=BFSDeepCrawlStrategy(
+                    max_depth=1,
+                    max_pages=3
+                )
+            )
+
+            result_container = await crawler.arun(TEST_DOMAIN, config=config)
+
+            # Handle both list and iterator results
+            if hasattr(result_container, '__aiter__'):
+                results = [r async for r in result_container]
+            else:
+                results = list(result_container) if hasattr(result_container, '__iter__') else [result_container]
+
+            # Each result should have HTML and links
+            for result in results:
+                assert result.html is not None
+                assert result.links is not None
+
+            # Should have crawled at least one page
+            assert len(results) >= 1
+
+    @pytest.mark.asyncio
+    async def test_prefetch_then_process_with_raw(self):
+        """Test the full two-phase workflow: prefetch then process."""
+        async with AsyncWebCrawler() as crawler:
+            # Phase 1: Prefetch
+            prefetch_config = CrawlerRunConfig(prefetch=True)
+            prefetch_result = await crawler.arun(TEST_DOMAIN, config=prefetch_config)
+
+            stored_html = prefetch_result.html
+
+            assert stored_html is not None
+            assert len(stored_html) > 0
+
+            # Phase 2: Process with raw: URL
+            process_config = CrawlerRunConfig(
+                # No prefetch - full processing
+                base_url=TEST_DOMAIN  # Provide base URL for link resolution
+            )
+            processed_result = await crawler.arun(
+                f"raw:{stored_html}",
+                config=process_config
+            )
+
+            # Should now have full processing
+            assert processed_result.html is not None
+            assert processed_result.success is True
+            # Note: cleaned_html and markdown depend on the content
+
+    @pytest.mark.asyncio
+    async def test_prefetch_links_structure(self):
+        """Test that links have the expected structure."""
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(prefetch=True)
+            result = await crawler.arun(TEST_DOMAIN, config=config)
+
+            assert result.links is not None
+
+            # Check internal links structure
+            if result.links["internal"]:
+                link = result.links["internal"][0]
+                assert "href" in link
+                assert "text" in link
+                assert link["href"].startswith("http")
+
+            # Check external links structure (if any)
+            if result.links["external"]:
+                link = result.links["external"][0]
+                assert "href" in link
+                assert "text" in link
+                assert link["href"].startswith("http")
+
+    @pytest.mark.asyncio
+    async def test_prefetch_config_clone(self):
+        """Test that config.clone() preserves prefetch setting."""
+        config = CrawlerRunConfig(prefetch=True)
+        cloned = config.clone()
+
+        assert cloned.prefetch == True
+
+        # Clone with override
+        cloned_false = config.clone(prefetch=False)
+        assert cloned_false.prefetch == False
+
+    @pytest.mark.asyncio
+    async def test_prefetch_to_dict(self):
+        """Test that to_dict() includes prefetch."""
+        config = CrawlerRunConfig(prefetch=True)
+        config_dict = config.to_dict()
+
+        assert "prefetch" in config_dict
+        assert config_dict["prefetch"] == True
+
+    @pytest.mark.asyncio
+    async def test_prefetch_default_false(self):
+        """Test that prefetch defaults to False."""
+        config = CrawlerRunConfig()
+        assert config.prefetch == False
+
+    @pytest.mark.asyncio
+    async def test_prefetch_explicit_false(self):
+        """Test explicit prefetch=False works like default."""
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(prefetch=False)
+            result = await crawler.arun(TEST_DOMAIN, config=config)
+
+            # Should have full processing
+            assert result.html is not None
+            # cleaned_html should be populated in normal mode
+            assert result.cleaned_html is not None
+
+
+class TestPrefetchPerformance:
+    """Performance-related tests for prefetch mode."""
+
+    @pytest.mark.asyncio
+    async def test_prefetch_returns_quickly(self):
+        """Test that prefetch mode returns results quickly."""
+        import time
+
+        async with AsyncWebCrawler() as crawler:
+            # Prefetch mode
+            start = time.time()
+            prefetch_config = CrawlerRunConfig(prefetch=True)
+            await crawler.arun(TEST_DOMAIN, config=prefetch_config)
+            prefetch_time = time.time() - start
+
+            # Full mode
+            start = time.time()
+            full_config = CrawlerRunConfig()
+            await crawler.arun(TEST_DOMAIN, config=full_config)
+            full_time = time.time() - start
+
+            # Log times for debugging
+            print(f"\nPrefetch: {prefetch_time:.3f}s, Full: {full_time:.3f}s")
+
+            # Prefetch should not be significantly slower
+            # (may be same or slightly faster depending on content)
+            # This is a soft check - mostly for logging
+
+
+class TestPrefetchWithRawHTML:
+    """Test prefetch mode with raw HTML input."""
+
+    @pytest.mark.asyncio
+    async def test_prefetch_with_raw_html(self):
+        """Test prefetch mode works with raw: URL scheme."""
+        sample_html = """
+        <html>
+            <head><title>Test Page</title></head>
+            <body>
+                <h1>Hello World</h1>
+                <a href="/link1">Link 1</a>
+                <a href="/link2">Link 2</a>
+                <a href="https://external.com/page">External</a>
+            </body>
+        </html>
+        """
+
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(
+                prefetch=True,
+                base_url="https://example.com"
+            )
+            result = await crawler.arun(f"raw:{sample_html}", config=config)
+
+            assert result.success is True
+            assert result.html is not None
+            assert result.links is not None
+
+            # Should have extracted links
+            assert len(result.links["internal"]) >= 2
+            assert len(result.links["external"]) >= 1
--- a/tests/test_prefetch_mode.py
+++ b/tests/test_prefetch_mode.py
@@ -0,0 +1,275 @@
+"""Unit tests for the quick_extract_links function used in prefetch mode."""
+
+import pytest
+from crawl4ai.utils import quick_extract_links
+
+
+class TestQuickExtractLinks:
+    """Unit tests for the quick_extract_links function."""
+
+    def test_basic_internal_links(self):
+        """Test extraction of internal links."""
+        html = '''
+        <html>
+            <body>
+                <a href="/page1">Page 1</a>
+                <a href="/page2">Page 2</a>
+                <a href="https://example.com/page3">Page 3</a>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        assert len(result["internal"]) == 3
+        assert result["internal"][0]["href"] == "https://example.com/page1"
+        assert result["internal"][0]["text"] == "Page 1"
+
+    def test_external_links(self):
+        """Test extraction and classification of external links."""
+        html = '''
+        <html>
+            <body>
+                <a href="https://other.com/page">External</a>
+                <a href="/internal">Internal</a>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        assert len(result["internal"]) == 1
+        assert len(result["external"]) == 1
+        assert result["external"][0]["href"] == "https://other.com/page"
+
+    def test_ignores_javascript_and_mailto(self):
+        """Test that javascript: and mailto: links are ignored."""
+        html = '''
+        <html>
+            <body>
+                <a href="javascript:void(0)">Click</a>
+                <a href="mailto:test@example.com">Email</a>
+                <a href="tel:+1234567890">Call</a>
+                <a href="/valid">Valid</a>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        assert len(result["internal"]) == 1
+        assert result["internal"][0]["href"] == "https://example.com/valid"
+
+    def test_ignores_anchor_only_links(self):
+        """Test that anchor-only links (#section) are ignored."""
+        html = '''
+        <html>
+            <body>
+                <a href="#section1">Section 1</a>
+                <a href="#section2">Section 2</a>
+                <a href="/page#section">Page with anchor</a>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        # Only the page link should be included, anchor-only links are skipped
+        assert len(result["internal"]) == 1
+        assert "/page" in result["internal"][0]["href"]
+
+    def test_deduplication(self):
+        """Test that duplicate URLs are deduplicated."""
+        html = '''
+        <html>
+            <body>
+                <a href="/page">Link 1</a>
+                <a href="/page">Link 2</a>
+                <a href="/page">Link 3</a>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        assert len(result["internal"]) == 1
+
+    def test_handles_malformed_html(self):
+        """Test graceful handling of malformed HTML."""
+        html = "not valid html at all <><><"
+        result = quick_extract_links(html, "https://example.com")
+
+        # Should not raise, should return empty
+        assert result["internal"] == []
+        assert result["external"] == []
+
+    def test_empty_html(self):
+        """Test handling of empty HTML."""
+        result = quick_extract_links("", "https://example.com")
+        assert result == {"internal": [], "external": []}
+
+    def test_relative_url_resolution(self):
+        """Test that relative URLs are resolved correctly."""
+        html = '''
+        <html>
+            <body>
+                <a href="page1.html">Relative</a>
+                <a href="./page2.html">Dot Relative</a>
+                <a href="../page3.html">Parent Relative</a>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com/docs/")
+
+        assert len(result["internal"]) >= 1
+        # All should be internal and properly resolved
+        for link in result["internal"]:
+            assert link["href"].startswith("https://example.com")
+
+    def test_text_truncation(self):
+        """Test that long link text is truncated to 200 chars."""
+        long_text = "A" * 300
+        html = f'''
+        <html>
+            <body>
+                <a href="/page">{long_text}</a>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        assert len(result["internal"]) == 1
+        assert len(result["internal"][0]["text"]) == 200
+
+    def test_empty_href_ignored(self):
+        """Test that empty href attributes are ignored."""
+        html = '''
+        <html>
+            <body>
+                <a href="">Empty</a>
+                <a href="   ">Whitespace</a>
+                <a href="/valid">Valid</a>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        assert len(result["internal"]) == 1
+        assert result["internal"][0]["href"] == "https://example.com/valid"
+
+    def test_mixed_internal_external(self):
+        """Test correct classification of mixed internal and external links."""
+        html = '''
+        <html>
+            <body>
+                <a href="/internal1">Internal 1</a>
+                <a href="https://example.com/internal2">Internal 2</a>
+                <a href="https://google.com">Google</a>
+                <a href="https://github.com/repo">GitHub</a>
+                <a href="/internal3">Internal 3</a>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        assert len(result["internal"]) == 3
+        assert len(result["external"]) == 2
+
+    def test_subdomain_handling(self):
+        """Test that subdomains are handled correctly."""
+        html = '''
+        <html>
+            <body>
+                <a href="https://docs.example.com/page">Docs subdomain</a>
+                <a href="https://api.example.com/v1">API subdomain</a>
+                <a href="https://example.com/main">Main domain</a>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        # All should be internal (same base domain)
+        total_links = len(result["internal"]) + len(result["external"])
+        assert total_links == 3
+
+
+class TestQuickExtractLinksEdgeCases:
+    """Edge case tests for quick_extract_links."""
+
+    def test_no_links_in_page(self):
+        """Test page with no links."""
+        html = '''
+        <html>
+            <body>
+                <h1>No Links Here</h1>
+                <p>Just some text content.</p>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        assert result["internal"] == []
+        assert result["external"] == []
+
+    def test_links_in_nested_elements(self):
+        """Test links nested in various elements."""
+        html = '''
+        <html>
+            <body>
+                <nav>
+                    <ul>
+                        <li><a href="/home">Home</a></li>
+                        <li><a href="/about">About</a></li>
+                    </ul>
+                </nav>
+                <div class="content">
+                    <p>Check out <a href="/products">our products</a>.</p>
+                </div>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        assert len(result["internal"]) == 3
+
+    def test_link_with_nested_elements(self):
+        """Test links containing nested elements."""
+        html = '''
+        <html>
+            <body>
+                <a href="/page"><span>Nested</span> <strong>Text</strong></a>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        assert len(result["internal"]) == 1
+        assert "Nested" in result["internal"][0]["text"]
+        assert "Text" in result["internal"][0]["text"]
+
+    def test_protocol_relative_urls(self):
+        """Test handling of protocol-relative URLs (//example.com)."""
+        html = '''
+        <html>
+            <body>
+                <a href="//cdn.example.com/asset">CDN Link</a>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        # Should be resolved with https:
+        total = len(result["internal"]) + len(result["external"])
+        assert total >= 1
+
+    def test_whitespace_in_href(self):
+        """Test handling of whitespace around href values."""
+        html = '''
+        <html>
+            <body>
+                <a href="  /page1  ">Padded</a>
+                <a href="
+                    /page2
+                ">Multiline</a>
+            </body>
+        </html>
+        '''
+        result = quick_extract_links(html, "https://example.com")
+
+        # Both should be extracted and normalized
+        assert len(result["internal"]) >= 1
--- a/tests/test_prefetch_regression.py
+++ b/tests/test_prefetch_regression.py
@@ -0,0 +1,232 @@
+"""Regression tests to ensure prefetch mode doesn't break existing functionality."""
+
+import pytest
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+TEST_URL = "https://docs.crawl4ai.com"
+
+
+class TestNoRegressions:
+    """Ensure prefetch mode doesn't break existing functionality."""
+
+    @pytest.mark.asyncio
+    async def test_default_mode_unchanged(self):
+        """Test that default mode (prefetch=False) works exactly as before."""
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig()  # Default config
+            result = await crawler.arun(TEST_URL, config=config)
+
+            # All standard fields should be populated
+            assert result.html is not None
+            assert result.cleaned_html is not None
+            assert result.links is not None
+            assert result.success is True
+
+    @pytest.mark.asyncio
+    async def test_explicit_prefetch_false(self):
+        """Test explicit prefetch=False works like default."""
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(prefetch=False)
+            result = await crawler.arun(TEST_URL, config=config)
+
+            assert result.cleaned_html is not None
+
+    @pytest.mark.asyncio
+    async def test_config_clone_preserves_prefetch(self):
+        """Test that config.clone() preserves prefetch setting."""
+        config = CrawlerRunConfig(prefetch=True)
+        cloned = config.clone()
+
+        assert cloned.prefetch == True
+
+        # Clone with override
+        cloned_false = config.clone(prefetch=False)
+        assert cloned_false.prefetch == False
+
+    @pytest.mark.asyncio
+    async def test_config_to_dict_includes_prefetch(self):
+        """Test that to_dict() includes prefetch."""
+        config_true = CrawlerRunConfig(prefetch=True)
+        config_false = CrawlerRunConfig(prefetch=False)
+
+        assert config_true.to_dict()["prefetch"] == True
+        assert config_false.to_dict()["prefetch"] == False
+
+    @pytest.mark.asyncio
+    async def test_existing_extraction_still_works(self):
+        """Test that extraction strategies still work in normal mode."""
+        from crawl4ai import JsonCssExtractionStrategy
+
+        schema = {
+            "name": "Links",
+            "baseSelector": "a",
+            "fields": [
+                {"name": "href", "selector": "", "type": "attribute", "attribute": "href"},
+                {"name": "text", "selector": "", "type": "text"}
+            ]
+        }
+
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(
+                extraction_strategy=JsonCssExtractionStrategy(schema=schema)
+            )
+            result = await crawler.arun(TEST_URL, config=config)
+
+            assert result.extracted_content is not None
+
+    @pytest.mark.asyncio
+    async def test_existing_deep_crawl_still_works(self):
+        """Test that deep crawl without prefetch still does full processing."""
+        from crawl4ai import BFSDeepCrawlStrategy
+
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(
+                deep_crawl_strategy=BFSDeepCrawlStrategy(
+                    max_depth=1,
+                    max_pages=2
+                )
+                # No prefetch - should do full processing
+            )
+
+            result_container = await crawler.arun(TEST_URL, config=config)
+
+            # Handle both list and iterator results
+            if hasattr(result_container, '__aiter__'):
+                results = [r async for r in result_container]
+            else:
+                results = list(result_container) if hasattr(result_container, '__iter__') else [result_container]
+
+            # Each result should have full processing
+            for result in results:
+                assert result.cleaned_html is not None
+
+            assert len(results) >= 1
+
+    @pytest.mark.asyncio
+    async def test_raw_url_scheme_still_works(self):
+        """Test that raw: URL scheme works for processing stored HTML."""
+        sample_html = """
+        <html>
+            <head><title>Test Page</title></head>
+            <body>
+                <h1>Hello World</h1>
+                <p>This is a test paragraph.</p>
+                <a href="/link1">Link 1</a>
+            </body>
+        </html>
+        """
+
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig()
+            result = await crawler.arun(f"raw:{sample_html}", config=config)
+
+            assert result.success is True
+            assert result.html is not None
+            assert "Hello World" in result.html
+            assert result.cleaned_html is not None
+
+    @pytest.mark.asyncio
+    async def test_screenshot_still_works(self):
+        """Test that screenshot option still works in normal mode."""
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(screenshot=True)
+            result = await crawler.arun(TEST_URL, config=config)
+
+            assert result.success is True
+            # Screenshot data should be present
+            assert result.screenshot is not None or result.screenshot_data is not None
+
+    @pytest.mark.asyncio
+    async def test_js_execution_still_works(self):
+        """Test that JavaScript execution still works in normal mode."""
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(
+                js_code="document.querySelector('h1')?.textContent"
+            )
+            result = await crawler.arun(TEST_URL, config=config)
+
+            assert result.success is True
+            assert result.html is not None
+
+
+class TestPrefetchDoesNotAffectOtherModes:
+    """Test that prefetch doesn't interfere with other configurations."""
+
+    @pytest.mark.asyncio
+    async def test_prefetch_with_other_options_ignored(self):
+        """Test that other options are properly ignored in prefetch mode."""
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(
+                prefetch=True,
+                # These should be ignored in prefetch mode
+                screenshot=True,
+                pdf=True,
+                only_text=True,
+                word_count_threshold=100
+            )
+            result = await crawler.arun(TEST_URL, config=config)
+
+            # Should still return HTML and links
+            assert result.html is not None
+            assert result.links is not None
+
+            # But should NOT have processed content
+            assert result.cleaned_html is None
+            assert result.extracted_content is None
+
+    @pytest.mark.asyncio
+    async def test_stream_mode_still_works(self):
+        """Test that stream mode still works normally."""
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(stream=True)
+            result = await crawler.arun(TEST_URL, config=config)
+
+            assert result.success is True
+            assert result.html is not None
+
+    @pytest.mark.asyncio
+    async def test_cache_mode_still_works(self):
+        """Test that cache mode still works normally."""
+        from crawl4ai import CacheMode
+
+        async with AsyncWebCrawler() as crawler:
+            # First request - bypass cache
+            config1 = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
+            result1 = await crawler.arun(TEST_URL, config=config1)
+            assert result1.success is True
+
+            # Second request - should work
+            config2 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED)
+            result2 = await crawler.arun(TEST_URL, config=config2)
+            assert result2.success is True
+
+
+class TestBackwardsCompatibility:
+    """Test backwards compatibility with existing code patterns."""
+
+    @pytest.mark.asyncio
+    async def test_config_without_prefetch_works(self):
+        """Test that configs created without prefetch parameter work."""
+        # Simulating old code that doesn't know about prefetch
+        config = CrawlerRunConfig(
+            word_count_threshold=50,
+            css_selector="body"
+        )
+
+        # Should default to prefetch=False
+        assert config.prefetch == False
+
+        async with AsyncWebCrawler() as crawler:
+            result = await crawler.arun(TEST_URL, config=config)
+            assert result.success is True
+            assert result.cleaned_html is not None
+
+    @pytest.mark.asyncio
+    async def test_from_kwargs_without_prefetch(self):
+        """Test CrawlerRunConfig.from_kwargs works without prefetch."""
+        config = CrawlerRunConfig.from_kwargs({
+            "word_count_threshold": 50,
+            "verbose": False
+        })
+
+        assert config.prefetch == False
--- a/tests/test_raw_html_browser.py
+++ b/tests/test_raw_html_browser.py
@@ -0,0 +1,172 @@
+"""
+Tests for raw:/file:// URL browser pipeline support.
+
+Tests the new feature that allows js_code, wait_for, and other browser operations
+to work with raw: and file:// URLs by routing them through _crawl_web() with
+set_content() instead of goto().
+"""
+
+import pytest
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+
+@pytest.mark.asyncio
+async def test_raw_html_fast_path():
+    """Test that raw: without browser params returns HTML directly (fast path)."""
+    html = "<html><body><div id='test'>Original Content</div></body></html>"
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig()  # No browser params
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert "Original Content" in result.html
+    # Fast path should not modify the HTML
+    assert result.html == html
+
+
+@pytest.mark.asyncio
+async def test_js_code_on_raw_html():
+    """Test that js_code executes on raw: HTML and modifies the DOM."""
+    html = "<html><body><div id='test'>Original</div></body></html>"
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            js_code="document.getElementById('test').innerText = 'Modified by JS'"
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert "Modified by JS" in result.html
+    assert "Original" not in result.html or "Modified by JS" in result.html
+
+
+@pytest.mark.asyncio
+async def test_js_code_adds_element_to_raw_html():
+    """Test that js_code can add new elements to raw: HTML."""
+    html = "<html><body><div id='container'></div></body></html>"
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            js_code='document.getElementById("container").innerHTML = "<span id=\'injected\'>Custom Content</span>"'
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert "injected" in result.html
+    assert "Custom Content" in result.html
+
+
+@pytest.mark.asyncio
+async def test_screenshot_on_raw_html():
+    """Test that screenshots work on raw: HTML."""
+    html = "<html><body><h1 style='color:red;font-size:48px;'>Screenshot Test</h1></body></html>"
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(screenshot=True)
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert result.screenshot is not None
+    assert len(result.screenshot) > 100  # Should have substantial screenshot data
+
+
+@pytest.mark.asyncio
+async def test_process_in_browser_flag():
+    """Test that process_in_browser=True forces browser path even without other params."""
+    html = "<html><body><div>Test</div></body></html>"
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(process_in_browser=True)
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    # Browser path normalizes HTML, so it may be slightly different
+    assert "Test" in result.html
+
+
+@pytest.mark.asyncio
+async def test_raw_prefix_variations():
+    """Test both raw: and raw:// prefix formats."""
+    html = "<html><body>Content</body></html>"
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            js_code='document.body.innerHTML += "<div id=\'added\'>Added</div>"'
+        )
+
+        # Test raw: prefix
+        result1 = await crawler.arun(f"raw:{html}", config=config)
+        assert result1.success
+        assert "Added" in result1.html
+
+        # Test raw:// prefix
+        result2 = await crawler.arun(f"raw://{html}", config=config)
+        assert result2.success
+        assert "Added" in result2.html
+
+
+@pytest.mark.asyncio
+async def test_wait_for_on_raw_html():
+    """Test that wait_for works with raw: HTML after js_code modifies DOM."""
+    html = "<html><body><div id='container'></div></body></html>"
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            js_code='''
+                setTimeout(() => {
+                    document.getElementById('container').innerHTML = '<div id="delayed">Delayed Content</div>';
+                }, 100);
+            ''',
+            wait_for="#delayed",
+            wait_for_timeout=5000
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert "Delayed Content" in result.html
+
+
+@pytest.mark.asyncio
+async def test_multiple_js_code_scripts():
+    """Test that multiple js_code scripts execute in order."""
+    html = "<html><body><div id='counter'>0</div></body></html>"
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            js_code=[
+                "document.getElementById('counter').innerText = '1'",
+                "document.getElementById('counter').innerText = parseInt(document.getElementById('counter').innerText) + 1",
+                "document.getElementById('counter').innerText = parseInt(document.getElementById('counter').innerText) + 1",
+            ]
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert ">3<" in result.html  # Counter should be 3 after all scripts run
+
+
+if __name__ == "__main__":
+    # Run a quick manual test
+    async def quick_test():
+        html = "<html><body><div id='test'>Original</div></body></html>"
+
+        async with AsyncWebCrawler(verbose=True) as crawler:
+            # Test 1: Fast path
+            print("\n=== Test 1: Fast path (no browser params) ===")
+            result1 = await crawler.arun(f"raw:{html}")
+            print(f"Success: {result1.success}")
+            print(f"HTML contains 'Original': {'Original' in result1.html}")
+
+            # Test 2: js_code modifies DOM
+            print("\n=== Test 2: js_code modifies DOM ===")
+            config = CrawlerRunConfig(
+                js_code="document.getElementById('test').innerText = 'Modified by JS'"
+            )
+            result2 = await crawler.arun(f"raw:{html}", config=config)
+            print(f"Success: {result2.success}")
+            print(f"HTML contains 'Modified by JS': {'Modified by JS' in result2.html}")
+            print(f"HTML snippet: {result2.html[:500]}...")
+
+    asyncio.run(quick_test())
--- a/tests/test_raw_html_edge_cases.py
+++ b/tests/test_raw_html_edge_cases.py
@@ -0,0 +1,563 @@
+"""
+BRUTAL edge case tests for raw:/file:// URL browser pipeline.
+
+These tests try to break the system with tricky inputs, edge cases,
+and compatibility checks to ensure we didn't break existing functionality.
+"""
+
+import pytest
+import asyncio
+import tempfile
+import os
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+
+# ============================================================================
+# EDGE CASE: Hash characters in HTML (previously broke urlparse - Issue #283)
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_raw_html_with_hash_in_css():
+    """Test that # in CSS colors doesn't break HTML parsing (regression for #283)."""
+    html = """
+    <html>
+    <head>
+        <style>
+            body { background-color: #ff5733; color: #333333; }
+            .highlight { border: 1px solid #000; }
+        </style>
+    </head>
+    <body>
+        <div class="highlight" style="color: #ffffff;">Content with hash colors</div>
+    </body>
+    </html>
+    """
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(js_code="document.body.innerHTML += '<div id=\"added\">Added</div>'")
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert "#ff5733" in result.html or "ff5733" in result.html  # Color should be preserved
+    assert "Added" in result.html  # JS executed
+    assert "Content with hash colors" in result.html  # Original content preserved
+
+
+@pytest.mark.asyncio
+async def test_raw_html_with_fragment_links():
+    """Test HTML with # fragment links doesn't break."""
+    html = """
+    <html><body>
+        <a href="#section1">Go to section 1</a>
+        <a href="#section2">Go to section 2</a>
+        <div id="section1">Section 1</div>
+        <div id="section2">Section 2</div>
+    </body></html>
+    """
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(js_code="document.getElementById('section1').innerText = 'Modified Section 1'")
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert "Modified Section 1" in result.html
+    assert "#section2" in result.html  # Fragment link preserved
+
+
+# ============================================================================
+# EDGE CASE: Special characters and unicode
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_raw_html_with_unicode():
+    """Test raw HTML with various unicode characters."""
+    html = """
+    <html><body>
+        <div id="unicode">日本語 中文 한국어 العربية 🎉 💻 🚀</div>
+        <div id="special">&amp; &lt; &gt; &quot; &apos;</div>
+    </body></html>
+    """
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(js_code="document.getElementById('unicode').innerText += ' ✅ Modified'")
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert "✅ Modified" in result.html or "Modified" in result.html
+    # Check unicode is preserved
+    assert "日本語" in result.html or "&#" in result.html  # Either preserved or encoded
+
+
+@pytest.mark.asyncio
+async def test_raw_html_with_script_tags():
+    """Test raw HTML with existing script tags doesn't interfere with js_code."""
+    html = """
+    <html><body>
+        <div id="counter">0</div>
+        <script>
+            // This script runs on page load
+            document.getElementById('counter').innerText = '10';
+        </script>
+    </body></html>
+    """
+
+    async with AsyncWebCrawler() as crawler:
+        # Our js_code runs AFTER the page scripts
+        config = CrawlerRunConfig(
+            js_code="document.getElementById('counter').innerText = parseInt(document.getElementById('counter').innerText) + 5"
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    # The embedded script sets it to 10, then our js_code adds 5
+    assert ">15<" in result.html or "15" in result.html
+
+
+# ============================================================================
+# EDGE CASE: Empty and malformed HTML
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_raw_html_empty():
+    """Test empty raw HTML."""
+    html = ""
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(js_code="document.body.innerHTML = '<div>Added to empty</div>'")
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert "Added to empty" in result.html
+
+
+@pytest.mark.asyncio
+async def test_raw_html_minimal():
+    """Test minimal HTML (just text, no tags)."""
+    html = "Just plain text, no HTML tags"
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(js_code="document.body.innerHTML += '<div id=\"injected\">Injected</div>'")
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    # Browser should wrap it in proper HTML
+    assert "Injected" in result.html
+
+
+@pytest.mark.asyncio
+async def test_raw_html_malformed():
+    """Test malformed HTML with unclosed tags."""
+    html = "<html><body><div><span>Unclosed tags<div>More content"
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(js_code="document.body.innerHTML += '<div id=\"valid\">Valid Added</div>'")
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert "Valid Added" in result.html
+    # Browser should have fixed the malformed HTML
+
+
+# ============================================================================
+# EDGE CASE: Very large HTML
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_raw_html_large():
+    """Test large raw HTML (100KB+)."""
+    # Generate 100KB of HTML
+    items = "".join([f'<div class="item" id="item-{i}">Item {i} content here with some text</div>\n' for i in range(2000)])
+    html = f"<html><body>{items}</body></html>"
+
+    assert len(html) > 100000  # Verify it's actually large
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            js_code="document.getElementById('item-999').innerText = 'MODIFIED ITEM 999'"
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert "MODIFIED ITEM 999" in result.html
+    assert "item-1999" in result.html  # Last item should still exist
+
+
+# ============================================================================
+# EDGE CASE: JavaScript errors and timeouts
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_raw_html_js_error_doesnt_crash():
+    """Test that JavaScript errors in js_code don't crash the crawl."""
+    html = "<html><body><div id='test'>Original</div></body></html>"
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            js_code=[
+                "nonExistentFunction();",  # This will throw an error
+                "document.getElementById('test').innerText = 'Still works'"  # This should still run
+            ]
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    # Crawl should succeed even with JS errors
+    assert result.success
+
+
+@pytest.mark.asyncio
+async def test_raw_html_wait_for_timeout():
+    """Test wait_for with element that never appears times out gracefully."""
+    html = "<html><body><div id='test'>Original</div></body></html>"
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            wait_for="#never-exists",
+            wait_for_timeout=1000  # 1 second timeout
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    # Should timeout but still return the HTML we have
+    # The behavior might be success=False or success=True with partial content
+    # Either way, it shouldn't hang or crash
+    assert result is not None
+
+
+# ============================================================================
+# COMPATIBILITY: Normal HTTP URLs still work
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_http_urls_still_work():
+    """Ensure we didn't break normal HTTP URL crawling."""
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun("https://example.com")
+
+    assert result.success
+    assert "Example Domain" in result.html
+
+
+@pytest.mark.asyncio
+async def test_http_with_js_code_still_works():
+    """Ensure HTTP URLs with js_code still work."""
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            js_code="document.body.innerHTML += '<div id=\"injected\">Injected via JS</div>'"
+        )
+        result = await crawler.arun("https://example.com", config=config)
+
+    assert result.success
+    assert "Injected via JS" in result.html
+
+
+# ============================================================================
+# COMPATIBILITY: File URLs
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_file_url_with_js_code():
+    """Test file:// URLs with js_code execution."""
+    # Create a temp file
+    with tempfile.NamedTemporaryFile(mode='w', suffix='.html', delete=False) as f:
+        f.write("<html><body><div id='file-content'>File Content</div></body></html>")
+        temp_path = f.name
+
+    try:
+        async with AsyncWebCrawler() as crawler:
+            config = CrawlerRunConfig(
+                js_code="document.getElementById('file-content').innerText = 'Modified File Content'"
+            )
+            result = await crawler.arun(f"file://{temp_path}", config=config)
+
+        assert result.success
+        assert "Modified File Content" in result.html
+    finally:
+        os.unlink(temp_path)
+
+
+@pytest.mark.asyncio
+async def test_file_url_fast_path():
+    """Test file:// fast path (no browser params)."""
+    with tempfile.NamedTemporaryFile(mode='w', suffix='.html', delete=False) as f:
+        f.write("<html><body>Fast path file content</body></html>")
+        temp_path = f.name
+
+    try:
+        async with AsyncWebCrawler() as crawler:
+            result = await crawler.arun(f"file://{temp_path}")
+
+        assert result.success
+        assert "Fast path file content" in result.html
+    finally:
+        os.unlink(temp_path)
+
+
+# ============================================================================
+# COMPATIBILITY: Extraction strategies with raw HTML
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_raw_html_with_css_extraction():
+    """Test CSS extraction on raw HTML after js_code modifies it."""
+    from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+
+    html = """
+    <html><body>
+        <div class="products">
+            <div class="product"><span class="name">Original Product</span></div>
+        </div>
+    </body></html>
+    """
+
+    schema = {
+        "name": "Products",
+        "baseSelector": ".product",
+        "fields": [
+            {"name": "name", "selector": ".name", "type": "text"}
+        ]
+    }
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            js_code="""
+                document.querySelector('.products').innerHTML +=
+                '<div class="product"><span class="name">JS Added Product</span></div>';
+            """,
+            extraction_strategy=JsonCssExtractionStrategy(schema)
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    # Check that extraction found both products
+    import json
+    extracted = json.loads(result.extracted_content)
+    names = [p.get('name', '') for p in extracted]
+    assert any("JS Added Product" in name for name in names)
+
+
+# ============================================================================
+# EDGE CASE: Concurrent raw: requests
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_concurrent_raw_requests():
+    """Test multiple concurrent raw: requests don't interfere."""
+    htmls = [
+        f"<html><body><div id='test'>Request {i}</div></body></html>"
+        for i in range(5)
+    ]
+
+    async with AsyncWebCrawler() as crawler:
+        configs = [
+            CrawlerRunConfig(
+                js_code=f"document.getElementById('test').innerText += ' Modified {i}'"
+            )
+            for i in range(5)
+        ]
+
+        # Run concurrently
+        tasks = [
+            crawler.arun(f"raw:{html}", config=config)
+            for html, config in zip(htmls, configs)
+        ]
+        results = await asyncio.gather(*tasks)
+
+    for i, result in enumerate(results):
+        assert result.success
+        assert f"Request {i}" in result.html
+        assert f"Modified {i}" in result.html
+
+
+# ============================================================================
+# EDGE CASE: raw: with base_url for link resolution
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_raw_html_with_base_url():
+    """Test that base_url is used for link resolution in markdown."""
+    html = """
+    <html><body>
+        <a href="/page1">Page 1</a>
+        <a href="/page2">Page 2</a>
+        <img src="/images/logo.png" alt="Logo">
+    </body></html>
+    """
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            base_url="https://example.com",
+            process_in_browser=True  # Force browser to test base_url handling
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    # Check markdown has absolute URLs
+    if result.markdown:
+        # Links should be absolute
+        md = result.markdown.raw_markdown if hasattr(result.markdown, 'raw_markdown') else str(result.markdown)
+        assert "example.com" in md or "/page1" in md
+
+
+# ============================================================================
+# EDGE CASE: raw: with screenshot of complex page
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_raw_html_screenshot_complex_page():
+    """Test screenshot of complex raw HTML with CSS and JS modifications."""
+    html = """
+    <html>
+    <head>
+        <style>
+            body { font-family: Arial; background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 40px; }
+            .card { background: white; padding: 20px; border-radius: 10px; box-shadow: 0 4px 6px rgba(0,0,0,0.1); }
+            h1 { color: #333; }
+        </style>
+    </head>
+    <body>
+        <div class="card">
+            <h1 id="title">Original Title</h1>
+            <p>This is a test card with styling.</p>
+        </div>
+    </body>
+    </html>
+    """
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            js_code="document.getElementById('title').innerText = 'Modified Title'",
+            screenshot=True
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert result.screenshot is not None
+    assert len(result.screenshot) > 1000  # Should be substantial
+    assert "Modified Title" in result.html
+
+
+# ============================================================================
+# EDGE CASE: JavaScript that tries to navigate away
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_raw_html_js_navigation_blocked():
+    """Test that JS trying to navigate doesn't break the crawl."""
+    html = """
+    <html><body>
+        <div id="content">Original Content</div>
+        <script>
+            // Try to navigate away (should be blocked or handled)
+            // window.location.href = 'https://example.com';
+        </script>
+    </body></html>
+    """
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            # Try to navigate via js_code
+            js_code=[
+                "document.getElementById('content').innerText = 'Before navigation attempt'",
+                # Actual navigation attempt commented - would cause issues
+                # "window.location.href = 'https://example.com'",
+            ]
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert "Before navigation attempt" in result.html
+
+
+# ============================================================================
+# EDGE CASE: Raw HTML with iframes
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_raw_html_with_iframes():
+    """Test raw HTML containing iframes."""
+    html = """
+    <html><body>
+        <div id="main">Main content</div>
+        <iframe id="frame1" srcdoc="<html><body><div id='iframe-content'>Iframe Content</div></body></html>"></iframe>
+    </body></html>
+    """
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            js_code="document.getElementById('main').innerText = 'Modified main'",
+            process_iframes=True
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert "Modified main" in result.html
+
+
+# ============================================================================
+# TRICKY: Protocol inside raw content
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_raw_html_with_urls_inside():
+    """Test raw: with http:// URLs inside the content."""
+    html = """
+    <html><body>
+        <a href="http://example.com">Example</a>
+        <a href="https://google.com">Google</a>
+        <img src="https://placekitten.com/200/300" alt="Cat">
+        <div id="test">Test content with URL: https://test.com</div>
+    </body></html>
+    """
+
+    async with AsyncWebCrawler() as crawler:
+        config = CrawlerRunConfig(
+            js_code="document.getElementById('test').innerText += ' - Modified'"
+        )
+        result = await crawler.arun(f"raw:{html}", config=config)
+
+    assert result.success
+    assert "Modified" in result.html
+    assert "http://example.com" in result.html or "example.com" in result.html
+
+
+# ============================================================================
+# TRICKY: Double raw: prefix
+# ============================================================================
+
+@pytest.mark.asyncio
+async def test_double_raw_prefix():
+    """Test what happens with double raw: prefix (edge case)."""
+    html = "<html><body>Content</body></html>"
+
+    async with AsyncWebCrawler() as crawler:
+        # raw:raw:<html>... - the second raw: becomes part of content
+        result = await crawler.arun(f"raw:raw:{html}")
+
+    # Should either handle gracefully or return "raw:<html>..." as content
+    assert result is not None
+
+
+if __name__ == "__main__":
+    import sys
+
+    async def run_tests():
+        # Run a few key tests manually
+        tests = [
+            ("Hash in CSS", test_raw_html_with_hash_in_css),
+            ("Unicode", test_raw_html_with_unicode),
+            ("Large HTML", test_raw_html_large),
+            ("HTTP still works", test_http_urls_still_work),
+            ("Concurrent requests", test_concurrent_raw_requests),
+            ("Complex screenshot", test_raw_html_screenshot_complex_page),
+        ]
+
+        for name, test_fn in tests:
+            print(f"\n=== Running: {name} ===")
+            try:
+                await test_fn()
+                print(f"✅ {name} PASSED")
+            except Exception as e:
+                print(f"❌ {name} FAILED: {e}")
+                import traceback
+                traceback.print_exc()
+
+    asyncio.run(run_tests())