Release v0.8.0: Crash Recovery, Prefetch Mode & Security Fixes (#1712)

* Fix: Use correct URL variable for raw HTML extraction (#1116)

- Prevents full HTML content from being passed as URL to extraction strategies
- Added unit tests to verify raw HTML and regular URL processing

Fix: Wrong URL variable used for extraction of raw html

* Fix #1181: Preserve whitespace in code blocks during HTML scraping

  The remove_empty_elements_fast() method was removing whitespace-only
  span elements inside <pre> and <code> tags, causing import statements
  like "import torch" to become "importtorch". Now skips elements inside
  code blocks where whitespace is significant.

* Refactor Pydantic model configuration to use ConfigDict for arbitrary types

* Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621

* Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638

* fix: ensure BrowserConfig.to_dict serializes proxy_config

* feat: make LLM backoff configurable end-to-end

- extend LLMConfig with backoff delay/attempt/factor fields and thread them
  through LLMExtractionStrategy, LLMContentFilter, table extraction, and
  Docker API handlers
- expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff
  and document them in the md_v2 guides

* reproduced AttributeError from #1642

* pass timeout parameter to docker client request

* added missing deep crawling objects to init

* generalized query in ContentRelevanceFilter to be a str or list

* import modules from enhanceable deserialization

* parameterized tests

* Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268

* refactor: replace PyPDF2 with pypdf across the codebase. ref #1412

* Add browser_context_id and target_id parameters to BrowserConfig

Enable Crawl4AI to connect to pre-created CDP browser contexts, which is
essential for cloud browser services that pre-create isolated contexts.

Changes:
- Add browser_context_id and target_id parameters to BrowserConfig
- Update from_kwargs() and to_dict() methods
- Modify BrowserManager.start() to use existing context when provided
- Add _get_page_by_target_id() helper method
- Update get_page() to handle pre-existing targets
- Add test for browser_context_id functionality

This enables cloud services to:
1. Create isolated CDP contexts before Crawl4AI connects
2. Pass context/target IDs to BrowserConfig
3. Have Crawl4AI reuse existing contexts instead of creating new ones

* Add cdp_cleanup_on_close flag to prevent memory leaks in cloud/server scenarios

* Fix: add cdp_cleanup_on_close to from_kwargs

* Fix: find context by target_id for concurrent CDP connections

* Fix: use target_id to find correct page in get_page

* Fix: use CDP to find context by browserContextId for concurrent sessions

* Revert context matching attempts - Playwright cannot see CDP-created contexts

* Add create_isolated_context flag for concurrent CDP crawls

When True, forces creation of a new browser context instead of reusing
the default context. Essential for concurrent crawls on the same browser
to prevent navigation conflicts.

* Add context caching to create_isolated_context branch

Uses contexts_by_config cache (same as non-CDP mode) to reuse contexts
for multiple URLs with same config. Still creates new page per crawl
for navigation isolation. Benefits batch/deep crawls.

* Add init_scripts support to BrowserConfig for pre-page-load JS injection

This adds the ability to inject JavaScript that runs before any page loads,
useful for stealth evasions (canvas/audio fingerprinting, userAgentData).

- Add init_scripts parameter to BrowserConfig (list of JS strings)
- Apply init_scripts in setup_context() via context.add_init_script()
- Update from_kwargs() and to_dict() for serialization

* Fix CDP connection handling: support WS URLs and proper cleanup

Changes to browser_manager.py:

1. _verify_cdp_ready(): Support multiple URL formats
   - WebSocket URLs (ws://, wss://): Skip HTTP verification, Playwright handles directly
   - HTTP URLs with query params: Properly parse with urlparse to preserve query string
   - Fixes issue where naive f"{cdp_url}/json/version" broke WS URLs and query params

2. close(): Proper cleanup when cdp_cleanup_on_close=True
   - Close all sessions (pages)
   - Close all contexts
   - Call browser.close() to disconnect (doesn't terminate browser, just releases connection)
   - Wait 1 second for CDP connection to fully release
   - Stop Playwright instance to prevent memory leaks

This enables:
- Connecting to specific browsers via WS URL
- Reusing the same browser with multiple sequential connections
- No user wait needed between connections (internal 1s delay handles it)

Added tests/browser/test_cdp_cleanup_reuse.py with comprehensive tests.

* Update gitignore

* Some debugging for caching

* Add _generate_screenshot_from_html for raw: and file:// URLs

Implements the missing method that was being called but never defined.
Now raw: and file:// URLs can generate screenshots by:
1. Loading HTML into a browser page via page.set_content()
2. Taking screenshot using existing take_screenshot() method
3. Cleaning up the page afterward

This enables cached HTML to be rendered with screenshots in crawl4ai-cloud.

* Add PDF and MHTML support for raw: and file:// URLs

- Replace _generate_screenshot_from_html with _generate_media_from_html
- New method handles screenshot, PDF, and MHTML in one browser session
- Update raw: and file:// URL handlers to use new method
- Enables cached HTML to generate all media types

* Add crash recovery for deep crawl strategies

Add optional resume_state and on_state_change parameters to all deep
crawl strategies (BFS, DFS, Best-First) for cloud deployment crash
recovery.

Features:
- resume_state: Pass saved state to resume from checkpoint
- on_state_change: Async callback fired after each URL for real-time
  state persistence to external storage (Redis, DB, etc.)
- export_state(): Get last captured state manually
- Zero overhead when features are disabled (None defaults)

State includes visited URLs, pending queue/stack, depths, and
pages_crawled count. All state is JSON-serializable.

* Fix: HTTP strategy raw: URL parsing truncates at # character

The AsyncHTTPCrawlerStrategy.crawl() method used urlparse() to extract
content from raw: URLs. This caused HTML with CSS color codes like #eee
to be truncated because # is treated as a URL fragment delimiter.

Before: raw:body{background:#eee} -> parsed.path = 'body{background:'
After:  raw:body{background:#eee} -> raw_content = 'body{background:#eee'

Fix: Strip the raw: or raw:// prefix directly instead of using urlparse,
matching how the browser strategy handles it.

* Add base_url parameter to CrawlerRunConfig for raw HTML processing

When processing raw: HTML (e.g., from cache), the URL parameter is meaningless
for markdown link resolution. This adds a base_url parameter that can be set
explicitly to provide proper URL resolution context.

Changes:
- Add base_url parameter to CrawlerRunConfig.__init__
- Add base_url to CrawlerRunConfig.from_kwargs
- Update aprocess_html to use base_url for markdown generation

Usage:
  config = CrawlerRunConfig(base_url='https://example.com')
  result = await crawler.arun(url='raw:{html}', config=config)

* Add prefetch mode for two-phase deep crawling

- Add `prefetch` parameter to CrawlerRunConfig
- Add `quick_extract_links()` function for fast link extraction
- Add short-circuit in aprocess_html() for prefetch mode
- Add 42 tests (unit, integration, regression)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Updates on proxy rotation and proxy configuration

* Add proxy support to HTTP crawler strategy

* Add browser pipeline support for raw:/file:// URLs

- Add process_in_browser parameter to CrawlerRunConfig
- Route raw:/file:// URLs through _crawl_web() when browser operations needed
- Use page.set_content() instead of goto() for local content
- Fix cookie handling for non-HTTP URLs in browser_manager
- Auto-detect browser requirements: js_code, wait_for, screenshot, etc.
- Maintain fast path for raw:/file:// without browser params

Fixes #310

* Add smart TTL cache for sitemap URL seeder

- Add cache_ttl_hours and validate_sitemap_lastmod params to SeedingConfig
- New JSON cache format with metadata (version, created_at, lastmod, url_count)
- Cache validation by TTL expiry and sitemap lastmod comparison
- Auto-migration from old .jsonl to new .json format
- Fixes bug where incomplete cache was used indefinitely

* Update URL seeder docs with smart TTL cache parameters

- Add cache_ttl_hours and validate_sitemap_lastmod to parameter table
- Document smart TTL cache validation with examples
- Add cache-related troubleshooting entries
- Update key features summary

* Add MEMORY.md to gitignore

* Docs: Add multi-sample schema generation section

Add documentation explaining how to pass multiple HTML samples
to generate_schema() for stable selectors that work across pages
with varying DOM structures.

Includes:
- Problem explanation (fragile nth-child selectors)
- Solution with code example
- Key points for multi-sample queries
- Comparison table of fragile vs stable selectors

* Fix critical RCE and LFI vulnerabilities in Docker API deployment

Security fixes for vulnerabilities reported by ProjectDiscovery:

1. Remote Code Execution via Hooks (CVE pending)
   - Remove __import__ from allowed_builtins in hook_manager.py
   - Prevents arbitrary module imports (os, subprocess, etc.)
   - Hooks now disabled by default via CRAWL4AI_HOOKS_ENABLED env var

2. Local File Inclusion via file:// URLs (CVE pending)
   - Add URL scheme validation to /execute_js, /screenshot, /pdf, /html
   - Block file://, javascript:, data: and other dangerous schemes
   - Only allow http://, https://, and raw: (where appropriate)

3. Security hardening
   - Add CRAWL4AI_HOOKS_ENABLED=false as default (opt-in for hooks)
   - Add security warning comments in config.yml
   - Add validate_url_scheme() helper for consistent validation

Testing:
   - Add unit tests (test_security_fixes.py) - 16 tests
   - Add integration tests (run_security_tests.py) for live server

Affected endpoints:
   - POST /crawl (hooks disabled by default)
   - POST /crawl/stream (hooks disabled by default)
   - POST /execute_js (URL validation added)
   - POST /screenshot (URL validation added)
   - POST /pdf (URL validation added)
   - POST /html (URL validation added)

Breaking changes:
   - Hooks require CRAWL4AI_HOOKS_ENABLED=true to function
   - file:// URLs no longer work on API endpoints (use library directly)

* Enhance authentication flow by implementing JWT token retrieval and adding authorization headers to API requests

* Add release notes for v0.7.9, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates

* Add release notes for v0.8.0, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates

Documentation for v0.8.0 release:

- SECURITY.md: Security policy and vulnerability reporting guidelines
- RELEASE_NOTES_v0.8.0.md: Comprehensive release notes
- migration/v0.8.0-upgrade-guide.md: Step-by-step migration guide
- security/GHSA-DRAFT-RCE-LFI.md: GitHub security advisory drafts
- CHANGELOG.md: Updated with v0.8.0 changes

Breaking changes documented:
- Docker API hooks disabled by default (CRAWL4AI_HOOKS_ENABLED)
- file:// URLs blocked on Docker API endpoints

Security fixes credited to Neo by ProjectDiscovery

* Add examples for deep crawl crash recovery and prefetch mode in documentation

* Release v0.8.0: The v0.8.0 Update

- Updated version to 0.8.0
- Added comprehensive demo and release notes
- Updated all documentation

* Update security researcher acknowledgment with a hyperlink for Neo by ProjectDiscovery

* Add async agenerate_schema method for schema generation

- Extract prompt building to shared _build_schema_prompt() method
- Add agenerate_schema() async version using aperform_completion_with_backoff
- Refactor generate_schema() to use shared prompt builder
- Fixes Gemini/Vertex AI compatibility in async contexts (FastAPI)

* Fix: Enable litellm.drop_params for O-series/GPT-5 model compatibility

O-series (o1, o3) and GPT-5 models only support temperature=1.
Setting litellm.drop_params=True auto-drops unsupported parameters
instead of throwing UnsupportedParamsError.

Fixes temperature=0.01 error for these models in LLM extraction.

---------

Co-authored-by: rbushria <rbushri@gmail.com>
Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com>
Co-authored-by: Soham Kukreti <kukretisoham@gmail.com>
Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com>
Co-authored-by: unclecode <unclecode@kidocode.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Nasrin
2026-01-17 14:19:15 +01:00
committed by GitHub
parent c85f56b085
commit f6f7f1b551
58 changed files with 11942 additions and 2411 deletions

View File

@@ -0,0 +1,489 @@
"""Test for browser_context_id and target_id parameters.
These tests verify that Crawl4AI can connect to and use pre-created
browser contexts, which is essential for cloud browser services that
pre-create isolated contexts for each user.
The flow being tested:
1. Start a browser with CDP
2. Create a context via raw CDP commands (simulating cloud service)
3. Create a page/target in that context
4. Have Crawl4AI connect using browser_context_id and target_id
5. Verify Crawl4AI uses the existing context/page instead of creating new ones
"""
import asyncio
import json
import os
import sys
import websockets
# Add the project root to Python path if running directly
if __name__ == "__main__":
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '../..')))
from crawl4ai.browser_manager import BrowserManager, ManagedBrowser
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
from crawl4ai.async_logger import AsyncLogger
# Create a logger for clear terminal output
logger = AsyncLogger(verbose=True, log_file=None)
class CDPContextCreator:
"""
Helper class to create browser contexts via raw CDP commands.
This simulates what a cloud browser service would do.
"""
def __init__(self, cdp_url: str):
self.cdp_url = cdp_url
self._message_id = 0
self._ws = None
self._pending_responses = {}
self._receiver_task = None
async def connect(self):
"""Establish WebSocket connection to browser."""
# Convert HTTP URL to WebSocket URL if needed
ws_url = self.cdp_url.replace("http://", "ws://").replace("https://", "wss://")
if not ws_url.endswith("/devtools/browser"):
# Get the browser websocket URL from /json/version
import aiohttp
async with aiohttp.ClientSession() as session:
async with session.get(f"{self.cdp_url}/json/version") as response:
data = await response.json()
ws_url = data.get("webSocketDebuggerUrl", ws_url)
self._ws = await websockets.connect(ws_url, max_size=None, ping_interval=None)
self._receiver_task = asyncio.create_task(self._receive_messages())
logger.info(f"Connected to CDP at {ws_url}", tag="CDP")
async def disconnect(self):
"""Close WebSocket connection."""
if self._receiver_task:
self._receiver_task.cancel()
try:
await self._receiver_task
except asyncio.CancelledError:
pass
if self._ws:
await self._ws.close()
self._ws = None
async def _receive_messages(self):
"""Background task to receive CDP messages."""
try:
async for message in self._ws:
data = json.loads(message)
msg_id = data.get('id')
if msg_id is not None and msg_id in self._pending_responses:
self._pending_responses[msg_id].set_result(data)
except asyncio.CancelledError:
pass
except Exception as e:
logger.error(f"CDP receiver error: {e}", tag="CDP")
async def _send_command(self, method: str, params: dict = None) -> dict:
"""Send CDP command and wait for response."""
self._message_id += 1
msg_id = self._message_id
message = {
"id": msg_id,
"method": method,
"params": params or {}
}
future = asyncio.get_event_loop().create_future()
self._pending_responses[msg_id] = future
try:
await self._ws.send(json.dumps(message))
response = await asyncio.wait_for(future, timeout=30.0)
if 'error' in response:
raise Exception(f"CDP error: {response['error']}")
return response.get('result', {})
finally:
self._pending_responses.pop(msg_id, None)
async def create_context(self) -> dict:
"""
Create an isolated browser context with a blank page.
Returns:
dict with browser_context_id, target_id, and cdp_session_id
"""
await self.connect()
# 1. Create isolated browser context
result = await self._send_command("Target.createBrowserContext", {
"disposeOnDetach": False # Keep context alive
})
browser_context_id = result["browserContextId"]
logger.info(f"Created browser context: {browser_context_id}", tag="CDP")
# 2. Create a new page (target) in the context
result = await self._send_command("Target.createTarget", {
"url": "about:blank",
"browserContextId": browser_context_id
})
target_id = result["targetId"]
logger.info(f"Created target: {target_id}", tag="CDP")
# 3. Attach to the target to get a session ID
result = await self._send_command("Target.attachToTarget", {
"targetId": target_id,
"flatten": True
})
cdp_session_id = result["sessionId"]
logger.info(f"Attached to target, sessionId: {cdp_session_id}", tag="CDP")
return {
"browser_context_id": browser_context_id,
"target_id": target_id,
"cdp_session_id": cdp_session_id
}
async def get_targets(self) -> list:
"""Get list of all targets in the browser."""
result = await self._send_command("Target.getTargets")
return result.get("targetInfos", [])
async def dispose_context(self, browser_context_id: str):
"""Dispose of a browser context."""
try:
await self._send_command("Target.disposeBrowserContext", {
"browserContextId": browser_context_id
})
logger.info(f"Disposed browser context: {browser_context_id}", tag="CDP")
except Exception as e:
logger.warning(f"Error disposing context: {e}", tag="CDP")
async def test_browser_context_id_basic():
"""
Test that BrowserConfig accepts browser_context_id and target_id parameters.
"""
logger.info("Testing BrowserConfig browser_context_id parameter", tag="TEST")
try:
# Test that BrowserConfig accepts the new parameters
config = BrowserConfig(
cdp_url="http://localhost:9222",
browser_context_id="test-context-id",
target_id="test-target-id",
headless=True
)
# Verify parameters are set correctly
assert config.browser_context_id == "test-context-id", "browser_context_id not set"
assert config.target_id == "test-target-id", "target_id not set"
# Test from_kwargs
config2 = BrowserConfig.from_kwargs({
"cdp_url": "http://localhost:9222",
"browser_context_id": "test-context-id-2",
"target_id": "test-target-id-2"
})
assert config2.browser_context_id == "test-context-id-2", "browser_context_id not set via from_kwargs"
assert config2.target_id == "test-target-id-2", "target_id not set via from_kwargs"
# Test to_dict
config_dict = config.to_dict()
assert config_dict.get("browser_context_id") == "test-context-id", "browser_context_id not in to_dict"
assert config_dict.get("target_id") == "test-target-id", "target_id not in to_dict"
logger.success("BrowserConfig browser_context_id test passed", tag="TEST")
return True
except Exception as e:
logger.error(f"Test failed: {str(e)}", tag="TEST")
return False
async def test_pre_created_context_usage():
"""
Test that Crawl4AI uses a pre-created browser context instead of creating a new one.
This simulates the cloud browser service flow:
1. Start browser with CDP
2. Create context via raw CDP (simulating cloud service)
3. Have Crawl4AI connect with browser_context_id
4. Verify it uses existing context
"""
logger.info("Testing pre-created context usage", tag="TEST")
# Start a managed browser first
browser_config_initial = BrowserConfig(
use_managed_browser=True,
headless=True,
debugging_port=9226, # Use unique port
verbose=True
)
managed_browser = ManagedBrowser(browser_config=browser_config_initial, logger=logger)
cdp_creator = None
manager = None
context_info = None
try:
# Start the browser
cdp_url = await managed_browser.start()
logger.info(f"Browser started at {cdp_url}", tag="TEST")
# Create a context via raw CDP (simulating cloud service)
cdp_creator = CDPContextCreator(cdp_url)
context_info = await cdp_creator.create_context()
logger.info(f"Pre-created context: {context_info['browser_context_id']}", tag="TEST")
logger.info(f"Pre-created target: {context_info['target_id']}", tag="TEST")
# Get initial target count
targets_before = await cdp_creator.get_targets()
initial_target_count = len(targets_before)
logger.info(f"Initial target count: {initial_target_count}", tag="TEST")
# Now create BrowserManager with browser_context_id and target_id
browser_config = BrowserConfig(
cdp_url=cdp_url,
browser_context_id=context_info['browser_context_id'],
target_id=context_info['target_id'],
headless=True,
verbose=True
)
manager = BrowserManager(browser_config=browser_config, logger=logger)
await manager.start()
logger.info("BrowserManager started with pre-created context", tag="TEST")
# Get a page
crawler_config = CrawlerRunConfig()
page, context = await manager.get_page(crawler_config)
# Navigate to a test page
await page.goto("https://example.com", wait_until="domcontentloaded")
title = await page.title()
logger.info(f"Page title: {title}", tag="TEST")
# Get target count after
targets_after = await cdp_creator.get_targets()
final_target_count = len(targets_after)
logger.info(f"Final target count: {final_target_count}", tag="TEST")
# Verify: target count should not have increased significantly
# (allow for 1 extra target for internal use, but not many more)
target_diff = final_target_count - initial_target_count
logger.info(f"Target count difference: {target_diff}", tag="TEST")
# Success criteria:
# 1. Page navigation worked
# 2. Target count didn't explode (reused existing context)
success = title == "Example Domain" and target_diff <= 1
if success:
logger.success("Pre-created context usage test passed", tag="TEST")
else:
logger.error(f"Test failed - Title: {title}, Target diff: {target_diff}", tag="TEST")
return success
except Exception as e:
logger.error(f"Test failed: {str(e)}", tag="TEST")
import traceback
traceback.print_exc()
return False
finally:
# Cleanup
if manager:
try:
await manager.close()
except:
pass
if cdp_creator and context_info:
try:
await cdp_creator.dispose_context(context_info['browser_context_id'])
await cdp_creator.disconnect()
except:
pass
if managed_browser:
try:
await managed_browser.cleanup()
except:
pass
async def test_context_isolation():
"""
Test that using browser_context_id actually provides isolation.
Create two contexts and verify they don't share state.
"""
logger.info("Testing context isolation with browser_context_id", tag="TEST")
browser_config_initial = BrowserConfig(
use_managed_browser=True,
headless=True,
debugging_port=9227,
verbose=True
)
managed_browser = ManagedBrowser(browser_config=browser_config_initial, logger=logger)
cdp_creator = None
manager1 = None
manager2 = None
context_info_1 = None
context_info_2 = None
try:
# Start the browser
cdp_url = await managed_browser.start()
logger.info(f"Browser started at {cdp_url}", tag="TEST")
# Create two separate contexts
cdp_creator = CDPContextCreator(cdp_url)
context_info_1 = await cdp_creator.create_context()
logger.info(f"Context 1: {context_info_1['browser_context_id']}", tag="TEST")
# Need to reconnect for second context (or use same connection)
await cdp_creator.disconnect()
cdp_creator2 = CDPContextCreator(cdp_url)
context_info_2 = await cdp_creator2.create_context()
logger.info(f"Context 2: {context_info_2['browser_context_id']}", tag="TEST")
# Verify contexts are different
assert context_info_1['browser_context_id'] != context_info_2['browser_context_id'], \
"Contexts should have different IDs"
# Connect with first context
browser_config_1 = BrowserConfig(
cdp_url=cdp_url,
browser_context_id=context_info_1['browser_context_id'],
target_id=context_info_1['target_id'],
headless=True
)
manager1 = BrowserManager(browser_config=browser_config_1, logger=logger)
await manager1.start()
# Set a cookie in context 1
page1, ctx1 = await manager1.get_page(CrawlerRunConfig())
await page1.goto("https://example.com", wait_until="domcontentloaded")
await ctx1.add_cookies([{
"name": "test_isolation",
"value": "context_1_value",
"domain": "example.com",
"path": "/"
}])
cookies1 = await ctx1.cookies(["https://example.com"])
cookie1_value = next((c["value"] for c in cookies1 if c["name"] == "test_isolation"), None)
logger.info(f"Cookie in context 1: {cookie1_value}", tag="TEST")
# Connect with second context
browser_config_2 = BrowserConfig(
cdp_url=cdp_url,
browser_context_id=context_info_2['browser_context_id'],
target_id=context_info_2['target_id'],
headless=True
)
manager2 = BrowserManager(browser_config=browser_config_2, logger=logger)
await manager2.start()
# Check cookies in context 2 - should not have the cookie from context 1
page2, ctx2 = await manager2.get_page(CrawlerRunConfig())
await page2.goto("https://example.com", wait_until="domcontentloaded")
cookies2 = await ctx2.cookies(["https://example.com"])
cookie2_value = next((c["value"] for c in cookies2 if c["name"] == "test_isolation"), None)
logger.info(f"Cookie in context 2: {cookie2_value}", tag="TEST")
# Verify isolation
isolation_works = cookie1_value == "context_1_value" and cookie2_value is None
if isolation_works:
logger.success("Context isolation test passed", tag="TEST")
else:
logger.error(f"Isolation failed - Cookie1: {cookie1_value}, Cookie2: {cookie2_value}", tag="TEST")
return isolation_works
except Exception as e:
logger.error(f"Test failed: {str(e)}", tag="TEST")
import traceback
traceback.print_exc()
return False
finally:
# Cleanup
for mgr in [manager1, manager2]:
if mgr:
try:
await mgr.close()
except:
pass
for ctx_info, creator in [(context_info_1, cdp_creator), (context_info_2, cdp_creator2 if 'cdp_creator2' in dir() else None)]:
if ctx_info and creator:
try:
await creator.dispose_context(ctx_info['browser_context_id'])
await creator.disconnect()
except:
pass
if managed_browser:
try:
await managed_browser.cleanup()
except:
pass
async def run_tests():
"""Run all browser_context_id tests."""
results = []
logger.info("Running browser_context_id tests", tag="SUITE")
# Basic parameter test
results.append(("browser_context_id_basic", await test_browser_context_id_basic()))
# Pre-created context usage test
results.append(("pre_created_context_usage", await test_pre_created_context_usage()))
# Note: Context isolation test is commented out because isolation is enforced
# at the CDP level by the cloud browser service, not at the Playwright level.
# When multiple BrowserManagers connect to the same browser, Playwright sees
# all contexts. In production, each worker gets exactly one pre-created context.
# results.append(("context_isolation", await test_context_isolation()))
# Print summary
total = len(results)
passed = sum(1 for _, r in results if r)
logger.info("=" * 50, tag="SUMMARY")
logger.info(f"Test Results: {passed}/{total} passed", tag="SUMMARY")
logger.info("=" * 50, tag="SUMMARY")
for name, result in results:
status = "PASSED" if result else "FAILED"
logger.info(f" {name}: {status}", tag="SUMMARY")
if passed == total:
logger.success("All tests passed!", tag="SUMMARY")
return True
else:
logger.error(f"{total - passed} tests failed", tag="SUMMARY")
return False
if __name__ == "__main__":
success = asyncio.run(run_tests())
sys.exit(0 if success else 1)

View File

@@ -0,0 +1,281 @@
#!/usr/bin/env python3
"""
Tests for CDP connection cleanup and browser reuse.
These tests verify that:
1. WebSocket URLs are properly handled (skip HTTP verification)
2. cdp_cleanup_on_close properly disconnects without terminating the browser
3. The same browser can be reused by multiple sequential connections
Requirements:
- A CDP-compatible browser pool service running (e.g., chromepoold)
- Service should be accessible at CDP_SERVICE_URL (default: http://localhost:11235)
Usage:
pytest tests/browser/test_cdp_cleanup_reuse.py -v
Or run directly:
python tests/browser/test_cdp_cleanup_reuse.py
"""
import asyncio
import os
import pytest
import requests
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
# Configuration
CDP_SERVICE_URL = os.getenv("CDP_SERVICE_URL", "http://localhost:11235")
def is_cdp_service_available():
"""Check if CDP service is running."""
try:
resp = requests.get(f"{CDP_SERVICE_URL}/health", timeout=2)
return resp.status_code == 200
except:
return False
def create_browser():
"""Create a browser via CDP service API."""
resp = requests.post(
f"{CDP_SERVICE_URL}/v1/browsers",
json={"headless": True},
timeout=10
)
resp.raise_for_status()
return resp.json()
def get_browser_info(browser_id):
"""Get browser info from CDP service."""
resp = requests.get(f"{CDP_SERVICE_URL}/v1/browsers", timeout=5)
for browser in resp.json():
if browser["id"] == browser_id:
return browser
return None
def delete_browser(browser_id):
"""Delete a browser via CDP service API."""
try:
requests.delete(f"{CDP_SERVICE_URL}/v1/browsers/{browser_id}", timeout=5)
except:
pass
# Skip all tests if CDP service is not available
pytestmark = pytest.mark.skipif(
not is_cdp_service_available(),
reason=f"CDP service not available at {CDP_SERVICE_URL}"
)
class TestCDPWebSocketURL:
"""Tests for WebSocket URL handling."""
@pytest.mark.asyncio
async def test_websocket_url_skips_http_verification(self):
"""WebSocket URLs should skip HTTP /json/version verification."""
browser = create_browser()
try:
ws_url = browser["ws_url"]
assert ws_url.startswith("ws://") or ws_url.startswith("wss://")
async with AsyncWebCrawler(
config=BrowserConfig(
browser_mode="cdp",
cdp_url=ws_url,
headless=True,
cdp_cleanup_on_close=True,
)
) as crawler:
result = await crawler.arun(
url="https://example.com",
config=CrawlerRunConfig(verbose=False),
)
assert result.success
assert "Example Domain" in result.metadata.get("title", "")
finally:
delete_browser(browser["browser_id"])
class TestCDPCleanupOnClose:
"""Tests for cdp_cleanup_on_close behavior."""
@pytest.mark.asyncio
async def test_browser_survives_after_cleanup_close(self):
"""Browser should remain alive after close with cdp_cleanup_on_close=True."""
browser = create_browser()
browser_id = browser["browser_id"]
ws_url = browser["ws_url"]
try:
# Verify browser exists
info_before = get_browser_info(browser_id)
assert info_before is not None
pid_before = info_before["pid"]
# Connect, crawl, and close with cleanup
async with AsyncWebCrawler(
config=BrowserConfig(
browser_mode="cdp",
cdp_url=ws_url,
headless=True,
cdp_cleanup_on_close=True,
)
) as crawler:
result = await crawler.arun(
url="https://example.com",
config=CrawlerRunConfig(verbose=False),
)
assert result.success
# Browser should still exist with same PID
info_after = get_browser_info(browser_id)
assert info_after is not None, "Browser was terminated but should only disconnect"
assert info_after["pid"] == pid_before, "Browser PID changed unexpectedly"
finally:
delete_browser(browser_id)
class TestCDPBrowserReuse:
"""Tests for reusing the same browser with multiple connections."""
@pytest.mark.asyncio
async def test_sequential_connections_same_browser(self):
"""Multiple sequential connections to the same browser should work."""
browser = create_browser()
browser_id = browser["browser_id"]
ws_url = browser["ws_url"]
try:
urls = [
"https://example.com",
"https://httpbin.org/ip",
"https://httpbin.org/headers",
]
for i, url in enumerate(urls, 1):
# Each connection uses cdp_cleanup_on_close=True
async with AsyncWebCrawler(
config=BrowserConfig(
browser_mode="cdp",
cdp_url=ws_url,
headless=True,
cdp_cleanup_on_close=True,
)
) as crawler:
result = await crawler.arun(
url=url,
config=CrawlerRunConfig(verbose=False),
)
assert result.success, f"Connection {i} failed for {url}"
# Verify browser is still healthy
info = get_browser_info(browser_id)
assert info is not None, f"Browser died after connection {i}"
finally:
delete_browser(browser_id)
@pytest.mark.asyncio
async def test_no_user_wait_needed_between_connections(self):
"""With cdp_cleanup_on_close=True, no user wait should be needed."""
browser = create_browser()
browser_id = browser["browser_id"]
ws_url = browser["ws_url"]
try:
# Rapid-fire connections with NO sleep between them
for i in range(3):
async with AsyncWebCrawler(
config=BrowserConfig(
browser_mode="cdp",
cdp_url=ws_url,
headless=True,
cdp_cleanup_on_close=True,
)
) as crawler:
result = await crawler.arun(
url="https://example.com",
config=CrawlerRunConfig(verbose=False),
)
assert result.success, f"Rapid connection {i+1} failed"
# NO asyncio.sleep() here - internal delay should be sufficient
finally:
delete_browser(browser_id)
class TestCDPBackwardCompatibility:
"""Tests for backward compatibility with existing CDP usage."""
@pytest.mark.asyncio
async def test_http_url_with_browser_id_works(self):
"""HTTP URL with browser_id query param should work (backward compatibility)."""
browser = create_browser()
browser_id = browser["browser_id"]
try:
# Use HTTP URL with browser_id query parameter
http_url = f"{CDP_SERVICE_URL}?browser_id={browser_id}"
async with AsyncWebCrawler(
config=BrowserConfig(
browser_mode="cdp",
cdp_url=http_url,
headless=True,
cdp_cleanup_on_close=True,
)
) as crawler:
result = await crawler.arun(
url="https://example.com",
config=CrawlerRunConfig(verbose=False),
)
assert result.success
finally:
delete_browser(browser_id)
# Allow running directly
if __name__ == "__main__":
if not is_cdp_service_available():
print(f"CDP service not available at {CDP_SERVICE_URL}")
print("Please start a CDP-compatible browser pool service first.")
exit(1)
async def run_tests():
print("=" * 60)
print("CDP Cleanup and Browser Reuse Tests")
print("=" * 60)
tests = [
("WebSocket URL handling", TestCDPWebSocketURL().test_websocket_url_skips_http_verification),
("Browser survives after cleanup", TestCDPCleanupOnClose().test_browser_survives_after_cleanup_close),
("Sequential connections", TestCDPBrowserReuse().test_sequential_connections_same_browser),
("No user wait needed", TestCDPBrowserReuse().test_no_user_wait_needed_between_connections),
("HTTP URL with browser_id", TestCDPBackwardCompatibility().test_http_url_with_browser_id_works),
]
results = []
for name, test_func in tests:
print(f"\n--- {name} ---")
try:
await test_func()
print(f"PASS")
results.append((name, True))
except Exception as e:
print(f"FAIL: {e}")
results.append((name, False))
print("\n" + "=" * 60)
print("SUMMARY")
print("=" * 60)
for name, passed in results:
print(f" {name}: {'PASS' if passed else 'FAIL'}")
all_passed = all(r[1] for r in results)
print(f"\nOverall: {'ALL TESTS PASSED' if all_passed else 'SOME TESTS FAILED'}")
return 0 if all_passed else 1
exit(asyncio.run(run_tests()))

View File

@@ -0,0 +1 @@
# Cache validation test suite

View File

@@ -0,0 +1,40 @@
"""Pytest fixtures for cache validation tests."""
import pytest
def pytest_configure(config):
"""Register custom markers."""
config.addinivalue_line(
"markers", "integration: marks tests as integration tests (may require network)"
)
@pytest.fixture
def sample_head_html():
"""Sample HTML head section for testing."""
return '''
<head>
<meta charset="utf-8">
<title>Test Page Title</title>
<meta name="description" content="This is a test page description">
<meta property="og:title" content="OG Test Title">
<meta property="og:description" content="OG Description">
<meta property="og:image" content="https://example.com/image.jpg">
<meta property="article:modified_time" content="2024-12-01T00:00:00Z">
<link rel="stylesheet" href="style.css">
<script src="app.js"></script>
</head>
'''
@pytest.fixture
def minimal_head_html():
"""Minimal head with just a title."""
return '<head><title>Minimal</title></head>'
@pytest.fixture
def empty_head_html():
"""Empty head section."""
return '<head></head>'

View File

@@ -0,0 +1,449 @@
"""
End-to-end tests for Smart Cache validation.
Tests the full flow:
1. Fresh crawl (browser launch) - SLOW
2. Cached crawl without validation (check_cache_freshness=False) - FAST
3. Cached crawl with validation (check_cache_freshness=True) - FAST (304/fingerprint)
Verifies all layers:
- Database storage of etag, last_modified, head_fingerprint, cached_at
- Cache validation logic
- HTTP conditional requests (304 Not Modified)
- Performance improvements
"""
import pytest
import time
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.async_database import async_db_manager
class TestEndToEndCacheValidation:
"""End-to-end tests for the complete cache validation flow."""
@pytest.mark.asyncio
async def test_full_cache_flow_docs_python(self):
"""
Test complete cache flow with docs.python.org:
1. Fresh crawl (slow - browser) - using BYPASS to force fresh
2. Cache hit without validation (fast)
3. Cache hit with validation (fast - 304)
"""
url = "https://docs.python.org/3/"
browser_config = BrowserConfig(headless=True, verbose=False)
# ========== CRAWL 1: Fresh crawl (force with WRITE_ONLY to skip cache read) ==========
config1 = CrawlerRunConfig(
cache_mode=CacheMode.WRITE_ONLY, # Skip reading, write new data
check_cache_freshness=False,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
start1 = time.perf_counter()
result1 = await crawler.arun(url, config=config1)
time1 = time.perf_counter() - start1
assert result1.success, f"First crawl failed: {result1.error_message}"
# WRITE_ONLY means we did a fresh crawl and wrote to cache
assert result1.cache_status == "miss", f"Expected 'miss', got '{result1.cache_status}'"
print(f"\n[CRAWL 1] Fresh crawl: {time1:.2f}s (cache_status: {result1.cache_status})")
# Verify data is stored in database
metadata = await async_db_manager.aget_cache_metadata(url)
assert metadata is not None, "Metadata should be stored in database"
assert metadata.get("etag") or metadata.get("last_modified"), "Should have ETag or Last-Modified"
print(f" - Stored ETag: {metadata.get('etag', 'N/A')[:30]}...")
print(f" - Stored Last-Modified: {metadata.get('last_modified', 'N/A')}")
print(f" - Stored head_fingerprint: {metadata.get('head_fingerprint', 'N/A')}")
print(f" - Stored cached_at: {metadata.get('cached_at', 'N/A')}")
# ========== CRAWL 2: Cache hit WITHOUT validation ==========
config2 = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED,
check_cache_freshness=False, # Skip validation - pure cache hit
)
async with AsyncWebCrawler(config=browser_config) as crawler:
start2 = time.perf_counter()
result2 = await crawler.arun(url, config=config2)
time2 = time.perf_counter() - start2
assert result2.success, f"Second crawl failed: {result2.error_message}"
assert result2.cache_status == "hit", f"Expected 'hit', got '{result2.cache_status}'"
print(f"\n[CRAWL 2] Cache hit (no validation): {time2:.2f}s (cache_status: {result2.cache_status})")
print(f" - Speedup: {time1/time2:.1f}x faster than fresh crawl")
# Should be MUCH faster - no browser, no HTTP request
assert time2 < time1 / 2, f"Cache hit should be at least 2x faster (was {time1/time2:.1f}x)"
# ========== CRAWL 3: Cache hit WITH validation (304) ==========
config3 = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED,
check_cache_freshness=True, # Validate cache freshness
)
async with AsyncWebCrawler(config=browser_config) as crawler:
start3 = time.perf_counter()
result3 = await crawler.arun(url, config=config3)
time3 = time.perf_counter() - start3
assert result3.success, f"Third crawl failed: {result3.error_message}"
# Should be "hit_validated" (304) or "hit_fallback" (error during validation)
assert result3.cache_status in ["hit_validated", "hit_fallback"], \
f"Expected validated cache hit, got '{result3.cache_status}'"
print(f"\n[CRAWL 3] Cache hit (with validation): {time3:.2f}s (cache_status: {result3.cache_status})")
print(f" - Speedup: {time1/time3:.1f}x faster than fresh crawl")
# Should still be fast - just a HEAD request, no browser
assert time3 < time1 / 2, f"Validated cache hit should be faster than fresh crawl"
# ========== SUMMARY ==========
print(f"\n{'='*60}")
print(f"PERFORMANCE SUMMARY for {url}")
print(f"{'='*60}")
print(f" Fresh crawl (browser): {time1:.2f}s")
print(f" Cache hit (no validation): {time2:.2f}s ({time1/time2:.1f}x faster)")
print(f" Cache hit (with validation): {time3:.2f}s ({time1/time3:.1f}x faster)")
print(f"{'='*60}")
@pytest.mark.asyncio
async def test_full_cache_flow_crawl4ai_docs(self):
"""Test with docs.crawl4ai.com."""
url = "https://docs.crawl4ai.com/"
browser_config = BrowserConfig(headless=True, verbose=False)
# Fresh crawl - use WRITE_ONLY to ensure we get fresh data
config1 = CrawlerRunConfig(cache_mode=CacheMode.WRITE_ONLY, check_cache_freshness=False)
async with AsyncWebCrawler(config=browser_config) as crawler:
start1 = time.perf_counter()
result1 = await crawler.arun(url, config=config1)
time1 = time.perf_counter() - start1
assert result1.success
assert result1.cache_status == "miss"
print(f"\n[docs.crawl4ai.com] Fresh: {time1:.2f}s")
# Cache hit with validation
config2 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
start2 = time.perf_counter()
result2 = await crawler.arun(url, config=config2)
time2 = time.perf_counter() - start2
assert result2.success
assert result2.cache_status in ["hit_validated", "hit_fallback"]
print(f"[docs.crawl4ai.com] Validated: {time2:.2f}s ({time1/time2:.1f}x faster)")
@pytest.mark.asyncio
async def test_verify_database_storage(self):
"""Verify all validation metadata is properly stored in database."""
url = "https://docs.python.org/3/library/asyncio.html"
browser_config = BrowserConfig(headless=True, verbose=False)
config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url, config=config)
assert result.success
# Verify all fields in database
metadata = await async_db_manager.aget_cache_metadata(url)
assert metadata is not None, "Metadata must be stored"
assert "url" in metadata
assert "etag" in metadata
assert "last_modified" in metadata
assert "head_fingerprint" in metadata
assert "cached_at" in metadata
assert "response_headers" in metadata
print(f"\nDatabase storage verification for {url}:")
print(f" - etag: {metadata['etag'][:40] if metadata['etag'] else 'None'}...")
print(f" - last_modified: {metadata['last_modified']}")
print(f" - head_fingerprint: {metadata['head_fingerprint']}")
print(f" - cached_at: {metadata['cached_at']}")
print(f" - response_headers keys: {list(metadata['response_headers'].keys())[:5]}...")
# At least one validation field should be populated
has_validation_data = (
metadata["etag"] or
metadata["last_modified"] or
metadata["head_fingerprint"]
)
assert has_validation_data, "Should have at least one validation field"
@pytest.mark.asyncio
async def test_head_fingerprint_stored_and_used(self):
"""Verify head fingerprint is computed, stored, and used for validation."""
url = "https://example.com/"
browser_config = BrowserConfig(headless=True, verbose=False)
# Fresh crawl
config1 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
async with AsyncWebCrawler(config=browser_config) as crawler:
result1 = await crawler.arun(url, config=config1)
assert result1.success
assert result1.head_fingerprint, "head_fingerprint should be set on CrawlResult"
# Verify in database
metadata = await async_db_manager.aget_cache_metadata(url)
assert metadata["head_fingerprint"], "head_fingerprint should be stored in database"
assert metadata["head_fingerprint"] == result1.head_fingerprint
print(f"\nHead fingerprint for {url}:")
print(f" - CrawlResult.head_fingerprint: {result1.head_fingerprint}")
print(f" - Database head_fingerprint: {metadata['head_fingerprint']}")
# Validate using fingerprint
config2 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
result2 = await crawler.arun(url, config=config2)
assert result2.success
assert result2.cache_status in ["hit_validated", "hit_fallback"]
print(f" - Validation result: {result2.cache_status}")
class TestCacheValidationPerformance:
"""Performance benchmarks for cache validation."""
@pytest.mark.asyncio
async def test_multiple_urls_performance(self):
"""Test cache performance across multiple URLs."""
urls = [
"https://docs.python.org/3/",
"https://docs.python.org/3/library/asyncio.html",
"https://en.wikipedia.org/wiki/Python_(programming_language)",
]
browser_config = BrowserConfig(headless=True, verbose=False)
fresh_times = []
cached_times = []
print(f"\n{'='*70}")
print("MULTI-URL PERFORMANCE TEST")
print(f"{'='*70}")
# Fresh crawls - use WRITE_ONLY to force fresh crawl
for url in urls:
config = CrawlerRunConfig(cache_mode=CacheMode.WRITE_ONLY, check_cache_freshness=False)
async with AsyncWebCrawler(config=browser_config) as crawler:
start = time.perf_counter()
result = await crawler.arun(url, config=config)
elapsed = time.perf_counter() - start
fresh_times.append(elapsed)
print(f"Fresh: {url[:50]:50} {elapsed:.2f}s ({result.cache_status})")
# Cached crawls with validation
for url in urls:
config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
start = time.perf_counter()
result = await crawler.arun(url, config=config)
elapsed = time.perf_counter() - start
cached_times.append(elapsed)
print(f"Cached: {url[:50]:50} {elapsed:.2f}s ({result.cache_status})")
avg_fresh = sum(fresh_times) / len(fresh_times)
avg_cached = sum(cached_times) / len(cached_times)
total_fresh = sum(fresh_times)
total_cached = sum(cached_times)
print(f"\n{'='*70}")
print(f"RESULTS:")
print(f" Total fresh crawl time: {total_fresh:.2f}s")
print(f" Total cached time: {total_cached:.2f}s")
print(f" Average speedup: {avg_fresh/avg_cached:.1f}x")
print(f" Time saved: {total_fresh - total_cached:.2f}s")
print(f"{'='*70}")
# Cached should be significantly faster
assert avg_cached < avg_fresh / 2, "Cached crawls should be at least 2x faster"
@pytest.mark.asyncio
async def test_repeated_access_same_url(self):
"""Test repeated access to the same URL shows consistent cache hits."""
url = "https://docs.python.org/3/"
num_accesses = 5
browser_config = BrowserConfig(headless=True, verbose=False)
print(f"\n{'='*60}")
print(f"REPEATED ACCESS TEST: {url}")
print(f"{'='*60}")
# First access - fresh crawl
config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
async with AsyncWebCrawler(config=browser_config) as crawler:
start = time.perf_counter()
result = await crawler.arun(url, config=config)
fresh_time = time.perf_counter() - start
print(f"Access 1 (fresh): {fresh_time:.2f}s - {result.cache_status}")
# Repeated accesses - should all be cache hits
cached_times = []
for i in range(2, num_accesses + 1):
config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
start = time.perf_counter()
result = await crawler.arun(url, config=config)
elapsed = time.perf_counter() - start
cached_times.append(elapsed)
print(f"Access {i} (cached): {elapsed:.2f}s - {result.cache_status}")
assert result.cache_status in ["hit", "hit_validated", "hit_fallback"]
avg_cached = sum(cached_times) / len(cached_times)
print(f"\nAverage cached time: {avg_cached:.2f}s")
print(f"Speedup over fresh: {fresh_time/avg_cached:.1f}x")
class TestCacheValidationModes:
"""Test different cache modes and their behavior."""
@pytest.mark.asyncio
async def test_cache_bypass_always_fresh(self):
"""CacheMode.BYPASS should always do fresh crawl."""
# Use a unique URL path to avoid cache from other tests
url = "https://example.com/test-bypass"
browser_config = BrowserConfig(headless=True, verbose=False)
# First crawl with WRITE_ONLY to populate cache (always fresh)
config1 = CrawlerRunConfig(cache_mode=CacheMode.WRITE_ONLY, check_cache_freshness=False)
async with AsyncWebCrawler(config=browser_config) as crawler:
result1 = await crawler.arun(url, config=config1)
assert result1.cache_status == "miss"
# Second crawl with BYPASS - should NOT use cache
config2 = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, check_cache_freshness=False)
async with AsyncWebCrawler(config=browser_config) as crawler:
result2 = await crawler.arun(url, config=config2)
# BYPASS mode means no cache interaction
assert result2.cache_status is None or result2.cache_status == "miss"
print(f"\nCacheMode.BYPASS result: {result2.cache_status}")
@pytest.mark.asyncio
async def test_validation_disabled_uses_cache_directly(self):
"""With check_cache_freshness=False, should use cache without HTTP validation."""
url = "https://docs.python.org/3/tutorial/"
browser_config = BrowserConfig(headless=True, verbose=False)
# Fresh crawl - use WRITE_ONLY to force fresh
config1 = CrawlerRunConfig(cache_mode=CacheMode.WRITE_ONLY, check_cache_freshness=False)
async with AsyncWebCrawler(config=browser_config) as crawler:
result1 = await crawler.arun(url, config=config1)
assert result1.cache_status == "miss"
# Cached with validation DISABLED - should be "hit" (not "hit_validated")
config2 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
async with AsyncWebCrawler(config=browser_config) as crawler:
start = time.perf_counter()
result2 = await crawler.arun(url, config=config2)
elapsed = time.perf_counter() - start
assert result2.cache_status == "hit", f"Expected 'hit', got '{result2.cache_status}'"
print(f"\nValidation disabled: {elapsed:.3f}s (cache_status: {result2.cache_status})")
# Should be very fast - no HTTP request at all
assert elapsed < 1.0, "Cache hit without validation should be < 1 second"
@pytest.mark.asyncio
async def test_validation_enabled_checks_freshness(self):
"""With check_cache_freshness=True, should validate before using cache."""
url = "https://docs.python.org/3/reference/"
browser_config = BrowserConfig(headless=True, verbose=False)
# Fresh crawl
config1 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
async with AsyncWebCrawler(config=browser_config) as crawler:
result1 = await crawler.arun(url, config=config1)
# Cached with validation ENABLED - should be "hit_validated"
config2 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
start = time.perf_counter()
result2 = await crawler.arun(url, config=config2)
elapsed = time.perf_counter() - start
assert result2.cache_status in ["hit_validated", "hit_fallback"]
print(f"\nValidation enabled: {elapsed:.3f}s (cache_status: {result2.cache_status})")
class TestCacheValidationResponseHeaders:
"""Test that response headers are properly stored and retrieved."""
@pytest.mark.asyncio
async def test_response_headers_stored(self):
"""Verify response headers including ETag and Last-Modified are stored."""
url = "https://docs.python.org/3/"
browser_config = BrowserConfig(headless=True, verbose=False)
config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url, config=config)
assert result.success
assert result.response_headers is not None
# Check that cache-relevant headers are captured
headers = result.response_headers
print(f"\nResponse headers for {url}:")
# Look for ETag (case-insensitive)
etag = headers.get("etag") or headers.get("ETag")
print(f" - ETag: {etag}")
# Look for Last-Modified
last_modified = headers.get("last-modified") or headers.get("Last-Modified")
print(f" - Last-Modified: {last_modified}")
# Look for Cache-Control
cache_control = headers.get("cache-control") or headers.get("Cache-Control")
print(f" - Cache-Control: {cache_control}")
# At least one should be present for docs.python.org
assert etag or last_modified, "Should have ETag or Last-Modified header"
@pytest.mark.asyncio
async def test_headers_used_for_validation(self):
"""Verify stored headers are used for conditional requests."""
url = "https://docs.crawl4ai.com/"
browser_config = BrowserConfig(headless=True, verbose=False)
# Fresh crawl to store headers
config1 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
async with AsyncWebCrawler(config=browser_config) as crawler:
result1 = await crawler.arun(url, config=config1)
# Get stored metadata
metadata = await async_db_manager.aget_cache_metadata(url)
stored_etag = metadata.get("etag")
stored_last_modified = metadata.get("last_modified")
print(f"\nStored validation data for {url}:")
print(f" - etag: {stored_etag}")
print(f" - last_modified: {stored_last_modified}")
# Validate - should use stored headers
config2 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
result2 = await crawler.arun(url, config=config2)
# Should get validated hit (304 response)
assert result2.cache_status in ["hit_validated", "hit_fallback"]
print(f" - Validation result: {result2.cache_status}")

View File

@@ -0,0 +1,97 @@
"""Unit tests for head fingerprinting."""
import pytest
from crawl4ai.utils import compute_head_fingerprint
class TestHeadFingerprint:
"""Tests for the compute_head_fingerprint function."""
def test_same_content_same_fingerprint(self):
"""Identical <head> content produces same fingerprint."""
head = "<head><title>Test Page</title></head>"
fp1 = compute_head_fingerprint(head)
fp2 = compute_head_fingerprint(head)
assert fp1 == fp2
assert fp1 != ""
def test_different_title_different_fingerprint(self):
"""Different title produces different fingerprint."""
head1 = "<head><title>Title A</title></head>"
head2 = "<head><title>Title B</title></head>"
assert compute_head_fingerprint(head1) != compute_head_fingerprint(head2)
def test_empty_head_returns_empty_string(self):
"""Empty or None head should return empty fingerprint."""
assert compute_head_fingerprint("") == ""
assert compute_head_fingerprint(None) == ""
def test_head_without_signals_returns_empty(self):
"""Head without title or key meta tags returns empty."""
head = "<head><link rel='stylesheet' href='style.css'></head>"
assert compute_head_fingerprint(head) == ""
def test_extracts_title(self):
"""Title is extracted and included in fingerprint."""
head1 = "<head><title>My Title</title></head>"
head2 = "<head><title>My Title</title><link href='x'></head>"
# Same title should produce same fingerprint
assert compute_head_fingerprint(head1) == compute_head_fingerprint(head2)
def test_extracts_meta_description(self):
"""Meta description is extracted."""
head1 = '<head><meta name="description" content="Test description"></head>'
head2 = '<head><meta name="description" content="Different description"></head>'
assert compute_head_fingerprint(head1) != compute_head_fingerprint(head2)
def test_extracts_og_tags(self):
"""Open Graph tags are extracted."""
head1 = '<head><meta property="og:title" content="OG Title"></head>'
head2 = '<head><meta property="og:title" content="Different OG Title"></head>'
assert compute_head_fingerprint(head1) != compute_head_fingerprint(head2)
def test_extracts_og_image(self):
"""og:image is extracted and affects fingerprint."""
head1 = '<head><meta property="og:image" content="https://example.com/img1.jpg"></head>'
head2 = '<head><meta property="og:image" content="https://example.com/img2.jpg"></head>'
assert compute_head_fingerprint(head1) != compute_head_fingerprint(head2)
def test_extracts_article_modified_time(self):
"""article:modified_time is extracted."""
head1 = '<head><meta property="article:modified_time" content="2024-01-01T00:00:00Z"></head>'
head2 = '<head><meta property="article:modified_time" content="2024-12-01T00:00:00Z"></head>'
assert compute_head_fingerprint(head1) != compute_head_fingerprint(head2)
def test_case_insensitive(self):
"""Fingerprinting is case-insensitive for tags."""
head1 = "<head><TITLE>Test</TITLE></head>"
head2 = "<head><title>test</title></head>"
# Both should extract title (case insensitive)
fp1 = compute_head_fingerprint(head1)
fp2 = compute_head_fingerprint(head2)
assert fp1 != ""
assert fp2 != ""
def test_handles_attribute_order(self):
"""Handles different attribute orders in meta tags."""
head1 = '<head><meta name="description" content="Test"></head>'
head2 = '<head><meta content="Test" name="description"></head>'
assert compute_head_fingerprint(head1) == compute_head_fingerprint(head2)
def test_real_world_head(self):
"""Test with a realistic head section."""
head = '''
<head>
<meta charset="utf-8">
<title>Python Documentation</title>
<meta name="description" content="Official Python documentation">
<meta property="og:title" content="Python Docs">
<meta property="og:description" content="Learn Python">
<meta property="og:image" content="https://python.org/logo.png">
<link rel="stylesheet" href="styles.css">
</head>
'''
fp = compute_head_fingerprint(head)
assert fp != ""
# Should be deterministic
assert fp == compute_head_fingerprint(head)

View File

@@ -0,0 +1,354 @@
"""
Real-world tests for cache validation using actual HTTP requests.
No mocks - all tests hit real servers.
"""
import pytest
from crawl4ai.cache_validator import CacheValidator, CacheValidationResult
from crawl4ai.utils import compute_head_fingerprint
class TestRealDomainsConditionalSupport:
"""Test domains that support HTTP conditional requests (ETag/Last-Modified)."""
@pytest.mark.asyncio
async def test_docs_python_org_etag(self):
"""docs.python.org supports ETag - should return 304."""
url = "https://docs.python.org/3/"
async with CacheValidator(timeout=15.0) as validator:
# First fetch to get ETag
head_html, etag, last_modified = await validator._fetch_head(url)
assert head_html is not None, "Should fetch head content"
assert etag is not None, "docs.python.org should return ETag"
# Validate with the ETag we just got
result = await validator.validate(url=url, stored_etag=etag)
assert result.status == CacheValidationResult.FRESH, f"Expected FRESH, got {result.status}: {result.reason}"
assert "304" in result.reason
@pytest.mark.asyncio
async def test_docs_crawl4ai_etag(self):
"""docs.crawl4ai.com supports ETag - should return 304."""
url = "https://docs.crawl4ai.com/"
async with CacheValidator(timeout=15.0) as validator:
head_html, etag, last_modified = await validator._fetch_head(url)
assert etag is not None, "docs.crawl4ai.com should return ETag"
result = await validator.validate(url=url, stored_etag=etag)
assert result.status == CacheValidationResult.FRESH, f"Expected FRESH, got {result.status}: {result.reason}"
@pytest.mark.asyncio
async def test_wikipedia_last_modified(self):
"""Wikipedia supports Last-Modified - should return 304."""
url = "https://en.wikipedia.org/wiki/Web_crawler"
async with CacheValidator(timeout=15.0) as validator:
head_html, etag, last_modified = await validator._fetch_head(url)
assert last_modified is not None, "Wikipedia should return Last-Modified"
result = await validator.validate(url=url, stored_last_modified=last_modified)
assert result.status == CacheValidationResult.FRESH, f"Expected FRESH, got {result.status}: {result.reason}"
@pytest.mark.asyncio
async def test_github_pages(self):
"""GitHub Pages supports conditional requests."""
url = "https://pages.github.com/"
async with CacheValidator(timeout=15.0) as validator:
head_html, etag, last_modified = await validator._fetch_head(url)
# GitHub Pages typically has at least one
has_conditional = etag is not None or last_modified is not None
assert has_conditional, "GitHub Pages should support conditional requests"
result = await validator.validate(
url=url,
stored_etag=etag,
stored_last_modified=last_modified,
)
assert result.status == CacheValidationResult.FRESH
@pytest.mark.asyncio
async def test_httpbin_etag(self):
"""httpbin.org/etag endpoint for testing ETag."""
url = "https://httpbin.org/etag/test-etag-value"
async with CacheValidator(timeout=15.0) as validator:
result = await validator.validate(url=url, stored_etag='"test-etag-value"')
# httpbin should return 304 for matching ETag
assert result.status == CacheValidationResult.FRESH, f"Expected FRESH, got {result.status}: {result.reason}"
class TestRealDomainsNoConditionalSupport:
"""Test domains that may NOT support HTTP conditional requests."""
@pytest.mark.asyncio
async def test_dynamic_site_fingerprint_fallback(self):
"""Test fingerprint-based validation for sites without conditional support."""
# Use a site that changes frequently but has stable head
url = "https://example.com/"
async with CacheValidator(timeout=15.0) as validator:
# Get head and compute fingerprint
head_html, etag, last_modified = await validator._fetch_head(url)
assert head_html is not None
fingerprint = compute_head_fingerprint(head_html)
# Validate using fingerprint (not etag/last-modified)
result = await validator.validate(
url=url,
stored_head_fingerprint=fingerprint,
)
# Should be FRESH since fingerprint should match
assert result.status == CacheValidationResult.FRESH, f"Expected FRESH, got {result.status}: {result.reason}"
assert "fingerprint" in result.reason.lower()
@pytest.mark.asyncio
async def test_news_site_changes_frequently(self):
"""News sites change frequently - test that we can detect changes."""
url = "https://www.bbc.com/news"
async with CacheValidator(timeout=15.0) as validator:
head_html, etag, last_modified = await validator._fetch_head(url)
# BBC News has ETag but it changes with content
assert head_html is not None
# Using a fake old ETag should return STALE (200 with different content)
result = await validator.validate(
url=url,
stored_etag='"fake-old-etag-12345"',
)
# Should be STALE because the ETag doesn't match
assert result.status == CacheValidationResult.STALE, f"Expected STALE, got {result.status}: {result.reason}"
class TestRealDomainsEdgeCases:
"""Edge cases with real domains."""
@pytest.mark.asyncio
async def test_nonexistent_domain(self):
"""Non-existent domain should return ERROR."""
url = "https://this-domain-definitely-does-not-exist-xyz123.com/"
async with CacheValidator(timeout=5.0) as validator:
result = await validator.validate(url=url, stored_etag='"test"')
assert result.status == CacheValidationResult.ERROR
@pytest.mark.asyncio
async def test_timeout_slow_server(self):
"""Test timeout handling with a slow endpoint."""
# httpbin delay endpoint
url = "https://httpbin.org/delay/10"
async with CacheValidator(timeout=2.0) as validator: # 2 second timeout
result = await validator.validate(url=url, stored_etag='"test"')
# Should timeout and return ERROR
assert result.status == CacheValidationResult.ERROR
assert "timeout" in result.reason.lower() or "timed out" in result.reason.lower()
@pytest.mark.asyncio
async def test_redirect_handling(self):
"""Test that redirects are followed."""
# httpbin redirect
url = "https://httpbin.org/redirect/1"
async with CacheValidator(timeout=15.0) as validator:
head_html, etag, last_modified = await validator._fetch_head(url)
# Should follow redirect and get content
# The final page might not have useful head content, but shouldn't error
# This tests that redirects are handled
@pytest.mark.asyncio
async def test_https_only(self):
"""Test HTTPS connection."""
url = "https://www.google.com/"
async with CacheValidator(timeout=15.0) as validator:
head_html, etag, last_modified = await validator._fetch_head(url)
assert head_html is not None
assert "<title" in head_html.lower()
class TestRealDomainsHeadFingerprint:
"""Test head fingerprint extraction with real domains."""
@pytest.mark.asyncio
async def test_python_docs_fingerprint(self):
"""Python docs has title and meta tags."""
url = "https://docs.python.org/3/"
async with CacheValidator(timeout=15.0) as validator:
head_html, _, _ = await validator._fetch_head(url)
assert head_html is not None
fingerprint = compute_head_fingerprint(head_html)
assert fingerprint != "", "Should extract fingerprint from Python docs"
# Fingerprint should be consistent
fingerprint2 = compute_head_fingerprint(head_html)
assert fingerprint == fingerprint2
@pytest.mark.asyncio
async def test_github_fingerprint(self):
"""GitHub has og: tags."""
url = "https://github.com/"
async with CacheValidator(timeout=15.0) as validator:
head_html, _, _ = await validator._fetch_head(url)
assert head_html is not None
assert "og:" in head_html.lower() or "title" in head_html.lower()
fingerprint = compute_head_fingerprint(head_html)
assert fingerprint != ""
@pytest.mark.asyncio
async def test_crawl4ai_docs_fingerprint(self):
"""Crawl4AI docs should have title and description."""
url = "https://docs.crawl4ai.com/"
async with CacheValidator(timeout=15.0) as validator:
head_html, _, _ = await validator._fetch_head(url)
assert head_html is not None
fingerprint = compute_head_fingerprint(head_html)
assert fingerprint != "", "Should extract fingerprint from Crawl4AI docs"
class TestRealDomainsFetchHead:
"""Test _fetch_head functionality with real domains."""
@pytest.mark.asyncio
async def test_fetch_stops_at_head_close(self):
"""Verify we stop reading after </head>."""
url = "https://docs.python.org/3/"
async with CacheValidator(timeout=15.0) as validator:
head_html, _, _ = await validator._fetch_head(url)
assert head_html is not None
assert "</head>" in head_html.lower()
# Should NOT contain body content
assert "<body" not in head_html.lower() or head_html.lower().index("</head>") < head_html.lower().find("<body")
@pytest.mark.asyncio
async def test_extracts_both_headers(self):
"""Test extraction of both ETag and Last-Modified."""
url = "https://docs.python.org/3/"
async with CacheValidator(timeout=15.0) as validator:
head_html, etag, last_modified = await validator._fetch_head(url)
# Python docs should have both
assert etag is not None, "Should have ETag"
assert last_modified is not None, "Should have Last-Modified"
@pytest.mark.asyncio
async def test_handles_missing_head_tag(self):
"""Handle pages that might not have proper head structure."""
# API endpoint that returns JSON (no HTML head)
url = "https://httpbin.org/json"
async with CacheValidator(timeout=15.0) as validator:
head_html, etag, last_modified = await validator._fetch_head(url)
# Should not crash, may return partial content or None
# The important thing is it doesn't error
class TestRealDomainsValidationCombinations:
"""Test various combinations of validation data."""
@pytest.mark.asyncio
async def test_etag_only(self):
"""Validate with only ETag."""
url = "https://docs.python.org/3/"
async with CacheValidator(timeout=15.0) as validator:
_, etag, _ = await validator._fetch_head(url)
result = await validator.validate(url=url, stored_etag=etag)
assert result.status == CacheValidationResult.FRESH
@pytest.mark.asyncio
async def test_last_modified_only(self):
"""Validate with only Last-Modified."""
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
async with CacheValidator(timeout=15.0) as validator:
_, _, last_modified = await validator._fetch_head(url)
if last_modified:
result = await validator.validate(url=url, stored_last_modified=last_modified)
assert result.status == CacheValidationResult.FRESH
@pytest.mark.asyncio
async def test_fingerprint_only(self):
"""Validate with only fingerprint."""
url = "https://example.com/"
async with CacheValidator(timeout=15.0) as validator:
head_html, _, _ = await validator._fetch_head(url)
fingerprint = compute_head_fingerprint(head_html)
if fingerprint:
result = await validator.validate(url=url, stored_head_fingerprint=fingerprint)
assert result.status == CacheValidationResult.FRESH
@pytest.mark.asyncio
async def test_all_validation_data(self):
"""Validate with all available data."""
url = "https://docs.python.org/3/"
async with CacheValidator(timeout=15.0) as validator:
head_html, etag, last_modified = await validator._fetch_head(url)
fingerprint = compute_head_fingerprint(head_html)
result = await validator.validate(
url=url,
stored_etag=etag,
stored_last_modified=last_modified,
stored_head_fingerprint=fingerprint,
)
assert result.status == CacheValidationResult.FRESH
@pytest.mark.asyncio
async def test_stale_etag_fresh_fingerprint(self):
"""When ETag is stale but fingerprint matches, should be FRESH."""
url = "https://docs.python.org/3/"
async with CacheValidator(timeout=15.0) as validator:
head_html, _, _ = await validator._fetch_head(url)
fingerprint = compute_head_fingerprint(head_html)
# Use fake ETag but real fingerprint
result = await validator.validate(
url=url,
stored_etag='"fake-stale-etag"',
stored_head_fingerprint=fingerprint,
)
# Fingerprint should save us
assert result.status == CacheValidationResult.FRESH
assert "fingerprint" in result.reason.lower()

View File

View File

@@ -0,0 +1,773 @@
"""
Test Suite: Deep Crawl Resume/Crash Recovery Tests
Tests that verify:
1. State export produces valid JSON-serializable data
2. Resume from checkpoint continues without duplicates
3. Simulated crash at various points recovers correctly
4. State callback fires at expected intervals
5. No damage to existing system behavior (regression tests)
"""
import pytest
import asyncio
import json
from typing import Dict, Any, List
from unittest.mock import AsyncMock, MagicMock
from crawl4ai.deep_crawling import (
BFSDeepCrawlStrategy,
DFSDeepCrawlStrategy,
BestFirstCrawlingStrategy,
FilterChain,
URLPatternFilter,
DomainFilter,
)
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
# ============================================================================
# Helper Functions for Mock Crawler
# ============================================================================
def create_mock_config(stream=False):
"""Create a mock CrawlerRunConfig."""
config = MagicMock()
config.clone = MagicMock(return_value=config)
config.stream = stream
return config
def create_mock_crawler_with_links(num_links: int = 3, include_keyword: bool = False):
"""Create mock crawler that returns results with links."""
call_count = 0
async def mock_arun_many(urls, config):
nonlocal call_count
results = []
for url in urls:
call_count += 1
result = MagicMock()
result.url = url
result.success = True
result.metadata = {}
# Generate child links
links = []
for i in range(num_links):
link_url = f"{url}/child{call_count}_{i}"
if include_keyword:
link_url = f"{url}/important-child{call_count}_{i}"
links.append({"href": link_url})
result.links = {"internal": links, "external": []}
results.append(result)
# For streaming mode, return async generator
if config.stream:
async def gen():
for r in results:
yield r
return gen()
return results
crawler = MagicMock()
crawler.arun_many = mock_arun_many
return crawler
def create_mock_crawler_tracking(crawl_order: List[str], return_no_links: bool = False):
"""Create mock crawler that tracks crawl order."""
async def mock_arun_many(urls, config):
results = []
for url in urls:
crawl_order.append(url)
result = MagicMock()
result.url = url
result.success = True
result.metadata = {}
result.links = {"internal": [], "external": []} if return_no_links else {"internal": [{"href": f"{url}/child"}], "external": []}
results.append(result)
# For streaming mode, return async generator
if config.stream:
async def gen():
for r in results:
yield r
return gen()
return results
crawler = MagicMock()
crawler.arun_many = mock_arun_many
return crawler
def create_simple_mock_crawler():
"""Basic mock crawler returning 1 result with 2 child links."""
call_count = 0
async def mock_arun_many(urls, config):
nonlocal call_count
results = []
for url in urls:
call_count += 1
result = MagicMock()
result.url = url
result.success = True
result.metadata = {}
result.links = {
"internal": [
{"href": f"{url}/child1"},
{"href": f"{url}/child2"},
],
"external": []
}
results.append(result)
if config.stream:
async def gen():
for r in results:
yield r
return gen()
return results
crawler = MagicMock()
crawler.arun_many = mock_arun_many
return crawler
def create_mock_crawler_unlimited_links():
"""Mock crawler that always returns links (for testing limits)."""
async def mock_arun_many(urls, config):
results = []
for url in urls:
result = MagicMock()
result.url = url
result.success = True
result.metadata = {}
result.links = {
"internal": [{"href": f"{url}/link{i}"} for i in range(10)],
"external": []
}
results.append(result)
if config.stream:
async def gen():
for r in results:
yield r
return gen()
return results
crawler = MagicMock()
crawler.arun_many = mock_arun_many
return crawler
# ============================================================================
# TEST SUITE 1: Crash Recovery Tests
# ============================================================================
class TestBFSResume:
"""BFS strategy resume tests."""
@pytest.mark.asyncio
async def test_state_export_json_serializable(self):
"""Verify exported state can be JSON serialized."""
captured_states: List[Dict] = []
async def capture_state(state: Dict[str, Any]):
# Verify JSON serializable
json_str = json.dumps(state)
parsed = json.loads(json_str)
captured_states.append(parsed)
strategy = BFSDeepCrawlStrategy(
max_depth=2,
max_pages=10,
on_state_change=capture_state,
)
# Create mock crawler that returns predictable results
mock_crawler = create_mock_crawler_with_links(num_links=3)
mock_config = create_mock_config()
results = await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
# Verify states were captured
assert len(captured_states) > 0
# Verify state structure
for state in captured_states:
assert state["strategy_type"] == "bfs"
assert "visited" in state
assert "pending" in state
assert "depths" in state
assert "pages_crawled" in state
assert isinstance(state["visited"], list)
assert isinstance(state["pending"], list)
assert isinstance(state["depths"], dict)
assert isinstance(state["pages_crawled"], int)
@pytest.mark.asyncio
async def test_resume_continues_from_checkpoint(self):
"""Verify resume starts from saved state, not beginning."""
# Simulate state from previous crawl (visited 5 URLs, 3 pending)
saved_state = {
"strategy_type": "bfs",
"visited": [
"https://example.com",
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
"https://example.com/page4",
],
"pending": [
{"url": "https://example.com/page5", "parent_url": "https://example.com/page2"},
{"url": "https://example.com/page6", "parent_url": "https://example.com/page3"},
{"url": "https://example.com/page7", "parent_url": "https://example.com/page3"},
],
"depths": {
"https://example.com": 0,
"https://example.com/page1": 1,
"https://example.com/page2": 1,
"https://example.com/page3": 1,
"https://example.com/page4": 1,
"https://example.com/page5": 2,
"https://example.com/page6": 2,
"https://example.com/page7": 2,
},
"pages_crawled": 5,
}
crawled_urls: List[str] = []
strategy = BFSDeepCrawlStrategy(
max_depth=2,
max_pages=20,
resume_state=saved_state,
)
# Verify internal state was restored
assert strategy._resume_state == saved_state
mock_crawler = create_mock_crawler_tracking(crawled_urls, return_no_links=True)
mock_config = create_mock_config()
await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
# Should NOT re-crawl already visited URLs
for visited_url in saved_state["visited"]:
assert visited_url not in crawled_urls, f"Re-crawled already visited: {visited_url}"
# Should crawl pending URLs
for pending in saved_state["pending"]:
assert pending["url"] in crawled_urls, f"Did not crawl pending: {pending['url']}"
@pytest.mark.asyncio
async def test_simulated_crash_mid_crawl(self):
"""Simulate crash at URL N, verify resume continues from pending URLs."""
crash_after = 3
states_before_crash: List[Dict] = []
async def capture_until_crash(state: Dict[str, Any]):
states_before_crash.append(state)
if state["pages_crawled"] >= crash_after:
raise Exception("Simulated crash!")
strategy1 = BFSDeepCrawlStrategy(
max_depth=2,
max_pages=10,
on_state_change=capture_until_crash,
)
mock_crawler = create_mock_crawler_with_links(num_links=5)
mock_config = create_mock_config()
# First crawl - crashes
with pytest.raises(Exception, match="Simulated crash"):
await strategy1._arun_batch("https://example.com", mock_crawler, mock_config)
# Get last state before crash
last_state = states_before_crash[-1]
assert last_state["pages_crawled"] >= crash_after
# Calculate which URLs were already crawled vs pending
pending_urls = {item["url"] for item in last_state["pending"]}
visited_urls = set(last_state["visited"])
already_crawled_urls = visited_urls - pending_urls
# Resume from checkpoint
crawled_in_resume: List[str] = []
strategy2 = BFSDeepCrawlStrategy(
max_depth=2,
max_pages=10,
resume_state=last_state,
)
mock_crawler2 = create_mock_crawler_tracking(crawled_in_resume, return_no_links=True)
await strategy2._arun_batch("https://example.com", mock_crawler2, mock_config)
# Verify already-crawled URLs are not re-crawled
for crawled_url in already_crawled_urls:
assert crawled_url not in crawled_in_resume, f"Re-crawled already visited: {crawled_url}"
# Verify pending URLs are crawled
for pending_url in pending_urls:
assert pending_url in crawled_in_resume, f"Did not crawl pending: {pending_url}"
@pytest.mark.asyncio
async def test_callback_fires_per_url(self):
"""Verify callback fires after each URL for maximum granularity."""
callback_count = 0
pages_crawled_sequence: List[int] = []
async def count_callbacks(state: Dict[str, Any]):
nonlocal callback_count
callback_count += 1
pages_crawled_sequence.append(state["pages_crawled"])
strategy = BFSDeepCrawlStrategy(
max_depth=1,
max_pages=5,
on_state_change=count_callbacks,
)
mock_crawler = create_mock_crawler_with_links(num_links=2)
mock_config = create_mock_config()
await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
# Callback should fire once per successful URL
assert callback_count == strategy._pages_crawled, \
f"Callback fired {callback_count} times, expected {strategy._pages_crawled} (per URL)"
# pages_crawled should increment by 1 each callback
for i, count in enumerate(pages_crawled_sequence):
assert count == i + 1, f"Expected pages_crawled={i+1} at callback {i}, got {count}"
@pytest.mark.asyncio
async def test_export_state_returns_last_captured(self):
"""Verify export_state() returns last captured state."""
last_state = None
async def capture(state):
nonlocal last_state
last_state = state
strategy = BFSDeepCrawlStrategy(max_depth=2, max_pages=5, on_state_change=capture)
mock_crawler = create_mock_crawler_with_links(num_links=2)
mock_config = create_mock_config()
await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
exported = strategy.export_state()
assert exported == last_state
class TestDFSResume:
"""DFS strategy resume tests."""
@pytest.mark.asyncio
async def test_state_export_includes_stack_and_dfs_seen(self):
"""Verify DFS state includes stack structure and _dfs_seen."""
captured_states: List[Dict] = []
async def capture_state(state: Dict[str, Any]):
captured_states.append(state)
strategy = DFSDeepCrawlStrategy(
max_depth=3,
max_pages=10,
on_state_change=capture_state,
)
mock_crawler = create_mock_crawler_with_links(num_links=2)
mock_config = create_mock_config()
await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
assert len(captured_states) > 0
for state in captured_states:
assert state["strategy_type"] == "dfs"
assert "stack" in state
assert "dfs_seen" in state
# Stack items should have depth
for item in state["stack"]:
assert "url" in item
assert "parent_url" in item
assert "depth" in item
@pytest.mark.asyncio
async def test_resume_restores_stack_order(self):
"""Verify DFS stack order is preserved on resume."""
saved_state = {
"strategy_type": "dfs",
"visited": ["https://example.com"],
"stack": [
{"url": "https://example.com/deep3", "parent_url": "https://example.com/deep2", "depth": 3},
{"url": "https://example.com/deep2", "parent_url": "https://example.com/deep1", "depth": 2},
{"url": "https://example.com/page1", "parent_url": "https://example.com", "depth": 1},
],
"depths": {"https://example.com": 0},
"pages_crawled": 1,
"dfs_seen": ["https://example.com", "https://example.com/deep3", "https://example.com/deep2", "https://example.com/page1"],
}
crawl_order: List[str] = []
strategy = DFSDeepCrawlStrategy(
max_depth=3,
max_pages=10,
resume_state=saved_state,
)
mock_crawler = create_mock_crawler_tracking(crawl_order, return_no_links=True)
mock_config = create_mock_config()
await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
# DFS pops from end of stack, so order should be: page1, deep2, deep3
assert crawl_order[0] == "https://example.com/page1"
assert crawl_order[1] == "https://example.com/deep2"
assert crawl_order[2] == "https://example.com/deep3"
class TestBestFirstResume:
"""Best-First strategy resume tests."""
@pytest.mark.asyncio
async def test_state_export_includes_scored_queue(self):
"""Verify Best-First state includes queue with scores."""
captured_states: List[Dict] = []
async def capture_state(state: Dict[str, Any]):
captured_states.append(state)
scorer = KeywordRelevanceScorer(keywords=["important"], weight=1.0)
strategy = BestFirstCrawlingStrategy(
max_depth=2,
max_pages=10,
url_scorer=scorer,
on_state_change=capture_state,
)
mock_crawler = create_mock_crawler_with_links(num_links=3, include_keyword=True)
mock_config = create_mock_config(stream=True)
async for _ in strategy._arun_stream("https://example.com", mock_crawler, mock_config):
pass
assert len(captured_states) > 0
for state in captured_states:
assert state["strategy_type"] == "best_first"
assert "queue_items" in state
for item in state["queue_items"]:
assert "score" in item
assert "depth" in item
assert "url" in item
assert "parent_url" in item
@pytest.mark.asyncio
async def test_resume_maintains_priority_order(self):
"""Verify priority queue order is maintained on resume."""
saved_state = {
"strategy_type": "best_first",
"visited": ["https://example.com"],
"queue_items": [
{"score": -0.9, "depth": 1, "url": "https://example.com/high-priority", "parent_url": "https://example.com"},
{"score": -0.5, "depth": 1, "url": "https://example.com/medium-priority", "parent_url": "https://example.com"},
{"score": -0.1, "depth": 1, "url": "https://example.com/low-priority", "parent_url": "https://example.com"},
],
"depths": {"https://example.com": 0},
"pages_crawled": 1,
}
crawl_order: List[str] = []
strategy = BestFirstCrawlingStrategy(
max_depth=2,
max_pages=10,
resume_state=saved_state,
)
mock_crawler = create_mock_crawler_tracking(crawl_order, return_no_links=True)
mock_config = create_mock_config(stream=True)
async for _ in strategy._arun_stream("https://example.com", mock_crawler, mock_config):
pass
# Higher negative score = higher priority (min-heap)
# So -0.9 should be crawled first
assert crawl_order[0] == "https://example.com/high-priority"
class TestCrossStrategyResume:
"""Tests that apply to all strategies."""
@pytest.mark.asyncio
@pytest.mark.parametrize("strategy_class,strategy_type", [
(BFSDeepCrawlStrategy, "bfs"),
(DFSDeepCrawlStrategy, "dfs"),
(BestFirstCrawlingStrategy, "best_first"),
])
async def test_no_callback_means_no_overhead(self, strategy_class, strategy_type):
"""Verify no state tracking when callback is None."""
strategy = strategy_class(max_depth=2, max_pages=5)
# _queue_shadow should be None for Best-First when no callback
if strategy_class == BestFirstCrawlingStrategy:
assert strategy._queue_shadow is None
# _last_state should be None initially
assert strategy._last_state is None
@pytest.mark.asyncio
@pytest.mark.parametrize("strategy_class", [
BFSDeepCrawlStrategy,
DFSDeepCrawlStrategy,
BestFirstCrawlingStrategy,
])
async def test_export_state_returns_last_captured(self, strategy_class):
"""Verify export_state() returns last captured state."""
last_state = None
async def capture(state):
nonlocal last_state
last_state = state
strategy = strategy_class(max_depth=2, max_pages=5, on_state_change=capture)
mock_crawler = create_mock_crawler_with_links(num_links=2)
if strategy_class == BestFirstCrawlingStrategy:
mock_config = create_mock_config(stream=True)
async for _ in strategy._arun_stream("https://example.com", mock_crawler, mock_config):
pass
else:
mock_config = create_mock_config()
await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
exported = strategy.export_state()
assert exported == last_state
# ============================================================================
# TEST SUITE 2: Regression Tests (No Damage to Current System)
# ============================================================================
class TestBFSRegressions:
"""Ensure BFS works identically when new params not used."""
@pytest.mark.asyncio
async def test_default_params_unchanged(self):
"""Constructor with only original params works."""
strategy = BFSDeepCrawlStrategy(
max_depth=2,
include_external=False,
max_pages=10,
)
assert strategy.max_depth == 2
assert strategy.include_external == False
assert strategy.max_pages == 10
assert strategy._resume_state is None
assert strategy._on_state_change is None
@pytest.mark.asyncio
async def test_filter_chain_still_works(self):
"""FilterChain integration unchanged."""
filter_chain = FilterChain([
URLPatternFilter(patterns=["*/blog/*"]),
DomainFilter(allowed_domains=["example.com"]),
])
strategy = BFSDeepCrawlStrategy(
max_depth=2,
filter_chain=filter_chain,
)
# Test filter still applies
assert await strategy.can_process_url("https://example.com/blog/post1", 1) == True
assert await strategy.can_process_url("https://other.com/blog/post1", 1) == False
@pytest.mark.asyncio
async def test_url_scorer_still_works(self):
"""URL scoring integration unchanged."""
scorer = KeywordRelevanceScorer(keywords=["python", "tutorial"], weight=1.0)
strategy = BFSDeepCrawlStrategy(
max_depth=2,
url_scorer=scorer,
score_threshold=0.5,
)
assert strategy.url_scorer is not None
assert strategy.score_threshold == 0.5
# Scorer should work
score = scorer.score("https://example.com/python-tutorial")
assert score > 0
@pytest.mark.asyncio
async def test_batch_mode_returns_list(self):
"""Batch mode still returns List[CrawlResult]."""
strategy = BFSDeepCrawlStrategy(max_depth=1, max_pages=5)
mock_crawler = create_simple_mock_crawler()
mock_config = create_mock_config(stream=False)
results = await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
assert isinstance(results, list)
assert len(results) > 0
@pytest.mark.asyncio
async def test_max_pages_limit_respected(self):
"""max_pages limit still enforced."""
strategy = BFSDeepCrawlStrategy(max_depth=10, max_pages=3)
mock_crawler = create_mock_crawler_unlimited_links()
mock_config = create_mock_config()
results = await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
# Should stop at max_pages
assert strategy._pages_crawled <= 3
@pytest.mark.asyncio
async def test_max_depth_limit_respected(self):
"""max_depth limit still enforced."""
strategy = BFSDeepCrawlStrategy(max_depth=2, max_pages=100)
mock_crawler = create_mock_crawler_unlimited_links()
mock_config = create_mock_config()
results = await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
# All results should have depth <= max_depth
for result in results:
assert result.metadata.get("depth", 0) <= 2
@pytest.mark.asyncio
async def test_metadata_depth_still_set(self):
"""Result metadata still includes depth."""
strategy = BFSDeepCrawlStrategy(max_depth=2, max_pages=5)
mock_crawler = create_simple_mock_crawler()
mock_config = create_mock_config()
results = await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
for result in results:
assert "depth" in result.metadata
assert isinstance(result.metadata["depth"], int)
@pytest.mark.asyncio
async def test_metadata_parent_url_still_set(self):
"""Result metadata still includes parent_url."""
strategy = BFSDeepCrawlStrategy(max_depth=2, max_pages=5)
mock_crawler = create_simple_mock_crawler()
mock_config = create_mock_config()
results = await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
# First result (start URL) should have parent_url = None
assert results[0].metadata.get("parent_url") is None
# Child results should have parent_url set
for result in results[1:]:
assert "parent_url" in result.metadata
class TestDFSRegressions:
"""Ensure DFS works identically when new params not used."""
@pytest.mark.asyncio
async def test_inherits_bfs_params(self):
"""DFS still inherits all BFS parameters."""
strategy = DFSDeepCrawlStrategy(
max_depth=3,
include_external=True,
max_pages=20,
score_threshold=0.5,
)
assert strategy.max_depth == 3
assert strategy.include_external == True
assert strategy.max_pages == 20
assert strategy.score_threshold == 0.5
@pytest.mark.asyncio
async def test_dfs_seen_initialized(self):
"""DFS _dfs_seen set still initialized."""
strategy = DFSDeepCrawlStrategy(max_depth=2)
assert hasattr(strategy, '_dfs_seen')
assert isinstance(strategy._dfs_seen, set)
class TestBestFirstRegressions:
"""Ensure Best-First works identically when new params not used."""
@pytest.mark.asyncio
async def test_default_params_unchanged(self):
"""Constructor with only original params works."""
strategy = BestFirstCrawlingStrategy(
max_depth=2,
include_external=False,
max_pages=10,
)
assert strategy.max_depth == 2
assert strategy.include_external == False
assert strategy.max_pages == 10
assert strategy._resume_state is None
assert strategy._on_state_change is None
assert strategy._queue_shadow is None # Not initialized without callback
@pytest.mark.asyncio
async def test_scorer_integration(self):
"""URL scorer still affects crawl priority."""
scorer = KeywordRelevanceScorer(keywords=["important"], weight=1.0)
strategy = BestFirstCrawlingStrategy(
max_depth=2,
max_pages=10,
url_scorer=scorer,
)
assert strategy.url_scorer is scorer
class TestAPICompatibility:
"""Ensure API/serialization compatibility."""
def test_strategy_signature_backward_compatible(self):
"""Old code calling with positional/keyword args still works."""
# Positional args (old style)
s1 = BFSDeepCrawlStrategy(2)
assert s1.max_depth == 2
# Keyword args (old style)
s2 = BFSDeepCrawlStrategy(max_depth=3, max_pages=10)
assert s2.max_depth == 3
# Mixed (old style)
s3 = BFSDeepCrawlStrategy(2, FilterChain(), None, False, float('-inf'), 100)
assert s3.max_depth == 2
assert s3.max_pages == 100
def test_no_required_new_params(self):
"""New params are optional, not required."""
# Should not raise
BFSDeepCrawlStrategy(max_depth=2)
DFSDeepCrawlStrategy(max_depth=2)
BestFirstCrawlingStrategy(max_depth=2)

View File

@@ -0,0 +1,162 @@
"""
Integration Test: Deep Crawl Resume with Real URLs
Tests the crash recovery feature using books.toscrape.com - a site
designed for scraping practice with a clear hierarchy:
- Home page → Category pages → Book detail pages
"""
import pytest
import asyncio
import json
from typing import Dict, Any, List
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
class TestBFSResumeIntegration:
"""Integration tests for BFS resume with real crawling."""
@pytest.mark.asyncio
async def test_real_crawl_state_capture_and_resume(self):
"""
Test crash recovery with real URLs from books.toscrape.com.
Flow:
1. Start crawl with state callback
2. Stop after N pages (simulated crash)
3. Resume from saved state
4. Verify no duplicate crawls
"""
# Phase 1: Initial crawl that "crashes" after 3 pages
crash_after = 3
captured_states: List[Dict[str, Any]] = []
crawled_urls_phase1: List[str] = []
async def capture_state_until_crash(state: Dict[str, Any]):
captured_states.append(state)
crawled_urls_phase1.clear()
crawled_urls_phase1.extend(state["visited"])
if state["pages_crawled"] >= crash_after:
raise Exception("Simulated crash!")
strategy1 = BFSDeepCrawlStrategy(
max_depth=2,
max_pages=10,
on_state_change=capture_state_until_crash,
)
config = CrawlerRunConfig(
deep_crawl_strategy=strategy1,
stream=False,
verbose=False,
)
async with AsyncWebCrawler(verbose=False) as crawler:
# First crawl - will crash after 3 pages
with pytest.raises(Exception, match="Simulated crash"):
await crawler.arun("https://books.toscrape.com", config=config)
# Verify we captured state before crash
assert len(captured_states) > 0, "No states captured before crash"
last_state = captured_states[-1]
print(f"\n=== Phase 1: Crashed after {last_state['pages_crawled']} pages ===")
print(f"Visited URLs: {len(last_state['visited'])}")
print(f"Pending URLs: {len(last_state['pending'])}")
# Verify state structure
assert last_state["strategy_type"] == "bfs"
assert last_state["pages_crawled"] >= crash_after
assert len(last_state["visited"]) > 0
assert "pending" in last_state
assert "depths" in last_state
# Verify state is JSON serializable (important for Redis/DB storage)
json_str = json.dumps(last_state)
restored_state = json.loads(json_str)
assert restored_state == last_state, "State not JSON round-trip safe"
# Phase 2: Resume from checkpoint
crawled_urls_phase2: List[str] = []
async def track_resumed_crawl(state: Dict[str, Any]):
# Track what's being crawled in phase 2
new_visited = set(state["visited"]) - set(last_state["visited"])
for url in new_visited:
if url not in crawled_urls_phase2:
crawled_urls_phase2.append(url)
strategy2 = BFSDeepCrawlStrategy(
max_depth=2,
max_pages=10,
resume_state=restored_state,
on_state_change=track_resumed_crawl,
)
config2 = CrawlerRunConfig(
deep_crawl_strategy=strategy2,
stream=False,
verbose=False,
)
async with AsyncWebCrawler(verbose=False) as crawler:
results = await crawler.arun("https://books.toscrape.com", config=config2)
print(f"\n=== Phase 2: Resumed crawl ===")
print(f"New URLs crawled: {len(crawled_urls_phase2)}")
print(f"Final pages_crawled: {strategy2._pages_crawled}")
# Verify no duplicates - URLs from phase 1 should not be re-crawled
already_crawled = set(last_state["visited"]) - {item["url"] for item in last_state["pending"]}
duplicates = set(crawled_urls_phase2) & already_crawled
assert len(duplicates) == 0, f"Duplicate crawls detected: {duplicates}"
# Verify we made progress (crawled some of the pending URLs)
pending_urls = {item["url"] for item in last_state["pending"]}
crawled_pending = set(crawled_urls_phase2) & pending_urls
print(f"Pending URLs crawled in phase 2: {len(crawled_pending)}")
# Final state should show more pages crawled than before crash
final_state = strategy2.export_state()
if final_state:
assert final_state["pages_crawled"] >= last_state["pages_crawled"], \
"Resume did not make progress"
print("\n=== Integration test PASSED ===")
@pytest.mark.asyncio
async def test_state_export_method(self):
"""Test that export_state() returns valid state during crawl."""
states_from_callback: List[Dict] = []
async def capture(state):
states_from_callback.append(state)
strategy = BFSDeepCrawlStrategy(
max_depth=1,
max_pages=3,
on_state_change=capture,
)
config = CrawlerRunConfig(
deep_crawl_strategy=strategy,
stream=False,
verbose=False,
)
async with AsyncWebCrawler(verbose=False) as crawler:
await crawler.arun("https://books.toscrape.com", config=config)
# export_state should return the last captured state
exported = strategy.export_state()
assert exported is not None, "export_state() returned None"
assert exported == states_from_callback[-1], "export_state() doesn't match last callback"
print(f"\n=== export_state() test PASSED ===")
print(f"Final state: {exported['pages_crawled']} pages, {len(exported['visited'])} visited")

View File

@@ -7,9 +7,46 @@ adapted for the Docker API with real URLs
import requests
import json
import time
from typing import Dict, Any
from typing import Dict, Optional
API_BASE_URL = "http://localhost:11234"
API_BASE_URL = "http://localhost:11235"
# Global token storage
_auth_token: Optional[str] = None
def get_auth_token(email: str = "test@gmail.com") -> str:
"""
Get a JWT token from the /token endpoint.
The email domain must have valid MX records.
"""
global _auth_token
if _auth_token:
return _auth_token
print(f"🔐 Requesting JWT token for {email}...")
response = requests.post(
f"{API_BASE_URL}/token",
json={"email": email}
)
if response.status_code == 200:
data = response.json()
_auth_token = data["access_token"]
print(f"✅ Token obtained successfully")
return _auth_token
else:
raise Exception(f"Failed to get token: {response.status_code} - {response.text}")
def get_auth_headers() -> Dict[str, str]:
"""Get headers with JWT Bearer token."""
token = get_auth_token()
return {
"Authorization": f"Bearer {token}",
"Content-Type": "application/json"
}
def test_all_hooks_demo():
@@ -164,8 +201,8 @@ async def hook(page, context, html, **kwargs):
print("\nSending request with all 8 hooks...")
start_time = time.time()
response = requests.post(f"{API_BASE_URL}/crawl", json=payload)
response = requests.post(f"{API_BASE_URL}/crawl", json=payload, headers=get_auth_headers())
elapsed_time = time.time() - start_time
print(f"Request completed in {elapsed_time:.2f} seconds")
@@ -278,7 +315,7 @@ async def hook(page, context, url, **kwargs):
}
print("\nTesting authentication with httpbin endpoints...")
response = requests.post(f"{API_BASE_URL}/crawl", json=payload)
response = requests.post(f"{API_BASE_URL}/crawl", json=payload, headers=get_auth_headers())
if response.status_code == 200:
data = response.json()
@@ -371,8 +408,8 @@ async def hook(page, context, **kwargs):
print("\nTesting performance optimization hooks...")
start_time = time.time()
response = requests.post(f"{API_BASE_URL}/crawl", json=payload)
response = requests.post(f"{API_BASE_URL}/crawl", json=payload, headers=get_auth_headers())
elapsed_time = time.time() - start_time
print(f"Request completed in {elapsed_time:.2f} seconds")
@@ -462,7 +499,7 @@ async def hook(page, context, **kwargs):
}
print("\nTesting content extraction hooks...")
response = requests.post(f"{API_BASE_URL}/crawl", json=payload)
response = requests.post(f"{API_BASE_URL}/crawl", json=payload, headers=get_auth_headers())
if response.status_code == 200:
data = response.json()
@@ -485,7 +522,16 @@ def main():
print("🔧 Crawl4AI Docker API - Comprehensive Hooks Testing")
print("Based on docs/examples/hooks_example.py")
print("=" * 70)
# Get JWT token first (required when jwt_enabled=true)
try:
get_auth_token()
print("=" * 70)
except Exception as e:
print(f"❌ Failed to authenticate: {e}")
print("Make sure the server is running and jwt_enabled is configured correctly.")
return
tests = [
("All Hooks Demo", test_all_hooks_demo),
("Authentication Flow", test_authentication_flow),

View File

@@ -0,0 +1,569 @@
"""
Comprehensive test suite for Sticky Proxy Sessions functionality.
Tests cover:
1. Basic sticky session - same proxy for same session_id
2. Different sessions get different proxies
3. Session release
4. TTL expiration
5. Thread safety / concurrent access
6. Integration tests with AsyncWebCrawler
"""
import asyncio
import os
import time
import pytest
from unittest.mock import patch
from crawl4ai import AsyncWebCrawler, BrowserConfig
from crawl4ai.async_configs import CrawlerRunConfig, ProxyConfig
from crawl4ai.proxy_strategy import RoundRobinProxyStrategy
from crawl4ai.cache_context import CacheMode
class TestRoundRobinProxyStrategySession:
"""Test suite for RoundRobinProxyStrategy session methods."""
def setup_method(self):
"""Setup for each test method."""
self.proxies = [
ProxyConfig(server=f"http://proxy{i}.test:8080")
for i in range(5)
]
# ==================== BASIC STICKY SESSION TESTS ====================
@pytest.mark.asyncio
async def test_sticky_session_same_proxy(self):
"""Verify same proxy is returned for same session_id."""
strategy = RoundRobinProxyStrategy(self.proxies)
# First call - acquires proxy
proxy1 = await strategy.get_proxy_for_session("session-1")
# Second call - should return same proxy
proxy2 = await strategy.get_proxy_for_session("session-1")
# Third call - should return same proxy
proxy3 = await strategy.get_proxy_for_session("session-1")
assert proxy1 is not None
assert proxy1.server == proxy2.server == proxy3.server
@pytest.mark.asyncio
async def test_different_sessions_different_proxies(self):
"""Verify different session_ids can get different proxies."""
strategy = RoundRobinProxyStrategy(self.proxies)
proxy_a = await strategy.get_proxy_for_session("session-a")
proxy_b = await strategy.get_proxy_for_session("session-b")
proxy_c = await strategy.get_proxy_for_session("session-c")
# All should be different (round-robin)
servers = {proxy_a.server, proxy_b.server, proxy_c.server}
assert len(servers) == 3
@pytest.mark.asyncio
async def test_sticky_session_with_regular_rotation(self):
"""Verify sticky sessions don't interfere with regular rotation."""
strategy = RoundRobinProxyStrategy(self.proxies)
# Acquire a sticky session
session_proxy = await strategy.get_proxy_for_session("sticky-session")
# Regular rotation should continue independently
regular_proxy1 = await strategy.get_next_proxy()
regular_proxy2 = await strategy.get_next_proxy()
# Sticky session should still return same proxy
session_proxy_again = await strategy.get_proxy_for_session("sticky-session")
assert session_proxy.server == session_proxy_again.server
# Regular proxies should rotate
assert regular_proxy1.server != regular_proxy2.server
# ==================== SESSION RELEASE TESTS ====================
@pytest.mark.asyncio
async def test_session_release(self):
"""Verify session can be released and reacquired."""
strategy = RoundRobinProxyStrategy(self.proxies)
# Acquire session
proxy1 = await strategy.get_proxy_for_session("session-1")
assert strategy.get_session_proxy("session-1") is not None
# Release session
await strategy.release_session("session-1")
assert strategy.get_session_proxy("session-1") is None
# Reacquire - should get a new proxy (next in round-robin)
proxy2 = await strategy.get_proxy_for_session("session-1")
assert proxy2 is not None
# After release, next call gets the next proxy in rotation
# (not necessarily the same as before)
@pytest.mark.asyncio
async def test_release_nonexistent_session(self):
"""Verify releasing non-existent session doesn't raise error."""
strategy = RoundRobinProxyStrategy(self.proxies)
# Should not raise
await strategy.release_session("nonexistent-session")
@pytest.mark.asyncio
async def test_release_twice(self):
"""Verify releasing session twice doesn't raise error."""
strategy = RoundRobinProxyStrategy(self.proxies)
await strategy.get_proxy_for_session("session-1")
await strategy.release_session("session-1")
await strategy.release_session("session-1") # Should not raise
# ==================== GET SESSION PROXY TESTS ====================
@pytest.mark.asyncio
async def test_get_session_proxy_existing(self):
"""Verify get_session_proxy returns proxy for existing session."""
strategy = RoundRobinProxyStrategy(self.proxies)
acquired = await strategy.get_proxy_for_session("session-1")
retrieved = strategy.get_session_proxy("session-1")
assert retrieved is not None
assert acquired.server == retrieved.server
def test_get_session_proxy_nonexistent(self):
"""Verify get_session_proxy returns None for non-existent session."""
strategy = RoundRobinProxyStrategy(self.proxies)
result = strategy.get_session_proxy("nonexistent-session")
assert result is None
# ==================== TTL EXPIRATION TESTS ====================
@pytest.mark.asyncio
async def test_session_ttl_not_expired(self):
"""Verify session returns same proxy when TTL not expired."""
strategy = RoundRobinProxyStrategy(self.proxies)
# Acquire with 10 second TTL
proxy1 = await strategy.get_proxy_for_session("session-1", ttl=10)
# Immediately request again - should return same proxy
proxy2 = await strategy.get_proxy_for_session("session-1", ttl=10)
assert proxy1.server == proxy2.server
@pytest.mark.asyncio
async def test_session_ttl_expired(self):
"""Verify new proxy acquired after TTL expires."""
strategy = RoundRobinProxyStrategy(self.proxies)
# Acquire with 1 second TTL
proxy1 = await strategy.get_proxy_for_session("session-1", ttl=1)
# Wait for TTL to expire
await asyncio.sleep(1.1)
# Request again - should get new proxy due to expiration
proxy2 = await strategy.get_proxy_for_session("session-1", ttl=1)
# May or may not be same server depending on round-robin state,
# but session should have been recreated
assert proxy2 is not None
@pytest.mark.asyncio
async def test_get_session_proxy_ttl_expired(self):
"""Verify get_session_proxy returns None after TTL expires."""
strategy = RoundRobinProxyStrategy(self.proxies)
await strategy.get_proxy_for_session("session-1", ttl=1)
# Wait for expiration
await asyncio.sleep(1.1)
# Should return None for expired session
result = strategy.get_session_proxy("session-1")
assert result is None
@pytest.mark.asyncio
async def test_cleanup_expired_sessions(self):
"""Verify cleanup_expired_sessions removes expired sessions."""
strategy = RoundRobinProxyStrategy(self.proxies)
# Create sessions with short TTL
await strategy.get_proxy_for_session("short-ttl-1", ttl=1)
await strategy.get_proxy_for_session("short-ttl-2", ttl=1)
# Create session without TTL (should not be cleaned up)
await strategy.get_proxy_for_session("no-ttl")
# Wait for TTL to expire
await asyncio.sleep(1.1)
# Cleanup
removed = await strategy.cleanup_expired_sessions()
assert removed == 2
assert strategy.get_session_proxy("short-ttl-1") is None
assert strategy.get_session_proxy("short-ttl-2") is None
assert strategy.get_session_proxy("no-ttl") is not None
# ==================== GET ACTIVE SESSIONS TESTS ====================
@pytest.mark.asyncio
async def test_get_active_sessions(self):
"""Verify get_active_sessions returns all active sessions."""
strategy = RoundRobinProxyStrategy(self.proxies)
await strategy.get_proxy_for_session("session-a")
await strategy.get_proxy_for_session("session-b")
await strategy.get_proxy_for_session("session-c")
active = strategy.get_active_sessions()
assert len(active) == 3
assert "session-a" in active
assert "session-b" in active
assert "session-c" in active
@pytest.mark.asyncio
async def test_get_active_sessions_excludes_expired(self):
"""Verify get_active_sessions excludes expired sessions."""
strategy = RoundRobinProxyStrategy(self.proxies)
await strategy.get_proxy_for_session("short-ttl", ttl=1)
await strategy.get_proxy_for_session("no-ttl")
# Before expiration
active = strategy.get_active_sessions()
assert len(active) == 2
# Wait for TTL to expire
await asyncio.sleep(1.1)
# After expiration
active = strategy.get_active_sessions()
assert len(active) == 1
assert "no-ttl" in active
assert "short-ttl" not in active
# ==================== THREAD SAFETY TESTS ====================
@pytest.mark.asyncio
async def test_concurrent_session_access(self):
"""Verify thread-safe access to sessions."""
strategy = RoundRobinProxyStrategy(self.proxies)
async def acquire_session(session_id: str):
proxy = await strategy.get_proxy_for_session(session_id)
await asyncio.sleep(0.01) # Simulate work
return proxy.server
# Acquire same session from multiple coroutines
results = await asyncio.gather(*[
acquire_session("shared-session") for _ in range(10)
])
# All should get same proxy
assert len(set(results)) == 1
@pytest.mark.asyncio
async def test_concurrent_different_sessions(self):
"""Verify concurrent acquisition of different sessions works correctly."""
strategy = RoundRobinProxyStrategy(self.proxies)
async def acquire_session(session_id: str):
proxy = await strategy.get_proxy_for_session(session_id)
await asyncio.sleep(0.01)
return (session_id, proxy.server)
# Acquire different sessions concurrently
results = await asyncio.gather(*[
acquire_session(f"session-{i}") for i in range(5)
])
# Each session should have a consistent proxy
session_proxies = dict(results)
assert len(session_proxies) == 5
# Verify each session still returns same proxy
for session_id, expected_server in session_proxies.items():
actual = await strategy.get_proxy_for_session(session_id)
assert actual.server == expected_server
@pytest.mark.asyncio
async def test_concurrent_session_acquire_and_release(self):
"""Verify concurrent acquire and release operations work correctly."""
strategy = RoundRobinProxyStrategy(self.proxies)
async def acquire_and_release(session_id: str):
proxy = await strategy.get_proxy_for_session(session_id)
await asyncio.sleep(0.01)
await strategy.release_session(session_id)
return proxy.server
# Run multiple acquire/release cycles concurrently
await asyncio.gather(*[
acquire_and_release(f"session-{i}") for i in range(10)
])
# All sessions should be released
active = strategy.get_active_sessions()
assert len(active) == 0
# ==================== EMPTY PROXY POOL TESTS ====================
@pytest.mark.asyncio
async def test_empty_proxy_pool_session(self):
"""Verify behavior with empty proxy pool."""
strategy = RoundRobinProxyStrategy() # No proxies
result = await strategy.get_proxy_for_session("session-1")
assert result is None
@pytest.mark.asyncio
async def test_add_proxies_after_session(self):
"""Verify adding proxies after session creation works."""
strategy = RoundRobinProxyStrategy()
# No proxies initially
result1 = await strategy.get_proxy_for_session("session-1")
assert result1 is None
# Add proxies
strategy.add_proxies(self.proxies)
# Now should work
result2 = await strategy.get_proxy_for_session("session-2")
assert result2 is not None
class TestCrawlerRunConfigSession:
"""Test CrawlerRunConfig with sticky session parameters."""
def test_config_has_session_fields(self):
"""Verify CrawlerRunConfig has sticky session fields."""
config = CrawlerRunConfig(
proxy_session_id="test-session",
proxy_session_ttl=300,
proxy_session_auto_release=True
)
assert config.proxy_session_id == "test-session"
assert config.proxy_session_ttl == 300
assert config.proxy_session_auto_release is True
def test_config_session_defaults(self):
"""Verify default values for session fields."""
config = CrawlerRunConfig()
assert config.proxy_session_id is None
assert config.proxy_session_ttl is None
assert config.proxy_session_auto_release is False
class TestCrawlerStickySessionIntegration:
"""Integration tests for AsyncWebCrawler with sticky sessions."""
def setup_method(self):
"""Setup for each test method."""
self.proxies = [
ProxyConfig(server=f"http://proxy{i}.test:8080")
for i in range(3)
]
self.test_url = "https://httpbin.org/ip"
@pytest.mark.asyncio
async def test_crawler_sticky_session_without_proxy(self):
"""Test that crawler works when proxy_session_id set but no strategy."""
browser_config = BrowserConfig(headless=True)
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
proxy_session_id="test-session",
page_timeout=15000
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url=self.test_url, config=config)
# Should work without errors (no proxy strategy means no proxy)
assert result is not None
@pytest.mark.asyncio
async def test_crawler_sticky_session_basic(self):
"""Test basic sticky session with crawler."""
strategy = RoundRobinProxyStrategy(self.proxies)
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
proxy_rotation_strategy=strategy,
proxy_session_id="integration-test",
page_timeout=10000
)
browser_config = BrowserConfig(headless=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
# First request
try:
result1 = await crawler.arun(url=self.test_url, config=config)
except Exception:
pass # Proxy connection may fail, but session should be tracked
# Verify session was created
session_proxy = strategy.get_session_proxy("integration-test")
assert session_proxy is not None
# Cleanup
await strategy.release_session("integration-test")
@pytest.mark.asyncio
async def test_crawler_rotating_vs_sticky(self):
"""Compare rotating behavior vs sticky session behavior."""
strategy = RoundRobinProxyStrategy(self.proxies)
# Config WITHOUT sticky session - should rotate
rotating_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
proxy_rotation_strategy=strategy,
page_timeout=5000
)
# Config WITH sticky session - should use same proxy
sticky_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
proxy_rotation_strategy=strategy,
proxy_session_id="sticky-test",
page_timeout=5000
)
browser_config = BrowserConfig(headless=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
# Track proxy configs used
rotating_proxies = []
sticky_proxies = []
# Try rotating requests (may fail due to test proxies, but config should be set)
for _ in range(3):
try:
await crawler.arun(url=self.test_url, config=rotating_config)
except Exception:
pass
rotating_proxies.append(rotating_config.proxy_config.server if rotating_config.proxy_config else None)
# Try sticky requests
for _ in range(3):
try:
await crawler.arun(url=self.test_url, config=sticky_config)
except Exception:
pass
sticky_proxies.append(sticky_config.proxy_config.server if sticky_config.proxy_config else None)
# Rotating should have different proxies (or cycle through them)
# Sticky should have same proxy for all requests
if all(sticky_proxies):
assert len(set(sticky_proxies)) == 1, "Sticky session should use same proxy"
await strategy.release_session("sticky-test")
class TestStickySessionRealWorld:
"""Real-world scenario tests for sticky sessions.
Note: These tests require actual proxy servers to verify IP consistency.
They are marked to be skipped if no proxy is configured.
"""
@pytest.mark.asyncio
@pytest.mark.skipif(
not os.environ.get('TEST_PROXY_1'),
reason="Requires TEST_PROXY_1 environment variable"
)
async def test_verify_ip_consistency(self):
"""Verify that sticky session actually uses same IP.
This test requires real proxies set in environment variables:
TEST_PROXY_1=ip:port:user:pass
TEST_PROXY_2=ip:port:user:pass
"""
import re
# Load proxies from environment
proxy_strs = [
os.environ.get('TEST_PROXY_1', ''),
os.environ.get('TEST_PROXY_2', '')
]
proxies = [ProxyConfig.from_string(p) for p in proxy_strs if p]
if len(proxies) < 2:
pytest.skip("Need at least 2 proxies for this test")
strategy = RoundRobinProxyStrategy(proxies)
# Config WITH sticky session
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
proxy_rotation_strategy=strategy,
proxy_session_id="ip-verify-session",
page_timeout=30000
)
browser_config = BrowserConfig(headless=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
ips = []
for i in range(3):
result = await crawler.arun(
url="https://httpbin.org/ip",
config=config
)
if result and result.success and result.html:
# Extract IP from response
ip_match = re.search(r'"origin":\s*"([^"]+)"', result.html)
if ip_match:
ips.append(ip_match.group(1))
await strategy.release_session("ip-verify-session")
# All IPs should be same for sticky session
if len(ips) >= 2:
assert len(set(ips)) == 1, f"Expected same IP, got: {ips}"
# ==================== STANDALONE TEST FUNCTIONS ====================
@pytest.mark.asyncio
async def test_sticky_session_simple():
"""Simple test for sticky session functionality."""
proxies = [
ProxyConfig(server=f"http://proxy{i}.test:8080")
for i in range(3)
]
strategy = RoundRobinProxyStrategy(proxies)
# Same session should return same proxy
p1 = await strategy.get_proxy_for_session("test")
p2 = await strategy.get_proxy_for_session("test")
p3 = await strategy.get_proxy_for_session("test")
assert p1.server == p2.server == p3.server
print(f"Sticky session works! All requests use: {p1.server}")
# Cleanup
await strategy.release_session("test")
if __name__ == "__main__":
print("Running Sticky Session tests...")
print("=" * 50)
asyncio.run(test_sticky_session_simple())
print("\n" + "=" * 50)
print("To run the full pytest suite, use: pytest " + __file__)
print("=" * 50)

View File

@@ -0,0 +1,236 @@
"""Integration tests for prefetch mode with the crawler."""
import pytest
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BrowserConfig
# Use crawl4ai docs as test domain
TEST_DOMAIN = "https://docs.crawl4ai.com"
class TestPrefetchModeIntegration:
"""Integration tests for prefetch mode."""
@pytest.mark.asyncio
async def test_prefetch_returns_html_and_links(self):
"""Test that prefetch mode returns HTML and links only."""
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(prefetch=True)
result = await crawler.arun(TEST_DOMAIN, config=config)
# Should have HTML
assert result.html is not None
assert len(result.html) > 0
assert "<html" in result.html.lower() or "<!doctype" in result.html.lower()
# Should have links
assert result.links is not None
assert "internal" in result.links
assert "external" in result.links
# Should NOT have processed content
assert result.markdown is None or (
hasattr(result.markdown, 'raw_markdown') and
result.markdown.raw_markdown is None
)
assert result.cleaned_html is None
assert result.extracted_content is None
@pytest.mark.asyncio
async def test_prefetch_preserves_metadata(self):
"""Test that prefetch mode preserves essential metadata."""
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(prefetch=True)
result = await crawler.arun(TEST_DOMAIN, config=config)
# Should have success flag
assert result.success is True
# Should have URL
assert result.url is not None
# Status code should be present
assert result.status_code is not None or result.status_code == 200
@pytest.mark.asyncio
async def test_prefetch_with_deep_crawl(self):
"""Test prefetch mode with deep crawl strategy."""
from crawl4ai import BFSDeepCrawlStrategy
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
prefetch=True,
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=1,
max_pages=3
)
)
result_container = await crawler.arun(TEST_DOMAIN, config=config)
# Handle both list and iterator results
if hasattr(result_container, '__aiter__'):
results = [r async for r in result_container]
else:
results = list(result_container) if hasattr(result_container, '__iter__') else [result_container]
# Each result should have HTML and links
for result in results:
assert result.html is not None
assert result.links is not None
# Should have crawled at least one page
assert len(results) >= 1
@pytest.mark.asyncio
async def test_prefetch_then_process_with_raw(self):
"""Test the full two-phase workflow: prefetch then process."""
async with AsyncWebCrawler() as crawler:
# Phase 1: Prefetch
prefetch_config = CrawlerRunConfig(prefetch=True)
prefetch_result = await crawler.arun(TEST_DOMAIN, config=prefetch_config)
stored_html = prefetch_result.html
assert stored_html is not None
assert len(stored_html) > 0
# Phase 2: Process with raw: URL
process_config = CrawlerRunConfig(
# No prefetch - full processing
base_url=TEST_DOMAIN # Provide base URL for link resolution
)
processed_result = await crawler.arun(
f"raw:{stored_html}",
config=process_config
)
# Should now have full processing
assert processed_result.html is not None
assert processed_result.success is True
# Note: cleaned_html and markdown depend on the content
@pytest.mark.asyncio
async def test_prefetch_links_structure(self):
"""Test that links have the expected structure."""
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(prefetch=True)
result = await crawler.arun(TEST_DOMAIN, config=config)
assert result.links is not None
# Check internal links structure
if result.links["internal"]:
link = result.links["internal"][0]
assert "href" in link
assert "text" in link
assert link["href"].startswith("http")
# Check external links structure (if any)
if result.links["external"]:
link = result.links["external"][0]
assert "href" in link
assert "text" in link
assert link["href"].startswith("http")
@pytest.mark.asyncio
async def test_prefetch_config_clone(self):
"""Test that config.clone() preserves prefetch setting."""
config = CrawlerRunConfig(prefetch=True)
cloned = config.clone()
assert cloned.prefetch == True
# Clone with override
cloned_false = config.clone(prefetch=False)
assert cloned_false.prefetch == False
@pytest.mark.asyncio
async def test_prefetch_to_dict(self):
"""Test that to_dict() includes prefetch."""
config = CrawlerRunConfig(prefetch=True)
config_dict = config.to_dict()
assert "prefetch" in config_dict
assert config_dict["prefetch"] == True
@pytest.mark.asyncio
async def test_prefetch_default_false(self):
"""Test that prefetch defaults to False."""
config = CrawlerRunConfig()
assert config.prefetch == False
@pytest.mark.asyncio
async def test_prefetch_explicit_false(self):
"""Test explicit prefetch=False works like default."""
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(prefetch=False)
result = await crawler.arun(TEST_DOMAIN, config=config)
# Should have full processing
assert result.html is not None
# cleaned_html should be populated in normal mode
assert result.cleaned_html is not None
class TestPrefetchPerformance:
"""Performance-related tests for prefetch mode."""
@pytest.mark.asyncio
async def test_prefetch_returns_quickly(self):
"""Test that prefetch mode returns results quickly."""
import time
async with AsyncWebCrawler() as crawler:
# Prefetch mode
start = time.time()
prefetch_config = CrawlerRunConfig(prefetch=True)
await crawler.arun(TEST_DOMAIN, config=prefetch_config)
prefetch_time = time.time() - start
# Full mode
start = time.time()
full_config = CrawlerRunConfig()
await crawler.arun(TEST_DOMAIN, config=full_config)
full_time = time.time() - start
# Log times for debugging
print(f"\nPrefetch: {prefetch_time:.3f}s, Full: {full_time:.3f}s")
# Prefetch should not be significantly slower
# (may be same or slightly faster depending on content)
# This is a soft check - mostly for logging
class TestPrefetchWithRawHTML:
"""Test prefetch mode with raw HTML input."""
@pytest.mark.asyncio
async def test_prefetch_with_raw_html(self):
"""Test prefetch mode works with raw: URL scheme."""
sample_html = """
<html>
<head><title>Test Page</title></head>
<body>
<h1>Hello World</h1>
<a href="/link1">Link 1</a>
<a href="/link2">Link 2</a>
<a href="https://external.com/page">External</a>
</body>
</html>
"""
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
prefetch=True,
base_url="https://example.com"
)
result = await crawler.arun(f"raw:{sample_html}", config=config)
assert result.success is True
assert result.html is not None
assert result.links is not None
# Should have extracted links
assert len(result.links["internal"]) >= 2
assert len(result.links["external"]) >= 1

275
tests/test_prefetch_mode.py Normal file
View File

@@ -0,0 +1,275 @@
"""Unit tests for the quick_extract_links function used in prefetch mode."""
import pytest
from crawl4ai.utils import quick_extract_links
class TestQuickExtractLinks:
"""Unit tests for the quick_extract_links function."""
def test_basic_internal_links(self):
"""Test extraction of internal links."""
html = '''
<html>
<body>
<a href="/page1">Page 1</a>
<a href="/page2">Page 2</a>
<a href="https://example.com/page3">Page 3</a>
</body>
</html>
'''
result = quick_extract_links(html, "https://example.com")
assert len(result["internal"]) == 3
assert result["internal"][0]["href"] == "https://example.com/page1"
assert result["internal"][0]["text"] == "Page 1"
def test_external_links(self):
"""Test extraction and classification of external links."""
html = '''
<html>
<body>
<a href="https://other.com/page">External</a>
<a href="/internal">Internal</a>
</body>
</html>
'''
result = quick_extract_links(html, "https://example.com")
assert len(result["internal"]) == 1
assert len(result["external"]) == 1
assert result["external"][0]["href"] == "https://other.com/page"
def test_ignores_javascript_and_mailto(self):
"""Test that javascript: and mailto: links are ignored."""
html = '''
<html>
<body>
<a href="javascript:void(0)">Click</a>
<a href="mailto:test@example.com">Email</a>
<a href="tel:+1234567890">Call</a>
<a href="/valid">Valid</a>
</body>
</html>
'''
result = quick_extract_links(html, "https://example.com")
assert len(result["internal"]) == 1
assert result["internal"][0]["href"] == "https://example.com/valid"
def test_ignores_anchor_only_links(self):
"""Test that anchor-only links (#section) are ignored."""
html = '''
<html>
<body>
<a href="#section1">Section 1</a>
<a href="#section2">Section 2</a>
<a href="/page#section">Page with anchor</a>
</body>
</html>
'''
result = quick_extract_links(html, "https://example.com")
# Only the page link should be included, anchor-only links are skipped
assert len(result["internal"]) == 1
assert "/page" in result["internal"][0]["href"]
def test_deduplication(self):
"""Test that duplicate URLs are deduplicated."""
html = '''
<html>
<body>
<a href="/page">Link 1</a>
<a href="/page">Link 2</a>
<a href="/page">Link 3</a>
</body>
</html>
'''
result = quick_extract_links(html, "https://example.com")
assert len(result["internal"]) == 1
def test_handles_malformed_html(self):
"""Test graceful handling of malformed HTML."""
html = "not valid html at all <><><"
result = quick_extract_links(html, "https://example.com")
# Should not raise, should return empty
assert result["internal"] == []
assert result["external"] == []
def test_empty_html(self):
"""Test handling of empty HTML."""
result = quick_extract_links("", "https://example.com")
assert result == {"internal": [], "external": []}
def test_relative_url_resolution(self):
"""Test that relative URLs are resolved correctly."""
html = '''
<html>
<body>
<a href="page1.html">Relative</a>
<a href="./page2.html">Dot Relative</a>
<a href="../page3.html">Parent Relative</a>
</body>
</html>
'''
result = quick_extract_links(html, "https://example.com/docs/")
assert len(result["internal"]) >= 1
# All should be internal and properly resolved
for link in result["internal"]:
assert link["href"].startswith("https://example.com")
def test_text_truncation(self):
"""Test that long link text is truncated to 200 chars."""
long_text = "A" * 300
html = f'''
<html>
<body>
<a href="/page">{long_text}</a>
</body>
</html>
'''
result = quick_extract_links(html, "https://example.com")
assert len(result["internal"]) == 1
assert len(result["internal"][0]["text"]) == 200
def test_empty_href_ignored(self):
"""Test that empty href attributes are ignored."""
html = '''
<html>
<body>
<a href="">Empty</a>
<a href=" ">Whitespace</a>
<a href="/valid">Valid</a>
</body>
</html>
'''
result = quick_extract_links(html, "https://example.com")
assert len(result["internal"]) == 1
assert result["internal"][0]["href"] == "https://example.com/valid"
def test_mixed_internal_external(self):
"""Test correct classification of mixed internal and external links."""
html = '''
<html>
<body>
<a href="/internal1">Internal 1</a>
<a href="https://example.com/internal2">Internal 2</a>
<a href="https://google.com">Google</a>
<a href="https://github.com/repo">GitHub</a>
<a href="/internal3">Internal 3</a>
</body>
</html>
'''
result = quick_extract_links(html, "https://example.com")
assert len(result["internal"]) == 3
assert len(result["external"]) == 2
def test_subdomain_handling(self):
"""Test that subdomains are handled correctly."""
html = '''
<html>
<body>
<a href="https://docs.example.com/page">Docs subdomain</a>
<a href="https://api.example.com/v1">API subdomain</a>
<a href="https://example.com/main">Main domain</a>
</body>
</html>
'''
result = quick_extract_links(html, "https://example.com")
# All should be internal (same base domain)
total_links = len(result["internal"]) + len(result["external"])
assert total_links == 3
class TestQuickExtractLinksEdgeCases:
"""Edge case tests for quick_extract_links."""
def test_no_links_in_page(self):
"""Test page with no links."""
html = '''
<html>
<body>
<h1>No Links Here</h1>
<p>Just some text content.</p>
</body>
</html>
'''
result = quick_extract_links(html, "https://example.com")
assert result["internal"] == []
assert result["external"] == []
def test_links_in_nested_elements(self):
"""Test links nested in various elements."""
html = '''
<html>
<body>
<nav>
<ul>
<li><a href="/home">Home</a></li>
<li><a href="/about">About</a></li>
</ul>
</nav>
<div class="content">
<p>Check out <a href="/products">our products</a>.</p>
</div>
</body>
</html>
'''
result = quick_extract_links(html, "https://example.com")
assert len(result["internal"]) == 3
def test_link_with_nested_elements(self):
"""Test links containing nested elements."""
html = '''
<html>
<body>
<a href="/page"><span>Nested</span> <strong>Text</strong></a>
</body>
</html>
'''
result = quick_extract_links(html, "https://example.com")
assert len(result["internal"]) == 1
assert "Nested" in result["internal"][0]["text"]
assert "Text" in result["internal"][0]["text"]
def test_protocol_relative_urls(self):
"""Test handling of protocol-relative URLs (//example.com)."""
html = '''
<html>
<body>
<a href="//cdn.example.com/asset">CDN Link</a>
</body>
</html>
'''
result = quick_extract_links(html, "https://example.com")
# Should be resolved with https:
total = len(result["internal"]) + len(result["external"])
assert total >= 1
def test_whitespace_in_href(self):
"""Test handling of whitespace around href values."""
html = '''
<html>
<body>
<a href=" /page1 ">Padded</a>
<a href="
/page2
">Multiline</a>
</body>
</html>
'''
result = quick_extract_links(html, "https://example.com")
# Both should be extracted and normalized
assert len(result["internal"]) >= 1

View File

@@ -0,0 +1,232 @@
"""Regression tests to ensure prefetch mode doesn't break existing functionality."""
import pytest
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
TEST_URL = "https://docs.crawl4ai.com"
class TestNoRegressions:
"""Ensure prefetch mode doesn't break existing functionality."""
@pytest.mark.asyncio
async def test_default_mode_unchanged(self):
"""Test that default mode (prefetch=False) works exactly as before."""
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig() # Default config
result = await crawler.arun(TEST_URL, config=config)
# All standard fields should be populated
assert result.html is not None
assert result.cleaned_html is not None
assert result.links is not None
assert result.success is True
@pytest.mark.asyncio
async def test_explicit_prefetch_false(self):
"""Test explicit prefetch=False works like default."""
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(prefetch=False)
result = await crawler.arun(TEST_URL, config=config)
assert result.cleaned_html is not None
@pytest.mark.asyncio
async def test_config_clone_preserves_prefetch(self):
"""Test that config.clone() preserves prefetch setting."""
config = CrawlerRunConfig(prefetch=True)
cloned = config.clone()
assert cloned.prefetch == True
# Clone with override
cloned_false = config.clone(prefetch=False)
assert cloned_false.prefetch == False
@pytest.mark.asyncio
async def test_config_to_dict_includes_prefetch(self):
"""Test that to_dict() includes prefetch."""
config_true = CrawlerRunConfig(prefetch=True)
config_false = CrawlerRunConfig(prefetch=False)
assert config_true.to_dict()["prefetch"] == True
assert config_false.to_dict()["prefetch"] == False
@pytest.mark.asyncio
async def test_existing_extraction_still_works(self):
"""Test that extraction strategies still work in normal mode."""
from crawl4ai import JsonCssExtractionStrategy
schema = {
"name": "Links",
"baseSelector": "a",
"fields": [
{"name": "href", "selector": "", "type": "attribute", "attribute": "href"},
{"name": "text", "selector": "", "type": "text"}
]
}
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
extraction_strategy=JsonCssExtractionStrategy(schema=schema)
)
result = await crawler.arun(TEST_URL, config=config)
assert result.extracted_content is not None
@pytest.mark.asyncio
async def test_existing_deep_crawl_still_works(self):
"""Test that deep crawl without prefetch still does full processing."""
from crawl4ai import BFSDeepCrawlStrategy
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=1,
max_pages=2
)
# No prefetch - should do full processing
)
result_container = await crawler.arun(TEST_URL, config=config)
# Handle both list and iterator results
if hasattr(result_container, '__aiter__'):
results = [r async for r in result_container]
else:
results = list(result_container) if hasattr(result_container, '__iter__') else [result_container]
# Each result should have full processing
for result in results:
assert result.cleaned_html is not None
assert len(results) >= 1
@pytest.mark.asyncio
async def test_raw_url_scheme_still_works(self):
"""Test that raw: URL scheme works for processing stored HTML."""
sample_html = """
<html>
<head><title>Test Page</title></head>
<body>
<h1>Hello World</h1>
<p>This is a test paragraph.</p>
<a href="/link1">Link 1</a>
</body>
</html>
"""
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig()
result = await crawler.arun(f"raw:{sample_html}", config=config)
assert result.success is True
assert result.html is not None
assert "Hello World" in result.html
assert result.cleaned_html is not None
@pytest.mark.asyncio
async def test_screenshot_still_works(self):
"""Test that screenshot option still works in normal mode."""
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(screenshot=True)
result = await crawler.arun(TEST_URL, config=config)
assert result.success is True
# Screenshot data should be present
assert result.screenshot is not None or result.screenshot_data is not None
@pytest.mark.asyncio
async def test_js_execution_still_works(self):
"""Test that JavaScript execution still works in normal mode."""
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
js_code="document.querySelector('h1')?.textContent"
)
result = await crawler.arun(TEST_URL, config=config)
assert result.success is True
assert result.html is not None
class TestPrefetchDoesNotAffectOtherModes:
"""Test that prefetch doesn't interfere with other configurations."""
@pytest.mark.asyncio
async def test_prefetch_with_other_options_ignored(self):
"""Test that other options are properly ignored in prefetch mode."""
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
prefetch=True,
# These should be ignored in prefetch mode
screenshot=True,
pdf=True,
only_text=True,
word_count_threshold=100
)
result = await crawler.arun(TEST_URL, config=config)
# Should still return HTML and links
assert result.html is not None
assert result.links is not None
# But should NOT have processed content
assert result.cleaned_html is None
assert result.extracted_content is None
@pytest.mark.asyncio
async def test_stream_mode_still_works(self):
"""Test that stream mode still works normally."""
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(stream=True)
result = await crawler.arun(TEST_URL, config=config)
assert result.success is True
assert result.html is not None
@pytest.mark.asyncio
async def test_cache_mode_still_works(self):
"""Test that cache mode still works normally."""
from crawl4ai import CacheMode
async with AsyncWebCrawler() as crawler:
# First request - bypass cache
config1 = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
result1 = await crawler.arun(TEST_URL, config=config1)
assert result1.success is True
# Second request - should work
config2 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED)
result2 = await crawler.arun(TEST_URL, config=config2)
assert result2.success is True
class TestBackwardsCompatibility:
"""Test backwards compatibility with existing code patterns."""
@pytest.mark.asyncio
async def test_config_without_prefetch_works(self):
"""Test that configs created without prefetch parameter work."""
# Simulating old code that doesn't know about prefetch
config = CrawlerRunConfig(
word_count_threshold=50,
css_selector="body"
)
# Should default to prefetch=False
assert config.prefetch == False
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(TEST_URL, config=config)
assert result.success is True
assert result.cleaned_html is not None
@pytest.mark.asyncio
async def test_from_kwargs_without_prefetch(self):
"""Test CrawlerRunConfig.from_kwargs works without prefetch."""
config = CrawlerRunConfig.from_kwargs({
"word_count_threshold": 50,
"verbose": False
})
assert config.prefetch == False

View File

@@ -0,0 +1,172 @@
"""
Tests for raw:/file:// URL browser pipeline support.
Tests the new feature that allows js_code, wait_for, and other browser operations
to work with raw: and file:// URLs by routing them through _crawl_web() with
set_content() instead of goto().
"""
import pytest
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
@pytest.mark.asyncio
async def test_raw_html_fast_path():
"""Test that raw: without browser params returns HTML directly (fast path)."""
html = "<html><body><div id='test'>Original Content</div></body></html>"
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig() # No browser params
result = await crawler.arun(f"raw:{html}", config=config)
assert result.success
assert "Original Content" in result.html
# Fast path should not modify the HTML
assert result.html == html
@pytest.mark.asyncio
async def test_js_code_on_raw_html():
"""Test that js_code executes on raw: HTML and modifies the DOM."""
html = "<html><body><div id='test'>Original</div></body></html>"
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
js_code="document.getElementById('test').innerText = 'Modified by JS'"
)
result = await crawler.arun(f"raw:{html}", config=config)
assert result.success
assert "Modified by JS" in result.html
assert "Original" not in result.html or "Modified by JS" in result.html
@pytest.mark.asyncio
async def test_js_code_adds_element_to_raw_html():
"""Test that js_code can add new elements to raw: HTML."""
html = "<html><body><div id='container'></div></body></html>"
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
js_code='document.getElementById("container").innerHTML = "<span id=\'injected\'>Custom Content</span>"'
)
result = await crawler.arun(f"raw:{html}", config=config)
assert result.success
assert "injected" in result.html
assert "Custom Content" in result.html
@pytest.mark.asyncio
async def test_screenshot_on_raw_html():
"""Test that screenshots work on raw: HTML."""
html = "<html><body><h1 style='color:red;font-size:48px;'>Screenshot Test</h1></body></html>"
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(screenshot=True)
result = await crawler.arun(f"raw:{html}", config=config)
assert result.success
assert result.screenshot is not None
assert len(result.screenshot) > 100 # Should have substantial screenshot data
@pytest.mark.asyncio
async def test_process_in_browser_flag():
"""Test that process_in_browser=True forces browser path even without other params."""
html = "<html><body><div>Test</div></body></html>"
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(process_in_browser=True)
result = await crawler.arun(f"raw:{html}", config=config)
assert result.success
# Browser path normalizes HTML, so it may be slightly different
assert "Test" in result.html
@pytest.mark.asyncio
async def test_raw_prefix_variations():
"""Test both raw: and raw:// prefix formats."""
html = "<html><body>Content</body></html>"
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
js_code='document.body.innerHTML += "<div id=\'added\'>Added</div>"'
)
# Test raw: prefix
result1 = await crawler.arun(f"raw:{html}", config=config)
assert result1.success
assert "Added" in result1.html
# Test raw:// prefix
result2 = await crawler.arun(f"raw://{html}", config=config)
assert result2.success
assert "Added" in result2.html
@pytest.mark.asyncio
async def test_wait_for_on_raw_html():
"""Test that wait_for works with raw: HTML after js_code modifies DOM."""
html = "<html><body><div id='container'></div></body></html>"
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
js_code='''
setTimeout(() => {
document.getElementById('container').innerHTML = '<div id="delayed">Delayed Content</div>';
}, 100);
''',
wait_for="#delayed",
wait_for_timeout=5000
)
result = await crawler.arun(f"raw:{html}", config=config)
assert result.success
assert "Delayed Content" in result.html
@pytest.mark.asyncio
async def test_multiple_js_code_scripts():
"""Test that multiple js_code scripts execute in order."""
html = "<html><body><div id='counter'>0</div></body></html>"
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
js_code=[
"document.getElementById('counter').innerText = '1'",
"document.getElementById('counter').innerText = parseInt(document.getElementById('counter').innerText) + 1",
"document.getElementById('counter').innerText = parseInt(document.getElementById('counter').innerText) + 1",
]
)
result = await crawler.arun(f"raw:{html}", config=config)
assert result.success
assert ">3<" in result.html # Counter should be 3 after all scripts run
if __name__ == "__main__":
# Run a quick manual test
async def quick_test():
html = "<html><body><div id='test'>Original</div></body></html>"
async with AsyncWebCrawler(verbose=True) as crawler:
# Test 1: Fast path
print("\n=== Test 1: Fast path (no browser params) ===")
result1 = await crawler.arun(f"raw:{html}")
print(f"Success: {result1.success}")
print(f"HTML contains 'Original': {'Original' in result1.html}")
# Test 2: js_code modifies DOM
print("\n=== Test 2: js_code modifies DOM ===")
config = CrawlerRunConfig(
js_code="document.getElementById('test').innerText = 'Modified by JS'"
)
result2 = await crawler.arun(f"raw:{html}", config=config)
print(f"Success: {result2.success}")
print(f"HTML contains 'Modified by JS': {'Modified by JS' in result2.html}")
print(f"HTML snippet: {result2.html[:500]}...")
asyncio.run(quick_test())

View File

@@ -0,0 +1,563 @@
"""
BRUTAL edge case tests for raw:/file:// URL browser pipeline.
These tests try to break the system with tricky inputs, edge cases,
and compatibility checks to ensure we didn't break existing functionality.
"""
import pytest
import asyncio
import tempfile
import os
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
# ============================================================================
# EDGE CASE: Hash characters in HTML (previously broke urlparse - Issue #283)
# ============================================================================
@pytest.mark.asyncio
async def test_raw_html_with_hash_in_css():
"""Test that # in CSS colors doesn't break HTML parsing (regression for #283)."""
html = """
<html>
<head>
<style>
body { background-color: #ff5733; color: #333333; }
.highlight { border: 1px solid #000; }
</style>
</head>
<body>
<div class="highlight" style="color: #ffffff;">Content with hash colors</div>
</body>
</html>
"""
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(js_code="document.body.innerHTML += '<div id=\"added\">Added</div>'")
result = await crawler.arun(f"raw:{html}", config=config)
assert result.success
assert "#ff5733" in result.html or "ff5733" in result.html # Color should be preserved
assert "Added" in result.html # JS executed
assert "Content with hash colors" in result.html # Original content preserved
@pytest.mark.asyncio
async def test_raw_html_with_fragment_links():
"""Test HTML with # fragment links doesn't break."""
html = """
<html><body>
<a href="#section1">Go to section 1</a>
<a href="#section2">Go to section 2</a>
<div id="section1">Section 1</div>
<div id="section2">Section 2</div>
</body></html>
"""
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(js_code="document.getElementById('section1').innerText = 'Modified Section 1'")
result = await crawler.arun(f"raw:{html}", config=config)
assert result.success
assert "Modified Section 1" in result.html
assert "#section2" in result.html # Fragment link preserved
# ============================================================================
# EDGE CASE: Special characters and unicode
# ============================================================================
@pytest.mark.asyncio
async def test_raw_html_with_unicode():
"""Test raw HTML with various unicode characters."""
html = """
<html><body>
<div id="unicode">日本語 中文 한국어 العربية 🎉 💻 🚀</div>
<div id="special">&amp; &lt; &gt; &quot; &apos;</div>
</body></html>
"""
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(js_code="document.getElementById('unicode').innerText += ' ✅ Modified'")
result = await crawler.arun(f"raw:{html}", config=config)
assert result.success
assert "✅ Modified" in result.html or "Modified" in result.html
# Check unicode is preserved
assert "日本語" in result.html or "&#" in result.html # Either preserved or encoded
@pytest.mark.asyncio
async def test_raw_html_with_script_tags():
"""Test raw HTML with existing script tags doesn't interfere with js_code."""
html = """
<html><body>
<div id="counter">0</div>
<script>
// This script runs on page load
document.getElementById('counter').innerText = '10';
</script>
</body></html>
"""
async with AsyncWebCrawler() as crawler:
# Our js_code runs AFTER the page scripts
config = CrawlerRunConfig(
js_code="document.getElementById('counter').innerText = parseInt(document.getElementById('counter').innerText) + 5"
)
result = await crawler.arun(f"raw:{html}", config=config)
assert result.success
# The embedded script sets it to 10, then our js_code adds 5
assert ">15<" in result.html or "15" in result.html
# ============================================================================
# EDGE CASE: Empty and malformed HTML
# ============================================================================
@pytest.mark.asyncio
async def test_raw_html_empty():
"""Test empty raw HTML."""
html = ""
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(js_code="document.body.innerHTML = '<div>Added to empty</div>'")
result = await crawler.arun(f"raw:{html}", config=config)
assert result.success
assert "Added to empty" in result.html
@pytest.mark.asyncio
async def test_raw_html_minimal():
"""Test minimal HTML (just text, no tags)."""
html = "Just plain text, no HTML tags"
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(js_code="document.body.innerHTML += '<div id=\"injected\">Injected</div>'")
result = await crawler.arun(f"raw:{html}", config=config)
assert result.success
# Browser should wrap it in proper HTML
assert "Injected" in result.html
@pytest.mark.asyncio
async def test_raw_html_malformed():
"""Test malformed HTML with unclosed tags."""
html = "<html><body><div><span>Unclosed tags<div>More content"
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(js_code="document.body.innerHTML += '<div id=\"valid\">Valid Added</div>'")
result = await crawler.arun(f"raw:{html}", config=config)
assert result.success
assert "Valid Added" in result.html
# Browser should have fixed the malformed HTML
# ============================================================================
# EDGE CASE: Very large HTML
# ============================================================================
@pytest.mark.asyncio
async def test_raw_html_large():
"""Test large raw HTML (100KB+)."""
# Generate 100KB of HTML
items = "".join([f'<div class="item" id="item-{i}">Item {i} content here with some text</div>\n' for i in range(2000)])
html = f"<html><body>{items}</body></html>"
assert len(html) > 100000 # Verify it's actually large
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
js_code="document.getElementById('item-999').innerText = 'MODIFIED ITEM 999'"
)
result = await crawler.arun(f"raw:{html}", config=config)
assert result.success
assert "MODIFIED ITEM 999" in result.html
assert "item-1999" in result.html # Last item should still exist
# ============================================================================
# EDGE CASE: JavaScript errors and timeouts
# ============================================================================
@pytest.mark.asyncio
async def test_raw_html_js_error_doesnt_crash():
"""Test that JavaScript errors in js_code don't crash the crawl."""
html = "<html><body><div id='test'>Original</div></body></html>"
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
js_code=[
"nonExistentFunction();", # This will throw an error
"document.getElementById('test').innerText = 'Still works'" # This should still run
]
)
result = await crawler.arun(f"raw:{html}", config=config)
# Crawl should succeed even with JS errors
assert result.success
@pytest.mark.asyncio
async def test_raw_html_wait_for_timeout():
"""Test wait_for with element that never appears times out gracefully."""
html = "<html><body><div id='test'>Original</div></body></html>"
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
wait_for="#never-exists",
wait_for_timeout=1000 # 1 second timeout
)
result = await crawler.arun(f"raw:{html}", config=config)
# Should timeout but still return the HTML we have
# The behavior might be success=False or success=True with partial content
# Either way, it shouldn't hang or crash
assert result is not None
# ============================================================================
# COMPATIBILITY: Normal HTTP URLs still work
# ============================================================================
@pytest.mark.asyncio
async def test_http_urls_still_work():
"""Ensure we didn't break normal HTTP URL crawling."""
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
assert result.success
assert "Example Domain" in result.html
@pytest.mark.asyncio
async def test_http_with_js_code_still_works():
"""Ensure HTTP URLs with js_code still work."""
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
js_code="document.body.innerHTML += '<div id=\"injected\">Injected via JS</div>'"
)
result = await crawler.arun("https://example.com", config=config)
assert result.success
assert "Injected via JS" in result.html
# ============================================================================
# COMPATIBILITY: File URLs
# ============================================================================
@pytest.mark.asyncio
async def test_file_url_with_js_code():
"""Test file:// URLs with js_code execution."""
# Create a temp file
with tempfile.NamedTemporaryFile(mode='w', suffix='.html', delete=False) as f:
f.write("<html><body><div id='file-content'>File Content</div></body></html>")
temp_path = f.name
try:
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
js_code="document.getElementById('file-content').innerText = 'Modified File Content'"
)
result = await crawler.arun(f"file://{temp_path}", config=config)
assert result.success
assert "Modified File Content" in result.html
finally:
os.unlink(temp_path)
@pytest.mark.asyncio
async def test_file_url_fast_path():
"""Test file:// fast path (no browser params)."""
with tempfile.NamedTemporaryFile(mode='w', suffix='.html', delete=False) as f:
f.write("<html><body>Fast path file content</body></html>")
temp_path = f.name
try:
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(f"file://{temp_path}")
assert result.success
assert "Fast path file content" in result.html
finally:
os.unlink(temp_path)
# ============================================================================
# COMPATIBILITY: Extraction strategies with raw HTML
# ============================================================================
@pytest.mark.asyncio
async def test_raw_html_with_css_extraction():
"""Test CSS extraction on raw HTML after js_code modifies it."""
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
html = """
<html><body>
<div class="products">
<div class="product"><span class="name">Original Product</span></div>
</div>
</body></html>
"""
schema = {
"name": "Products",
"baseSelector": ".product",
"fields": [
{"name": "name", "selector": ".name", "type": "text"}
]
}
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
js_code="""
document.querySelector('.products').innerHTML +=
'<div class="product"><span class="name">JS Added Product</span></div>';
""",
extraction_strategy=JsonCssExtractionStrategy(schema)
)
result = await crawler.arun(f"raw:{html}", config=config)
assert result.success
# Check that extraction found both products
import json
extracted = json.loads(result.extracted_content)
names = [p.get('name', '') for p in extracted]
assert any("JS Added Product" in name for name in names)
# ============================================================================
# EDGE CASE: Concurrent raw: requests
# ============================================================================
@pytest.mark.asyncio
async def test_concurrent_raw_requests():
"""Test multiple concurrent raw: requests don't interfere."""
htmls = [
f"<html><body><div id='test'>Request {i}</div></body></html>"
for i in range(5)
]
async with AsyncWebCrawler() as crawler:
configs = [
CrawlerRunConfig(
js_code=f"document.getElementById('test').innerText += ' Modified {i}'"
)
for i in range(5)
]
# Run concurrently
tasks = [
crawler.arun(f"raw:{html}", config=config)
for html, config in zip(htmls, configs)
]
results = await asyncio.gather(*tasks)
for i, result in enumerate(results):
assert result.success
assert f"Request {i}" in result.html
assert f"Modified {i}" in result.html
# ============================================================================
# EDGE CASE: raw: with base_url for link resolution
# ============================================================================
@pytest.mark.asyncio
async def test_raw_html_with_base_url():
"""Test that base_url is used for link resolution in markdown."""
html = """
<html><body>
<a href="/page1">Page 1</a>
<a href="/page2">Page 2</a>
<img src="/images/logo.png" alt="Logo">
</body></html>
"""
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
base_url="https://example.com",
process_in_browser=True # Force browser to test base_url handling
)
result = await crawler.arun(f"raw:{html}", config=config)
assert result.success
# Check markdown has absolute URLs
if result.markdown:
# Links should be absolute
md = result.markdown.raw_markdown if hasattr(result.markdown, 'raw_markdown') else str(result.markdown)
assert "example.com" in md or "/page1" in md
# ============================================================================
# EDGE CASE: raw: with screenshot of complex page
# ============================================================================
@pytest.mark.asyncio
async def test_raw_html_screenshot_complex_page():
"""Test screenshot of complex raw HTML with CSS and JS modifications."""
html = """
<html>
<head>
<style>
body { font-family: Arial; background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 40px; }
.card { background: white; padding: 20px; border-radius: 10px; box-shadow: 0 4px 6px rgba(0,0,0,0.1); }
h1 { color: #333; }
</style>
</head>
<body>
<div class="card">
<h1 id="title">Original Title</h1>
<p>This is a test card with styling.</p>
</div>
</body>
</html>
"""
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
js_code="document.getElementById('title').innerText = 'Modified Title'",
screenshot=True
)
result = await crawler.arun(f"raw:{html}", config=config)
assert result.success
assert result.screenshot is not None
assert len(result.screenshot) > 1000 # Should be substantial
assert "Modified Title" in result.html
# ============================================================================
# EDGE CASE: JavaScript that tries to navigate away
# ============================================================================
@pytest.mark.asyncio
async def test_raw_html_js_navigation_blocked():
"""Test that JS trying to navigate doesn't break the crawl."""
html = """
<html><body>
<div id="content">Original Content</div>
<script>
// Try to navigate away (should be blocked or handled)
// window.location.href = 'https://example.com';
</script>
</body></html>
"""
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
# Try to navigate via js_code
js_code=[
"document.getElementById('content').innerText = 'Before navigation attempt'",
# Actual navigation attempt commented - would cause issues
# "window.location.href = 'https://example.com'",
]
)
result = await crawler.arun(f"raw:{html}", config=config)
assert result.success
assert "Before navigation attempt" in result.html
# ============================================================================
# EDGE CASE: Raw HTML with iframes
# ============================================================================
@pytest.mark.asyncio
async def test_raw_html_with_iframes():
"""Test raw HTML containing iframes."""
html = """
<html><body>
<div id="main">Main content</div>
<iframe id="frame1" srcdoc="<html><body><div id='iframe-content'>Iframe Content</div></body></html>"></iframe>
</body></html>
"""
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
js_code="document.getElementById('main').innerText = 'Modified main'",
process_iframes=True
)
result = await crawler.arun(f"raw:{html}", config=config)
assert result.success
assert "Modified main" in result.html
# ============================================================================
# TRICKY: Protocol inside raw content
# ============================================================================
@pytest.mark.asyncio
async def test_raw_html_with_urls_inside():
"""Test raw: with http:// URLs inside the content."""
html = """
<html><body>
<a href="http://example.com">Example</a>
<a href="https://google.com">Google</a>
<img src="https://placekitten.com/200/300" alt="Cat">
<div id="test">Test content with URL: https://test.com</div>
</body></html>
"""
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
js_code="document.getElementById('test').innerText += ' - Modified'"
)
result = await crawler.arun(f"raw:{html}", config=config)
assert result.success
assert "Modified" in result.html
assert "http://example.com" in result.html or "example.com" in result.html
# ============================================================================
# TRICKY: Double raw: prefix
# ============================================================================
@pytest.mark.asyncio
async def test_double_raw_prefix():
"""Test what happens with double raw: prefix (edge case)."""
html = "<html><body>Content</body></html>"
async with AsyncWebCrawler() as crawler:
# raw:raw:<html>... - the second raw: becomes part of content
result = await crawler.arun(f"raw:raw:{html}")
# Should either handle gracefully or return "raw:<html>..." as content
assert result is not None
if __name__ == "__main__":
import sys
async def run_tests():
# Run a few key tests manually
tests = [
("Hash in CSS", test_raw_html_with_hash_in_css),
("Unicode", test_raw_html_with_unicode),
("Large HTML", test_raw_html_large),
("HTTP still works", test_http_urls_still_work),
("Concurrent requests", test_concurrent_raw_requests),
("Complex screenshot", test_raw_html_screenshot_complex_page),
]
for name, test_fn in tests:
print(f"\n=== Running: {name} ===")
try:
await test_fn()
print(f"{name} PASSED")
except Exception as e:
print(f"{name} FAILED: {e}")
import traceback
traceback.print_exc()
asyncio.run(run_tests())