Release v0.8.0: Crash Recovery, Prefetch Mode & Security Fixes (#1712)
* Fix: Use correct URL variable for raw HTML extraction (#1116) - Prevents full HTML content from being passed as URL to extraction strategies - Added unit tests to verify raw HTML and regular URL processing Fix: Wrong URL variable used for extraction of raw html * Fix #1181: Preserve whitespace in code blocks during HTML scraping The remove_empty_elements_fast() method was removing whitespace-only span elements inside <pre> and <code> tags, causing import statements like "import torch" to become "importtorch". Now skips elements inside code blocks where whitespace is significant. * Refactor Pydantic model configuration to use ConfigDict for arbitrary types * Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621 * Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638 * fix: ensure BrowserConfig.to_dict serializes proxy_config * feat: make LLM backoff configurable end-to-end - extend LLMConfig with backoff delay/attempt/factor fields and thread them through LLMExtractionStrategy, LLMContentFilter, table extraction, and Docker API handlers - expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff and document them in the md_v2 guides * reproduced AttributeError from #1642 * pass timeout parameter to docker client request * added missing deep crawling objects to init * generalized query in ContentRelevanceFilter to be a str or list * import modules from enhanceable deserialization * parameterized tests * Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268 * refactor: replace PyPDF2 with pypdf across the codebase. ref #1412 * Add browser_context_id and target_id parameters to BrowserConfig Enable Crawl4AI to connect to pre-created CDP browser contexts, which is essential for cloud browser services that pre-create isolated contexts. Changes: - Add browser_context_id and target_id parameters to BrowserConfig - Update from_kwargs() and to_dict() methods - Modify BrowserManager.start() to use existing context when provided - Add _get_page_by_target_id() helper method - Update get_page() to handle pre-existing targets - Add test for browser_context_id functionality This enables cloud services to: 1. Create isolated CDP contexts before Crawl4AI connects 2. Pass context/target IDs to BrowserConfig 3. Have Crawl4AI reuse existing contexts instead of creating new ones * Add cdp_cleanup_on_close flag to prevent memory leaks in cloud/server scenarios * Fix: add cdp_cleanup_on_close to from_kwargs * Fix: find context by target_id for concurrent CDP connections * Fix: use target_id to find correct page in get_page * Fix: use CDP to find context by browserContextId for concurrent sessions * Revert context matching attempts - Playwright cannot see CDP-created contexts * Add create_isolated_context flag for concurrent CDP crawls When True, forces creation of a new browser context instead of reusing the default context. Essential for concurrent crawls on the same browser to prevent navigation conflicts. * Add context caching to create_isolated_context branch Uses contexts_by_config cache (same as non-CDP mode) to reuse contexts for multiple URLs with same config. Still creates new page per crawl for navigation isolation. Benefits batch/deep crawls. * Add init_scripts support to BrowserConfig for pre-page-load JS injection This adds the ability to inject JavaScript that runs before any page loads, useful for stealth evasions (canvas/audio fingerprinting, userAgentData). - Add init_scripts parameter to BrowserConfig (list of JS strings) - Apply init_scripts in setup_context() via context.add_init_script() - Update from_kwargs() and to_dict() for serialization * Fix CDP connection handling: support WS URLs and proper cleanup Changes to browser_manager.py: 1. _verify_cdp_ready(): Support multiple URL formats - WebSocket URLs (ws://, wss://): Skip HTTP verification, Playwright handles directly - HTTP URLs with query params: Properly parse with urlparse to preserve query string - Fixes issue where naive f"{cdp_url}/json/version" broke WS URLs and query params 2. close(): Proper cleanup when cdp_cleanup_on_close=True - Close all sessions (pages) - Close all contexts - Call browser.close() to disconnect (doesn't terminate browser, just releases connection) - Wait 1 second for CDP connection to fully release - Stop Playwright instance to prevent memory leaks This enables: - Connecting to specific browsers via WS URL - Reusing the same browser with multiple sequential connections - No user wait needed between connections (internal 1s delay handles it) Added tests/browser/test_cdp_cleanup_reuse.py with comprehensive tests. * Update gitignore * Some debugging for caching * Add _generate_screenshot_from_html for raw: and file:// URLs Implements the missing method that was being called but never defined. Now raw: and file:// URLs can generate screenshots by: 1. Loading HTML into a browser page via page.set_content() 2. Taking screenshot using existing take_screenshot() method 3. Cleaning up the page afterward This enables cached HTML to be rendered with screenshots in crawl4ai-cloud. * Add PDF and MHTML support for raw: and file:// URLs - Replace _generate_screenshot_from_html with _generate_media_from_html - New method handles screenshot, PDF, and MHTML in one browser session - Update raw: and file:// URL handlers to use new method - Enables cached HTML to generate all media types * Add crash recovery for deep crawl strategies Add optional resume_state and on_state_change parameters to all deep crawl strategies (BFS, DFS, Best-First) for cloud deployment crash recovery. Features: - resume_state: Pass saved state to resume from checkpoint - on_state_change: Async callback fired after each URL for real-time state persistence to external storage (Redis, DB, etc.) - export_state(): Get last captured state manually - Zero overhead when features are disabled (None defaults) State includes visited URLs, pending queue/stack, depths, and pages_crawled count. All state is JSON-serializable. * Fix: HTTP strategy raw: URL parsing truncates at # character The AsyncHTTPCrawlerStrategy.crawl() method used urlparse() to extract content from raw: URLs. This caused HTML with CSS color codes like #eee to be truncated because # is treated as a URL fragment delimiter. Before: raw:body{background:#eee} -> parsed.path = 'body{background:' After: raw:body{background:#eee} -> raw_content = 'body{background:#eee' Fix: Strip the raw: or raw:// prefix directly instead of using urlparse, matching how the browser strategy handles it. * Add base_url parameter to CrawlerRunConfig for raw HTML processing When processing raw: HTML (e.g., from cache), the URL parameter is meaningless for markdown link resolution. This adds a base_url parameter that can be set explicitly to provide proper URL resolution context. Changes: - Add base_url parameter to CrawlerRunConfig.__init__ - Add base_url to CrawlerRunConfig.from_kwargs - Update aprocess_html to use base_url for markdown generation Usage: config = CrawlerRunConfig(base_url='https://example.com') result = await crawler.arun(url='raw:{html}', config=config) * Add prefetch mode for two-phase deep crawling - Add `prefetch` parameter to CrawlerRunConfig - Add `quick_extract_links()` function for fast link extraction - Add short-circuit in aprocess_html() for prefetch mode - Add 42 tests (unit, integration, regression) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Updates on proxy rotation and proxy configuration * Add proxy support to HTTP crawler strategy * Add browser pipeline support for raw:/file:// URLs - Add process_in_browser parameter to CrawlerRunConfig - Route raw:/file:// URLs through _crawl_web() when browser operations needed - Use page.set_content() instead of goto() for local content - Fix cookie handling for non-HTTP URLs in browser_manager - Auto-detect browser requirements: js_code, wait_for, screenshot, etc. - Maintain fast path for raw:/file:// without browser params Fixes #310 * Add smart TTL cache for sitemap URL seeder - Add cache_ttl_hours and validate_sitemap_lastmod params to SeedingConfig - New JSON cache format with metadata (version, created_at, lastmod, url_count) - Cache validation by TTL expiry and sitemap lastmod comparison - Auto-migration from old .jsonl to new .json format - Fixes bug where incomplete cache was used indefinitely * Update URL seeder docs with smart TTL cache parameters - Add cache_ttl_hours and validate_sitemap_lastmod to parameter table - Document smart TTL cache validation with examples - Add cache-related troubleshooting entries - Update key features summary * Add MEMORY.md to gitignore * Docs: Add multi-sample schema generation section Add documentation explaining how to pass multiple HTML samples to generate_schema() for stable selectors that work across pages with varying DOM structures. Includes: - Problem explanation (fragile nth-child selectors) - Solution with code example - Key points for multi-sample queries - Comparison table of fragile vs stable selectors * Fix critical RCE and LFI vulnerabilities in Docker API deployment Security fixes for vulnerabilities reported by ProjectDiscovery: 1. Remote Code Execution via Hooks (CVE pending) - Remove __import__ from allowed_builtins in hook_manager.py - Prevents arbitrary module imports (os, subprocess, etc.) - Hooks now disabled by default via CRAWL4AI_HOOKS_ENABLED env var 2. Local File Inclusion via file:// URLs (CVE pending) - Add URL scheme validation to /execute_js, /screenshot, /pdf, /html - Block file://, javascript:, data: and other dangerous schemes - Only allow http://, https://, and raw: (where appropriate) 3. Security hardening - Add CRAWL4AI_HOOKS_ENABLED=false as default (opt-in for hooks) - Add security warning comments in config.yml - Add validate_url_scheme() helper for consistent validation Testing: - Add unit tests (test_security_fixes.py) - 16 tests - Add integration tests (run_security_tests.py) for live server Affected endpoints: - POST /crawl (hooks disabled by default) - POST /crawl/stream (hooks disabled by default) - POST /execute_js (URL validation added) - POST /screenshot (URL validation added) - POST /pdf (URL validation added) - POST /html (URL validation added) Breaking changes: - Hooks require CRAWL4AI_HOOKS_ENABLED=true to function - file:// URLs no longer work on API endpoints (use library directly) * Enhance authentication flow by implementing JWT token retrieval and adding authorization headers to API requests * Add release notes for v0.7.9, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates * Add release notes for v0.8.0, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates Documentation for v0.8.0 release: - SECURITY.md: Security policy and vulnerability reporting guidelines - RELEASE_NOTES_v0.8.0.md: Comprehensive release notes - migration/v0.8.0-upgrade-guide.md: Step-by-step migration guide - security/GHSA-DRAFT-RCE-LFI.md: GitHub security advisory drafts - CHANGELOG.md: Updated with v0.8.0 changes Breaking changes documented: - Docker API hooks disabled by default (CRAWL4AI_HOOKS_ENABLED) - file:// URLs blocked on Docker API endpoints Security fixes credited to Neo by ProjectDiscovery * Add examples for deep crawl crash recovery and prefetch mode in documentation * Release v0.8.0: The v0.8.0 Update - Updated version to 0.8.0 - Added comprehensive demo and release notes - Updated all documentation * Update security researcher acknowledgment with a hyperlink for Neo by ProjectDiscovery * Add async agenerate_schema method for schema generation - Extract prompt building to shared _build_schema_prompt() method - Add agenerate_schema() async version using aperform_completion_with_backoff - Refactor generate_schema() to use shared prompt builder - Fixes Gemini/Vertex AI compatibility in async contexts (FastAPI) * Fix: Enable litellm.drop_params for O-series/GPT-5 model compatibility O-series (o1, o3) and GPT-5 models only support temperature=1. Setting litellm.drop_params=True auto-drops unsupported parameters instead of throwing UnsupportedParamsError. Fixes temperature=0.01 error for these models in LLM extraction. --------- Co-authored-by: rbushria <rbushri@gmail.com> Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com> Co-authored-by: Soham Kukreti <kukretisoham@gmail.com> Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com> Co-authored-by: unclecode <unclecode@kidocode.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
489
tests/browser/test_browser_context_id.py
Normal file
489
tests/browser/test_browser_context_id.py
Normal file
@@ -0,0 +1,489 @@
|
||||
"""Test for browser_context_id and target_id parameters.
|
||||
|
||||
These tests verify that Crawl4AI can connect to and use pre-created
|
||||
browser contexts, which is essential for cloud browser services that
|
||||
pre-create isolated contexts for each user.
|
||||
|
||||
The flow being tested:
|
||||
1. Start a browser with CDP
|
||||
2. Create a context via raw CDP commands (simulating cloud service)
|
||||
3. Create a page/target in that context
|
||||
4. Have Crawl4AI connect using browser_context_id and target_id
|
||||
5. Verify Crawl4AI uses the existing context/page instead of creating new ones
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
import websockets
|
||||
|
||||
# Add the project root to Python path if running directly
|
||||
if __name__ == "__main__":
|
||||
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '../..')))
|
||||
|
||||
from crawl4ai.browser_manager import BrowserManager, ManagedBrowser
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
from crawl4ai.async_logger import AsyncLogger
|
||||
|
||||
# Create a logger for clear terminal output
|
||||
logger = AsyncLogger(verbose=True, log_file=None)
|
||||
|
||||
|
||||
class CDPContextCreator:
|
||||
"""
|
||||
Helper class to create browser contexts via raw CDP commands.
|
||||
This simulates what a cloud browser service would do.
|
||||
"""
|
||||
|
||||
def __init__(self, cdp_url: str):
|
||||
self.cdp_url = cdp_url
|
||||
self._message_id = 0
|
||||
self._ws = None
|
||||
self._pending_responses = {}
|
||||
self._receiver_task = None
|
||||
|
||||
async def connect(self):
|
||||
"""Establish WebSocket connection to browser."""
|
||||
# Convert HTTP URL to WebSocket URL if needed
|
||||
ws_url = self.cdp_url.replace("http://", "ws://").replace("https://", "wss://")
|
||||
if not ws_url.endswith("/devtools/browser"):
|
||||
# Get the browser websocket URL from /json/version
|
||||
import aiohttp
|
||||
async with aiohttp.ClientSession() as session:
|
||||
async with session.get(f"{self.cdp_url}/json/version") as response:
|
||||
data = await response.json()
|
||||
ws_url = data.get("webSocketDebuggerUrl", ws_url)
|
||||
|
||||
self._ws = await websockets.connect(ws_url, max_size=None, ping_interval=None)
|
||||
self._receiver_task = asyncio.create_task(self._receive_messages())
|
||||
logger.info(f"Connected to CDP at {ws_url}", tag="CDP")
|
||||
|
||||
async def disconnect(self):
|
||||
"""Close WebSocket connection."""
|
||||
if self._receiver_task:
|
||||
self._receiver_task.cancel()
|
||||
try:
|
||||
await self._receiver_task
|
||||
except asyncio.CancelledError:
|
||||
pass
|
||||
if self._ws:
|
||||
await self._ws.close()
|
||||
self._ws = None
|
||||
|
||||
async def _receive_messages(self):
|
||||
"""Background task to receive CDP messages."""
|
||||
try:
|
||||
async for message in self._ws:
|
||||
data = json.loads(message)
|
||||
msg_id = data.get('id')
|
||||
if msg_id is not None and msg_id in self._pending_responses:
|
||||
self._pending_responses[msg_id].set_result(data)
|
||||
except asyncio.CancelledError:
|
||||
pass
|
||||
except Exception as e:
|
||||
logger.error(f"CDP receiver error: {e}", tag="CDP")
|
||||
|
||||
async def _send_command(self, method: str, params: dict = None) -> dict:
|
||||
"""Send CDP command and wait for response."""
|
||||
self._message_id += 1
|
||||
msg_id = self._message_id
|
||||
|
||||
message = {
|
||||
"id": msg_id,
|
||||
"method": method,
|
||||
"params": params or {}
|
||||
}
|
||||
|
||||
future = asyncio.get_event_loop().create_future()
|
||||
self._pending_responses[msg_id] = future
|
||||
|
||||
try:
|
||||
await self._ws.send(json.dumps(message))
|
||||
response = await asyncio.wait_for(future, timeout=30.0)
|
||||
|
||||
if 'error' in response:
|
||||
raise Exception(f"CDP error: {response['error']}")
|
||||
|
||||
return response.get('result', {})
|
||||
finally:
|
||||
self._pending_responses.pop(msg_id, None)
|
||||
|
||||
async def create_context(self) -> dict:
|
||||
"""
|
||||
Create an isolated browser context with a blank page.
|
||||
|
||||
Returns:
|
||||
dict with browser_context_id, target_id, and cdp_session_id
|
||||
"""
|
||||
await self.connect()
|
||||
|
||||
# 1. Create isolated browser context
|
||||
result = await self._send_command("Target.createBrowserContext", {
|
||||
"disposeOnDetach": False # Keep context alive
|
||||
})
|
||||
browser_context_id = result["browserContextId"]
|
||||
logger.info(f"Created browser context: {browser_context_id}", tag="CDP")
|
||||
|
||||
# 2. Create a new page (target) in the context
|
||||
result = await self._send_command("Target.createTarget", {
|
||||
"url": "about:blank",
|
||||
"browserContextId": browser_context_id
|
||||
})
|
||||
target_id = result["targetId"]
|
||||
logger.info(f"Created target: {target_id}", tag="CDP")
|
||||
|
||||
# 3. Attach to the target to get a session ID
|
||||
result = await self._send_command("Target.attachToTarget", {
|
||||
"targetId": target_id,
|
||||
"flatten": True
|
||||
})
|
||||
cdp_session_id = result["sessionId"]
|
||||
logger.info(f"Attached to target, sessionId: {cdp_session_id}", tag="CDP")
|
||||
|
||||
return {
|
||||
"browser_context_id": browser_context_id,
|
||||
"target_id": target_id,
|
||||
"cdp_session_id": cdp_session_id
|
||||
}
|
||||
|
||||
async def get_targets(self) -> list:
|
||||
"""Get list of all targets in the browser."""
|
||||
result = await self._send_command("Target.getTargets")
|
||||
return result.get("targetInfos", [])
|
||||
|
||||
async def dispose_context(self, browser_context_id: str):
|
||||
"""Dispose of a browser context."""
|
||||
try:
|
||||
await self._send_command("Target.disposeBrowserContext", {
|
||||
"browserContextId": browser_context_id
|
||||
})
|
||||
logger.info(f"Disposed browser context: {browser_context_id}", tag="CDP")
|
||||
except Exception as e:
|
||||
logger.warning(f"Error disposing context: {e}", tag="CDP")
|
||||
|
||||
|
||||
async def test_browser_context_id_basic():
|
||||
"""
|
||||
Test that BrowserConfig accepts browser_context_id and target_id parameters.
|
||||
"""
|
||||
logger.info("Testing BrowserConfig browser_context_id parameter", tag="TEST")
|
||||
|
||||
try:
|
||||
# Test that BrowserConfig accepts the new parameters
|
||||
config = BrowserConfig(
|
||||
cdp_url="http://localhost:9222",
|
||||
browser_context_id="test-context-id",
|
||||
target_id="test-target-id",
|
||||
headless=True
|
||||
)
|
||||
|
||||
# Verify parameters are set correctly
|
||||
assert config.browser_context_id == "test-context-id", "browser_context_id not set"
|
||||
assert config.target_id == "test-target-id", "target_id not set"
|
||||
|
||||
# Test from_kwargs
|
||||
config2 = BrowserConfig.from_kwargs({
|
||||
"cdp_url": "http://localhost:9222",
|
||||
"browser_context_id": "test-context-id-2",
|
||||
"target_id": "test-target-id-2"
|
||||
})
|
||||
|
||||
assert config2.browser_context_id == "test-context-id-2", "browser_context_id not set via from_kwargs"
|
||||
assert config2.target_id == "test-target-id-2", "target_id not set via from_kwargs"
|
||||
|
||||
# Test to_dict
|
||||
config_dict = config.to_dict()
|
||||
assert config_dict.get("browser_context_id") == "test-context-id", "browser_context_id not in to_dict"
|
||||
assert config_dict.get("target_id") == "test-target-id", "target_id not in to_dict"
|
||||
|
||||
logger.success("BrowserConfig browser_context_id test passed", tag="TEST")
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Test failed: {str(e)}", tag="TEST")
|
||||
return False
|
||||
|
||||
|
||||
async def test_pre_created_context_usage():
|
||||
"""
|
||||
Test that Crawl4AI uses a pre-created browser context instead of creating a new one.
|
||||
|
||||
This simulates the cloud browser service flow:
|
||||
1. Start browser with CDP
|
||||
2. Create context via raw CDP (simulating cloud service)
|
||||
3. Have Crawl4AI connect with browser_context_id
|
||||
4. Verify it uses existing context
|
||||
"""
|
||||
logger.info("Testing pre-created context usage", tag="TEST")
|
||||
|
||||
# Start a managed browser first
|
||||
browser_config_initial = BrowserConfig(
|
||||
use_managed_browser=True,
|
||||
headless=True,
|
||||
debugging_port=9226, # Use unique port
|
||||
verbose=True
|
||||
)
|
||||
|
||||
managed_browser = ManagedBrowser(browser_config=browser_config_initial, logger=logger)
|
||||
cdp_creator = None
|
||||
manager = None
|
||||
context_info = None
|
||||
|
||||
try:
|
||||
# Start the browser
|
||||
cdp_url = await managed_browser.start()
|
||||
logger.info(f"Browser started at {cdp_url}", tag="TEST")
|
||||
|
||||
# Create a context via raw CDP (simulating cloud service)
|
||||
cdp_creator = CDPContextCreator(cdp_url)
|
||||
context_info = await cdp_creator.create_context()
|
||||
|
||||
logger.info(f"Pre-created context: {context_info['browser_context_id']}", tag="TEST")
|
||||
logger.info(f"Pre-created target: {context_info['target_id']}", tag="TEST")
|
||||
|
||||
# Get initial target count
|
||||
targets_before = await cdp_creator.get_targets()
|
||||
initial_target_count = len(targets_before)
|
||||
logger.info(f"Initial target count: {initial_target_count}", tag="TEST")
|
||||
|
||||
# Now create BrowserManager with browser_context_id and target_id
|
||||
browser_config = BrowserConfig(
|
||||
cdp_url=cdp_url,
|
||||
browser_context_id=context_info['browser_context_id'],
|
||||
target_id=context_info['target_id'],
|
||||
headless=True,
|
||||
verbose=True
|
||||
)
|
||||
|
||||
manager = BrowserManager(browser_config=browser_config, logger=logger)
|
||||
await manager.start()
|
||||
|
||||
logger.info("BrowserManager started with pre-created context", tag="TEST")
|
||||
|
||||
# Get a page
|
||||
crawler_config = CrawlerRunConfig()
|
||||
page, context = await manager.get_page(crawler_config)
|
||||
|
||||
# Navigate to a test page
|
||||
await page.goto("https://example.com", wait_until="domcontentloaded")
|
||||
title = await page.title()
|
||||
|
||||
logger.info(f"Page title: {title}", tag="TEST")
|
||||
|
||||
# Get target count after
|
||||
targets_after = await cdp_creator.get_targets()
|
||||
final_target_count = len(targets_after)
|
||||
logger.info(f"Final target count: {final_target_count}", tag="TEST")
|
||||
|
||||
# Verify: target count should not have increased significantly
|
||||
# (allow for 1 extra target for internal use, but not many more)
|
||||
target_diff = final_target_count - initial_target_count
|
||||
logger.info(f"Target count difference: {target_diff}", tag="TEST")
|
||||
|
||||
# Success criteria:
|
||||
# 1. Page navigation worked
|
||||
# 2. Target count didn't explode (reused existing context)
|
||||
success = title == "Example Domain" and target_diff <= 1
|
||||
|
||||
if success:
|
||||
logger.success("Pre-created context usage test passed", tag="TEST")
|
||||
else:
|
||||
logger.error(f"Test failed - Title: {title}, Target diff: {target_diff}", tag="TEST")
|
||||
|
||||
return success
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Test failed: {str(e)}", tag="TEST")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return False
|
||||
|
||||
finally:
|
||||
# Cleanup
|
||||
if manager:
|
||||
try:
|
||||
await manager.close()
|
||||
except:
|
||||
pass
|
||||
|
||||
if cdp_creator and context_info:
|
||||
try:
|
||||
await cdp_creator.dispose_context(context_info['browser_context_id'])
|
||||
await cdp_creator.disconnect()
|
||||
except:
|
||||
pass
|
||||
|
||||
if managed_browser:
|
||||
try:
|
||||
await managed_browser.cleanup()
|
||||
except:
|
||||
pass
|
||||
|
||||
|
||||
async def test_context_isolation():
|
||||
"""
|
||||
Test that using browser_context_id actually provides isolation.
|
||||
Create two contexts and verify they don't share state.
|
||||
"""
|
||||
logger.info("Testing context isolation with browser_context_id", tag="TEST")
|
||||
|
||||
browser_config_initial = BrowserConfig(
|
||||
use_managed_browser=True,
|
||||
headless=True,
|
||||
debugging_port=9227,
|
||||
verbose=True
|
||||
)
|
||||
|
||||
managed_browser = ManagedBrowser(browser_config=browser_config_initial, logger=logger)
|
||||
cdp_creator = None
|
||||
manager1 = None
|
||||
manager2 = None
|
||||
context_info_1 = None
|
||||
context_info_2 = None
|
||||
|
||||
try:
|
||||
# Start the browser
|
||||
cdp_url = await managed_browser.start()
|
||||
logger.info(f"Browser started at {cdp_url}", tag="TEST")
|
||||
|
||||
# Create two separate contexts
|
||||
cdp_creator = CDPContextCreator(cdp_url)
|
||||
context_info_1 = await cdp_creator.create_context()
|
||||
logger.info(f"Context 1: {context_info_1['browser_context_id']}", tag="TEST")
|
||||
|
||||
# Need to reconnect for second context (or use same connection)
|
||||
await cdp_creator.disconnect()
|
||||
cdp_creator2 = CDPContextCreator(cdp_url)
|
||||
context_info_2 = await cdp_creator2.create_context()
|
||||
logger.info(f"Context 2: {context_info_2['browser_context_id']}", tag="TEST")
|
||||
|
||||
# Verify contexts are different
|
||||
assert context_info_1['browser_context_id'] != context_info_2['browser_context_id'], \
|
||||
"Contexts should have different IDs"
|
||||
|
||||
# Connect with first context
|
||||
browser_config_1 = BrowserConfig(
|
||||
cdp_url=cdp_url,
|
||||
browser_context_id=context_info_1['browser_context_id'],
|
||||
target_id=context_info_1['target_id'],
|
||||
headless=True
|
||||
)
|
||||
|
||||
manager1 = BrowserManager(browser_config=browser_config_1, logger=logger)
|
||||
await manager1.start()
|
||||
|
||||
# Set a cookie in context 1
|
||||
page1, ctx1 = await manager1.get_page(CrawlerRunConfig())
|
||||
await page1.goto("https://example.com", wait_until="domcontentloaded")
|
||||
await ctx1.add_cookies([{
|
||||
"name": "test_isolation",
|
||||
"value": "context_1_value",
|
||||
"domain": "example.com",
|
||||
"path": "/"
|
||||
}])
|
||||
|
||||
cookies1 = await ctx1.cookies(["https://example.com"])
|
||||
cookie1_value = next((c["value"] for c in cookies1 if c["name"] == "test_isolation"), None)
|
||||
logger.info(f"Cookie in context 1: {cookie1_value}", tag="TEST")
|
||||
|
||||
# Connect with second context
|
||||
browser_config_2 = BrowserConfig(
|
||||
cdp_url=cdp_url,
|
||||
browser_context_id=context_info_2['browser_context_id'],
|
||||
target_id=context_info_2['target_id'],
|
||||
headless=True
|
||||
)
|
||||
|
||||
manager2 = BrowserManager(browser_config=browser_config_2, logger=logger)
|
||||
await manager2.start()
|
||||
|
||||
# Check cookies in context 2 - should not have the cookie from context 1
|
||||
page2, ctx2 = await manager2.get_page(CrawlerRunConfig())
|
||||
await page2.goto("https://example.com", wait_until="domcontentloaded")
|
||||
|
||||
cookies2 = await ctx2.cookies(["https://example.com"])
|
||||
cookie2_value = next((c["value"] for c in cookies2 if c["name"] == "test_isolation"), None)
|
||||
logger.info(f"Cookie in context 2: {cookie2_value}", tag="TEST")
|
||||
|
||||
# Verify isolation
|
||||
isolation_works = cookie1_value == "context_1_value" and cookie2_value is None
|
||||
|
||||
if isolation_works:
|
||||
logger.success("Context isolation test passed", tag="TEST")
|
||||
else:
|
||||
logger.error(f"Isolation failed - Cookie1: {cookie1_value}, Cookie2: {cookie2_value}", tag="TEST")
|
||||
|
||||
return isolation_works
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Test failed: {str(e)}", tag="TEST")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return False
|
||||
|
||||
finally:
|
||||
# Cleanup
|
||||
for mgr in [manager1, manager2]:
|
||||
if mgr:
|
||||
try:
|
||||
await mgr.close()
|
||||
except:
|
||||
pass
|
||||
|
||||
for ctx_info, creator in [(context_info_1, cdp_creator), (context_info_2, cdp_creator2 if 'cdp_creator2' in dir() else None)]:
|
||||
if ctx_info and creator:
|
||||
try:
|
||||
await creator.dispose_context(ctx_info['browser_context_id'])
|
||||
await creator.disconnect()
|
||||
except:
|
||||
pass
|
||||
|
||||
if managed_browser:
|
||||
try:
|
||||
await managed_browser.cleanup()
|
||||
except:
|
||||
pass
|
||||
|
||||
|
||||
async def run_tests():
|
||||
"""Run all browser_context_id tests."""
|
||||
results = []
|
||||
|
||||
logger.info("Running browser_context_id tests", tag="SUITE")
|
||||
|
||||
# Basic parameter test
|
||||
results.append(("browser_context_id_basic", await test_browser_context_id_basic()))
|
||||
|
||||
# Pre-created context usage test
|
||||
results.append(("pre_created_context_usage", await test_pre_created_context_usage()))
|
||||
|
||||
# Note: Context isolation test is commented out because isolation is enforced
|
||||
# at the CDP level by the cloud browser service, not at the Playwright level.
|
||||
# When multiple BrowserManagers connect to the same browser, Playwright sees
|
||||
# all contexts. In production, each worker gets exactly one pre-created context.
|
||||
# results.append(("context_isolation", await test_context_isolation()))
|
||||
|
||||
# Print summary
|
||||
total = len(results)
|
||||
passed = sum(1 for _, r in results if r)
|
||||
|
||||
logger.info("=" * 50, tag="SUMMARY")
|
||||
logger.info(f"Test Results: {passed}/{total} passed", tag="SUMMARY")
|
||||
logger.info("=" * 50, tag="SUMMARY")
|
||||
|
||||
for name, result in results:
|
||||
status = "PASSED" if result else "FAILED"
|
||||
logger.info(f" {name}: {status}", tag="SUMMARY")
|
||||
|
||||
if passed == total:
|
||||
logger.success("All tests passed!", tag="SUMMARY")
|
||||
return True
|
||||
else:
|
||||
logger.error(f"{total - passed} tests failed", tag="SUMMARY")
|
||||
return False
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
success = asyncio.run(run_tests())
|
||||
sys.exit(0 if success else 1)
|
||||
281
tests/browser/test_cdp_cleanup_reuse.py
Normal file
281
tests/browser/test_cdp_cleanup_reuse.py
Normal file
@@ -0,0 +1,281 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Tests for CDP connection cleanup and browser reuse.
|
||||
|
||||
These tests verify that:
|
||||
1. WebSocket URLs are properly handled (skip HTTP verification)
|
||||
2. cdp_cleanup_on_close properly disconnects without terminating the browser
|
||||
3. The same browser can be reused by multiple sequential connections
|
||||
|
||||
Requirements:
|
||||
- A CDP-compatible browser pool service running (e.g., chromepoold)
|
||||
- Service should be accessible at CDP_SERVICE_URL (default: http://localhost:11235)
|
||||
|
||||
Usage:
|
||||
pytest tests/browser/test_cdp_cleanup_reuse.py -v
|
||||
|
||||
Or run directly:
|
||||
python tests/browser/test_cdp_cleanup_reuse.py
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
import pytest
|
||||
import requests
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
|
||||
# Configuration
|
||||
CDP_SERVICE_URL = os.getenv("CDP_SERVICE_URL", "http://localhost:11235")
|
||||
|
||||
|
||||
def is_cdp_service_available():
|
||||
"""Check if CDP service is running."""
|
||||
try:
|
||||
resp = requests.get(f"{CDP_SERVICE_URL}/health", timeout=2)
|
||||
return resp.status_code == 200
|
||||
except:
|
||||
return False
|
||||
|
||||
|
||||
def create_browser():
|
||||
"""Create a browser via CDP service API."""
|
||||
resp = requests.post(
|
||||
f"{CDP_SERVICE_URL}/v1/browsers",
|
||||
json={"headless": True},
|
||||
timeout=10
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()
|
||||
|
||||
|
||||
def get_browser_info(browser_id):
|
||||
"""Get browser info from CDP service."""
|
||||
resp = requests.get(f"{CDP_SERVICE_URL}/v1/browsers", timeout=5)
|
||||
for browser in resp.json():
|
||||
if browser["id"] == browser_id:
|
||||
return browser
|
||||
return None
|
||||
|
||||
|
||||
def delete_browser(browser_id):
|
||||
"""Delete a browser via CDP service API."""
|
||||
try:
|
||||
requests.delete(f"{CDP_SERVICE_URL}/v1/browsers/{browser_id}", timeout=5)
|
||||
except:
|
||||
pass
|
||||
|
||||
|
||||
# Skip all tests if CDP service is not available
|
||||
pytestmark = pytest.mark.skipif(
|
||||
not is_cdp_service_available(),
|
||||
reason=f"CDP service not available at {CDP_SERVICE_URL}"
|
||||
)
|
||||
|
||||
|
||||
class TestCDPWebSocketURL:
|
||||
"""Tests for WebSocket URL handling."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_websocket_url_skips_http_verification(self):
|
||||
"""WebSocket URLs should skip HTTP /json/version verification."""
|
||||
browser = create_browser()
|
||||
try:
|
||||
ws_url = browser["ws_url"]
|
||||
assert ws_url.startswith("ws://") or ws_url.startswith("wss://")
|
||||
|
||||
async with AsyncWebCrawler(
|
||||
config=BrowserConfig(
|
||||
browser_mode="cdp",
|
||||
cdp_url=ws_url,
|
||||
headless=True,
|
||||
cdp_cleanup_on_close=True,
|
||||
)
|
||||
) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=CrawlerRunConfig(verbose=False),
|
||||
)
|
||||
assert result.success
|
||||
assert "Example Domain" in result.metadata.get("title", "")
|
||||
finally:
|
||||
delete_browser(browser["browser_id"])
|
||||
|
||||
|
||||
class TestCDPCleanupOnClose:
|
||||
"""Tests for cdp_cleanup_on_close behavior."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_browser_survives_after_cleanup_close(self):
|
||||
"""Browser should remain alive after close with cdp_cleanup_on_close=True."""
|
||||
browser = create_browser()
|
||||
browser_id = browser["browser_id"]
|
||||
ws_url = browser["ws_url"]
|
||||
|
||||
try:
|
||||
# Verify browser exists
|
||||
info_before = get_browser_info(browser_id)
|
||||
assert info_before is not None
|
||||
pid_before = info_before["pid"]
|
||||
|
||||
# Connect, crawl, and close with cleanup
|
||||
async with AsyncWebCrawler(
|
||||
config=BrowserConfig(
|
||||
browser_mode="cdp",
|
||||
cdp_url=ws_url,
|
||||
headless=True,
|
||||
cdp_cleanup_on_close=True,
|
||||
)
|
||||
) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=CrawlerRunConfig(verbose=False),
|
||||
)
|
||||
assert result.success
|
||||
|
||||
# Browser should still exist with same PID
|
||||
info_after = get_browser_info(browser_id)
|
||||
assert info_after is not None, "Browser was terminated but should only disconnect"
|
||||
assert info_after["pid"] == pid_before, "Browser PID changed unexpectedly"
|
||||
finally:
|
||||
delete_browser(browser_id)
|
||||
|
||||
|
||||
class TestCDPBrowserReuse:
|
||||
"""Tests for reusing the same browser with multiple connections."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_sequential_connections_same_browser(self):
|
||||
"""Multiple sequential connections to the same browser should work."""
|
||||
browser = create_browser()
|
||||
browser_id = browser["browser_id"]
|
||||
ws_url = browser["ws_url"]
|
||||
|
||||
try:
|
||||
urls = [
|
||||
"https://example.com",
|
||||
"https://httpbin.org/ip",
|
||||
"https://httpbin.org/headers",
|
||||
]
|
||||
|
||||
for i, url in enumerate(urls, 1):
|
||||
# Each connection uses cdp_cleanup_on_close=True
|
||||
async with AsyncWebCrawler(
|
||||
config=BrowserConfig(
|
||||
browser_mode="cdp",
|
||||
cdp_url=ws_url,
|
||||
headless=True,
|
||||
cdp_cleanup_on_close=True,
|
||||
)
|
||||
) as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
config=CrawlerRunConfig(verbose=False),
|
||||
)
|
||||
assert result.success, f"Connection {i} failed for {url}"
|
||||
|
||||
# Verify browser is still healthy
|
||||
info = get_browser_info(browser_id)
|
||||
assert info is not None, f"Browser died after connection {i}"
|
||||
|
||||
finally:
|
||||
delete_browser(browser_id)
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_no_user_wait_needed_between_connections(self):
|
||||
"""With cdp_cleanup_on_close=True, no user wait should be needed."""
|
||||
browser = create_browser()
|
||||
browser_id = browser["browser_id"]
|
||||
ws_url = browser["ws_url"]
|
||||
|
||||
try:
|
||||
# Rapid-fire connections with NO sleep between them
|
||||
for i in range(3):
|
||||
async with AsyncWebCrawler(
|
||||
config=BrowserConfig(
|
||||
browser_mode="cdp",
|
||||
cdp_url=ws_url,
|
||||
headless=True,
|
||||
cdp_cleanup_on_close=True,
|
||||
)
|
||||
) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=CrawlerRunConfig(verbose=False),
|
||||
)
|
||||
assert result.success, f"Rapid connection {i+1} failed"
|
||||
# NO asyncio.sleep() here - internal delay should be sufficient
|
||||
finally:
|
||||
delete_browser(browser_id)
|
||||
|
||||
|
||||
class TestCDPBackwardCompatibility:
|
||||
"""Tests for backward compatibility with existing CDP usage."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_http_url_with_browser_id_works(self):
|
||||
"""HTTP URL with browser_id query param should work (backward compatibility)."""
|
||||
browser = create_browser()
|
||||
browser_id = browser["browser_id"]
|
||||
try:
|
||||
# Use HTTP URL with browser_id query parameter
|
||||
http_url = f"{CDP_SERVICE_URL}?browser_id={browser_id}"
|
||||
|
||||
async with AsyncWebCrawler(
|
||||
config=BrowserConfig(
|
||||
browser_mode="cdp",
|
||||
cdp_url=http_url,
|
||||
headless=True,
|
||||
cdp_cleanup_on_close=True,
|
||||
)
|
||||
) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=CrawlerRunConfig(verbose=False),
|
||||
)
|
||||
assert result.success
|
||||
finally:
|
||||
delete_browser(browser_id)
|
||||
|
||||
|
||||
# Allow running directly
|
||||
if __name__ == "__main__":
|
||||
if not is_cdp_service_available():
|
||||
print(f"CDP service not available at {CDP_SERVICE_URL}")
|
||||
print("Please start a CDP-compatible browser pool service first.")
|
||||
exit(1)
|
||||
|
||||
async def run_tests():
|
||||
print("=" * 60)
|
||||
print("CDP Cleanup and Browser Reuse Tests")
|
||||
print("=" * 60)
|
||||
|
||||
tests = [
|
||||
("WebSocket URL handling", TestCDPWebSocketURL().test_websocket_url_skips_http_verification),
|
||||
("Browser survives after cleanup", TestCDPCleanupOnClose().test_browser_survives_after_cleanup_close),
|
||||
("Sequential connections", TestCDPBrowserReuse().test_sequential_connections_same_browser),
|
||||
("No user wait needed", TestCDPBrowserReuse().test_no_user_wait_needed_between_connections),
|
||||
("HTTP URL with browser_id", TestCDPBackwardCompatibility().test_http_url_with_browser_id_works),
|
||||
]
|
||||
|
||||
results = []
|
||||
for name, test_func in tests:
|
||||
print(f"\n--- {name} ---")
|
||||
try:
|
||||
await test_func()
|
||||
print(f"PASS")
|
||||
results.append((name, True))
|
||||
except Exception as e:
|
||||
print(f"FAIL: {e}")
|
||||
results.append((name, False))
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("SUMMARY")
|
||||
print("=" * 60)
|
||||
for name, passed in results:
|
||||
print(f" {name}: {'PASS' if passed else 'FAIL'}")
|
||||
|
||||
all_passed = all(r[1] for r in results)
|
||||
print(f"\nOverall: {'ALL TESTS PASSED' if all_passed else 'SOME TESTS FAILED'}")
|
||||
return 0 if all_passed else 1
|
||||
|
||||
exit(asyncio.run(run_tests()))
|
||||
1
tests/cache_validation/__init__.py
Normal file
1
tests/cache_validation/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
# Cache validation test suite
|
||||
40
tests/cache_validation/conftest.py
Normal file
40
tests/cache_validation/conftest.py
Normal file
@@ -0,0 +1,40 @@
|
||||
"""Pytest fixtures for cache validation tests."""
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
def pytest_configure(config):
|
||||
"""Register custom markers."""
|
||||
config.addinivalue_line(
|
||||
"markers", "integration: marks tests as integration tests (may require network)"
|
||||
)
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sample_head_html():
|
||||
"""Sample HTML head section for testing."""
|
||||
return '''
|
||||
<head>
|
||||
<meta charset="utf-8">
|
||||
<title>Test Page Title</title>
|
||||
<meta name="description" content="This is a test page description">
|
||||
<meta property="og:title" content="OG Test Title">
|
||||
<meta property="og:description" content="OG Description">
|
||||
<meta property="og:image" content="https://example.com/image.jpg">
|
||||
<meta property="article:modified_time" content="2024-12-01T00:00:00Z">
|
||||
<link rel="stylesheet" href="style.css">
|
||||
<script src="app.js"></script>
|
||||
</head>
|
||||
'''
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def minimal_head_html():
|
||||
"""Minimal head with just a title."""
|
||||
return '<head><title>Minimal</title></head>'
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def empty_head_html():
|
||||
"""Empty head section."""
|
||||
return '<head></head>'
|
||||
449
tests/cache_validation/test_end_to_end.py
Normal file
449
tests/cache_validation/test_end_to_end.py
Normal file
@@ -0,0 +1,449 @@
|
||||
"""
|
||||
End-to-end tests for Smart Cache validation.
|
||||
|
||||
Tests the full flow:
|
||||
1. Fresh crawl (browser launch) - SLOW
|
||||
2. Cached crawl without validation (check_cache_freshness=False) - FAST
|
||||
3. Cached crawl with validation (check_cache_freshness=True) - FAST (304/fingerprint)
|
||||
|
||||
Verifies all layers:
|
||||
- Database storage of etag, last_modified, head_fingerprint, cached_at
|
||||
- Cache validation logic
|
||||
- HTTP conditional requests (304 Not Modified)
|
||||
- Performance improvements
|
||||
"""
|
||||
|
||||
import pytest
|
||||
import time
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.async_database import async_db_manager
|
||||
|
||||
|
||||
class TestEndToEndCacheValidation:
|
||||
"""End-to-end tests for the complete cache validation flow."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_full_cache_flow_docs_python(self):
|
||||
"""
|
||||
Test complete cache flow with docs.python.org:
|
||||
1. Fresh crawl (slow - browser) - using BYPASS to force fresh
|
||||
2. Cache hit without validation (fast)
|
||||
3. Cache hit with validation (fast - 304)
|
||||
"""
|
||||
url = "https://docs.python.org/3/"
|
||||
|
||||
browser_config = BrowserConfig(headless=True, verbose=False)
|
||||
|
||||
# ========== CRAWL 1: Fresh crawl (force with WRITE_ONLY to skip cache read) ==========
|
||||
config1 = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.WRITE_ONLY, # Skip reading, write new data
|
||||
check_cache_freshness=False,
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
start1 = time.perf_counter()
|
||||
result1 = await crawler.arun(url, config=config1)
|
||||
time1 = time.perf_counter() - start1
|
||||
|
||||
assert result1.success, f"First crawl failed: {result1.error_message}"
|
||||
# WRITE_ONLY means we did a fresh crawl and wrote to cache
|
||||
assert result1.cache_status == "miss", f"Expected 'miss', got '{result1.cache_status}'"
|
||||
|
||||
print(f"\n[CRAWL 1] Fresh crawl: {time1:.2f}s (cache_status: {result1.cache_status})")
|
||||
|
||||
# Verify data is stored in database
|
||||
metadata = await async_db_manager.aget_cache_metadata(url)
|
||||
assert metadata is not None, "Metadata should be stored in database"
|
||||
assert metadata.get("etag") or metadata.get("last_modified"), "Should have ETag or Last-Modified"
|
||||
print(f" - Stored ETag: {metadata.get('etag', 'N/A')[:30]}...")
|
||||
print(f" - Stored Last-Modified: {metadata.get('last_modified', 'N/A')}")
|
||||
print(f" - Stored head_fingerprint: {metadata.get('head_fingerprint', 'N/A')}")
|
||||
print(f" - Stored cached_at: {metadata.get('cached_at', 'N/A')}")
|
||||
|
||||
# ========== CRAWL 2: Cache hit WITHOUT validation ==========
|
||||
config2 = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.ENABLED,
|
||||
check_cache_freshness=False, # Skip validation - pure cache hit
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
start2 = time.perf_counter()
|
||||
result2 = await crawler.arun(url, config=config2)
|
||||
time2 = time.perf_counter() - start2
|
||||
|
||||
assert result2.success, f"Second crawl failed: {result2.error_message}"
|
||||
assert result2.cache_status == "hit", f"Expected 'hit', got '{result2.cache_status}'"
|
||||
|
||||
print(f"\n[CRAWL 2] Cache hit (no validation): {time2:.2f}s (cache_status: {result2.cache_status})")
|
||||
print(f" - Speedup: {time1/time2:.1f}x faster than fresh crawl")
|
||||
|
||||
# Should be MUCH faster - no browser, no HTTP request
|
||||
assert time2 < time1 / 2, f"Cache hit should be at least 2x faster (was {time1/time2:.1f}x)"
|
||||
|
||||
# ========== CRAWL 3: Cache hit WITH validation (304) ==========
|
||||
config3 = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.ENABLED,
|
||||
check_cache_freshness=True, # Validate cache freshness
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
start3 = time.perf_counter()
|
||||
result3 = await crawler.arun(url, config=config3)
|
||||
time3 = time.perf_counter() - start3
|
||||
|
||||
assert result3.success, f"Third crawl failed: {result3.error_message}"
|
||||
# Should be "hit_validated" (304) or "hit_fallback" (error during validation)
|
||||
assert result3.cache_status in ["hit_validated", "hit_fallback"], \
|
||||
f"Expected validated cache hit, got '{result3.cache_status}'"
|
||||
|
||||
print(f"\n[CRAWL 3] Cache hit (with validation): {time3:.2f}s (cache_status: {result3.cache_status})")
|
||||
print(f" - Speedup: {time1/time3:.1f}x faster than fresh crawl")
|
||||
|
||||
# Should still be fast - just a HEAD request, no browser
|
||||
assert time3 < time1 / 2, f"Validated cache hit should be faster than fresh crawl"
|
||||
|
||||
# ========== SUMMARY ==========
|
||||
print(f"\n{'='*60}")
|
||||
print(f"PERFORMANCE SUMMARY for {url}")
|
||||
print(f"{'='*60}")
|
||||
print(f" Fresh crawl (browser): {time1:.2f}s")
|
||||
print(f" Cache hit (no validation): {time2:.2f}s ({time1/time2:.1f}x faster)")
|
||||
print(f" Cache hit (with validation): {time3:.2f}s ({time1/time3:.1f}x faster)")
|
||||
print(f"{'='*60}")
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_full_cache_flow_crawl4ai_docs(self):
|
||||
"""Test with docs.crawl4ai.com."""
|
||||
url = "https://docs.crawl4ai.com/"
|
||||
|
||||
browser_config = BrowserConfig(headless=True, verbose=False)
|
||||
|
||||
# Fresh crawl - use WRITE_ONLY to ensure we get fresh data
|
||||
config1 = CrawlerRunConfig(cache_mode=CacheMode.WRITE_ONLY, check_cache_freshness=False)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
start1 = time.perf_counter()
|
||||
result1 = await crawler.arun(url, config=config1)
|
||||
time1 = time.perf_counter() - start1
|
||||
|
||||
assert result1.success
|
||||
assert result1.cache_status == "miss"
|
||||
print(f"\n[docs.crawl4ai.com] Fresh: {time1:.2f}s")
|
||||
|
||||
# Cache hit with validation
|
||||
config2 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=True)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
start2 = time.perf_counter()
|
||||
result2 = await crawler.arun(url, config=config2)
|
||||
time2 = time.perf_counter() - start2
|
||||
|
||||
assert result2.success
|
||||
assert result2.cache_status in ["hit_validated", "hit_fallback"]
|
||||
print(f"[docs.crawl4ai.com] Validated: {time2:.2f}s ({time1/time2:.1f}x faster)")
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_verify_database_storage(self):
|
||||
"""Verify all validation metadata is properly stored in database."""
|
||||
url = "https://docs.python.org/3/library/asyncio.html"
|
||||
|
||||
browser_config = BrowserConfig(headless=True, verbose=False)
|
||||
config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url, config=config)
|
||||
|
||||
assert result.success
|
||||
|
||||
# Verify all fields in database
|
||||
metadata = await async_db_manager.aget_cache_metadata(url)
|
||||
|
||||
assert metadata is not None, "Metadata must be stored"
|
||||
assert "url" in metadata
|
||||
assert "etag" in metadata
|
||||
assert "last_modified" in metadata
|
||||
assert "head_fingerprint" in metadata
|
||||
assert "cached_at" in metadata
|
||||
assert "response_headers" in metadata
|
||||
|
||||
print(f"\nDatabase storage verification for {url}:")
|
||||
print(f" - etag: {metadata['etag'][:40] if metadata['etag'] else 'None'}...")
|
||||
print(f" - last_modified: {metadata['last_modified']}")
|
||||
print(f" - head_fingerprint: {metadata['head_fingerprint']}")
|
||||
print(f" - cached_at: {metadata['cached_at']}")
|
||||
print(f" - response_headers keys: {list(metadata['response_headers'].keys())[:5]}...")
|
||||
|
||||
# At least one validation field should be populated
|
||||
has_validation_data = (
|
||||
metadata["etag"] or
|
||||
metadata["last_modified"] or
|
||||
metadata["head_fingerprint"]
|
||||
)
|
||||
assert has_validation_data, "Should have at least one validation field"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_head_fingerprint_stored_and_used(self):
|
||||
"""Verify head fingerprint is computed, stored, and used for validation."""
|
||||
url = "https://example.com/"
|
||||
|
||||
browser_config = BrowserConfig(headless=True, verbose=False)
|
||||
|
||||
# Fresh crawl
|
||||
config1 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result1 = await crawler.arun(url, config=config1)
|
||||
|
||||
assert result1.success
|
||||
assert result1.head_fingerprint, "head_fingerprint should be set on CrawlResult"
|
||||
|
||||
# Verify in database
|
||||
metadata = await async_db_manager.aget_cache_metadata(url)
|
||||
assert metadata["head_fingerprint"], "head_fingerprint should be stored in database"
|
||||
assert metadata["head_fingerprint"] == result1.head_fingerprint
|
||||
|
||||
print(f"\nHead fingerprint for {url}:")
|
||||
print(f" - CrawlResult.head_fingerprint: {result1.head_fingerprint}")
|
||||
print(f" - Database head_fingerprint: {metadata['head_fingerprint']}")
|
||||
|
||||
# Validate using fingerprint
|
||||
config2 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=True)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result2 = await crawler.arun(url, config=config2)
|
||||
|
||||
assert result2.success
|
||||
assert result2.cache_status in ["hit_validated", "hit_fallback"]
|
||||
print(f" - Validation result: {result2.cache_status}")
|
||||
|
||||
|
||||
class TestCacheValidationPerformance:
|
||||
"""Performance benchmarks for cache validation."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_multiple_urls_performance(self):
|
||||
"""Test cache performance across multiple URLs."""
|
||||
urls = [
|
||||
"https://docs.python.org/3/",
|
||||
"https://docs.python.org/3/library/asyncio.html",
|
||||
"https://en.wikipedia.org/wiki/Python_(programming_language)",
|
||||
]
|
||||
|
||||
browser_config = BrowserConfig(headless=True, verbose=False)
|
||||
fresh_times = []
|
||||
cached_times = []
|
||||
|
||||
print(f"\n{'='*70}")
|
||||
print("MULTI-URL PERFORMANCE TEST")
|
||||
print(f"{'='*70}")
|
||||
|
||||
# Fresh crawls - use WRITE_ONLY to force fresh crawl
|
||||
for url in urls:
|
||||
config = CrawlerRunConfig(cache_mode=CacheMode.WRITE_ONLY, check_cache_freshness=False)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
start = time.perf_counter()
|
||||
result = await crawler.arun(url, config=config)
|
||||
elapsed = time.perf_counter() - start
|
||||
fresh_times.append(elapsed)
|
||||
print(f"Fresh: {url[:50]:50} {elapsed:.2f}s ({result.cache_status})")
|
||||
|
||||
# Cached crawls with validation
|
||||
for url in urls:
|
||||
config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=True)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
start = time.perf_counter()
|
||||
result = await crawler.arun(url, config=config)
|
||||
elapsed = time.perf_counter() - start
|
||||
cached_times.append(elapsed)
|
||||
print(f"Cached: {url[:50]:50} {elapsed:.2f}s ({result.cache_status})")
|
||||
|
||||
avg_fresh = sum(fresh_times) / len(fresh_times)
|
||||
avg_cached = sum(cached_times) / len(cached_times)
|
||||
total_fresh = sum(fresh_times)
|
||||
total_cached = sum(cached_times)
|
||||
|
||||
print(f"\n{'='*70}")
|
||||
print(f"RESULTS:")
|
||||
print(f" Total fresh crawl time: {total_fresh:.2f}s")
|
||||
print(f" Total cached time: {total_cached:.2f}s")
|
||||
print(f" Average speedup: {avg_fresh/avg_cached:.1f}x")
|
||||
print(f" Time saved: {total_fresh - total_cached:.2f}s")
|
||||
print(f"{'='*70}")
|
||||
|
||||
# Cached should be significantly faster
|
||||
assert avg_cached < avg_fresh / 2, "Cached crawls should be at least 2x faster"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_repeated_access_same_url(self):
|
||||
"""Test repeated access to the same URL shows consistent cache hits."""
|
||||
url = "https://docs.python.org/3/"
|
||||
num_accesses = 5
|
||||
|
||||
browser_config = BrowserConfig(headless=True, verbose=False)
|
||||
|
||||
print(f"\n{'='*60}")
|
||||
print(f"REPEATED ACCESS TEST: {url}")
|
||||
print(f"{'='*60}")
|
||||
|
||||
# First access - fresh crawl
|
||||
config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
start = time.perf_counter()
|
||||
result = await crawler.arun(url, config=config)
|
||||
fresh_time = time.perf_counter() - start
|
||||
print(f"Access 1 (fresh): {fresh_time:.2f}s - {result.cache_status}")
|
||||
|
||||
# Repeated accesses - should all be cache hits
|
||||
cached_times = []
|
||||
for i in range(2, num_accesses + 1):
|
||||
config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=True)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
start = time.perf_counter()
|
||||
result = await crawler.arun(url, config=config)
|
||||
elapsed = time.perf_counter() - start
|
||||
cached_times.append(elapsed)
|
||||
print(f"Access {i} (cached): {elapsed:.2f}s - {result.cache_status}")
|
||||
assert result.cache_status in ["hit", "hit_validated", "hit_fallback"]
|
||||
|
||||
avg_cached = sum(cached_times) / len(cached_times)
|
||||
print(f"\nAverage cached time: {avg_cached:.2f}s")
|
||||
print(f"Speedup over fresh: {fresh_time/avg_cached:.1f}x")
|
||||
|
||||
|
||||
class TestCacheValidationModes:
|
||||
"""Test different cache modes and their behavior."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_cache_bypass_always_fresh(self):
|
||||
"""CacheMode.BYPASS should always do fresh crawl."""
|
||||
# Use a unique URL path to avoid cache from other tests
|
||||
url = "https://example.com/test-bypass"
|
||||
|
||||
browser_config = BrowserConfig(headless=True, verbose=False)
|
||||
|
||||
# First crawl with WRITE_ONLY to populate cache (always fresh)
|
||||
config1 = CrawlerRunConfig(cache_mode=CacheMode.WRITE_ONLY, check_cache_freshness=False)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result1 = await crawler.arun(url, config=config1)
|
||||
assert result1.cache_status == "miss"
|
||||
|
||||
# Second crawl with BYPASS - should NOT use cache
|
||||
config2 = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, check_cache_freshness=False)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result2 = await crawler.arun(url, config=config2)
|
||||
|
||||
# BYPASS mode means no cache interaction
|
||||
assert result2.cache_status is None or result2.cache_status == "miss"
|
||||
print(f"\nCacheMode.BYPASS result: {result2.cache_status}")
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_validation_disabled_uses_cache_directly(self):
|
||||
"""With check_cache_freshness=False, should use cache without HTTP validation."""
|
||||
url = "https://docs.python.org/3/tutorial/"
|
||||
|
||||
browser_config = BrowserConfig(headless=True, verbose=False)
|
||||
|
||||
# Fresh crawl - use WRITE_ONLY to force fresh
|
||||
config1 = CrawlerRunConfig(cache_mode=CacheMode.WRITE_ONLY, check_cache_freshness=False)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result1 = await crawler.arun(url, config=config1)
|
||||
assert result1.cache_status == "miss"
|
||||
|
||||
# Cached with validation DISABLED - should be "hit" (not "hit_validated")
|
||||
config2 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
start = time.perf_counter()
|
||||
result2 = await crawler.arun(url, config=config2)
|
||||
elapsed = time.perf_counter() - start
|
||||
|
||||
assert result2.cache_status == "hit", f"Expected 'hit', got '{result2.cache_status}'"
|
||||
print(f"\nValidation disabled: {elapsed:.3f}s (cache_status: {result2.cache_status})")
|
||||
|
||||
# Should be very fast - no HTTP request at all
|
||||
assert elapsed < 1.0, "Cache hit without validation should be < 1 second"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_validation_enabled_checks_freshness(self):
|
||||
"""With check_cache_freshness=True, should validate before using cache."""
|
||||
url = "https://docs.python.org/3/reference/"
|
||||
|
||||
browser_config = BrowserConfig(headless=True, verbose=False)
|
||||
|
||||
# Fresh crawl
|
||||
config1 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result1 = await crawler.arun(url, config=config1)
|
||||
|
||||
# Cached with validation ENABLED - should be "hit_validated"
|
||||
config2 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=True)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
start = time.perf_counter()
|
||||
result2 = await crawler.arun(url, config=config2)
|
||||
elapsed = time.perf_counter() - start
|
||||
|
||||
assert result2.cache_status in ["hit_validated", "hit_fallback"]
|
||||
print(f"\nValidation enabled: {elapsed:.3f}s (cache_status: {result2.cache_status})")
|
||||
|
||||
|
||||
class TestCacheValidationResponseHeaders:
|
||||
"""Test that response headers are properly stored and retrieved."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_response_headers_stored(self):
|
||||
"""Verify response headers including ETag and Last-Modified are stored."""
|
||||
url = "https://docs.python.org/3/"
|
||||
|
||||
browser_config = BrowserConfig(headless=True, verbose=False)
|
||||
config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url, config=config)
|
||||
|
||||
assert result.success
|
||||
assert result.response_headers is not None
|
||||
|
||||
# Check that cache-relevant headers are captured
|
||||
headers = result.response_headers
|
||||
print(f"\nResponse headers for {url}:")
|
||||
|
||||
# Look for ETag (case-insensitive)
|
||||
etag = headers.get("etag") or headers.get("ETag")
|
||||
print(f" - ETag: {etag}")
|
||||
|
||||
# Look for Last-Modified
|
||||
last_modified = headers.get("last-modified") or headers.get("Last-Modified")
|
||||
print(f" - Last-Modified: {last_modified}")
|
||||
|
||||
# Look for Cache-Control
|
||||
cache_control = headers.get("cache-control") or headers.get("Cache-Control")
|
||||
print(f" - Cache-Control: {cache_control}")
|
||||
|
||||
# At least one should be present for docs.python.org
|
||||
assert etag or last_modified, "Should have ETag or Last-Modified header"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_headers_used_for_validation(self):
|
||||
"""Verify stored headers are used for conditional requests."""
|
||||
url = "https://docs.crawl4ai.com/"
|
||||
|
||||
browser_config = BrowserConfig(headless=True, verbose=False)
|
||||
|
||||
# Fresh crawl to store headers
|
||||
config1 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=False)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result1 = await crawler.arun(url, config=config1)
|
||||
|
||||
# Get stored metadata
|
||||
metadata = await async_db_manager.aget_cache_metadata(url)
|
||||
stored_etag = metadata.get("etag")
|
||||
stored_last_modified = metadata.get("last_modified")
|
||||
|
||||
print(f"\nStored validation data for {url}:")
|
||||
print(f" - etag: {stored_etag}")
|
||||
print(f" - last_modified: {stored_last_modified}")
|
||||
|
||||
# Validate - should use stored headers
|
||||
config2 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED, check_cache_freshness=True)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result2 = await crawler.arun(url, config=config2)
|
||||
|
||||
# Should get validated hit (304 response)
|
||||
assert result2.cache_status in ["hit_validated", "hit_fallback"]
|
||||
print(f" - Validation result: {result2.cache_status}")
|
||||
97
tests/cache_validation/test_head_fingerprint.py
Normal file
97
tests/cache_validation/test_head_fingerprint.py
Normal file
@@ -0,0 +1,97 @@
|
||||
"""Unit tests for head fingerprinting."""
|
||||
|
||||
import pytest
|
||||
from crawl4ai.utils import compute_head_fingerprint
|
||||
|
||||
|
||||
class TestHeadFingerprint:
|
||||
"""Tests for the compute_head_fingerprint function."""
|
||||
|
||||
def test_same_content_same_fingerprint(self):
|
||||
"""Identical <head> content produces same fingerprint."""
|
||||
head = "<head><title>Test Page</title></head>"
|
||||
fp1 = compute_head_fingerprint(head)
|
||||
fp2 = compute_head_fingerprint(head)
|
||||
assert fp1 == fp2
|
||||
assert fp1 != ""
|
||||
|
||||
def test_different_title_different_fingerprint(self):
|
||||
"""Different title produces different fingerprint."""
|
||||
head1 = "<head><title>Title A</title></head>"
|
||||
head2 = "<head><title>Title B</title></head>"
|
||||
assert compute_head_fingerprint(head1) != compute_head_fingerprint(head2)
|
||||
|
||||
def test_empty_head_returns_empty_string(self):
|
||||
"""Empty or None head should return empty fingerprint."""
|
||||
assert compute_head_fingerprint("") == ""
|
||||
assert compute_head_fingerprint(None) == ""
|
||||
|
||||
def test_head_without_signals_returns_empty(self):
|
||||
"""Head without title or key meta tags returns empty."""
|
||||
head = "<head><link rel='stylesheet' href='style.css'></head>"
|
||||
assert compute_head_fingerprint(head) == ""
|
||||
|
||||
def test_extracts_title(self):
|
||||
"""Title is extracted and included in fingerprint."""
|
||||
head1 = "<head><title>My Title</title></head>"
|
||||
head2 = "<head><title>My Title</title><link href='x'></head>"
|
||||
# Same title should produce same fingerprint
|
||||
assert compute_head_fingerprint(head1) == compute_head_fingerprint(head2)
|
||||
|
||||
def test_extracts_meta_description(self):
|
||||
"""Meta description is extracted."""
|
||||
head1 = '<head><meta name="description" content="Test description"></head>'
|
||||
head2 = '<head><meta name="description" content="Different description"></head>'
|
||||
assert compute_head_fingerprint(head1) != compute_head_fingerprint(head2)
|
||||
|
||||
def test_extracts_og_tags(self):
|
||||
"""Open Graph tags are extracted."""
|
||||
head1 = '<head><meta property="og:title" content="OG Title"></head>'
|
||||
head2 = '<head><meta property="og:title" content="Different OG Title"></head>'
|
||||
assert compute_head_fingerprint(head1) != compute_head_fingerprint(head2)
|
||||
|
||||
def test_extracts_og_image(self):
|
||||
"""og:image is extracted and affects fingerprint."""
|
||||
head1 = '<head><meta property="og:image" content="https://example.com/img1.jpg"></head>'
|
||||
head2 = '<head><meta property="og:image" content="https://example.com/img2.jpg"></head>'
|
||||
assert compute_head_fingerprint(head1) != compute_head_fingerprint(head2)
|
||||
|
||||
def test_extracts_article_modified_time(self):
|
||||
"""article:modified_time is extracted."""
|
||||
head1 = '<head><meta property="article:modified_time" content="2024-01-01T00:00:00Z"></head>'
|
||||
head2 = '<head><meta property="article:modified_time" content="2024-12-01T00:00:00Z"></head>'
|
||||
assert compute_head_fingerprint(head1) != compute_head_fingerprint(head2)
|
||||
|
||||
def test_case_insensitive(self):
|
||||
"""Fingerprinting is case-insensitive for tags."""
|
||||
head1 = "<head><TITLE>Test</TITLE></head>"
|
||||
head2 = "<head><title>test</title></head>"
|
||||
# Both should extract title (case insensitive)
|
||||
fp1 = compute_head_fingerprint(head1)
|
||||
fp2 = compute_head_fingerprint(head2)
|
||||
assert fp1 != ""
|
||||
assert fp2 != ""
|
||||
|
||||
def test_handles_attribute_order(self):
|
||||
"""Handles different attribute orders in meta tags."""
|
||||
head1 = '<head><meta name="description" content="Test"></head>'
|
||||
head2 = '<head><meta content="Test" name="description"></head>'
|
||||
assert compute_head_fingerprint(head1) == compute_head_fingerprint(head2)
|
||||
|
||||
def test_real_world_head(self):
|
||||
"""Test with a realistic head section."""
|
||||
head = '''
|
||||
<head>
|
||||
<meta charset="utf-8">
|
||||
<title>Python Documentation</title>
|
||||
<meta name="description" content="Official Python documentation">
|
||||
<meta property="og:title" content="Python Docs">
|
||||
<meta property="og:description" content="Learn Python">
|
||||
<meta property="og:image" content="https://python.org/logo.png">
|
||||
<link rel="stylesheet" href="styles.css">
|
||||
</head>
|
||||
'''
|
||||
fp = compute_head_fingerprint(head)
|
||||
assert fp != ""
|
||||
# Should be deterministic
|
||||
assert fp == compute_head_fingerprint(head)
|
||||
354
tests/cache_validation/test_real_domains.py
Normal file
354
tests/cache_validation/test_real_domains.py
Normal file
@@ -0,0 +1,354 @@
|
||||
"""
|
||||
Real-world tests for cache validation using actual HTTP requests.
|
||||
No mocks - all tests hit real servers.
|
||||
"""
|
||||
|
||||
import pytest
|
||||
from crawl4ai.cache_validator import CacheValidator, CacheValidationResult
|
||||
from crawl4ai.utils import compute_head_fingerprint
|
||||
|
||||
|
||||
class TestRealDomainsConditionalSupport:
|
||||
"""Test domains that support HTTP conditional requests (ETag/Last-Modified)."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_docs_python_org_etag(self):
|
||||
"""docs.python.org supports ETag - should return 304."""
|
||||
url = "https://docs.python.org/3/"
|
||||
|
||||
async with CacheValidator(timeout=15.0) as validator:
|
||||
# First fetch to get ETag
|
||||
head_html, etag, last_modified = await validator._fetch_head(url)
|
||||
|
||||
assert head_html is not None, "Should fetch head content"
|
||||
assert etag is not None, "docs.python.org should return ETag"
|
||||
|
||||
# Validate with the ETag we just got
|
||||
result = await validator.validate(url=url, stored_etag=etag)
|
||||
|
||||
assert result.status == CacheValidationResult.FRESH, f"Expected FRESH, got {result.status}: {result.reason}"
|
||||
assert "304" in result.reason
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_docs_crawl4ai_etag(self):
|
||||
"""docs.crawl4ai.com supports ETag - should return 304."""
|
||||
url = "https://docs.crawl4ai.com/"
|
||||
|
||||
async with CacheValidator(timeout=15.0) as validator:
|
||||
head_html, etag, last_modified = await validator._fetch_head(url)
|
||||
|
||||
assert etag is not None, "docs.crawl4ai.com should return ETag"
|
||||
|
||||
result = await validator.validate(url=url, stored_etag=etag)
|
||||
|
||||
assert result.status == CacheValidationResult.FRESH, f"Expected FRESH, got {result.status}: {result.reason}"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_wikipedia_last_modified(self):
|
||||
"""Wikipedia supports Last-Modified - should return 304."""
|
||||
url = "https://en.wikipedia.org/wiki/Web_crawler"
|
||||
|
||||
async with CacheValidator(timeout=15.0) as validator:
|
||||
head_html, etag, last_modified = await validator._fetch_head(url)
|
||||
|
||||
assert last_modified is not None, "Wikipedia should return Last-Modified"
|
||||
|
||||
result = await validator.validate(url=url, stored_last_modified=last_modified)
|
||||
|
||||
assert result.status == CacheValidationResult.FRESH, f"Expected FRESH, got {result.status}: {result.reason}"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_github_pages(self):
|
||||
"""GitHub Pages supports conditional requests."""
|
||||
url = "https://pages.github.com/"
|
||||
|
||||
async with CacheValidator(timeout=15.0) as validator:
|
||||
head_html, etag, last_modified = await validator._fetch_head(url)
|
||||
|
||||
# GitHub Pages typically has at least one
|
||||
has_conditional = etag is not None or last_modified is not None
|
||||
assert has_conditional, "GitHub Pages should support conditional requests"
|
||||
|
||||
result = await validator.validate(
|
||||
url=url,
|
||||
stored_etag=etag,
|
||||
stored_last_modified=last_modified,
|
||||
)
|
||||
|
||||
assert result.status == CacheValidationResult.FRESH
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_httpbin_etag(self):
|
||||
"""httpbin.org/etag endpoint for testing ETag."""
|
||||
url = "https://httpbin.org/etag/test-etag-value"
|
||||
|
||||
async with CacheValidator(timeout=15.0) as validator:
|
||||
result = await validator.validate(url=url, stored_etag='"test-etag-value"')
|
||||
|
||||
# httpbin should return 304 for matching ETag
|
||||
assert result.status == CacheValidationResult.FRESH, f"Expected FRESH, got {result.status}: {result.reason}"
|
||||
|
||||
|
||||
class TestRealDomainsNoConditionalSupport:
|
||||
"""Test domains that may NOT support HTTP conditional requests."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_dynamic_site_fingerprint_fallback(self):
|
||||
"""Test fingerprint-based validation for sites without conditional support."""
|
||||
# Use a site that changes frequently but has stable head
|
||||
url = "https://example.com/"
|
||||
|
||||
async with CacheValidator(timeout=15.0) as validator:
|
||||
# Get head and compute fingerprint
|
||||
head_html, etag, last_modified = await validator._fetch_head(url)
|
||||
|
||||
assert head_html is not None
|
||||
fingerprint = compute_head_fingerprint(head_html)
|
||||
|
||||
# Validate using fingerprint (not etag/last-modified)
|
||||
result = await validator.validate(
|
||||
url=url,
|
||||
stored_head_fingerprint=fingerprint,
|
||||
)
|
||||
|
||||
# Should be FRESH since fingerprint should match
|
||||
assert result.status == CacheValidationResult.FRESH, f"Expected FRESH, got {result.status}: {result.reason}"
|
||||
assert "fingerprint" in result.reason.lower()
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_news_site_changes_frequently(self):
|
||||
"""News sites change frequently - test that we can detect changes."""
|
||||
url = "https://www.bbc.com/news"
|
||||
|
||||
async with CacheValidator(timeout=15.0) as validator:
|
||||
head_html, etag, last_modified = await validator._fetch_head(url)
|
||||
|
||||
# BBC News has ETag but it changes with content
|
||||
assert head_html is not None
|
||||
|
||||
# Using a fake old ETag should return STALE (200 with different content)
|
||||
result = await validator.validate(
|
||||
url=url,
|
||||
stored_etag='"fake-old-etag-12345"',
|
||||
)
|
||||
|
||||
# Should be STALE because the ETag doesn't match
|
||||
assert result.status == CacheValidationResult.STALE, f"Expected STALE, got {result.status}: {result.reason}"
|
||||
|
||||
|
||||
class TestRealDomainsEdgeCases:
|
||||
"""Edge cases with real domains."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_nonexistent_domain(self):
|
||||
"""Non-existent domain should return ERROR."""
|
||||
url = "https://this-domain-definitely-does-not-exist-xyz123.com/"
|
||||
|
||||
async with CacheValidator(timeout=5.0) as validator:
|
||||
result = await validator.validate(url=url, stored_etag='"test"')
|
||||
|
||||
assert result.status == CacheValidationResult.ERROR
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_timeout_slow_server(self):
|
||||
"""Test timeout handling with a slow endpoint."""
|
||||
# httpbin delay endpoint
|
||||
url = "https://httpbin.org/delay/10"
|
||||
|
||||
async with CacheValidator(timeout=2.0) as validator: # 2 second timeout
|
||||
result = await validator.validate(url=url, stored_etag='"test"')
|
||||
|
||||
# Should timeout and return ERROR
|
||||
assert result.status == CacheValidationResult.ERROR
|
||||
assert "timeout" in result.reason.lower() or "timed out" in result.reason.lower()
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_redirect_handling(self):
|
||||
"""Test that redirects are followed."""
|
||||
# httpbin redirect
|
||||
url = "https://httpbin.org/redirect/1"
|
||||
|
||||
async with CacheValidator(timeout=15.0) as validator:
|
||||
head_html, etag, last_modified = await validator._fetch_head(url)
|
||||
|
||||
# Should follow redirect and get content
|
||||
# The final page might not have useful head content, but shouldn't error
|
||||
# This tests that redirects are handled
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_https_only(self):
|
||||
"""Test HTTPS connection."""
|
||||
url = "https://www.google.com/"
|
||||
|
||||
async with CacheValidator(timeout=15.0) as validator:
|
||||
head_html, etag, last_modified = await validator._fetch_head(url)
|
||||
|
||||
assert head_html is not None
|
||||
assert "<title" in head_html.lower()
|
||||
|
||||
|
||||
class TestRealDomainsHeadFingerprint:
|
||||
"""Test head fingerprint extraction with real domains."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_python_docs_fingerprint(self):
|
||||
"""Python docs has title and meta tags."""
|
||||
url = "https://docs.python.org/3/"
|
||||
|
||||
async with CacheValidator(timeout=15.0) as validator:
|
||||
head_html, _, _ = await validator._fetch_head(url)
|
||||
|
||||
assert head_html is not None
|
||||
fingerprint = compute_head_fingerprint(head_html)
|
||||
|
||||
assert fingerprint != "", "Should extract fingerprint from Python docs"
|
||||
|
||||
# Fingerprint should be consistent
|
||||
fingerprint2 = compute_head_fingerprint(head_html)
|
||||
assert fingerprint == fingerprint2
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_github_fingerprint(self):
|
||||
"""GitHub has og: tags."""
|
||||
url = "https://github.com/"
|
||||
|
||||
async with CacheValidator(timeout=15.0) as validator:
|
||||
head_html, _, _ = await validator._fetch_head(url)
|
||||
|
||||
assert head_html is not None
|
||||
assert "og:" in head_html.lower() or "title" in head_html.lower()
|
||||
|
||||
fingerprint = compute_head_fingerprint(head_html)
|
||||
assert fingerprint != ""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_crawl4ai_docs_fingerprint(self):
|
||||
"""Crawl4AI docs should have title and description."""
|
||||
url = "https://docs.crawl4ai.com/"
|
||||
|
||||
async with CacheValidator(timeout=15.0) as validator:
|
||||
head_html, _, _ = await validator._fetch_head(url)
|
||||
|
||||
assert head_html is not None
|
||||
fingerprint = compute_head_fingerprint(head_html)
|
||||
|
||||
assert fingerprint != "", "Should extract fingerprint from Crawl4AI docs"
|
||||
|
||||
|
||||
class TestRealDomainsFetchHead:
|
||||
"""Test _fetch_head functionality with real domains."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_fetch_stops_at_head_close(self):
|
||||
"""Verify we stop reading after </head>."""
|
||||
url = "https://docs.python.org/3/"
|
||||
|
||||
async with CacheValidator(timeout=15.0) as validator:
|
||||
head_html, _, _ = await validator._fetch_head(url)
|
||||
|
||||
assert head_html is not None
|
||||
assert "</head>" in head_html.lower()
|
||||
# Should NOT contain body content
|
||||
assert "<body" not in head_html.lower() or head_html.lower().index("</head>") < head_html.lower().find("<body")
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_extracts_both_headers(self):
|
||||
"""Test extraction of both ETag and Last-Modified."""
|
||||
url = "https://docs.python.org/3/"
|
||||
|
||||
async with CacheValidator(timeout=15.0) as validator:
|
||||
head_html, etag, last_modified = await validator._fetch_head(url)
|
||||
|
||||
# Python docs should have both
|
||||
assert etag is not None, "Should have ETag"
|
||||
assert last_modified is not None, "Should have Last-Modified"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_handles_missing_head_tag(self):
|
||||
"""Handle pages that might not have proper head structure."""
|
||||
# API endpoint that returns JSON (no HTML head)
|
||||
url = "https://httpbin.org/json"
|
||||
|
||||
async with CacheValidator(timeout=15.0) as validator:
|
||||
head_html, etag, last_modified = await validator._fetch_head(url)
|
||||
|
||||
# Should not crash, may return partial content or None
|
||||
# The important thing is it doesn't error
|
||||
|
||||
|
||||
class TestRealDomainsValidationCombinations:
|
||||
"""Test various combinations of validation data."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_etag_only(self):
|
||||
"""Validate with only ETag."""
|
||||
url = "https://docs.python.org/3/"
|
||||
|
||||
async with CacheValidator(timeout=15.0) as validator:
|
||||
_, etag, _ = await validator._fetch_head(url)
|
||||
|
||||
result = await validator.validate(url=url, stored_etag=etag)
|
||||
assert result.status == CacheValidationResult.FRESH
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_last_modified_only(self):
|
||||
"""Validate with only Last-Modified."""
|
||||
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
|
||||
|
||||
async with CacheValidator(timeout=15.0) as validator:
|
||||
_, _, last_modified = await validator._fetch_head(url)
|
||||
|
||||
if last_modified:
|
||||
result = await validator.validate(url=url, stored_last_modified=last_modified)
|
||||
assert result.status == CacheValidationResult.FRESH
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_fingerprint_only(self):
|
||||
"""Validate with only fingerprint."""
|
||||
url = "https://example.com/"
|
||||
|
||||
async with CacheValidator(timeout=15.0) as validator:
|
||||
head_html, _, _ = await validator._fetch_head(url)
|
||||
fingerprint = compute_head_fingerprint(head_html)
|
||||
|
||||
if fingerprint:
|
||||
result = await validator.validate(url=url, stored_head_fingerprint=fingerprint)
|
||||
assert result.status == CacheValidationResult.FRESH
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_all_validation_data(self):
|
||||
"""Validate with all available data."""
|
||||
url = "https://docs.python.org/3/"
|
||||
|
||||
async with CacheValidator(timeout=15.0) as validator:
|
||||
head_html, etag, last_modified = await validator._fetch_head(url)
|
||||
fingerprint = compute_head_fingerprint(head_html)
|
||||
|
||||
result = await validator.validate(
|
||||
url=url,
|
||||
stored_etag=etag,
|
||||
stored_last_modified=last_modified,
|
||||
stored_head_fingerprint=fingerprint,
|
||||
)
|
||||
|
||||
assert result.status == CacheValidationResult.FRESH
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_stale_etag_fresh_fingerprint(self):
|
||||
"""When ETag is stale but fingerprint matches, should be FRESH."""
|
||||
url = "https://docs.python.org/3/"
|
||||
|
||||
async with CacheValidator(timeout=15.0) as validator:
|
||||
head_html, _, _ = await validator._fetch_head(url)
|
||||
fingerprint = compute_head_fingerprint(head_html)
|
||||
|
||||
# Use fake ETag but real fingerprint
|
||||
result = await validator.validate(
|
||||
url=url,
|
||||
stored_etag='"fake-stale-etag"',
|
||||
stored_head_fingerprint=fingerprint,
|
||||
)
|
||||
|
||||
# Fingerprint should save us
|
||||
assert result.status == CacheValidationResult.FRESH
|
||||
assert "fingerprint" in result.reason.lower()
|
||||
0
tests/deep_crawling/__init__.py
Normal file
0
tests/deep_crawling/__init__.py
Normal file
773
tests/deep_crawling/test_deep_crawl_resume.py
Normal file
773
tests/deep_crawling/test_deep_crawl_resume.py
Normal file
@@ -0,0 +1,773 @@
|
||||
"""
|
||||
Test Suite: Deep Crawl Resume/Crash Recovery Tests
|
||||
|
||||
Tests that verify:
|
||||
1. State export produces valid JSON-serializable data
|
||||
2. Resume from checkpoint continues without duplicates
|
||||
3. Simulated crash at various points recovers correctly
|
||||
4. State callback fires at expected intervals
|
||||
5. No damage to existing system behavior (regression tests)
|
||||
"""
|
||||
|
||||
import pytest
|
||||
import asyncio
|
||||
import json
|
||||
from typing import Dict, Any, List
|
||||
from unittest.mock import AsyncMock, MagicMock
|
||||
|
||||
from crawl4ai.deep_crawling import (
|
||||
BFSDeepCrawlStrategy,
|
||||
DFSDeepCrawlStrategy,
|
||||
BestFirstCrawlingStrategy,
|
||||
FilterChain,
|
||||
URLPatternFilter,
|
||||
DomainFilter,
|
||||
)
|
||||
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Helper Functions for Mock Crawler
|
||||
# ============================================================================
|
||||
|
||||
def create_mock_config(stream=False):
|
||||
"""Create a mock CrawlerRunConfig."""
|
||||
config = MagicMock()
|
||||
config.clone = MagicMock(return_value=config)
|
||||
config.stream = stream
|
||||
return config
|
||||
|
||||
|
||||
def create_mock_crawler_with_links(num_links: int = 3, include_keyword: bool = False):
|
||||
"""Create mock crawler that returns results with links."""
|
||||
call_count = 0
|
||||
|
||||
async def mock_arun_many(urls, config):
|
||||
nonlocal call_count
|
||||
results = []
|
||||
for url in urls:
|
||||
call_count += 1
|
||||
result = MagicMock()
|
||||
result.url = url
|
||||
result.success = True
|
||||
result.metadata = {}
|
||||
|
||||
# Generate child links
|
||||
links = []
|
||||
for i in range(num_links):
|
||||
link_url = f"{url}/child{call_count}_{i}"
|
||||
if include_keyword:
|
||||
link_url = f"{url}/important-child{call_count}_{i}"
|
||||
links.append({"href": link_url})
|
||||
|
||||
result.links = {"internal": links, "external": []}
|
||||
results.append(result)
|
||||
|
||||
# For streaming mode, return async generator
|
||||
if config.stream:
|
||||
async def gen():
|
||||
for r in results:
|
||||
yield r
|
||||
return gen()
|
||||
return results
|
||||
|
||||
crawler = MagicMock()
|
||||
crawler.arun_many = mock_arun_many
|
||||
return crawler
|
||||
|
||||
|
||||
def create_mock_crawler_tracking(crawl_order: List[str], return_no_links: bool = False):
|
||||
"""Create mock crawler that tracks crawl order."""
|
||||
|
||||
async def mock_arun_many(urls, config):
|
||||
results = []
|
||||
for url in urls:
|
||||
crawl_order.append(url)
|
||||
result = MagicMock()
|
||||
result.url = url
|
||||
result.success = True
|
||||
result.metadata = {}
|
||||
result.links = {"internal": [], "external": []} if return_no_links else {"internal": [{"href": f"{url}/child"}], "external": []}
|
||||
results.append(result)
|
||||
|
||||
# For streaming mode, return async generator
|
||||
if config.stream:
|
||||
async def gen():
|
||||
for r in results:
|
||||
yield r
|
||||
return gen()
|
||||
return results
|
||||
|
||||
crawler = MagicMock()
|
||||
crawler.arun_many = mock_arun_many
|
||||
return crawler
|
||||
|
||||
|
||||
def create_simple_mock_crawler():
|
||||
"""Basic mock crawler returning 1 result with 2 child links."""
|
||||
call_count = 0
|
||||
|
||||
async def mock_arun_many(urls, config):
|
||||
nonlocal call_count
|
||||
results = []
|
||||
for url in urls:
|
||||
call_count += 1
|
||||
result = MagicMock()
|
||||
result.url = url
|
||||
result.success = True
|
||||
result.metadata = {}
|
||||
result.links = {
|
||||
"internal": [
|
||||
{"href": f"{url}/child1"},
|
||||
{"href": f"{url}/child2"},
|
||||
],
|
||||
"external": []
|
||||
}
|
||||
results.append(result)
|
||||
|
||||
if config.stream:
|
||||
async def gen():
|
||||
for r in results:
|
||||
yield r
|
||||
return gen()
|
||||
return results
|
||||
|
||||
crawler = MagicMock()
|
||||
crawler.arun_many = mock_arun_many
|
||||
return crawler
|
||||
|
||||
|
||||
def create_mock_crawler_unlimited_links():
|
||||
"""Mock crawler that always returns links (for testing limits)."""
|
||||
async def mock_arun_many(urls, config):
|
||||
results = []
|
||||
for url in urls:
|
||||
result = MagicMock()
|
||||
result.url = url
|
||||
result.success = True
|
||||
result.metadata = {}
|
||||
result.links = {
|
||||
"internal": [{"href": f"{url}/link{i}"} for i in range(10)],
|
||||
"external": []
|
||||
}
|
||||
results.append(result)
|
||||
|
||||
if config.stream:
|
||||
async def gen():
|
||||
for r in results:
|
||||
yield r
|
||||
return gen()
|
||||
return results
|
||||
|
||||
crawler = MagicMock()
|
||||
crawler.arun_many = mock_arun_many
|
||||
return crawler
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# TEST SUITE 1: Crash Recovery Tests
|
||||
# ============================================================================
|
||||
|
||||
class TestBFSResume:
|
||||
"""BFS strategy resume tests."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_state_export_json_serializable(self):
|
||||
"""Verify exported state can be JSON serialized."""
|
||||
captured_states: List[Dict] = []
|
||||
|
||||
async def capture_state(state: Dict[str, Any]):
|
||||
# Verify JSON serializable
|
||||
json_str = json.dumps(state)
|
||||
parsed = json.loads(json_str)
|
||||
captured_states.append(parsed)
|
||||
|
||||
strategy = BFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
max_pages=10,
|
||||
on_state_change=capture_state,
|
||||
)
|
||||
|
||||
# Create mock crawler that returns predictable results
|
||||
mock_crawler = create_mock_crawler_with_links(num_links=3)
|
||||
mock_config = create_mock_config()
|
||||
|
||||
results = await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
|
||||
|
||||
# Verify states were captured
|
||||
assert len(captured_states) > 0
|
||||
|
||||
# Verify state structure
|
||||
for state in captured_states:
|
||||
assert state["strategy_type"] == "bfs"
|
||||
assert "visited" in state
|
||||
assert "pending" in state
|
||||
assert "depths" in state
|
||||
assert "pages_crawled" in state
|
||||
assert isinstance(state["visited"], list)
|
||||
assert isinstance(state["pending"], list)
|
||||
assert isinstance(state["depths"], dict)
|
||||
assert isinstance(state["pages_crawled"], int)
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_resume_continues_from_checkpoint(self):
|
||||
"""Verify resume starts from saved state, not beginning."""
|
||||
# Simulate state from previous crawl (visited 5 URLs, 3 pending)
|
||||
saved_state = {
|
||||
"strategy_type": "bfs",
|
||||
"visited": [
|
||||
"https://example.com",
|
||||
"https://example.com/page1",
|
||||
"https://example.com/page2",
|
||||
"https://example.com/page3",
|
||||
"https://example.com/page4",
|
||||
],
|
||||
"pending": [
|
||||
{"url": "https://example.com/page5", "parent_url": "https://example.com/page2"},
|
||||
{"url": "https://example.com/page6", "parent_url": "https://example.com/page3"},
|
||||
{"url": "https://example.com/page7", "parent_url": "https://example.com/page3"},
|
||||
],
|
||||
"depths": {
|
||||
"https://example.com": 0,
|
||||
"https://example.com/page1": 1,
|
||||
"https://example.com/page2": 1,
|
||||
"https://example.com/page3": 1,
|
||||
"https://example.com/page4": 1,
|
||||
"https://example.com/page5": 2,
|
||||
"https://example.com/page6": 2,
|
||||
"https://example.com/page7": 2,
|
||||
},
|
||||
"pages_crawled": 5,
|
||||
}
|
||||
|
||||
crawled_urls: List[str] = []
|
||||
|
||||
strategy = BFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
max_pages=20,
|
||||
resume_state=saved_state,
|
||||
)
|
||||
|
||||
# Verify internal state was restored
|
||||
assert strategy._resume_state == saved_state
|
||||
|
||||
mock_crawler = create_mock_crawler_tracking(crawled_urls, return_no_links=True)
|
||||
mock_config = create_mock_config()
|
||||
|
||||
await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
|
||||
|
||||
# Should NOT re-crawl already visited URLs
|
||||
for visited_url in saved_state["visited"]:
|
||||
assert visited_url not in crawled_urls, f"Re-crawled already visited: {visited_url}"
|
||||
|
||||
# Should crawl pending URLs
|
||||
for pending in saved_state["pending"]:
|
||||
assert pending["url"] in crawled_urls, f"Did not crawl pending: {pending['url']}"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_simulated_crash_mid_crawl(self):
|
||||
"""Simulate crash at URL N, verify resume continues from pending URLs."""
|
||||
crash_after = 3
|
||||
states_before_crash: List[Dict] = []
|
||||
|
||||
async def capture_until_crash(state: Dict[str, Any]):
|
||||
states_before_crash.append(state)
|
||||
if state["pages_crawled"] >= crash_after:
|
||||
raise Exception("Simulated crash!")
|
||||
|
||||
strategy1 = BFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
max_pages=10,
|
||||
on_state_change=capture_until_crash,
|
||||
)
|
||||
|
||||
mock_crawler = create_mock_crawler_with_links(num_links=5)
|
||||
mock_config = create_mock_config()
|
||||
|
||||
# First crawl - crashes
|
||||
with pytest.raises(Exception, match="Simulated crash"):
|
||||
await strategy1._arun_batch("https://example.com", mock_crawler, mock_config)
|
||||
|
||||
# Get last state before crash
|
||||
last_state = states_before_crash[-1]
|
||||
assert last_state["pages_crawled"] >= crash_after
|
||||
|
||||
# Calculate which URLs were already crawled vs pending
|
||||
pending_urls = {item["url"] for item in last_state["pending"]}
|
||||
visited_urls = set(last_state["visited"])
|
||||
already_crawled_urls = visited_urls - pending_urls
|
||||
|
||||
# Resume from checkpoint
|
||||
crawled_in_resume: List[str] = []
|
||||
|
||||
strategy2 = BFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
max_pages=10,
|
||||
resume_state=last_state,
|
||||
)
|
||||
|
||||
mock_crawler2 = create_mock_crawler_tracking(crawled_in_resume, return_no_links=True)
|
||||
|
||||
await strategy2._arun_batch("https://example.com", mock_crawler2, mock_config)
|
||||
|
||||
# Verify already-crawled URLs are not re-crawled
|
||||
for crawled_url in already_crawled_urls:
|
||||
assert crawled_url not in crawled_in_resume, f"Re-crawled already visited: {crawled_url}"
|
||||
|
||||
# Verify pending URLs are crawled
|
||||
for pending_url in pending_urls:
|
||||
assert pending_url in crawled_in_resume, f"Did not crawl pending: {pending_url}"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_callback_fires_per_url(self):
|
||||
"""Verify callback fires after each URL for maximum granularity."""
|
||||
callback_count = 0
|
||||
pages_crawled_sequence: List[int] = []
|
||||
|
||||
async def count_callbacks(state: Dict[str, Any]):
|
||||
nonlocal callback_count
|
||||
callback_count += 1
|
||||
pages_crawled_sequence.append(state["pages_crawled"])
|
||||
|
||||
strategy = BFSDeepCrawlStrategy(
|
||||
max_depth=1,
|
||||
max_pages=5,
|
||||
on_state_change=count_callbacks,
|
||||
)
|
||||
|
||||
mock_crawler = create_mock_crawler_with_links(num_links=2)
|
||||
mock_config = create_mock_config()
|
||||
|
||||
await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
|
||||
|
||||
# Callback should fire once per successful URL
|
||||
assert callback_count == strategy._pages_crawled, \
|
||||
f"Callback fired {callback_count} times, expected {strategy._pages_crawled} (per URL)"
|
||||
|
||||
# pages_crawled should increment by 1 each callback
|
||||
for i, count in enumerate(pages_crawled_sequence):
|
||||
assert count == i + 1, f"Expected pages_crawled={i+1} at callback {i}, got {count}"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_export_state_returns_last_captured(self):
|
||||
"""Verify export_state() returns last captured state."""
|
||||
last_state = None
|
||||
|
||||
async def capture(state):
|
||||
nonlocal last_state
|
||||
last_state = state
|
||||
|
||||
strategy = BFSDeepCrawlStrategy(max_depth=2, max_pages=5, on_state_change=capture)
|
||||
|
||||
mock_crawler = create_mock_crawler_with_links(num_links=2)
|
||||
mock_config = create_mock_config()
|
||||
|
||||
await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
|
||||
|
||||
exported = strategy.export_state()
|
||||
assert exported == last_state
|
||||
|
||||
|
||||
class TestDFSResume:
|
||||
"""DFS strategy resume tests."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_state_export_includes_stack_and_dfs_seen(self):
|
||||
"""Verify DFS state includes stack structure and _dfs_seen."""
|
||||
captured_states: List[Dict] = []
|
||||
|
||||
async def capture_state(state: Dict[str, Any]):
|
||||
captured_states.append(state)
|
||||
|
||||
strategy = DFSDeepCrawlStrategy(
|
||||
max_depth=3,
|
||||
max_pages=10,
|
||||
on_state_change=capture_state,
|
||||
)
|
||||
|
||||
mock_crawler = create_mock_crawler_with_links(num_links=2)
|
||||
mock_config = create_mock_config()
|
||||
|
||||
await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
|
||||
|
||||
assert len(captured_states) > 0
|
||||
|
||||
for state in captured_states:
|
||||
assert state["strategy_type"] == "dfs"
|
||||
assert "stack" in state
|
||||
assert "dfs_seen" in state
|
||||
# Stack items should have depth
|
||||
for item in state["stack"]:
|
||||
assert "url" in item
|
||||
assert "parent_url" in item
|
||||
assert "depth" in item
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_resume_restores_stack_order(self):
|
||||
"""Verify DFS stack order is preserved on resume."""
|
||||
saved_state = {
|
||||
"strategy_type": "dfs",
|
||||
"visited": ["https://example.com"],
|
||||
"stack": [
|
||||
{"url": "https://example.com/deep3", "parent_url": "https://example.com/deep2", "depth": 3},
|
||||
{"url": "https://example.com/deep2", "parent_url": "https://example.com/deep1", "depth": 2},
|
||||
{"url": "https://example.com/page1", "parent_url": "https://example.com", "depth": 1},
|
||||
],
|
||||
"depths": {"https://example.com": 0},
|
||||
"pages_crawled": 1,
|
||||
"dfs_seen": ["https://example.com", "https://example.com/deep3", "https://example.com/deep2", "https://example.com/page1"],
|
||||
}
|
||||
|
||||
crawl_order: List[str] = []
|
||||
|
||||
strategy = DFSDeepCrawlStrategy(
|
||||
max_depth=3,
|
||||
max_pages=10,
|
||||
resume_state=saved_state,
|
||||
)
|
||||
|
||||
mock_crawler = create_mock_crawler_tracking(crawl_order, return_no_links=True)
|
||||
mock_config = create_mock_config()
|
||||
|
||||
await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
|
||||
|
||||
# DFS pops from end of stack, so order should be: page1, deep2, deep3
|
||||
assert crawl_order[0] == "https://example.com/page1"
|
||||
assert crawl_order[1] == "https://example.com/deep2"
|
||||
assert crawl_order[2] == "https://example.com/deep3"
|
||||
|
||||
|
||||
class TestBestFirstResume:
|
||||
"""Best-First strategy resume tests."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_state_export_includes_scored_queue(self):
|
||||
"""Verify Best-First state includes queue with scores."""
|
||||
captured_states: List[Dict] = []
|
||||
|
||||
async def capture_state(state: Dict[str, Any]):
|
||||
captured_states.append(state)
|
||||
|
||||
scorer = KeywordRelevanceScorer(keywords=["important"], weight=1.0)
|
||||
|
||||
strategy = BestFirstCrawlingStrategy(
|
||||
max_depth=2,
|
||||
max_pages=10,
|
||||
url_scorer=scorer,
|
||||
on_state_change=capture_state,
|
||||
)
|
||||
|
||||
mock_crawler = create_mock_crawler_with_links(num_links=3, include_keyword=True)
|
||||
mock_config = create_mock_config(stream=True)
|
||||
|
||||
async for _ in strategy._arun_stream("https://example.com", mock_crawler, mock_config):
|
||||
pass
|
||||
|
||||
assert len(captured_states) > 0
|
||||
|
||||
for state in captured_states:
|
||||
assert state["strategy_type"] == "best_first"
|
||||
assert "queue_items" in state
|
||||
for item in state["queue_items"]:
|
||||
assert "score" in item
|
||||
assert "depth" in item
|
||||
assert "url" in item
|
||||
assert "parent_url" in item
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_resume_maintains_priority_order(self):
|
||||
"""Verify priority queue order is maintained on resume."""
|
||||
saved_state = {
|
||||
"strategy_type": "best_first",
|
||||
"visited": ["https://example.com"],
|
||||
"queue_items": [
|
||||
{"score": -0.9, "depth": 1, "url": "https://example.com/high-priority", "parent_url": "https://example.com"},
|
||||
{"score": -0.5, "depth": 1, "url": "https://example.com/medium-priority", "parent_url": "https://example.com"},
|
||||
{"score": -0.1, "depth": 1, "url": "https://example.com/low-priority", "parent_url": "https://example.com"},
|
||||
],
|
||||
"depths": {"https://example.com": 0},
|
||||
"pages_crawled": 1,
|
||||
}
|
||||
|
||||
crawl_order: List[str] = []
|
||||
|
||||
strategy = BestFirstCrawlingStrategy(
|
||||
max_depth=2,
|
||||
max_pages=10,
|
||||
resume_state=saved_state,
|
||||
)
|
||||
|
||||
mock_crawler = create_mock_crawler_tracking(crawl_order, return_no_links=True)
|
||||
mock_config = create_mock_config(stream=True)
|
||||
|
||||
async for _ in strategy._arun_stream("https://example.com", mock_crawler, mock_config):
|
||||
pass
|
||||
|
||||
# Higher negative score = higher priority (min-heap)
|
||||
# So -0.9 should be crawled first
|
||||
assert crawl_order[0] == "https://example.com/high-priority"
|
||||
|
||||
|
||||
class TestCrossStrategyResume:
|
||||
"""Tests that apply to all strategies."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
@pytest.mark.parametrize("strategy_class,strategy_type", [
|
||||
(BFSDeepCrawlStrategy, "bfs"),
|
||||
(DFSDeepCrawlStrategy, "dfs"),
|
||||
(BestFirstCrawlingStrategy, "best_first"),
|
||||
])
|
||||
async def test_no_callback_means_no_overhead(self, strategy_class, strategy_type):
|
||||
"""Verify no state tracking when callback is None."""
|
||||
strategy = strategy_class(max_depth=2, max_pages=5)
|
||||
|
||||
# _queue_shadow should be None for Best-First when no callback
|
||||
if strategy_class == BestFirstCrawlingStrategy:
|
||||
assert strategy._queue_shadow is None
|
||||
|
||||
# _last_state should be None initially
|
||||
assert strategy._last_state is None
|
||||
|
||||
@pytest.mark.asyncio
|
||||
@pytest.mark.parametrize("strategy_class", [
|
||||
BFSDeepCrawlStrategy,
|
||||
DFSDeepCrawlStrategy,
|
||||
BestFirstCrawlingStrategy,
|
||||
])
|
||||
async def test_export_state_returns_last_captured(self, strategy_class):
|
||||
"""Verify export_state() returns last captured state."""
|
||||
last_state = None
|
||||
|
||||
async def capture(state):
|
||||
nonlocal last_state
|
||||
last_state = state
|
||||
|
||||
strategy = strategy_class(max_depth=2, max_pages=5, on_state_change=capture)
|
||||
|
||||
mock_crawler = create_mock_crawler_with_links(num_links=2)
|
||||
|
||||
if strategy_class == BestFirstCrawlingStrategy:
|
||||
mock_config = create_mock_config(stream=True)
|
||||
async for _ in strategy._arun_stream("https://example.com", mock_crawler, mock_config):
|
||||
pass
|
||||
else:
|
||||
mock_config = create_mock_config()
|
||||
await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
|
||||
|
||||
exported = strategy.export_state()
|
||||
assert exported == last_state
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# TEST SUITE 2: Regression Tests (No Damage to Current System)
|
||||
# ============================================================================
|
||||
|
||||
class TestBFSRegressions:
|
||||
"""Ensure BFS works identically when new params not used."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_default_params_unchanged(self):
|
||||
"""Constructor with only original params works."""
|
||||
strategy = BFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
include_external=False,
|
||||
max_pages=10,
|
||||
)
|
||||
|
||||
assert strategy.max_depth == 2
|
||||
assert strategy.include_external == False
|
||||
assert strategy.max_pages == 10
|
||||
assert strategy._resume_state is None
|
||||
assert strategy._on_state_change is None
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_filter_chain_still_works(self):
|
||||
"""FilterChain integration unchanged."""
|
||||
filter_chain = FilterChain([
|
||||
URLPatternFilter(patterns=["*/blog/*"]),
|
||||
DomainFilter(allowed_domains=["example.com"]),
|
||||
])
|
||||
|
||||
strategy = BFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
filter_chain=filter_chain,
|
||||
)
|
||||
|
||||
# Test filter still applies
|
||||
assert await strategy.can_process_url("https://example.com/blog/post1", 1) == True
|
||||
assert await strategy.can_process_url("https://other.com/blog/post1", 1) == False
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_url_scorer_still_works(self):
|
||||
"""URL scoring integration unchanged."""
|
||||
scorer = KeywordRelevanceScorer(keywords=["python", "tutorial"], weight=1.0)
|
||||
|
||||
strategy = BFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
url_scorer=scorer,
|
||||
score_threshold=0.5,
|
||||
)
|
||||
|
||||
assert strategy.url_scorer is not None
|
||||
assert strategy.score_threshold == 0.5
|
||||
|
||||
# Scorer should work
|
||||
score = scorer.score("https://example.com/python-tutorial")
|
||||
assert score > 0
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_batch_mode_returns_list(self):
|
||||
"""Batch mode still returns List[CrawlResult]."""
|
||||
strategy = BFSDeepCrawlStrategy(max_depth=1, max_pages=5)
|
||||
|
||||
mock_crawler = create_simple_mock_crawler()
|
||||
mock_config = create_mock_config(stream=False)
|
||||
|
||||
results = await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
|
||||
|
||||
assert isinstance(results, list)
|
||||
assert len(results) > 0
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_max_pages_limit_respected(self):
|
||||
"""max_pages limit still enforced."""
|
||||
strategy = BFSDeepCrawlStrategy(max_depth=10, max_pages=3)
|
||||
|
||||
mock_crawler = create_mock_crawler_unlimited_links()
|
||||
mock_config = create_mock_config()
|
||||
|
||||
results = await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
|
||||
|
||||
# Should stop at max_pages
|
||||
assert strategy._pages_crawled <= 3
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_max_depth_limit_respected(self):
|
||||
"""max_depth limit still enforced."""
|
||||
strategy = BFSDeepCrawlStrategy(max_depth=2, max_pages=100)
|
||||
|
||||
mock_crawler = create_mock_crawler_unlimited_links()
|
||||
mock_config = create_mock_config()
|
||||
|
||||
results = await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
|
||||
|
||||
# All results should have depth <= max_depth
|
||||
for result in results:
|
||||
assert result.metadata.get("depth", 0) <= 2
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_metadata_depth_still_set(self):
|
||||
"""Result metadata still includes depth."""
|
||||
strategy = BFSDeepCrawlStrategy(max_depth=2, max_pages=5)
|
||||
|
||||
mock_crawler = create_simple_mock_crawler()
|
||||
mock_config = create_mock_config()
|
||||
|
||||
results = await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
|
||||
|
||||
for result in results:
|
||||
assert "depth" in result.metadata
|
||||
assert isinstance(result.metadata["depth"], int)
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_metadata_parent_url_still_set(self):
|
||||
"""Result metadata still includes parent_url."""
|
||||
strategy = BFSDeepCrawlStrategy(max_depth=2, max_pages=5)
|
||||
|
||||
mock_crawler = create_simple_mock_crawler()
|
||||
mock_config = create_mock_config()
|
||||
|
||||
results = await strategy._arun_batch("https://example.com", mock_crawler, mock_config)
|
||||
|
||||
# First result (start URL) should have parent_url = None
|
||||
assert results[0].metadata.get("parent_url") is None
|
||||
|
||||
# Child results should have parent_url set
|
||||
for result in results[1:]:
|
||||
assert "parent_url" in result.metadata
|
||||
|
||||
|
||||
class TestDFSRegressions:
|
||||
"""Ensure DFS works identically when new params not used."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_inherits_bfs_params(self):
|
||||
"""DFS still inherits all BFS parameters."""
|
||||
strategy = DFSDeepCrawlStrategy(
|
||||
max_depth=3,
|
||||
include_external=True,
|
||||
max_pages=20,
|
||||
score_threshold=0.5,
|
||||
)
|
||||
|
||||
assert strategy.max_depth == 3
|
||||
assert strategy.include_external == True
|
||||
assert strategy.max_pages == 20
|
||||
assert strategy.score_threshold == 0.5
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_dfs_seen_initialized(self):
|
||||
"""DFS _dfs_seen set still initialized."""
|
||||
strategy = DFSDeepCrawlStrategy(max_depth=2)
|
||||
|
||||
assert hasattr(strategy, '_dfs_seen')
|
||||
assert isinstance(strategy._dfs_seen, set)
|
||||
|
||||
|
||||
class TestBestFirstRegressions:
|
||||
"""Ensure Best-First works identically when new params not used."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_default_params_unchanged(self):
|
||||
"""Constructor with only original params works."""
|
||||
strategy = BestFirstCrawlingStrategy(
|
||||
max_depth=2,
|
||||
include_external=False,
|
||||
max_pages=10,
|
||||
)
|
||||
|
||||
assert strategy.max_depth == 2
|
||||
assert strategy.include_external == False
|
||||
assert strategy.max_pages == 10
|
||||
assert strategy._resume_state is None
|
||||
assert strategy._on_state_change is None
|
||||
assert strategy._queue_shadow is None # Not initialized without callback
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_scorer_integration(self):
|
||||
"""URL scorer still affects crawl priority."""
|
||||
scorer = KeywordRelevanceScorer(keywords=["important"], weight=1.0)
|
||||
|
||||
strategy = BestFirstCrawlingStrategy(
|
||||
max_depth=2,
|
||||
max_pages=10,
|
||||
url_scorer=scorer,
|
||||
)
|
||||
|
||||
assert strategy.url_scorer is scorer
|
||||
|
||||
|
||||
class TestAPICompatibility:
|
||||
"""Ensure API/serialization compatibility."""
|
||||
|
||||
def test_strategy_signature_backward_compatible(self):
|
||||
"""Old code calling with positional/keyword args still works."""
|
||||
# Positional args (old style)
|
||||
s1 = BFSDeepCrawlStrategy(2)
|
||||
assert s1.max_depth == 2
|
||||
|
||||
# Keyword args (old style)
|
||||
s2 = BFSDeepCrawlStrategy(max_depth=3, max_pages=10)
|
||||
assert s2.max_depth == 3
|
||||
|
||||
# Mixed (old style)
|
||||
s3 = BFSDeepCrawlStrategy(2, FilterChain(), None, False, float('-inf'), 100)
|
||||
assert s3.max_depth == 2
|
||||
assert s3.max_pages == 100
|
||||
|
||||
def test_no_required_new_params(self):
|
||||
"""New params are optional, not required."""
|
||||
# Should not raise
|
||||
BFSDeepCrawlStrategy(max_depth=2)
|
||||
DFSDeepCrawlStrategy(max_depth=2)
|
||||
BestFirstCrawlingStrategy(max_depth=2)
|
||||
162
tests/deep_crawling/test_deep_crawl_resume_integration.py
Normal file
162
tests/deep_crawling/test_deep_crawl_resume_integration.py
Normal file
@@ -0,0 +1,162 @@
|
||||
"""
|
||||
Integration Test: Deep Crawl Resume with Real URLs
|
||||
|
||||
Tests the crash recovery feature using books.toscrape.com - a site
|
||||
designed for scraping practice with a clear hierarchy:
|
||||
- Home page → Category pages → Book detail pages
|
||||
"""
|
||||
|
||||
import pytest
|
||||
import asyncio
|
||||
import json
|
||||
from typing import Dict, Any, List
|
||||
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
|
||||
|
||||
|
||||
class TestBFSResumeIntegration:
|
||||
"""Integration tests for BFS resume with real crawling."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_real_crawl_state_capture_and_resume(self):
|
||||
"""
|
||||
Test crash recovery with real URLs from books.toscrape.com.
|
||||
|
||||
Flow:
|
||||
1. Start crawl with state callback
|
||||
2. Stop after N pages (simulated crash)
|
||||
3. Resume from saved state
|
||||
4. Verify no duplicate crawls
|
||||
"""
|
||||
# Phase 1: Initial crawl that "crashes" after 3 pages
|
||||
crash_after = 3
|
||||
captured_states: List[Dict[str, Any]] = []
|
||||
crawled_urls_phase1: List[str] = []
|
||||
|
||||
async def capture_state_until_crash(state: Dict[str, Any]):
|
||||
captured_states.append(state)
|
||||
crawled_urls_phase1.clear()
|
||||
crawled_urls_phase1.extend(state["visited"])
|
||||
|
||||
if state["pages_crawled"] >= crash_after:
|
||||
raise Exception("Simulated crash!")
|
||||
|
||||
strategy1 = BFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
max_pages=10,
|
||||
on_state_change=capture_state_until_crash,
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=strategy1,
|
||||
stream=False,
|
||||
verbose=False,
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(verbose=False) as crawler:
|
||||
# First crawl - will crash after 3 pages
|
||||
with pytest.raises(Exception, match="Simulated crash"):
|
||||
await crawler.arun("https://books.toscrape.com", config=config)
|
||||
|
||||
# Verify we captured state before crash
|
||||
assert len(captured_states) > 0, "No states captured before crash"
|
||||
last_state = captured_states[-1]
|
||||
|
||||
print(f"\n=== Phase 1: Crashed after {last_state['pages_crawled']} pages ===")
|
||||
print(f"Visited URLs: {len(last_state['visited'])}")
|
||||
print(f"Pending URLs: {len(last_state['pending'])}")
|
||||
|
||||
# Verify state structure
|
||||
assert last_state["strategy_type"] == "bfs"
|
||||
assert last_state["pages_crawled"] >= crash_after
|
||||
assert len(last_state["visited"]) > 0
|
||||
assert "pending" in last_state
|
||||
assert "depths" in last_state
|
||||
|
||||
# Verify state is JSON serializable (important for Redis/DB storage)
|
||||
json_str = json.dumps(last_state)
|
||||
restored_state = json.loads(json_str)
|
||||
assert restored_state == last_state, "State not JSON round-trip safe"
|
||||
|
||||
# Phase 2: Resume from checkpoint
|
||||
crawled_urls_phase2: List[str] = []
|
||||
|
||||
async def track_resumed_crawl(state: Dict[str, Any]):
|
||||
# Track what's being crawled in phase 2
|
||||
new_visited = set(state["visited"]) - set(last_state["visited"])
|
||||
for url in new_visited:
|
||||
if url not in crawled_urls_phase2:
|
||||
crawled_urls_phase2.append(url)
|
||||
|
||||
strategy2 = BFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
max_pages=10,
|
||||
resume_state=restored_state,
|
||||
on_state_change=track_resumed_crawl,
|
||||
)
|
||||
|
||||
config2 = CrawlerRunConfig(
|
||||
deep_crawl_strategy=strategy2,
|
||||
stream=False,
|
||||
verbose=False,
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(verbose=False) as crawler:
|
||||
results = await crawler.arun("https://books.toscrape.com", config=config2)
|
||||
|
||||
print(f"\n=== Phase 2: Resumed crawl ===")
|
||||
print(f"New URLs crawled: {len(crawled_urls_phase2)}")
|
||||
print(f"Final pages_crawled: {strategy2._pages_crawled}")
|
||||
|
||||
# Verify no duplicates - URLs from phase 1 should not be re-crawled
|
||||
already_crawled = set(last_state["visited"]) - {item["url"] for item in last_state["pending"]}
|
||||
duplicates = set(crawled_urls_phase2) & already_crawled
|
||||
|
||||
assert len(duplicates) == 0, f"Duplicate crawls detected: {duplicates}"
|
||||
|
||||
# Verify we made progress (crawled some of the pending URLs)
|
||||
pending_urls = {item["url"] for item in last_state["pending"]}
|
||||
crawled_pending = set(crawled_urls_phase2) & pending_urls
|
||||
|
||||
print(f"Pending URLs crawled in phase 2: {len(crawled_pending)}")
|
||||
|
||||
# Final state should show more pages crawled than before crash
|
||||
final_state = strategy2.export_state()
|
||||
if final_state:
|
||||
assert final_state["pages_crawled"] >= last_state["pages_crawled"], \
|
||||
"Resume did not make progress"
|
||||
|
||||
print("\n=== Integration test PASSED ===")
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_state_export_method(self):
|
||||
"""Test that export_state() returns valid state during crawl."""
|
||||
states_from_callback: List[Dict] = []
|
||||
|
||||
async def capture(state):
|
||||
states_from_callback.append(state)
|
||||
|
||||
strategy = BFSDeepCrawlStrategy(
|
||||
max_depth=1,
|
||||
max_pages=3,
|
||||
on_state_change=capture,
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=strategy,
|
||||
stream=False,
|
||||
verbose=False,
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(verbose=False) as crawler:
|
||||
await crawler.arun("https://books.toscrape.com", config=config)
|
||||
|
||||
# export_state should return the last captured state
|
||||
exported = strategy.export_state()
|
||||
|
||||
assert exported is not None, "export_state() returned None"
|
||||
assert exported == states_from_callback[-1], "export_state() doesn't match last callback"
|
||||
|
||||
print(f"\n=== export_state() test PASSED ===")
|
||||
print(f"Final state: {exported['pages_crawled']} pages, {len(exported['visited'])} visited")
|
||||
@@ -7,9 +7,46 @@ adapted for the Docker API with real URLs
|
||||
import requests
|
||||
import json
|
||||
import time
|
||||
from typing import Dict, Any
|
||||
from typing import Dict, Optional
|
||||
|
||||
API_BASE_URL = "http://localhost:11234"
|
||||
API_BASE_URL = "http://localhost:11235"
|
||||
|
||||
# Global token storage
|
||||
_auth_token: Optional[str] = None
|
||||
|
||||
|
||||
def get_auth_token(email: str = "test@gmail.com") -> str:
|
||||
"""
|
||||
Get a JWT token from the /token endpoint.
|
||||
The email domain must have valid MX records.
|
||||
"""
|
||||
global _auth_token
|
||||
|
||||
if _auth_token:
|
||||
return _auth_token
|
||||
|
||||
print(f"🔐 Requesting JWT token for {email}...")
|
||||
response = requests.post(
|
||||
f"{API_BASE_URL}/token",
|
||||
json={"email": email}
|
||||
)
|
||||
|
||||
if response.status_code == 200:
|
||||
data = response.json()
|
||||
_auth_token = data["access_token"]
|
||||
print(f"✅ Token obtained successfully")
|
||||
return _auth_token
|
||||
else:
|
||||
raise Exception(f"Failed to get token: {response.status_code} - {response.text}")
|
||||
|
||||
|
||||
def get_auth_headers() -> Dict[str, str]:
|
||||
"""Get headers with JWT Bearer token."""
|
||||
token = get_auth_token()
|
||||
return {
|
||||
"Authorization": f"Bearer {token}",
|
||||
"Content-Type": "application/json"
|
||||
}
|
||||
|
||||
|
||||
def test_all_hooks_demo():
|
||||
@@ -164,8 +201,8 @@ async def hook(page, context, html, **kwargs):
|
||||
|
||||
print("\nSending request with all 8 hooks...")
|
||||
start_time = time.time()
|
||||
|
||||
response = requests.post(f"{API_BASE_URL}/crawl", json=payload)
|
||||
|
||||
response = requests.post(f"{API_BASE_URL}/crawl", json=payload, headers=get_auth_headers())
|
||||
|
||||
elapsed_time = time.time() - start_time
|
||||
print(f"Request completed in {elapsed_time:.2f} seconds")
|
||||
@@ -278,7 +315,7 @@ async def hook(page, context, url, **kwargs):
|
||||
}
|
||||
|
||||
print("\nTesting authentication with httpbin endpoints...")
|
||||
response = requests.post(f"{API_BASE_URL}/crawl", json=payload)
|
||||
response = requests.post(f"{API_BASE_URL}/crawl", json=payload, headers=get_auth_headers())
|
||||
|
||||
if response.status_code == 200:
|
||||
data = response.json()
|
||||
@@ -371,8 +408,8 @@ async def hook(page, context, **kwargs):
|
||||
|
||||
print("\nTesting performance optimization hooks...")
|
||||
start_time = time.time()
|
||||
|
||||
response = requests.post(f"{API_BASE_URL}/crawl", json=payload)
|
||||
|
||||
response = requests.post(f"{API_BASE_URL}/crawl", json=payload, headers=get_auth_headers())
|
||||
|
||||
elapsed_time = time.time() - start_time
|
||||
print(f"Request completed in {elapsed_time:.2f} seconds")
|
||||
@@ -462,7 +499,7 @@ async def hook(page, context, **kwargs):
|
||||
}
|
||||
|
||||
print("\nTesting content extraction hooks...")
|
||||
response = requests.post(f"{API_BASE_URL}/crawl", json=payload)
|
||||
response = requests.post(f"{API_BASE_URL}/crawl", json=payload, headers=get_auth_headers())
|
||||
|
||||
if response.status_code == 200:
|
||||
data = response.json()
|
||||
@@ -485,7 +522,16 @@ def main():
|
||||
print("🔧 Crawl4AI Docker API - Comprehensive Hooks Testing")
|
||||
print("Based on docs/examples/hooks_example.py")
|
||||
print("=" * 70)
|
||||
|
||||
|
||||
# Get JWT token first (required when jwt_enabled=true)
|
||||
try:
|
||||
get_auth_token()
|
||||
print("=" * 70)
|
||||
except Exception as e:
|
||||
print(f"❌ Failed to authenticate: {e}")
|
||||
print("Make sure the server is running and jwt_enabled is configured correctly.")
|
||||
return
|
||||
|
||||
tests = [
|
||||
("All Hooks Demo", test_all_hooks_demo),
|
||||
("Authentication Flow", test_authentication_flow),
|
||||
|
||||
569
tests/proxy/test_sticky_sessions.py
Normal file
569
tests/proxy/test_sticky_sessions.py
Normal file
@@ -0,0 +1,569 @@
|
||||
"""
|
||||
Comprehensive test suite for Sticky Proxy Sessions functionality.
|
||||
|
||||
Tests cover:
|
||||
1. Basic sticky session - same proxy for same session_id
|
||||
2. Different sessions get different proxies
|
||||
3. Session release
|
||||
4. TTL expiration
|
||||
5. Thread safety / concurrent access
|
||||
6. Integration tests with AsyncWebCrawler
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
import time
|
||||
import pytest
|
||||
from unittest.mock import patch
|
||||
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
from crawl4ai.async_configs import CrawlerRunConfig, ProxyConfig
|
||||
from crawl4ai.proxy_strategy import RoundRobinProxyStrategy
|
||||
from crawl4ai.cache_context import CacheMode
|
||||
|
||||
|
||||
class TestRoundRobinProxyStrategySession:
|
||||
"""Test suite for RoundRobinProxyStrategy session methods."""
|
||||
|
||||
def setup_method(self):
|
||||
"""Setup for each test method."""
|
||||
self.proxies = [
|
||||
ProxyConfig(server=f"http://proxy{i}.test:8080")
|
||||
for i in range(5)
|
||||
]
|
||||
|
||||
# ==================== BASIC STICKY SESSION TESTS ====================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_sticky_session_same_proxy(self):
|
||||
"""Verify same proxy is returned for same session_id."""
|
||||
strategy = RoundRobinProxyStrategy(self.proxies)
|
||||
|
||||
# First call - acquires proxy
|
||||
proxy1 = await strategy.get_proxy_for_session("session-1")
|
||||
|
||||
# Second call - should return same proxy
|
||||
proxy2 = await strategy.get_proxy_for_session("session-1")
|
||||
|
||||
# Third call - should return same proxy
|
||||
proxy3 = await strategy.get_proxy_for_session("session-1")
|
||||
|
||||
assert proxy1 is not None
|
||||
assert proxy1.server == proxy2.server == proxy3.server
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_different_sessions_different_proxies(self):
|
||||
"""Verify different session_ids can get different proxies."""
|
||||
strategy = RoundRobinProxyStrategy(self.proxies)
|
||||
|
||||
proxy_a = await strategy.get_proxy_for_session("session-a")
|
||||
proxy_b = await strategy.get_proxy_for_session("session-b")
|
||||
proxy_c = await strategy.get_proxy_for_session("session-c")
|
||||
|
||||
# All should be different (round-robin)
|
||||
servers = {proxy_a.server, proxy_b.server, proxy_c.server}
|
||||
assert len(servers) == 3
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_sticky_session_with_regular_rotation(self):
|
||||
"""Verify sticky sessions don't interfere with regular rotation."""
|
||||
strategy = RoundRobinProxyStrategy(self.proxies)
|
||||
|
||||
# Acquire a sticky session
|
||||
session_proxy = await strategy.get_proxy_for_session("sticky-session")
|
||||
|
||||
# Regular rotation should continue independently
|
||||
regular_proxy1 = await strategy.get_next_proxy()
|
||||
regular_proxy2 = await strategy.get_next_proxy()
|
||||
|
||||
# Sticky session should still return same proxy
|
||||
session_proxy_again = await strategy.get_proxy_for_session("sticky-session")
|
||||
|
||||
assert session_proxy.server == session_proxy_again.server
|
||||
# Regular proxies should rotate
|
||||
assert regular_proxy1.server != regular_proxy2.server
|
||||
|
||||
# ==================== SESSION RELEASE TESTS ====================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_session_release(self):
|
||||
"""Verify session can be released and reacquired."""
|
||||
strategy = RoundRobinProxyStrategy(self.proxies)
|
||||
|
||||
# Acquire session
|
||||
proxy1 = await strategy.get_proxy_for_session("session-1")
|
||||
assert strategy.get_session_proxy("session-1") is not None
|
||||
|
||||
# Release session
|
||||
await strategy.release_session("session-1")
|
||||
assert strategy.get_session_proxy("session-1") is None
|
||||
|
||||
# Reacquire - should get a new proxy (next in round-robin)
|
||||
proxy2 = await strategy.get_proxy_for_session("session-1")
|
||||
assert proxy2 is not None
|
||||
# After release, next call gets the next proxy in rotation
|
||||
# (not necessarily the same as before)
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_release_nonexistent_session(self):
|
||||
"""Verify releasing non-existent session doesn't raise error."""
|
||||
strategy = RoundRobinProxyStrategy(self.proxies)
|
||||
|
||||
# Should not raise
|
||||
await strategy.release_session("nonexistent-session")
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_release_twice(self):
|
||||
"""Verify releasing session twice doesn't raise error."""
|
||||
strategy = RoundRobinProxyStrategy(self.proxies)
|
||||
|
||||
await strategy.get_proxy_for_session("session-1")
|
||||
await strategy.release_session("session-1")
|
||||
await strategy.release_session("session-1") # Should not raise
|
||||
|
||||
# ==================== GET SESSION PROXY TESTS ====================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_get_session_proxy_existing(self):
|
||||
"""Verify get_session_proxy returns proxy for existing session."""
|
||||
strategy = RoundRobinProxyStrategy(self.proxies)
|
||||
|
||||
acquired = await strategy.get_proxy_for_session("session-1")
|
||||
retrieved = strategy.get_session_proxy("session-1")
|
||||
|
||||
assert retrieved is not None
|
||||
assert acquired.server == retrieved.server
|
||||
|
||||
def test_get_session_proxy_nonexistent(self):
|
||||
"""Verify get_session_proxy returns None for non-existent session."""
|
||||
strategy = RoundRobinProxyStrategy(self.proxies)
|
||||
|
||||
result = strategy.get_session_proxy("nonexistent-session")
|
||||
assert result is None
|
||||
|
||||
# ==================== TTL EXPIRATION TESTS ====================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_session_ttl_not_expired(self):
|
||||
"""Verify session returns same proxy when TTL not expired."""
|
||||
strategy = RoundRobinProxyStrategy(self.proxies)
|
||||
|
||||
# Acquire with 10 second TTL
|
||||
proxy1 = await strategy.get_proxy_for_session("session-1", ttl=10)
|
||||
|
||||
# Immediately request again - should return same proxy
|
||||
proxy2 = await strategy.get_proxy_for_session("session-1", ttl=10)
|
||||
|
||||
assert proxy1.server == proxy2.server
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_session_ttl_expired(self):
|
||||
"""Verify new proxy acquired after TTL expires."""
|
||||
strategy = RoundRobinProxyStrategy(self.proxies)
|
||||
|
||||
# Acquire with 1 second TTL
|
||||
proxy1 = await strategy.get_proxy_for_session("session-1", ttl=1)
|
||||
|
||||
# Wait for TTL to expire
|
||||
await asyncio.sleep(1.1)
|
||||
|
||||
# Request again - should get new proxy due to expiration
|
||||
proxy2 = await strategy.get_proxy_for_session("session-1", ttl=1)
|
||||
|
||||
# May or may not be same server depending on round-robin state,
|
||||
# but session should have been recreated
|
||||
assert proxy2 is not None
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_get_session_proxy_ttl_expired(self):
|
||||
"""Verify get_session_proxy returns None after TTL expires."""
|
||||
strategy = RoundRobinProxyStrategy(self.proxies)
|
||||
|
||||
await strategy.get_proxy_for_session("session-1", ttl=1)
|
||||
|
||||
# Wait for expiration
|
||||
await asyncio.sleep(1.1)
|
||||
|
||||
# Should return None for expired session
|
||||
result = strategy.get_session_proxy("session-1")
|
||||
assert result is None
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_cleanup_expired_sessions(self):
|
||||
"""Verify cleanup_expired_sessions removes expired sessions."""
|
||||
strategy = RoundRobinProxyStrategy(self.proxies)
|
||||
|
||||
# Create sessions with short TTL
|
||||
await strategy.get_proxy_for_session("short-ttl-1", ttl=1)
|
||||
await strategy.get_proxy_for_session("short-ttl-2", ttl=1)
|
||||
# Create session without TTL (should not be cleaned up)
|
||||
await strategy.get_proxy_for_session("no-ttl")
|
||||
|
||||
# Wait for TTL to expire
|
||||
await asyncio.sleep(1.1)
|
||||
|
||||
# Cleanup
|
||||
removed = await strategy.cleanup_expired_sessions()
|
||||
|
||||
assert removed == 2
|
||||
assert strategy.get_session_proxy("short-ttl-1") is None
|
||||
assert strategy.get_session_proxy("short-ttl-2") is None
|
||||
assert strategy.get_session_proxy("no-ttl") is not None
|
||||
|
||||
# ==================== GET ACTIVE SESSIONS TESTS ====================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_get_active_sessions(self):
|
||||
"""Verify get_active_sessions returns all active sessions."""
|
||||
strategy = RoundRobinProxyStrategy(self.proxies)
|
||||
|
||||
await strategy.get_proxy_for_session("session-a")
|
||||
await strategy.get_proxy_for_session("session-b")
|
||||
await strategy.get_proxy_for_session("session-c")
|
||||
|
||||
active = strategy.get_active_sessions()
|
||||
|
||||
assert len(active) == 3
|
||||
assert "session-a" in active
|
||||
assert "session-b" in active
|
||||
assert "session-c" in active
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_get_active_sessions_excludes_expired(self):
|
||||
"""Verify get_active_sessions excludes expired sessions."""
|
||||
strategy = RoundRobinProxyStrategy(self.proxies)
|
||||
|
||||
await strategy.get_proxy_for_session("short-ttl", ttl=1)
|
||||
await strategy.get_proxy_for_session("no-ttl")
|
||||
|
||||
# Before expiration
|
||||
active = strategy.get_active_sessions()
|
||||
assert len(active) == 2
|
||||
|
||||
# Wait for TTL to expire
|
||||
await asyncio.sleep(1.1)
|
||||
|
||||
# After expiration
|
||||
active = strategy.get_active_sessions()
|
||||
assert len(active) == 1
|
||||
assert "no-ttl" in active
|
||||
assert "short-ttl" not in active
|
||||
|
||||
# ==================== THREAD SAFETY TESTS ====================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_concurrent_session_access(self):
|
||||
"""Verify thread-safe access to sessions."""
|
||||
strategy = RoundRobinProxyStrategy(self.proxies)
|
||||
|
||||
async def acquire_session(session_id: str):
|
||||
proxy = await strategy.get_proxy_for_session(session_id)
|
||||
await asyncio.sleep(0.01) # Simulate work
|
||||
return proxy.server
|
||||
|
||||
# Acquire same session from multiple coroutines
|
||||
results = await asyncio.gather(*[
|
||||
acquire_session("shared-session") for _ in range(10)
|
||||
])
|
||||
|
||||
# All should get same proxy
|
||||
assert len(set(results)) == 1
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_concurrent_different_sessions(self):
|
||||
"""Verify concurrent acquisition of different sessions works correctly."""
|
||||
strategy = RoundRobinProxyStrategy(self.proxies)
|
||||
|
||||
async def acquire_session(session_id: str):
|
||||
proxy = await strategy.get_proxy_for_session(session_id)
|
||||
await asyncio.sleep(0.01)
|
||||
return (session_id, proxy.server)
|
||||
|
||||
# Acquire different sessions concurrently
|
||||
results = await asyncio.gather(*[
|
||||
acquire_session(f"session-{i}") for i in range(5)
|
||||
])
|
||||
|
||||
# Each session should have a consistent proxy
|
||||
session_proxies = dict(results)
|
||||
assert len(session_proxies) == 5
|
||||
|
||||
# Verify each session still returns same proxy
|
||||
for session_id, expected_server in session_proxies.items():
|
||||
actual = await strategy.get_proxy_for_session(session_id)
|
||||
assert actual.server == expected_server
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_concurrent_session_acquire_and_release(self):
|
||||
"""Verify concurrent acquire and release operations work correctly."""
|
||||
strategy = RoundRobinProxyStrategy(self.proxies)
|
||||
|
||||
async def acquire_and_release(session_id: str):
|
||||
proxy = await strategy.get_proxy_for_session(session_id)
|
||||
await asyncio.sleep(0.01)
|
||||
await strategy.release_session(session_id)
|
||||
return proxy.server
|
||||
|
||||
# Run multiple acquire/release cycles concurrently
|
||||
await asyncio.gather(*[
|
||||
acquire_and_release(f"session-{i}") for i in range(10)
|
||||
])
|
||||
|
||||
# All sessions should be released
|
||||
active = strategy.get_active_sessions()
|
||||
assert len(active) == 0
|
||||
|
||||
# ==================== EMPTY PROXY POOL TESTS ====================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_empty_proxy_pool_session(self):
|
||||
"""Verify behavior with empty proxy pool."""
|
||||
strategy = RoundRobinProxyStrategy() # No proxies
|
||||
|
||||
result = await strategy.get_proxy_for_session("session-1")
|
||||
assert result is None
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_add_proxies_after_session(self):
|
||||
"""Verify adding proxies after session creation works."""
|
||||
strategy = RoundRobinProxyStrategy()
|
||||
|
||||
# No proxies initially
|
||||
result1 = await strategy.get_proxy_for_session("session-1")
|
||||
assert result1 is None
|
||||
|
||||
# Add proxies
|
||||
strategy.add_proxies(self.proxies)
|
||||
|
||||
# Now should work
|
||||
result2 = await strategy.get_proxy_for_session("session-2")
|
||||
assert result2 is not None
|
||||
|
||||
|
||||
class TestCrawlerRunConfigSession:
|
||||
"""Test CrawlerRunConfig with sticky session parameters."""
|
||||
|
||||
def test_config_has_session_fields(self):
|
||||
"""Verify CrawlerRunConfig has sticky session fields."""
|
||||
config = CrawlerRunConfig(
|
||||
proxy_session_id="test-session",
|
||||
proxy_session_ttl=300,
|
||||
proxy_session_auto_release=True
|
||||
)
|
||||
|
||||
assert config.proxy_session_id == "test-session"
|
||||
assert config.proxy_session_ttl == 300
|
||||
assert config.proxy_session_auto_release is True
|
||||
|
||||
def test_config_session_defaults(self):
|
||||
"""Verify default values for session fields."""
|
||||
config = CrawlerRunConfig()
|
||||
|
||||
assert config.proxy_session_id is None
|
||||
assert config.proxy_session_ttl is None
|
||||
assert config.proxy_session_auto_release is False
|
||||
|
||||
|
||||
class TestCrawlerStickySessionIntegration:
|
||||
"""Integration tests for AsyncWebCrawler with sticky sessions."""
|
||||
|
||||
def setup_method(self):
|
||||
"""Setup for each test method."""
|
||||
self.proxies = [
|
||||
ProxyConfig(server=f"http://proxy{i}.test:8080")
|
||||
for i in range(3)
|
||||
]
|
||||
self.test_url = "https://httpbin.org/ip"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_crawler_sticky_session_without_proxy(self):
|
||||
"""Test that crawler works when proxy_session_id set but no strategy."""
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
proxy_session_id="test-session",
|
||||
page_timeout=15000
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url=self.test_url, config=config)
|
||||
# Should work without errors (no proxy strategy means no proxy)
|
||||
assert result is not None
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_crawler_sticky_session_basic(self):
|
||||
"""Test basic sticky session with crawler."""
|
||||
strategy = RoundRobinProxyStrategy(self.proxies)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
proxy_rotation_strategy=strategy,
|
||||
proxy_session_id="integration-test",
|
||||
page_timeout=10000
|
||||
)
|
||||
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# First request
|
||||
try:
|
||||
result1 = await crawler.arun(url=self.test_url, config=config)
|
||||
except Exception:
|
||||
pass # Proxy connection may fail, but session should be tracked
|
||||
|
||||
# Verify session was created
|
||||
session_proxy = strategy.get_session_proxy("integration-test")
|
||||
assert session_proxy is not None
|
||||
|
||||
# Cleanup
|
||||
await strategy.release_session("integration-test")
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_crawler_rotating_vs_sticky(self):
|
||||
"""Compare rotating behavior vs sticky session behavior."""
|
||||
strategy = RoundRobinProxyStrategy(self.proxies)
|
||||
|
||||
# Config WITHOUT sticky session - should rotate
|
||||
rotating_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
proxy_rotation_strategy=strategy,
|
||||
page_timeout=5000
|
||||
)
|
||||
|
||||
# Config WITH sticky session - should use same proxy
|
||||
sticky_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
proxy_rotation_strategy=strategy,
|
||||
proxy_session_id="sticky-test",
|
||||
page_timeout=5000
|
||||
)
|
||||
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# Track proxy configs used
|
||||
rotating_proxies = []
|
||||
sticky_proxies = []
|
||||
|
||||
# Try rotating requests (may fail due to test proxies, but config should be set)
|
||||
for _ in range(3):
|
||||
try:
|
||||
await crawler.arun(url=self.test_url, config=rotating_config)
|
||||
except Exception:
|
||||
pass
|
||||
rotating_proxies.append(rotating_config.proxy_config.server if rotating_config.proxy_config else None)
|
||||
|
||||
# Try sticky requests
|
||||
for _ in range(3):
|
||||
try:
|
||||
await crawler.arun(url=self.test_url, config=sticky_config)
|
||||
except Exception:
|
||||
pass
|
||||
sticky_proxies.append(sticky_config.proxy_config.server if sticky_config.proxy_config else None)
|
||||
|
||||
# Rotating should have different proxies (or cycle through them)
|
||||
# Sticky should have same proxy for all requests
|
||||
if all(sticky_proxies):
|
||||
assert len(set(sticky_proxies)) == 1, "Sticky session should use same proxy"
|
||||
|
||||
await strategy.release_session("sticky-test")
|
||||
|
||||
|
||||
class TestStickySessionRealWorld:
|
||||
"""Real-world scenario tests for sticky sessions.
|
||||
|
||||
Note: These tests require actual proxy servers to verify IP consistency.
|
||||
They are marked to be skipped if no proxy is configured.
|
||||
"""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
@pytest.mark.skipif(
|
||||
not os.environ.get('TEST_PROXY_1'),
|
||||
reason="Requires TEST_PROXY_1 environment variable"
|
||||
)
|
||||
async def test_verify_ip_consistency(self):
|
||||
"""Verify that sticky session actually uses same IP.
|
||||
|
||||
This test requires real proxies set in environment variables:
|
||||
TEST_PROXY_1=ip:port:user:pass
|
||||
TEST_PROXY_2=ip:port:user:pass
|
||||
"""
|
||||
import re
|
||||
|
||||
# Load proxies from environment
|
||||
proxy_strs = [
|
||||
os.environ.get('TEST_PROXY_1', ''),
|
||||
os.environ.get('TEST_PROXY_2', '')
|
||||
]
|
||||
proxies = [ProxyConfig.from_string(p) for p in proxy_strs if p]
|
||||
|
||||
if len(proxies) < 2:
|
||||
pytest.skip("Need at least 2 proxies for this test")
|
||||
|
||||
strategy = RoundRobinProxyStrategy(proxies)
|
||||
|
||||
# Config WITH sticky session
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
proxy_rotation_strategy=strategy,
|
||||
proxy_session_id="ip-verify-session",
|
||||
page_timeout=30000
|
||||
)
|
||||
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
ips = []
|
||||
|
||||
for i in range(3):
|
||||
result = await crawler.arun(
|
||||
url="https://httpbin.org/ip",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result and result.success and result.html:
|
||||
# Extract IP from response
|
||||
ip_match = re.search(r'"origin":\s*"([^"]+)"', result.html)
|
||||
if ip_match:
|
||||
ips.append(ip_match.group(1))
|
||||
|
||||
await strategy.release_session("ip-verify-session")
|
||||
|
||||
# All IPs should be same for sticky session
|
||||
if len(ips) >= 2:
|
||||
assert len(set(ips)) == 1, f"Expected same IP, got: {ips}"
|
||||
|
||||
|
||||
# ==================== STANDALONE TEST FUNCTIONS ====================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_sticky_session_simple():
|
||||
"""Simple test for sticky session functionality."""
|
||||
proxies = [
|
||||
ProxyConfig(server=f"http://proxy{i}.test:8080")
|
||||
for i in range(3)
|
||||
]
|
||||
strategy = RoundRobinProxyStrategy(proxies)
|
||||
|
||||
# Same session should return same proxy
|
||||
p1 = await strategy.get_proxy_for_session("test")
|
||||
p2 = await strategy.get_proxy_for_session("test")
|
||||
p3 = await strategy.get_proxy_for_session("test")
|
||||
|
||||
assert p1.server == p2.server == p3.server
|
||||
print(f"Sticky session works! All requests use: {p1.server}")
|
||||
|
||||
# Cleanup
|
||||
await strategy.release_session("test")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
print("Running Sticky Session tests...")
|
||||
print("=" * 50)
|
||||
|
||||
asyncio.run(test_sticky_session_simple())
|
||||
|
||||
print("\n" + "=" * 50)
|
||||
print("To run the full pytest suite, use: pytest " + __file__)
|
||||
print("=" * 50)
|
||||
236
tests/test_prefetch_integration.py
Normal file
236
tests/test_prefetch_integration.py
Normal file
@@ -0,0 +1,236 @@
|
||||
"""Integration tests for prefetch mode with the crawler."""
|
||||
|
||||
import pytest
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BrowserConfig
|
||||
|
||||
# Use crawl4ai docs as test domain
|
||||
TEST_DOMAIN = "https://docs.crawl4ai.com"
|
||||
|
||||
|
||||
class TestPrefetchModeIntegration:
|
||||
"""Integration tests for prefetch mode."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_prefetch_returns_html_and_links(self):
|
||||
"""Test that prefetch mode returns HTML and links only."""
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(prefetch=True)
|
||||
result = await crawler.arun(TEST_DOMAIN, config=config)
|
||||
|
||||
# Should have HTML
|
||||
assert result.html is not None
|
||||
assert len(result.html) > 0
|
||||
assert "<html" in result.html.lower() or "<!doctype" in result.html.lower()
|
||||
|
||||
# Should have links
|
||||
assert result.links is not None
|
||||
assert "internal" in result.links
|
||||
assert "external" in result.links
|
||||
|
||||
# Should NOT have processed content
|
||||
assert result.markdown is None or (
|
||||
hasattr(result.markdown, 'raw_markdown') and
|
||||
result.markdown.raw_markdown is None
|
||||
)
|
||||
assert result.cleaned_html is None
|
||||
assert result.extracted_content is None
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_prefetch_preserves_metadata(self):
|
||||
"""Test that prefetch mode preserves essential metadata."""
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(prefetch=True)
|
||||
result = await crawler.arun(TEST_DOMAIN, config=config)
|
||||
|
||||
# Should have success flag
|
||||
assert result.success is True
|
||||
|
||||
# Should have URL
|
||||
assert result.url is not None
|
||||
|
||||
# Status code should be present
|
||||
assert result.status_code is not None or result.status_code == 200
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_prefetch_with_deep_crawl(self):
|
||||
"""Test prefetch mode with deep crawl strategy."""
|
||||
from crawl4ai import BFSDeepCrawlStrategy
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
prefetch=True,
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||||
max_depth=1,
|
||||
max_pages=3
|
||||
)
|
||||
)
|
||||
|
||||
result_container = await crawler.arun(TEST_DOMAIN, config=config)
|
||||
|
||||
# Handle both list and iterator results
|
||||
if hasattr(result_container, '__aiter__'):
|
||||
results = [r async for r in result_container]
|
||||
else:
|
||||
results = list(result_container) if hasattr(result_container, '__iter__') else [result_container]
|
||||
|
||||
# Each result should have HTML and links
|
||||
for result in results:
|
||||
assert result.html is not None
|
||||
assert result.links is not None
|
||||
|
||||
# Should have crawled at least one page
|
||||
assert len(results) >= 1
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_prefetch_then_process_with_raw(self):
|
||||
"""Test the full two-phase workflow: prefetch then process."""
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Phase 1: Prefetch
|
||||
prefetch_config = CrawlerRunConfig(prefetch=True)
|
||||
prefetch_result = await crawler.arun(TEST_DOMAIN, config=prefetch_config)
|
||||
|
||||
stored_html = prefetch_result.html
|
||||
|
||||
assert stored_html is not None
|
||||
assert len(stored_html) > 0
|
||||
|
||||
# Phase 2: Process with raw: URL
|
||||
process_config = CrawlerRunConfig(
|
||||
# No prefetch - full processing
|
||||
base_url=TEST_DOMAIN # Provide base URL for link resolution
|
||||
)
|
||||
processed_result = await crawler.arun(
|
||||
f"raw:{stored_html}",
|
||||
config=process_config
|
||||
)
|
||||
|
||||
# Should now have full processing
|
||||
assert processed_result.html is not None
|
||||
assert processed_result.success is True
|
||||
# Note: cleaned_html and markdown depend on the content
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_prefetch_links_structure(self):
|
||||
"""Test that links have the expected structure."""
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(prefetch=True)
|
||||
result = await crawler.arun(TEST_DOMAIN, config=config)
|
||||
|
||||
assert result.links is not None
|
||||
|
||||
# Check internal links structure
|
||||
if result.links["internal"]:
|
||||
link = result.links["internal"][0]
|
||||
assert "href" in link
|
||||
assert "text" in link
|
||||
assert link["href"].startswith("http")
|
||||
|
||||
# Check external links structure (if any)
|
||||
if result.links["external"]:
|
||||
link = result.links["external"][0]
|
||||
assert "href" in link
|
||||
assert "text" in link
|
||||
assert link["href"].startswith("http")
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_prefetch_config_clone(self):
|
||||
"""Test that config.clone() preserves prefetch setting."""
|
||||
config = CrawlerRunConfig(prefetch=True)
|
||||
cloned = config.clone()
|
||||
|
||||
assert cloned.prefetch == True
|
||||
|
||||
# Clone with override
|
||||
cloned_false = config.clone(prefetch=False)
|
||||
assert cloned_false.prefetch == False
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_prefetch_to_dict(self):
|
||||
"""Test that to_dict() includes prefetch."""
|
||||
config = CrawlerRunConfig(prefetch=True)
|
||||
config_dict = config.to_dict()
|
||||
|
||||
assert "prefetch" in config_dict
|
||||
assert config_dict["prefetch"] == True
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_prefetch_default_false(self):
|
||||
"""Test that prefetch defaults to False."""
|
||||
config = CrawlerRunConfig()
|
||||
assert config.prefetch == False
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_prefetch_explicit_false(self):
|
||||
"""Test explicit prefetch=False works like default."""
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(prefetch=False)
|
||||
result = await crawler.arun(TEST_DOMAIN, config=config)
|
||||
|
||||
# Should have full processing
|
||||
assert result.html is not None
|
||||
# cleaned_html should be populated in normal mode
|
||||
assert result.cleaned_html is not None
|
||||
|
||||
|
||||
class TestPrefetchPerformance:
|
||||
"""Performance-related tests for prefetch mode."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_prefetch_returns_quickly(self):
|
||||
"""Test that prefetch mode returns results quickly."""
|
||||
import time
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Prefetch mode
|
||||
start = time.time()
|
||||
prefetch_config = CrawlerRunConfig(prefetch=True)
|
||||
await crawler.arun(TEST_DOMAIN, config=prefetch_config)
|
||||
prefetch_time = time.time() - start
|
||||
|
||||
# Full mode
|
||||
start = time.time()
|
||||
full_config = CrawlerRunConfig()
|
||||
await crawler.arun(TEST_DOMAIN, config=full_config)
|
||||
full_time = time.time() - start
|
||||
|
||||
# Log times for debugging
|
||||
print(f"\nPrefetch: {prefetch_time:.3f}s, Full: {full_time:.3f}s")
|
||||
|
||||
# Prefetch should not be significantly slower
|
||||
# (may be same or slightly faster depending on content)
|
||||
# This is a soft check - mostly for logging
|
||||
|
||||
|
||||
class TestPrefetchWithRawHTML:
|
||||
"""Test prefetch mode with raw HTML input."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_prefetch_with_raw_html(self):
|
||||
"""Test prefetch mode works with raw: URL scheme."""
|
||||
sample_html = """
|
||||
<html>
|
||||
<head><title>Test Page</title></head>
|
||||
<body>
|
||||
<h1>Hello World</h1>
|
||||
<a href="/link1">Link 1</a>
|
||||
<a href="/link2">Link 2</a>
|
||||
<a href="https://external.com/page">External</a>
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
prefetch=True,
|
||||
base_url="https://example.com"
|
||||
)
|
||||
result = await crawler.arun(f"raw:{sample_html}", config=config)
|
||||
|
||||
assert result.success is True
|
||||
assert result.html is not None
|
||||
assert result.links is not None
|
||||
|
||||
# Should have extracted links
|
||||
assert len(result.links["internal"]) >= 2
|
||||
assert len(result.links["external"]) >= 1
|
||||
275
tests/test_prefetch_mode.py
Normal file
275
tests/test_prefetch_mode.py
Normal file
@@ -0,0 +1,275 @@
|
||||
"""Unit tests for the quick_extract_links function used in prefetch mode."""
|
||||
|
||||
import pytest
|
||||
from crawl4ai.utils import quick_extract_links
|
||||
|
||||
|
||||
class TestQuickExtractLinks:
|
||||
"""Unit tests for the quick_extract_links function."""
|
||||
|
||||
def test_basic_internal_links(self):
|
||||
"""Test extraction of internal links."""
|
||||
html = '''
|
||||
<html>
|
||||
<body>
|
||||
<a href="/page1">Page 1</a>
|
||||
<a href="/page2">Page 2</a>
|
||||
<a href="https://example.com/page3">Page 3</a>
|
||||
</body>
|
||||
</html>
|
||||
'''
|
||||
result = quick_extract_links(html, "https://example.com")
|
||||
|
||||
assert len(result["internal"]) == 3
|
||||
assert result["internal"][0]["href"] == "https://example.com/page1"
|
||||
assert result["internal"][0]["text"] == "Page 1"
|
||||
|
||||
def test_external_links(self):
|
||||
"""Test extraction and classification of external links."""
|
||||
html = '''
|
||||
<html>
|
||||
<body>
|
||||
<a href="https://other.com/page">External</a>
|
||||
<a href="/internal">Internal</a>
|
||||
</body>
|
||||
</html>
|
||||
'''
|
||||
result = quick_extract_links(html, "https://example.com")
|
||||
|
||||
assert len(result["internal"]) == 1
|
||||
assert len(result["external"]) == 1
|
||||
assert result["external"][0]["href"] == "https://other.com/page"
|
||||
|
||||
def test_ignores_javascript_and_mailto(self):
|
||||
"""Test that javascript: and mailto: links are ignored."""
|
||||
html = '''
|
||||
<html>
|
||||
<body>
|
||||
<a href="javascript:void(0)">Click</a>
|
||||
<a href="mailto:test@example.com">Email</a>
|
||||
<a href="tel:+1234567890">Call</a>
|
||||
<a href="/valid">Valid</a>
|
||||
</body>
|
||||
</html>
|
||||
'''
|
||||
result = quick_extract_links(html, "https://example.com")
|
||||
|
||||
assert len(result["internal"]) == 1
|
||||
assert result["internal"][0]["href"] == "https://example.com/valid"
|
||||
|
||||
def test_ignores_anchor_only_links(self):
|
||||
"""Test that anchor-only links (#section) are ignored."""
|
||||
html = '''
|
||||
<html>
|
||||
<body>
|
||||
<a href="#section1">Section 1</a>
|
||||
<a href="#section2">Section 2</a>
|
||||
<a href="/page#section">Page with anchor</a>
|
||||
</body>
|
||||
</html>
|
||||
'''
|
||||
result = quick_extract_links(html, "https://example.com")
|
||||
|
||||
# Only the page link should be included, anchor-only links are skipped
|
||||
assert len(result["internal"]) == 1
|
||||
assert "/page" in result["internal"][0]["href"]
|
||||
|
||||
def test_deduplication(self):
|
||||
"""Test that duplicate URLs are deduplicated."""
|
||||
html = '''
|
||||
<html>
|
||||
<body>
|
||||
<a href="/page">Link 1</a>
|
||||
<a href="/page">Link 2</a>
|
||||
<a href="/page">Link 3</a>
|
||||
</body>
|
||||
</html>
|
||||
'''
|
||||
result = quick_extract_links(html, "https://example.com")
|
||||
|
||||
assert len(result["internal"]) == 1
|
||||
|
||||
def test_handles_malformed_html(self):
|
||||
"""Test graceful handling of malformed HTML."""
|
||||
html = "not valid html at all <><><"
|
||||
result = quick_extract_links(html, "https://example.com")
|
||||
|
||||
# Should not raise, should return empty
|
||||
assert result["internal"] == []
|
||||
assert result["external"] == []
|
||||
|
||||
def test_empty_html(self):
|
||||
"""Test handling of empty HTML."""
|
||||
result = quick_extract_links("", "https://example.com")
|
||||
assert result == {"internal": [], "external": []}
|
||||
|
||||
def test_relative_url_resolution(self):
|
||||
"""Test that relative URLs are resolved correctly."""
|
||||
html = '''
|
||||
<html>
|
||||
<body>
|
||||
<a href="page1.html">Relative</a>
|
||||
<a href="./page2.html">Dot Relative</a>
|
||||
<a href="../page3.html">Parent Relative</a>
|
||||
</body>
|
||||
</html>
|
||||
'''
|
||||
result = quick_extract_links(html, "https://example.com/docs/")
|
||||
|
||||
assert len(result["internal"]) >= 1
|
||||
# All should be internal and properly resolved
|
||||
for link in result["internal"]:
|
||||
assert link["href"].startswith("https://example.com")
|
||||
|
||||
def test_text_truncation(self):
|
||||
"""Test that long link text is truncated to 200 chars."""
|
||||
long_text = "A" * 300
|
||||
html = f'''
|
||||
<html>
|
||||
<body>
|
||||
<a href="/page">{long_text}</a>
|
||||
</body>
|
||||
</html>
|
||||
'''
|
||||
result = quick_extract_links(html, "https://example.com")
|
||||
|
||||
assert len(result["internal"]) == 1
|
||||
assert len(result["internal"][0]["text"]) == 200
|
||||
|
||||
def test_empty_href_ignored(self):
|
||||
"""Test that empty href attributes are ignored."""
|
||||
html = '''
|
||||
<html>
|
||||
<body>
|
||||
<a href="">Empty</a>
|
||||
<a href=" ">Whitespace</a>
|
||||
<a href="/valid">Valid</a>
|
||||
</body>
|
||||
</html>
|
||||
'''
|
||||
result = quick_extract_links(html, "https://example.com")
|
||||
|
||||
assert len(result["internal"]) == 1
|
||||
assert result["internal"][0]["href"] == "https://example.com/valid"
|
||||
|
||||
def test_mixed_internal_external(self):
|
||||
"""Test correct classification of mixed internal and external links."""
|
||||
html = '''
|
||||
<html>
|
||||
<body>
|
||||
<a href="/internal1">Internal 1</a>
|
||||
<a href="https://example.com/internal2">Internal 2</a>
|
||||
<a href="https://google.com">Google</a>
|
||||
<a href="https://github.com/repo">GitHub</a>
|
||||
<a href="/internal3">Internal 3</a>
|
||||
</body>
|
||||
</html>
|
||||
'''
|
||||
result = quick_extract_links(html, "https://example.com")
|
||||
|
||||
assert len(result["internal"]) == 3
|
||||
assert len(result["external"]) == 2
|
||||
|
||||
def test_subdomain_handling(self):
|
||||
"""Test that subdomains are handled correctly."""
|
||||
html = '''
|
||||
<html>
|
||||
<body>
|
||||
<a href="https://docs.example.com/page">Docs subdomain</a>
|
||||
<a href="https://api.example.com/v1">API subdomain</a>
|
||||
<a href="https://example.com/main">Main domain</a>
|
||||
</body>
|
||||
</html>
|
||||
'''
|
||||
result = quick_extract_links(html, "https://example.com")
|
||||
|
||||
# All should be internal (same base domain)
|
||||
total_links = len(result["internal"]) + len(result["external"])
|
||||
assert total_links == 3
|
||||
|
||||
|
||||
class TestQuickExtractLinksEdgeCases:
|
||||
"""Edge case tests for quick_extract_links."""
|
||||
|
||||
def test_no_links_in_page(self):
|
||||
"""Test page with no links."""
|
||||
html = '''
|
||||
<html>
|
||||
<body>
|
||||
<h1>No Links Here</h1>
|
||||
<p>Just some text content.</p>
|
||||
</body>
|
||||
</html>
|
||||
'''
|
||||
result = quick_extract_links(html, "https://example.com")
|
||||
|
||||
assert result["internal"] == []
|
||||
assert result["external"] == []
|
||||
|
||||
def test_links_in_nested_elements(self):
|
||||
"""Test links nested in various elements."""
|
||||
html = '''
|
||||
<html>
|
||||
<body>
|
||||
<nav>
|
||||
<ul>
|
||||
<li><a href="/home">Home</a></li>
|
||||
<li><a href="/about">About</a></li>
|
||||
</ul>
|
||||
</nav>
|
||||
<div class="content">
|
||||
<p>Check out <a href="/products">our products</a>.</p>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
'''
|
||||
result = quick_extract_links(html, "https://example.com")
|
||||
|
||||
assert len(result["internal"]) == 3
|
||||
|
||||
def test_link_with_nested_elements(self):
|
||||
"""Test links containing nested elements."""
|
||||
html = '''
|
||||
<html>
|
||||
<body>
|
||||
<a href="/page"><span>Nested</span> <strong>Text</strong></a>
|
||||
</body>
|
||||
</html>
|
||||
'''
|
||||
result = quick_extract_links(html, "https://example.com")
|
||||
|
||||
assert len(result["internal"]) == 1
|
||||
assert "Nested" in result["internal"][0]["text"]
|
||||
assert "Text" in result["internal"][0]["text"]
|
||||
|
||||
def test_protocol_relative_urls(self):
|
||||
"""Test handling of protocol-relative URLs (//example.com)."""
|
||||
html = '''
|
||||
<html>
|
||||
<body>
|
||||
<a href="//cdn.example.com/asset">CDN Link</a>
|
||||
</body>
|
||||
</html>
|
||||
'''
|
||||
result = quick_extract_links(html, "https://example.com")
|
||||
|
||||
# Should be resolved with https:
|
||||
total = len(result["internal"]) + len(result["external"])
|
||||
assert total >= 1
|
||||
|
||||
def test_whitespace_in_href(self):
|
||||
"""Test handling of whitespace around href values."""
|
||||
html = '''
|
||||
<html>
|
||||
<body>
|
||||
<a href=" /page1 ">Padded</a>
|
||||
<a href="
|
||||
/page2
|
||||
">Multiline</a>
|
||||
</body>
|
||||
</html>
|
||||
'''
|
||||
result = quick_extract_links(html, "https://example.com")
|
||||
|
||||
# Both should be extracted and normalized
|
||||
assert len(result["internal"]) >= 1
|
||||
232
tests/test_prefetch_regression.py
Normal file
232
tests/test_prefetch_regression.py
Normal file
@@ -0,0 +1,232 @@
|
||||
"""Regression tests to ensure prefetch mode doesn't break existing functionality."""
|
||||
|
||||
import pytest
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
TEST_URL = "https://docs.crawl4ai.com"
|
||||
|
||||
|
||||
class TestNoRegressions:
|
||||
"""Ensure prefetch mode doesn't break existing functionality."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_default_mode_unchanged(self):
|
||||
"""Test that default mode (prefetch=False) works exactly as before."""
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig() # Default config
|
||||
result = await crawler.arun(TEST_URL, config=config)
|
||||
|
||||
# All standard fields should be populated
|
||||
assert result.html is not None
|
||||
assert result.cleaned_html is not None
|
||||
assert result.links is not None
|
||||
assert result.success is True
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_explicit_prefetch_false(self):
|
||||
"""Test explicit prefetch=False works like default."""
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(prefetch=False)
|
||||
result = await crawler.arun(TEST_URL, config=config)
|
||||
|
||||
assert result.cleaned_html is not None
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_config_clone_preserves_prefetch(self):
|
||||
"""Test that config.clone() preserves prefetch setting."""
|
||||
config = CrawlerRunConfig(prefetch=True)
|
||||
cloned = config.clone()
|
||||
|
||||
assert cloned.prefetch == True
|
||||
|
||||
# Clone with override
|
||||
cloned_false = config.clone(prefetch=False)
|
||||
assert cloned_false.prefetch == False
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_config_to_dict_includes_prefetch(self):
|
||||
"""Test that to_dict() includes prefetch."""
|
||||
config_true = CrawlerRunConfig(prefetch=True)
|
||||
config_false = CrawlerRunConfig(prefetch=False)
|
||||
|
||||
assert config_true.to_dict()["prefetch"] == True
|
||||
assert config_false.to_dict()["prefetch"] == False
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_existing_extraction_still_works(self):
|
||||
"""Test that extraction strategies still work in normal mode."""
|
||||
from crawl4ai import JsonCssExtractionStrategy
|
||||
|
||||
schema = {
|
||||
"name": "Links",
|
||||
"baseSelector": "a",
|
||||
"fields": [
|
||||
{"name": "href", "selector": "", "type": "attribute", "attribute": "href"},
|
||||
{"name": "text", "selector": "", "type": "text"}
|
||||
]
|
||||
}
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema=schema)
|
||||
)
|
||||
result = await crawler.arun(TEST_URL, config=config)
|
||||
|
||||
assert result.extracted_content is not None
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_existing_deep_crawl_still_works(self):
|
||||
"""Test that deep crawl without prefetch still does full processing."""
|
||||
from crawl4ai import BFSDeepCrawlStrategy
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||||
max_depth=1,
|
||||
max_pages=2
|
||||
)
|
||||
# No prefetch - should do full processing
|
||||
)
|
||||
|
||||
result_container = await crawler.arun(TEST_URL, config=config)
|
||||
|
||||
# Handle both list and iterator results
|
||||
if hasattr(result_container, '__aiter__'):
|
||||
results = [r async for r in result_container]
|
||||
else:
|
||||
results = list(result_container) if hasattr(result_container, '__iter__') else [result_container]
|
||||
|
||||
# Each result should have full processing
|
||||
for result in results:
|
||||
assert result.cleaned_html is not None
|
||||
|
||||
assert len(results) >= 1
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_raw_url_scheme_still_works(self):
|
||||
"""Test that raw: URL scheme works for processing stored HTML."""
|
||||
sample_html = """
|
||||
<html>
|
||||
<head><title>Test Page</title></head>
|
||||
<body>
|
||||
<h1>Hello World</h1>
|
||||
<p>This is a test paragraph.</p>
|
||||
<a href="/link1">Link 1</a>
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig()
|
||||
result = await crawler.arun(f"raw:{sample_html}", config=config)
|
||||
|
||||
assert result.success is True
|
||||
assert result.html is not None
|
||||
assert "Hello World" in result.html
|
||||
assert result.cleaned_html is not None
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_screenshot_still_works(self):
|
||||
"""Test that screenshot option still works in normal mode."""
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(screenshot=True)
|
||||
result = await crawler.arun(TEST_URL, config=config)
|
||||
|
||||
assert result.success is True
|
||||
# Screenshot data should be present
|
||||
assert result.screenshot is not None or result.screenshot_data is not None
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_js_execution_still_works(self):
|
||||
"""Test that JavaScript execution still works in normal mode."""
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
js_code="document.querySelector('h1')?.textContent"
|
||||
)
|
||||
result = await crawler.arun(TEST_URL, config=config)
|
||||
|
||||
assert result.success is True
|
||||
assert result.html is not None
|
||||
|
||||
|
||||
class TestPrefetchDoesNotAffectOtherModes:
|
||||
"""Test that prefetch doesn't interfere with other configurations."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_prefetch_with_other_options_ignored(self):
|
||||
"""Test that other options are properly ignored in prefetch mode."""
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
prefetch=True,
|
||||
# These should be ignored in prefetch mode
|
||||
screenshot=True,
|
||||
pdf=True,
|
||||
only_text=True,
|
||||
word_count_threshold=100
|
||||
)
|
||||
result = await crawler.arun(TEST_URL, config=config)
|
||||
|
||||
# Should still return HTML and links
|
||||
assert result.html is not None
|
||||
assert result.links is not None
|
||||
|
||||
# But should NOT have processed content
|
||||
assert result.cleaned_html is None
|
||||
assert result.extracted_content is None
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_stream_mode_still_works(self):
|
||||
"""Test that stream mode still works normally."""
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(stream=True)
|
||||
result = await crawler.arun(TEST_URL, config=config)
|
||||
|
||||
assert result.success is True
|
||||
assert result.html is not None
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_cache_mode_still_works(self):
|
||||
"""Test that cache mode still works normally."""
|
||||
from crawl4ai import CacheMode
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# First request - bypass cache
|
||||
config1 = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
result1 = await crawler.arun(TEST_URL, config=config1)
|
||||
assert result1.success is True
|
||||
|
||||
# Second request - should work
|
||||
config2 = CrawlerRunConfig(cache_mode=CacheMode.ENABLED)
|
||||
result2 = await crawler.arun(TEST_URL, config=config2)
|
||||
assert result2.success is True
|
||||
|
||||
|
||||
class TestBackwardsCompatibility:
|
||||
"""Test backwards compatibility with existing code patterns."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_config_without_prefetch_works(self):
|
||||
"""Test that configs created without prefetch parameter work."""
|
||||
# Simulating old code that doesn't know about prefetch
|
||||
config = CrawlerRunConfig(
|
||||
word_count_threshold=50,
|
||||
css_selector="body"
|
||||
)
|
||||
|
||||
# Should default to prefetch=False
|
||||
assert config.prefetch == False
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(TEST_URL, config=config)
|
||||
assert result.success is True
|
||||
assert result.cleaned_html is not None
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_from_kwargs_without_prefetch(self):
|
||||
"""Test CrawlerRunConfig.from_kwargs works without prefetch."""
|
||||
config = CrawlerRunConfig.from_kwargs({
|
||||
"word_count_threshold": 50,
|
||||
"verbose": False
|
||||
})
|
||||
|
||||
assert config.prefetch == False
|
||||
172
tests/test_raw_html_browser.py
Normal file
172
tests/test_raw_html_browser.py
Normal file
@@ -0,0 +1,172 @@
|
||||
"""
|
||||
Tests for raw:/file:// URL browser pipeline support.
|
||||
|
||||
Tests the new feature that allows js_code, wait_for, and other browser operations
|
||||
to work with raw: and file:// URLs by routing them through _crawl_web() with
|
||||
set_content() instead of goto().
|
||||
"""
|
||||
|
||||
import pytest
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_raw_html_fast_path():
|
||||
"""Test that raw: without browser params returns HTML directly (fast path)."""
|
||||
html = "<html><body><div id='test'>Original Content</div></body></html>"
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig() # No browser params
|
||||
result = await crawler.arun(f"raw:{html}", config=config)
|
||||
|
||||
assert result.success
|
||||
assert "Original Content" in result.html
|
||||
# Fast path should not modify the HTML
|
||||
assert result.html == html
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_js_code_on_raw_html():
|
||||
"""Test that js_code executes on raw: HTML and modifies the DOM."""
|
||||
html = "<html><body><div id='test'>Original</div></body></html>"
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
js_code="document.getElementById('test').innerText = 'Modified by JS'"
|
||||
)
|
||||
result = await crawler.arun(f"raw:{html}", config=config)
|
||||
|
||||
assert result.success
|
||||
assert "Modified by JS" in result.html
|
||||
assert "Original" not in result.html or "Modified by JS" in result.html
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_js_code_adds_element_to_raw_html():
|
||||
"""Test that js_code can add new elements to raw: HTML."""
|
||||
html = "<html><body><div id='container'></div></body></html>"
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
js_code='document.getElementById("container").innerHTML = "<span id=\'injected\'>Custom Content</span>"'
|
||||
)
|
||||
result = await crawler.arun(f"raw:{html}", config=config)
|
||||
|
||||
assert result.success
|
||||
assert "injected" in result.html
|
||||
assert "Custom Content" in result.html
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_screenshot_on_raw_html():
|
||||
"""Test that screenshots work on raw: HTML."""
|
||||
html = "<html><body><h1 style='color:red;font-size:48px;'>Screenshot Test</h1></body></html>"
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(screenshot=True)
|
||||
result = await crawler.arun(f"raw:{html}", config=config)
|
||||
|
||||
assert result.success
|
||||
assert result.screenshot is not None
|
||||
assert len(result.screenshot) > 100 # Should have substantial screenshot data
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_process_in_browser_flag():
|
||||
"""Test that process_in_browser=True forces browser path even without other params."""
|
||||
html = "<html><body><div>Test</div></body></html>"
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(process_in_browser=True)
|
||||
result = await crawler.arun(f"raw:{html}", config=config)
|
||||
|
||||
assert result.success
|
||||
# Browser path normalizes HTML, so it may be slightly different
|
||||
assert "Test" in result.html
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_raw_prefix_variations():
|
||||
"""Test both raw: and raw:// prefix formats."""
|
||||
html = "<html><body>Content</body></html>"
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
js_code='document.body.innerHTML += "<div id=\'added\'>Added</div>"'
|
||||
)
|
||||
|
||||
# Test raw: prefix
|
||||
result1 = await crawler.arun(f"raw:{html}", config=config)
|
||||
assert result1.success
|
||||
assert "Added" in result1.html
|
||||
|
||||
# Test raw:// prefix
|
||||
result2 = await crawler.arun(f"raw://{html}", config=config)
|
||||
assert result2.success
|
||||
assert "Added" in result2.html
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_wait_for_on_raw_html():
|
||||
"""Test that wait_for works with raw: HTML after js_code modifies DOM."""
|
||||
html = "<html><body><div id='container'></div></body></html>"
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
js_code='''
|
||||
setTimeout(() => {
|
||||
document.getElementById('container').innerHTML = '<div id="delayed">Delayed Content</div>';
|
||||
}, 100);
|
||||
''',
|
||||
wait_for="#delayed",
|
||||
wait_for_timeout=5000
|
||||
)
|
||||
result = await crawler.arun(f"raw:{html}", config=config)
|
||||
|
||||
assert result.success
|
||||
assert "Delayed Content" in result.html
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_multiple_js_code_scripts():
|
||||
"""Test that multiple js_code scripts execute in order."""
|
||||
html = "<html><body><div id='counter'>0</div></body></html>"
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
js_code=[
|
||||
"document.getElementById('counter').innerText = '1'",
|
||||
"document.getElementById('counter').innerText = parseInt(document.getElementById('counter').innerText) + 1",
|
||||
"document.getElementById('counter').innerText = parseInt(document.getElementById('counter').innerText) + 1",
|
||||
]
|
||||
)
|
||||
result = await crawler.arun(f"raw:{html}", config=config)
|
||||
|
||||
assert result.success
|
||||
assert ">3<" in result.html # Counter should be 3 after all scripts run
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Run a quick manual test
|
||||
async def quick_test():
|
||||
html = "<html><body><div id='test'>Original</div></body></html>"
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# Test 1: Fast path
|
||||
print("\n=== Test 1: Fast path (no browser params) ===")
|
||||
result1 = await crawler.arun(f"raw:{html}")
|
||||
print(f"Success: {result1.success}")
|
||||
print(f"HTML contains 'Original': {'Original' in result1.html}")
|
||||
|
||||
# Test 2: js_code modifies DOM
|
||||
print("\n=== Test 2: js_code modifies DOM ===")
|
||||
config = CrawlerRunConfig(
|
||||
js_code="document.getElementById('test').innerText = 'Modified by JS'"
|
||||
)
|
||||
result2 = await crawler.arun(f"raw:{html}", config=config)
|
||||
print(f"Success: {result2.success}")
|
||||
print(f"HTML contains 'Modified by JS': {'Modified by JS' in result2.html}")
|
||||
print(f"HTML snippet: {result2.html[:500]}...")
|
||||
|
||||
asyncio.run(quick_test())
|
||||
563
tests/test_raw_html_edge_cases.py
Normal file
563
tests/test_raw_html_edge_cases.py
Normal file
@@ -0,0 +1,563 @@
|
||||
"""
|
||||
BRUTAL edge case tests for raw:/file:// URL browser pipeline.
|
||||
|
||||
These tests try to break the system with tricky inputs, edge cases,
|
||||
and compatibility checks to ensure we didn't break existing functionality.
|
||||
"""
|
||||
|
||||
import pytest
|
||||
import asyncio
|
||||
import tempfile
|
||||
import os
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# EDGE CASE: Hash characters in HTML (previously broke urlparse - Issue #283)
|
||||
# ============================================================================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_raw_html_with_hash_in_css():
|
||||
"""Test that # in CSS colors doesn't break HTML parsing (regression for #283)."""
|
||||
html = """
|
||||
<html>
|
||||
<head>
|
||||
<style>
|
||||
body { background-color: #ff5733; color: #333333; }
|
||||
.highlight { border: 1px solid #000; }
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<div class="highlight" style="color: #ffffff;">Content with hash colors</div>
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(js_code="document.body.innerHTML += '<div id=\"added\">Added</div>'")
|
||||
result = await crawler.arun(f"raw:{html}", config=config)
|
||||
|
||||
assert result.success
|
||||
assert "#ff5733" in result.html or "ff5733" in result.html # Color should be preserved
|
||||
assert "Added" in result.html # JS executed
|
||||
assert "Content with hash colors" in result.html # Original content preserved
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_raw_html_with_fragment_links():
|
||||
"""Test HTML with # fragment links doesn't break."""
|
||||
html = """
|
||||
<html><body>
|
||||
<a href="#section1">Go to section 1</a>
|
||||
<a href="#section2">Go to section 2</a>
|
||||
<div id="section1">Section 1</div>
|
||||
<div id="section2">Section 2</div>
|
||||
</body></html>
|
||||
"""
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(js_code="document.getElementById('section1').innerText = 'Modified Section 1'")
|
||||
result = await crawler.arun(f"raw:{html}", config=config)
|
||||
|
||||
assert result.success
|
||||
assert "Modified Section 1" in result.html
|
||||
assert "#section2" in result.html # Fragment link preserved
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# EDGE CASE: Special characters and unicode
|
||||
# ============================================================================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_raw_html_with_unicode():
|
||||
"""Test raw HTML with various unicode characters."""
|
||||
html = """
|
||||
<html><body>
|
||||
<div id="unicode">日本語 中文 한국어 العربية 🎉 💻 🚀</div>
|
||||
<div id="special">& < > " '</div>
|
||||
</body></html>
|
||||
"""
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(js_code="document.getElementById('unicode').innerText += ' ✅ Modified'")
|
||||
result = await crawler.arun(f"raw:{html}", config=config)
|
||||
|
||||
assert result.success
|
||||
assert "✅ Modified" in result.html or "Modified" in result.html
|
||||
# Check unicode is preserved
|
||||
assert "日本語" in result.html or "&#" in result.html # Either preserved or encoded
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_raw_html_with_script_tags():
|
||||
"""Test raw HTML with existing script tags doesn't interfere with js_code."""
|
||||
html = """
|
||||
<html><body>
|
||||
<div id="counter">0</div>
|
||||
<script>
|
||||
// This script runs on page load
|
||||
document.getElementById('counter').innerText = '10';
|
||||
</script>
|
||||
</body></html>
|
||||
"""
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Our js_code runs AFTER the page scripts
|
||||
config = CrawlerRunConfig(
|
||||
js_code="document.getElementById('counter').innerText = parseInt(document.getElementById('counter').innerText) + 5"
|
||||
)
|
||||
result = await crawler.arun(f"raw:{html}", config=config)
|
||||
|
||||
assert result.success
|
||||
# The embedded script sets it to 10, then our js_code adds 5
|
||||
assert ">15<" in result.html or "15" in result.html
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# EDGE CASE: Empty and malformed HTML
|
||||
# ============================================================================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_raw_html_empty():
|
||||
"""Test empty raw HTML."""
|
||||
html = ""
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(js_code="document.body.innerHTML = '<div>Added to empty</div>'")
|
||||
result = await crawler.arun(f"raw:{html}", config=config)
|
||||
|
||||
assert result.success
|
||||
assert "Added to empty" in result.html
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_raw_html_minimal():
|
||||
"""Test minimal HTML (just text, no tags)."""
|
||||
html = "Just plain text, no HTML tags"
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(js_code="document.body.innerHTML += '<div id=\"injected\">Injected</div>'")
|
||||
result = await crawler.arun(f"raw:{html}", config=config)
|
||||
|
||||
assert result.success
|
||||
# Browser should wrap it in proper HTML
|
||||
assert "Injected" in result.html
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_raw_html_malformed():
|
||||
"""Test malformed HTML with unclosed tags."""
|
||||
html = "<html><body><div><span>Unclosed tags<div>More content"
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(js_code="document.body.innerHTML += '<div id=\"valid\">Valid Added</div>'")
|
||||
result = await crawler.arun(f"raw:{html}", config=config)
|
||||
|
||||
assert result.success
|
||||
assert "Valid Added" in result.html
|
||||
# Browser should have fixed the malformed HTML
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# EDGE CASE: Very large HTML
|
||||
# ============================================================================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_raw_html_large():
|
||||
"""Test large raw HTML (100KB+)."""
|
||||
# Generate 100KB of HTML
|
||||
items = "".join([f'<div class="item" id="item-{i}">Item {i} content here with some text</div>\n' for i in range(2000)])
|
||||
html = f"<html><body>{items}</body></html>"
|
||||
|
||||
assert len(html) > 100000 # Verify it's actually large
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
js_code="document.getElementById('item-999').innerText = 'MODIFIED ITEM 999'"
|
||||
)
|
||||
result = await crawler.arun(f"raw:{html}", config=config)
|
||||
|
||||
assert result.success
|
||||
assert "MODIFIED ITEM 999" in result.html
|
||||
assert "item-1999" in result.html # Last item should still exist
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# EDGE CASE: JavaScript errors and timeouts
|
||||
# ============================================================================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_raw_html_js_error_doesnt_crash():
|
||||
"""Test that JavaScript errors in js_code don't crash the crawl."""
|
||||
html = "<html><body><div id='test'>Original</div></body></html>"
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
js_code=[
|
||||
"nonExistentFunction();", # This will throw an error
|
||||
"document.getElementById('test').innerText = 'Still works'" # This should still run
|
||||
]
|
||||
)
|
||||
result = await crawler.arun(f"raw:{html}", config=config)
|
||||
|
||||
# Crawl should succeed even with JS errors
|
||||
assert result.success
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_raw_html_wait_for_timeout():
|
||||
"""Test wait_for with element that never appears times out gracefully."""
|
||||
html = "<html><body><div id='test'>Original</div></body></html>"
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
wait_for="#never-exists",
|
||||
wait_for_timeout=1000 # 1 second timeout
|
||||
)
|
||||
result = await crawler.arun(f"raw:{html}", config=config)
|
||||
|
||||
# Should timeout but still return the HTML we have
|
||||
# The behavior might be success=False or success=True with partial content
|
||||
# Either way, it shouldn't hang or crash
|
||||
assert result is not None
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# COMPATIBILITY: Normal HTTP URLs still work
|
||||
# ============================================================================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_http_urls_still_work():
|
||||
"""Ensure we didn't break normal HTTP URL crawling."""
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com")
|
||||
|
||||
assert result.success
|
||||
assert "Example Domain" in result.html
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_http_with_js_code_still_works():
|
||||
"""Ensure HTTP URLs with js_code still work."""
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
js_code="document.body.innerHTML += '<div id=\"injected\">Injected via JS</div>'"
|
||||
)
|
||||
result = await crawler.arun("https://example.com", config=config)
|
||||
|
||||
assert result.success
|
||||
assert "Injected via JS" in result.html
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# COMPATIBILITY: File URLs
|
||||
# ============================================================================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_file_url_with_js_code():
|
||||
"""Test file:// URLs with js_code execution."""
|
||||
# Create a temp file
|
||||
with tempfile.NamedTemporaryFile(mode='w', suffix='.html', delete=False) as f:
|
||||
f.write("<html><body><div id='file-content'>File Content</div></body></html>")
|
||||
temp_path = f.name
|
||||
|
||||
try:
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
js_code="document.getElementById('file-content').innerText = 'Modified File Content'"
|
||||
)
|
||||
result = await crawler.arun(f"file://{temp_path}", config=config)
|
||||
|
||||
assert result.success
|
||||
assert "Modified File Content" in result.html
|
||||
finally:
|
||||
os.unlink(temp_path)
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_file_url_fast_path():
|
||||
"""Test file:// fast path (no browser params)."""
|
||||
with tempfile.NamedTemporaryFile(mode='w', suffix='.html', delete=False) as f:
|
||||
f.write("<html><body>Fast path file content</body></html>")
|
||||
temp_path = f.name
|
||||
|
||||
try:
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(f"file://{temp_path}")
|
||||
|
||||
assert result.success
|
||||
assert "Fast path file content" in result.html
|
||||
finally:
|
||||
os.unlink(temp_path)
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# COMPATIBILITY: Extraction strategies with raw HTML
|
||||
# ============================================================================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_raw_html_with_css_extraction():
|
||||
"""Test CSS extraction on raw HTML after js_code modifies it."""
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
html = """
|
||||
<html><body>
|
||||
<div class="products">
|
||||
<div class="product"><span class="name">Original Product</span></div>
|
||||
</div>
|
||||
</body></html>
|
||||
"""
|
||||
|
||||
schema = {
|
||||
"name": "Products",
|
||||
"baseSelector": ".product",
|
||||
"fields": [
|
||||
{"name": "name", "selector": ".name", "type": "text"}
|
||||
]
|
||||
}
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
js_code="""
|
||||
document.querySelector('.products').innerHTML +=
|
||||
'<div class="product"><span class="name">JS Added Product</span></div>';
|
||||
""",
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema)
|
||||
)
|
||||
result = await crawler.arun(f"raw:{html}", config=config)
|
||||
|
||||
assert result.success
|
||||
# Check that extraction found both products
|
||||
import json
|
||||
extracted = json.loads(result.extracted_content)
|
||||
names = [p.get('name', '') for p in extracted]
|
||||
assert any("JS Added Product" in name for name in names)
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# EDGE CASE: Concurrent raw: requests
|
||||
# ============================================================================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_concurrent_raw_requests():
|
||||
"""Test multiple concurrent raw: requests don't interfere."""
|
||||
htmls = [
|
||||
f"<html><body><div id='test'>Request {i}</div></body></html>"
|
||||
for i in range(5)
|
||||
]
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
configs = [
|
||||
CrawlerRunConfig(
|
||||
js_code=f"document.getElementById('test').innerText += ' Modified {i}'"
|
||||
)
|
||||
for i in range(5)
|
||||
]
|
||||
|
||||
# Run concurrently
|
||||
tasks = [
|
||||
crawler.arun(f"raw:{html}", config=config)
|
||||
for html, config in zip(htmls, configs)
|
||||
]
|
||||
results = await asyncio.gather(*tasks)
|
||||
|
||||
for i, result in enumerate(results):
|
||||
assert result.success
|
||||
assert f"Request {i}" in result.html
|
||||
assert f"Modified {i}" in result.html
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# EDGE CASE: raw: with base_url for link resolution
|
||||
# ============================================================================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_raw_html_with_base_url():
|
||||
"""Test that base_url is used for link resolution in markdown."""
|
||||
html = """
|
||||
<html><body>
|
||||
<a href="/page1">Page 1</a>
|
||||
<a href="/page2">Page 2</a>
|
||||
<img src="/images/logo.png" alt="Logo">
|
||||
</body></html>
|
||||
"""
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
base_url="https://example.com",
|
||||
process_in_browser=True # Force browser to test base_url handling
|
||||
)
|
||||
result = await crawler.arun(f"raw:{html}", config=config)
|
||||
|
||||
assert result.success
|
||||
# Check markdown has absolute URLs
|
||||
if result.markdown:
|
||||
# Links should be absolute
|
||||
md = result.markdown.raw_markdown if hasattr(result.markdown, 'raw_markdown') else str(result.markdown)
|
||||
assert "example.com" in md or "/page1" in md
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# EDGE CASE: raw: with screenshot of complex page
|
||||
# ============================================================================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_raw_html_screenshot_complex_page():
|
||||
"""Test screenshot of complex raw HTML with CSS and JS modifications."""
|
||||
html = """
|
||||
<html>
|
||||
<head>
|
||||
<style>
|
||||
body { font-family: Arial; background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 40px; }
|
||||
.card { background: white; padding: 20px; border-radius: 10px; box-shadow: 0 4px 6px rgba(0,0,0,0.1); }
|
||||
h1 { color: #333; }
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<div class="card">
|
||||
<h1 id="title">Original Title</h1>
|
||||
<p>This is a test card with styling.</p>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
js_code="document.getElementById('title').innerText = 'Modified Title'",
|
||||
screenshot=True
|
||||
)
|
||||
result = await crawler.arun(f"raw:{html}", config=config)
|
||||
|
||||
assert result.success
|
||||
assert result.screenshot is not None
|
||||
assert len(result.screenshot) > 1000 # Should be substantial
|
||||
assert "Modified Title" in result.html
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# EDGE CASE: JavaScript that tries to navigate away
|
||||
# ============================================================================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_raw_html_js_navigation_blocked():
|
||||
"""Test that JS trying to navigate doesn't break the crawl."""
|
||||
html = """
|
||||
<html><body>
|
||||
<div id="content">Original Content</div>
|
||||
<script>
|
||||
// Try to navigate away (should be blocked or handled)
|
||||
// window.location.href = 'https://example.com';
|
||||
</script>
|
||||
</body></html>
|
||||
"""
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
# Try to navigate via js_code
|
||||
js_code=[
|
||||
"document.getElementById('content').innerText = 'Before navigation attempt'",
|
||||
# Actual navigation attempt commented - would cause issues
|
||||
# "window.location.href = 'https://example.com'",
|
||||
]
|
||||
)
|
||||
result = await crawler.arun(f"raw:{html}", config=config)
|
||||
|
||||
assert result.success
|
||||
assert "Before navigation attempt" in result.html
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# EDGE CASE: Raw HTML with iframes
|
||||
# ============================================================================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_raw_html_with_iframes():
|
||||
"""Test raw HTML containing iframes."""
|
||||
html = """
|
||||
<html><body>
|
||||
<div id="main">Main content</div>
|
||||
<iframe id="frame1" srcdoc="<html><body><div id='iframe-content'>Iframe Content</div></body></html>"></iframe>
|
||||
</body></html>
|
||||
"""
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
js_code="document.getElementById('main').innerText = 'Modified main'",
|
||||
process_iframes=True
|
||||
)
|
||||
result = await crawler.arun(f"raw:{html}", config=config)
|
||||
|
||||
assert result.success
|
||||
assert "Modified main" in result.html
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# TRICKY: Protocol inside raw content
|
||||
# ============================================================================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_raw_html_with_urls_inside():
|
||||
"""Test raw: with http:// URLs inside the content."""
|
||||
html = """
|
||||
<html><body>
|
||||
<a href="http://example.com">Example</a>
|
||||
<a href="https://google.com">Google</a>
|
||||
<img src="https://placekitten.com/200/300" alt="Cat">
|
||||
<div id="test">Test content with URL: https://test.com</div>
|
||||
</body></html>
|
||||
"""
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
js_code="document.getElementById('test').innerText += ' - Modified'"
|
||||
)
|
||||
result = await crawler.arun(f"raw:{html}", config=config)
|
||||
|
||||
assert result.success
|
||||
assert "Modified" in result.html
|
||||
assert "http://example.com" in result.html or "example.com" in result.html
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# TRICKY: Double raw: prefix
|
||||
# ============================================================================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_double_raw_prefix():
|
||||
"""Test what happens with double raw: prefix (edge case)."""
|
||||
html = "<html><body>Content</body></html>"
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# raw:raw:<html>... - the second raw: becomes part of content
|
||||
result = await crawler.arun(f"raw:raw:{html}")
|
||||
|
||||
# Should either handle gracefully or return "raw:<html>..." as content
|
||||
assert result is not None
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
|
||||
async def run_tests():
|
||||
# Run a few key tests manually
|
||||
tests = [
|
||||
("Hash in CSS", test_raw_html_with_hash_in_css),
|
||||
("Unicode", test_raw_html_with_unicode),
|
||||
("Large HTML", test_raw_html_large),
|
||||
("HTTP still works", test_http_urls_still_work),
|
||||
("Concurrent requests", test_concurrent_raw_requests),
|
||||
("Complex screenshot", test_raw_html_screenshot_complex_page),
|
||||
]
|
||||
|
||||
for name, test_fn in tests:
|
||||
print(f"\n=== Running: {name} ===")
|
||||
try:
|
||||
await test_fn()
|
||||
print(f"✅ {name} PASSED")
|
||||
except Exception as e:
|
||||
print(f"❌ {name} FAILED: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
|
||||
asyncio.run(run_tests())
|
||||
Reference in New Issue
Block a user