- Created standalone smoke test script for quick validation - Added detailed CHANGES_CDP_CONCURRENCY.md documentation - Documented all fixes, testing approach, and migration guide - Smoke test can run without pytest for easy verification Co-authored-by: Ahmed-Tawfik94 <106467151+Ahmed-Tawfik94@users.noreply.github.com>
7.1 KiB
CDP Browser Concurrency Fixes and Improvements
Overview
This document describes the changes made to fix concurrency issues with CDP (Chrome DevTools Protocol) browsers when using arun_many and improve overall browser management.
Problems Addressed
-
Race Conditions in Page Creation: When using managed CDP browsers with concurrent
arun_manycalls, the code attempted to reuse existing pages fromcontext.pages, leading to race conditions and "Target page/context closed" errors. -
Proxy Configuration Issues: Proxy credentials were incorrectly embedded in the
--proxy-serverURL, which doesn't work properly with CDP browsers. -
Insufficient Startup Checks: Browser process startup checks were minimal and didn't catch early failures effectively.
-
Unclear Logging: Logging messages lacked structure and context, making debugging difficult.
-
Duplicate Browser Arguments: Browser launch arguments could contain duplicates despite deduplication attempts.
Solutions Implemented
1. Always Create New Pages in Managed Browser Mode
File: crawl4ai/browser_manager.py (lines 1106-1113)
Change: Modified get_page() method to always create new pages instead of attempting to reuse existing ones for managed browsers without storage_state.
Before:
context = self.default_context
pages = context.pages
page = next((p for p in pages if p.url == crawlerRunConfig.url), None)
if not page:
if pages:
page = pages[0]
else:
# Create new page only if none exist
async with self._page_lock:
page = await context.new_page()
After:
context = self.default_context
# Always create new pages instead of reusing existing ones
# This prevents race conditions in concurrent scenarios (arun_many with CDP)
# Serialize page creation to avoid 'Target page/context closed' errors
async with self._page_lock:
page = await context.new_page()
await self._apply_stealth_to_page(page)
Benefits:
- Eliminates race conditions when multiple tasks call
arun_manyconcurrently - Each request gets a fresh, independent page
- Page lock serializes creation to prevent TOCTOU (Time-of-check to time-of-use) issues
2. Fixed Proxy Flag Formatting
File: crawl4ai/browser_manager.py (lines 103-109)
Change: Removed credentials from proxy URL as they should be handled via separate authentication mechanisms in CDP.
Before:
elif config.proxy_config:
creds = ""
if config.proxy_config.username and config.proxy_config.password:
creds = f"{config.proxy_config.username}:{config.proxy_config.password}@"
flags.append(f"--proxy-server={creds}{config.proxy_config.server}")
After:
elif config.proxy_config:
# Note: For CDP/managed browsers, proxy credentials should be handled
# via authentication, not in the URL. Only pass the server address.
flags.append(f"--proxy-server={config.proxy_config.server}")
3. Enhanced Startup Checks
File: crawl4ai/browser_manager.py (lines 298-336)
Changes:
- Multiple check intervals (0.1s, 0.2s, 0.3s) to catch early failures
- Capture and log stdout/stderr on failure (limited to 200 chars)
- Raise
RuntimeErrorwith detailed diagnostics on startup failure - Log process PID on successful startup in verbose mode
Benefits:
- Catches browser crashes during startup
- Provides detailed diagnostic information for debugging
- Fails fast with clear error messages
4. Improved Logging
File: crawl4ai/browser_manager.py (lines 218-291)
Changes:
- Structured logging with proper parameter substitution
- Log browser type, port, and headless status at launch
- Format and log full command with proper shell escaping
- Better error messages with context
- Consistent use of logger with null checks
Example:
if self.logger and self.browser_config.verbose:
self.logger.debug(
"Launching browser: {browser_type} | Port: {port} | Headless: {headless}",
tag="BROWSER",
params={
"browser_type": self.browser_type,
"port": self.debugging_port,
"headless": self.headless
}
)
5. Deduplicate Browser Launch Arguments
File: crawl4ai/browser_manager.py (lines 424-425)
Change: Added explicit deduplication after merging all flags.
# merge common launch flags
flags.extend(self.build_browser_flags(self.browser_config))
# Deduplicate flags - use dict.fromkeys to preserve order while removing duplicates
flags = list(dict.fromkeys(flags))
6. Import Refactoring
Files: crawl4ai/browser_manager.py, crawl4ai/browser_profiler.py, tests/browser/test_cdp_concurrency.py
Changes: Organized all imports according to PEP 8:
- Standard library imports (alphabetized)
- Third-party imports (alphabetized)
- Local imports (alphabetized)
Benefits:
- Improved code readability
- Easier to spot missing or unused imports
- Consistent style across the codebase
Testing
New Test Suite
File: tests/browser/test_cdp_concurrency.py
Comprehensive test suite with 8 tests covering:
- Basic Concurrent arun_many: Validates multiple URLs can be crawled concurrently
- Sequential arun_many Calls: Ensures multiple sequential batches work correctly
- Stress Test: Multiple concurrent
arun_manycalls to test page lock effectiveness - Page Isolation: Verifies pages are truly independent
- Different Configurations: Tests with varying viewport sizes and configs
- Error Handling: Ensures errors in one request don't affect others
- Large Batches: Scalability test with 10+ URLs
- Smoke Test Script: Standalone script for quick validation
Running Tests
With pytest (if available):
cd /path/to/crawl4ai
pytest tests/browser/test_cdp_concurrency.py -v
Standalone smoke test:
cd /path/to/crawl4ai
python3 tests/browser/smoke_test_cdp.py
Migration Guide
For Users
No breaking changes. Existing code will continue to work, but with better reliability in concurrent scenarios.
For Contributors
When working with managed browsers:
- Always use the page lock when creating pages in shared contexts
- Prefer creating new pages over reusing existing ones for concurrent operations
- Use structured logging with parameter substitution
- Follow PEP 8 import organization
Performance Impact
- Positive: Eliminates race conditions and crashes in concurrent scenarios
- Neutral: Page creation overhead is negligible compared to page navigation
- Consideration: More pages may be created, but they are properly closed after use
Backward Compatibility
All changes are backward compatible. Session-based page reuse still works as before when session_id is provided.
Related Issues
- Fixes race conditions in concurrent
arun_manycalls with CDP browsers - Addresses "Target page/context closed" errors
- Improves browser startup reliability
Future Improvements
Consider:
- Configurable page pooling with proper lifecycle management
- More granular locks for different contexts
- Metrics for page creation/reuse patterns
- Connection pooling for CDP connections