- Created standalone smoke test script for quick validation - Added detailed CHANGES_CDP_CONCURRENCY.md documentation - Documented all fixes, testing approach, and migration guide - Smoke test can run without pytest for easy verification Co-authored-by: Ahmed-Tawfik94 <106467151+Ahmed-Tawfik94@users.noreply.github.com>
215 lines
7.1 KiB
Markdown
215 lines
7.1 KiB
Markdown
# CDP Browser Concurrency Fixes and Improvements
|
|
|
|
## Overview
|
|
|
|
This document describes the changes made to fix concurrency issues with CDP (Chrome DevTools Protocol) browsers when using `arun_many` and improve overall browser management.
|
|
|
|
## Problems Addressed
|
|
|
|
1. **Race Conditions in Page Creation**: When using managed CDP browsers with concurrent `arun_many` calls, the code attempted to reuse existing pages from `context.pages`, leading to race conditions and "Target page/context closed" errors.
|
|
|
|
2. **Proxy Configuration Issues**: Proxy credentials were incorrectly embedded in the `--proxy-server` URL, which doesn't work properly with CDP browsers.
|
|
|
|
3. **Insufficient Startup Checks**: Browser process startup checks were minimal and didn't catch early failures effectively.
|
|
|
|
4. **Unclear Logging**: Logging messages lacked structure and context, making debugging difficult.
|
|
|
|
5. **Duplicate Browser Arguments**: Browser launch arguments could contain duplicates despite deduplication attempts.
|
|
|
|
## Solutions Implemented
|
|
|
|
### 1. Always Create New Pages in Managed Browser Mode
|
|
|
|
**File**: `crawl4ai/browser_manager.py` (lines 1106-1113)
|
|
|
|
**Change**: Modified `get_page()` method to always create new pages instead of attempting to reuse existing ones for managed browsers without `storage_state`.
|
|
|
|
**Before**:
|
|
```python
|
|
context = self.default_context
|
|
pages = context.pages
|
|
page = next((p for p in pages if p.url == crawlerRunConfig.url), None)
|
|
if not page:
|
|
if pages:
|
|
page = pages[0]
|
|
else:
|
|
# Create new page only if none exist
|
|
async with self._page_lock:
|
|
page = await context.new_page()
|
|
```
|
|
|
|
**After**:
|
|
```python
|
|
context = self.default_context
|
|
# Always create new pages instead of reusing existing ones
|
|
# This prevents race conditions in concurrent scenarios (arun_many with CDP)
|
|
# Serialize page creation to avoid 'Target page/context closed' errors
|
|
async with self._page_lock:
|
|
page = await context.new_page()
|
|
await self._apply_stealth_to_page(page)
|
|
```
|
|
|
|
**Benefits**:
|
|
- Eliminates race conditions when multiple tasks call `arun_many` concurrently
|
|
- Each request gets a fresh, independent page
|
|
- Page lock serializes creation to prevent TOCTOU (Time-of-check to time-of-use) issues
|
|
|
|
### 2. Fixed Proxy Flag Formatting
|
|
|
|
**File**: `crawl4ai/browser_manager.py` (lines 103-109)
|
|
|
|
**Change**: Removed credentials from proxy URL as they should be handled via separate authentication mechanisms in CDP.
|
|
|
|
**Before**:
|
|
```python
|
|
elif config.proxy_config:
|
|
creds = ""
|
|
if config.proxy_config.username and config.proxy_config.password:
|
|
creds = f"{config.proxy_config.username}:{config.proxy_config.password}@"
|
|
flags.append(f"--proxy-server={creds}{config.proxy_config.server}")
|
|
```
|
|
|
|
**After**:
|
|
```python
|
|
elif config.proxy_config:
|
|
# Note: For CDP/managed browsers, proxy credentials should be handled
|
|
# via authentication, not in the URL. Only pass the server address.
|
|
flags.append(f"--proxy-server={config.proxy_config.server}")
|
|
```
|
|
|
|
### 3. Enhanced Startup Checks
|
|
|
|
**File**: `crawl4ai/browser_manager.py` (lines 298-336)
|
|
|
|
**Changes**:
|
|
- Multiple check intervals (0.1s, 0.2s, 0.3s) to catch early failures
|
|
- Capture and log stdout/stderr on failure (limited to 200 chars)
|
|
- Raise `RuntimeError` with detailed diagnostics on startup failure
|
|
- Log process PID on successful startup in verbose mode
|
|
|
|
**Benefits**:
|
|
- Catches browser crashes during startup
|
|
- Provides detailed diagnostic information for debugging
|
|
- Fails fast with clear error messages
|
|
|
|
### 4. Improved Logging
|
|
|
|
**File**: `crawl4ai/browser_manager.py` (lines 218-291)
|
|
|
|
**Changes**:
|
|
- Structured logging with proper parameter substitution
|
|
- Log browser type, port, and headless status at launch
|
|
- Format and log full command with proper shell escaping
|
|
- Better error messages with context
|
|
- Consistent use of logger with null checks
|
|
|
|
**Example**:
|
|
```python
|
|
if self.logger and self.browser_config.verbose:
|
|
self.logger.debug(
|
|
"Launching browser: {browser_type} | Port: {port} | Headless: {headless}",
|
|
tag="BROWSER",
|
|
params={
|
|
"browser_type": self.browser_type,
|
|
"port": self.debugging_port,
|
|
"headless": self.headless
|
|
}
|
|
)
|
|
```
|
|
|
|
### 5. Deduplicate Browser Launch Arguments
|
|
|
|
**File**: `crawl4ai/browser_manager.py` (lines 424-425)
|
|
|
|
**Change**: Added explicit deduplication after merging all flags.
|
|
|
|
```python
|
|
# merge common launch flags
|
|
flags.extend(self.build_browser_flags(self.browser_config))
|
|
# Deduplicate flags - use dict.fromkeys to preserve order while removing duplicates
|
|
flags = list(dict.fromkeys(flags))
|
|
```
|
|
|
|
### 6. Import Refactoring
|
|
|
|
**Files**: `crawl4ai/browser_manager.py`, `crawl4ai/browser_profiler.py`, `tests/browser/test_cdp_concurrency.py`
|
|
|
|
**Changes**: Organized all imports according to PEP 8:
|
|
1. Standard library imports (alphabetized)
|
|
2. Third-party imports (alphabetized)
|
|
3. Local imports (alphabetized)
|
|
|
|
**Benefits**:
|
|
- Improved code readability
|
|
- Easier to spot missing or unused imports
|
|
- Consistent style across the codebase
|
|
|
|
## Testing
|
|
|
|
### New Test Suite
|
|
|
|
**File**: `tests/browser/test_cdp_concurrency.py`
|
|
|
|
Comprehensive test suite with 8 tests covering:
|
|
|
|
1. **Basic Concurrent arun_many**: Validates multiple URLs can be crawled concurrently
|
|
2. **Sequential arun_many Calls**: Ensures multiple sequential batches work correctly
|
|
3. **Stress Test**: Multiple concurrent `arun_many` calls to test page lock effectiveness
|
|
4. **Page Isolation**: Verifies pages are truly independent
|
|
5. **Different Configurations**: Tests with varying viewport sizes and configs
|
|
6. **Error Handling**: Ensures errors in one request don't affect others
|
|
7. **Large Batches**: Scalability test with 10+ URLs
|
|
8. **Smoke Test Script**: Standalone script for quick validation
|
|
|
|
### Running Tests
|
|
|
|
**With pytest** (if available):
|
|
```bash
|
|
cd /path/to/crawl4ai
|
|
pytest tests/browser/test_cdp_concurrency.py -v
|
|
```
|
|
|
|
**Standalone smoke test**:
|
|
```bash
|
|
cd /path/to/crawl4ai
|
|
python3 tests/browser/smoke_test_cdp.py
|
|
```
|
|
|
|
## Migration Guide
|
|
|
|
### For Users
|
|
|
|
No breaking changes. Existing code will continue to work, but with better reliability in concurrent scenarios.
|
|
|
|
### For Contributors
|
|
|
|
When working with managed browsers:
|
|
1. Always use the page lock when creating pages in shared contexts
|
|
2. Prefer creating new pages over reusing existing ones for concurrent operations
|
|
3. Use structured logging with parameter substitution
|
|
4. Follow PEP 8 import organization
|
|
|
|
## Performance Impact
|
|
|
|
- **Positive**: Eliminates race conditions and crashes in concurrent scenarios
|
|
- **Neutral**: Page creation overhead is negligible compared to page navigation
|
|
- **Consideration**: More pages may be created, but they are properly closed after use
|
|
|
|
## Backward Compatibility
|
|
|
|
All changes are backward compatible. Session-based page reuse still works as before when `session_id` is provided.
|
|
|
|
## Related Issues
|
|
|
|
- Fixes race conditions in concurrent `arun_many` calls with CDP browsers
|
|
- Addresses "Target page/context closed" errors
|
|
- Improves browser startup reliability
|
|
|
|
## Future Improvements
|
|
|
|
Consider:
|
|
1. Configurable page pooling with proper lifecycle management
|
|
2. More granular locks for different contexts
|
|
3. Metrics for page creation/reuse patterns
|
|
4. Connection pooling for CDP connections
|