Files
crawl4ai/CHANGES_CDP_CONCURRENCY.md
copilot-swe-agent[bot] c1c5dfc49b Add smoke test and comprehensive documentation
- Created standalone smoke test script for quick validation
- Added detailed CHANGES_CDP_CONCURRENCY.md documentation
- Documented all fixes, testing approach, and migration guide
- Smoke test can run without pytest for easy verification

Co-authored-by: Ahmed-Tawfik94 <106467151+Ahmed-Tawfik94@users.noreply.github.com>
2025-11-06 08:20:39 +00:00

7.1 KiB

CDP Browser Concurrency Fixes and Improvements

Overview

This document describes the changes made to fix concurrency issues with CDP (Chrome DevTools Protocol) browsers when using arun_many and improve overall browser management.

Problems Addressed

  1. Race Conditions in Page Creation: When using managed CDP browsers with concurrent arun_many calls, the code attempted to reuse existing pages from context.pages, leading to race conditions and "Target page/context closed" errors.

  2. Proxy Configuration Issues: Proxy credentials were incorrectly embedded in the --proxy-server URL, which doesn't work properly with CDP browsers.

  3. Insufficient Startup Checks: Browser process startup checks were minimal and didn't catch early failures effectively.

  4. Unclear Logging: Logging messages lacked structure and context, making debugging difficult.

  5. Duplicate Browser Arguments: Browser launch arguments could contain duplicates despite deduplication attempts.

Solutions Implemented

1. Always Create New Pages in Managed Browser Mode

File: crawl4ai/browser_manager.py (lines 1106-1113)

Change: Modified get_page() method to always create new pages instead of attempting to reuse existing ones for managed browsers without storage_state.

Before:

context = self.default_context
pages = context.pages
page = next((p for p in pages if p.url == crawlerRunConfig.url), None)
if not page:
    if pages:
        page = pages[0]
    else:
        # Create new page only if none exist
        async with self._page_lock:
            page = await context.new_page()

After:

context = self.default_context
# Always create new pages instead of reusing existing ones
# This prevents race conditions in concurrent scenarios (arun_many with CDP)
# Serialize page creation to avoid 'Target page/context closed' errors
async with self._page_lock:
    page = await context.new_page()
await self._apply_stealth_to_page(page)

Benefits:

  • Eliminates race conditions when multiple tasks call arun_many concurrently
  • Each request gets a fresh, independent page
  • Page lock serializes creation to prevent TOCTOU (Time-of-check to time-of-use) issues

2. Fixed Proxy Flag Formatting

File: crawl4ai/browser_manager.py (lines 103-109)

Change: Removed credentials from proxy URL as they should be handled via separate authentication mechanisms in CDP.

Before:

elif config.proxy_config:
    creds = ""
    if config.proxy_config.username and config.proxy_config.password:
        creds = f"{config.proxy_config.username}:{config.proxy_config.password}@"
    flags.append(f"--proxy-server={creds}{config.proxy_config.server}")

After:

elif config.proxy_config:
    # Note: For CDP/managed browsers, proxy credentials should be handled
    # via authentication, not in the URL. Only pass the server address.
    flags.append(f"--proxy-server={config.proxy_config.server}")

3. Enhanced Startup Checks

File: crawl4ai/browser_manager.py (lines 298-336)

Changes:

  • Multiple check intervals (0.1s, 0.2s, 0.3s) to catch early failures
  • Capture and log stdout/stderr on failure (limited to 200 chars)
  • Raise RuntimeError with detailed diagnostics on startup failure
  • Log process PID on successful startup in verbose mode

Benefits:

  • Catches browser crashes during startup
  • Provides detailed diagnostic information for debugging
  • Fails fast with clear error messages

4. Improved Logging

File: crawl4ai/browser_manager.py (lines 218-291)

Changes:

  • Structured logging with proper parameter substitution
  • Log browser type, port, and headless status at launch
  • Format and log full command with proper shell escaping
  • Better error messages with context
  • Consistent use of logger with null checks

Example:

if self.logger and self.browser_config.verbose:
    self.logger.debug(
        "Launching browser: {browser_type} | Port: {port} | Headless: {headless}",
        tag="BROWSER",
        params={
            "browser_type": self.browser_type,
            "port": self.debugging_port,
            "headless": self.headless
        }
    )

5. Deduplicate Browser Launch Arguments

File: crawl4ai/browser_manager.py (lines 424-425)

Change: Added explicit deduplication after merging all flags.

# merge common launch flags
flags.extend(self.build_browser_flags(self.browser_config))
# Deduplicate flags - use dict.fromkeys to preserve order while removing duplicates
flags = list(dict.fromkeys(flags))

6. Import Refactoring

Files: crawl4ai/browser_manager.py, crawl4ai/browser_profiler.py, tests/browser/test_cdp_concurrency.py

Changes: Organized all imports according to PEP 8:

  1. Standard library imports (alphabetized)
  2. Third-party imports (alphabetized)
  3. Local imports (alphabetized)

Benefits:

  • Improved code readability
  • Easier to spot missing or unused imports
  • Consistent style across the codebase

Testing

New Test Suite

File: tests/browser/test_cdp_concurrency.py

Comprehensive test suite with 8 tests covering:

  1. Basic Concurrent arun_many: Validates multiple URLs can be crawled concurrently
  2. Sequential arun_many Calls: Ensures multiple sequential batches work correctly
  3. Stress Test: Multiple concurrent arun_many calls to test page lock effectiveness
  4. Page Isolation: Verifies pages are truly independent
  5. Different Configurations: Tests with varying viewport sizes and configs
  6. Error Handling: Ensures errors in one request don't affect others
  7. Large Batches: Scalability test with 10+ URLs
  8. Smoke Test Script: Standalone script for quick validation

Running Tests

With pytest (if available):

cd /path/to/crawl4ai
pytest tests/browser/test_cdp_concurrency.py -v

Standalone smoke test:

cd /path/to/crawl4ai
python3 tests/browser/smoke_test_cdp.py

Migration Guide

For Users

No breaking changes. Existing code will continue to work, but with better reliability in concurrent scenarios.

For Contributors

When working with managed browsers:

  1. Always use the page lock when creating pages in shared contexts
  2. Prefer creating new pages over reusing existing ones for concurrent operations
  3. Use structured logging with parameter substitution
  4. Follow PEP 8 import organization

Performance Impact

  • Positive: Eliminates race conditions and crashes in concurrent scenarios
  • Neutral: Page creation overhead is negligible compared to page navigation
  • Consideration: More pages may be created, but they are properly closed after use

Backward Compatibility

All changes are backward compatible. Session-based page reuse still works as before when session_id is provided.

  • Fixes race conditions in concurrent arun_many calls with CDP browsers
  • Addresses "Target page/context closed" errors
  • Improves browser startup reliability

Future Improvements

Consider:

  1. Configurable page pooling with proper lifecycle management
  2. More granular locks for different contexts
  3. Metrics for page creation/reuse patterns
  4. Connection pooling for CDP connections