crawl4ai/docs/md_v2/advanced/cdp-browser-crawling.md

# CDP Browser Crawling

> **New in v0.7.6**: Efficient concurrent crawling with managed CDP (Chrome DevTools Protocol) browsers. Connect to a running browser instance and perform multiple crawls without spawning new windows.

## 1. Overview

When working with CDP browsers, you can connect to an existing browser instance instead of launching a new one for each crawl. This is particularly useful for:

- **Development**: Keep your browser open with DevTools for debugging
- **Persistent Sessions**: Maintain authentication across multiple crawls
- **Resource Efficiency**: Reuse a single browser instance for multiple operations
- **Concurrent Crawling**: Run multiple crawls simultaneously with proper isolation

**Key Benefits:**

- ✅ Single browser window with multiple tabs (no window clutter)
- ✅ Shared state (cookies, localStorage) across crawls
- ✅ Concurrent safety with automatic page isolation
- ✅ Automatic cleanup to prevent memory leaks
- ✅ Works seamlessly with `arun_many()` for parallel crawling

---

## 2. Quick Start

### 2.1 Starting a CDP Browser

Use the Crawl4AI CLI to start a managed CDP browser:

```bash
# Start CDP browser on default port (9222)
crwl cdp

# Start on custom port
crwl cdp -d 9223

# Start in headless mode
crwl cdp --headless
```

The browser will stay running until you press 'q' or close the terminal.

### 2.2 Basic CDP Connection

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

async def main():
    # Configure CDP connection
    browser_cfg = BrowserConfig(
        browser_type="chromium",
        cdp_url="http://localhost:9222",
        verbose=True
    )

    # Crawl a single URL
    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        result = await crawler.arun(
            url="https://example.com",
            config=CrawlerRunConfig()
        )
        print(f"Success: {result.success}")
        print(f"Content length: {len(result.markdown)}")

if __name__ == "__main__":
    asyncio.run(main())
```

---

## 3. Concurrent Crawling with arun_many()

The real power of CDP crawling shines with `arun_many()`. The browser manager automatically handles:

- **Page Isolation**: Each crawl gets its own tab
- **Context Sharing**: All tabs share cookies and localStorage
- **Concurrent Safety**: Proper locking prevents race conditions
- **Auto Cleanup**: Tabs are closed after crawling (except sessions)

### 3.1 Basic Concurrent Crawling

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

async def crawl_multiple_urls():
    # URLs to crawl
    urls = [
        "https://example.com",
        "https://httpbin.org/html",
        "https://www.python.org",
    ]

    # Configure CDP browser
    browser_cfg = BrowserConfig(
        browser_type="chromium",
        cdp_url="http://localhost:9222",
        verbose=False
    )

    # Configure crawler (bypass cache for fresh data)
    crawler_cfg = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS
    )

    # Crawl all URLs concurrently
    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        results = await crawler.arun_many(
            urls=urls,
            config=crawler_cfg
        )

        # Process results
        for result in results:
            print(f"\nURL: {result.url}")
            if result.success:
                print(f"✓ Success | Content length: {len(result.markdown)}")
            else:
                print(f"✗ Failed: {result.error_message}")

if __name__ == "__main__":
    asyncio.run(crawl_multiple_urls())
```

### 3.2 With Session Management

Use sessions to maintain authentication and state across individual crawls:

```python
async def crawl_with_sessions():
    browser_cfg = BrowserConfig(
        browser_type="chromium",
        cdp_url="http://localhost:9222"
    )

    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        # First crawl: Login page
        login_result = await crawler.arun(
            url="https://example.com/login",
            config=CrawlerRunConfig(
                session_id="my-session",  # Session persists
                js_code="document.querySelector('#login').click();"
            )
        )

        # Second crawl: Reuse authenticated session
        dashboard_result = await crawler.arun(
            url="https://example.com/dashboard",
            config=CrawlerRunConfig(
                session_id="my-session"  # Same session, cookies preserved
            )
        )
```

---

## 4. How It Works

### 4.1 Browser Context Reuse

When using CDP browsers, Crawl4AI:

1. **Connects** to the existing browser via CDP URL
2. **Reuses** the default browser context (single window)
3. **Creates** new pages (tabs) for each crawl
4. **Locks** page creation to prevent concurrent races
5. **Cleans up** pages after crawling (unless it's a session)

```python
# Internal behavior (simplified)
if self.config.use_managed_browser:
    context = self.default_context  # Shared context

    # Thread-safe page creation
    async with self._page_lock:
        page = await context.new_page()  # New tab per crawl

    # After crawl completes
    if not config.session_id:
        await page.close()  # Auto cleanup
```

### 4.2 Page Lifecycle

```mermaid
graph TD
    A[Start Crawl] --> B{Has session_id?}
    B -->|Yes| C[Reuse existing page]
    B -->|No| D[Create new page/tab]
    D --> E[Navigate & Extract]
    C --> E
    E --> F{Is session?}
    F -->|Yes| G[Keep page open]
    F -->|No| H[Close page]
    H --> I[End]
    G --> I
```

### 4.3 State Sharing

All pages in the same context share:

- 🍪 **Cookies**: Authentication tokens, preferences
- 💾 **localStorage**: Client-side data storage
- 🔐 **sessionStorage**: Per-tab session data
- 🌐 **Network cache**: Shared HTTP cache

This makes it perfect for crawling authenticated sites or maintaining state across multiple pages.

---

## 5. Configuration Options

### 5.1 BrowserConfig for CDP

```python
browser_cfg = BrowserConfig(
    browser_type="chromium",           # Must be "chromium" for CDP
    cdp_url="http://localhost:9222",   # CDP endpoint URL
    verbose=True,                      # Log browser operations

    # Optional: Override headers for all requests
    headers={
        "Accept-Language": "en-US,en;q=0.9",
    },

    # Optional: Set user agent
    user_agent="Mozilla/5.0 ...",

    # Optional: Enable stealth mode (requires dedicated browser)
    # enable_stealth=False,  # Not compatible with CDP
)
```

### 5.2 CrawlerRunConfig Options

```python
crawler_cfg = CrawlerRunConfig(
    # Session management
    session_id="my-session",           # Persist page across calls

    # Caching
    cache_mode=CacheMode.BYPASS,       # Fresh data every time

    # Browser location (affects timezone, locale)
    locale="en-US",
    timezone_id="America/New_York",
    geolocation={
        "latitude": 40.7128,
        "longitude": -74.0060
    },

    # Proxy (per-crawl override)
    proxy_config={
        "server": "http://proxy.example.com:8080",
        "username": "user",
        "password": "pass"
    }
)
```

---

## 6. Advanced Patterns

### 6.1 Streaming Results

Process URLs as they complete instead of waiting for all:

```python
async def stream_crawl_results():
    browser_cfg = BrowserConfig(
        browser_type="chromium",
        cdp_url="http://localhost:9222"
    )

    urls = ["https://example.com" for _ in range(100)]

    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        # Stream results as they complete
        async for result in crawler.arun_many(
            urls=urls,
            config=CrawlerRunConfig(stream=True)
        ):
            if result.success:
                print(f"✓ {result.url}: {len(result.markdown)} chars")
                # Process immediately instead of waiting for all
                await save_to_database(result)
```

### 6.2 Custom Concurrency Control

```python
from crawl4ai import CrawlerRunConfig

# Limit concurrent crawls to 3
crawler_cfg = CrawlerRunConfig(
    semaphore_count=3,      # Max 3 concurrent requests
    mean_delay=0.5,         # Average 0.5s delay between requests
    max_range=1.0,          # +/- 1s random delay
)

async with AsyncWebCrawler(config=browser_cfg) as crawler:
    results = await crawler.arun_many(urls, config=crawler_cfg)
```

### 6.3 Multi-Config Crawling

Different configurations for different URL groups:

```python
from crawl4ai import CrawlerRunConfig

# Fast crawl for static pages
fast_config = CrawlerRunConfig(
    wait_until="domcontentloaded",
    page_timeout=30000
)

# Slow crawl for dynamic pages
slow_config = CrawlerRunConfig(
    wait_until="networkidle",
    page_timeout=60000,
    js_code="window.scrollTo(0, document.body.scrollHeight);"
)

configs = [fast_config, slow_config, fast_config]
urls = ["https://static.com", "https://dynamic.com", "https://static2.com"]

async with AsyncWebCrawler(config=browser_cfg) as crawler:
    results = await crawler.arun_many(urls, configs=configs)
```

---

## 7. Best Practices

### 7.1 Resource Management

✅ **DO:**
```python
# Use context manager for automatic cleanup
async with AsyncWebCrawler(config=browser_cfg) as crawler:
    results = await crawler.arun_many(urls)
# Browser connection closed automatically
```

❌ **DON'T:**
```python
# Manual management risks resource leaks
crawler = AsyncWebCrawler(config=browser_cfg)
await crawler.start()
results = await crawler.arun_many(urls)
# Forgot to call crawler.close()!
```

### 7.2 Session Management

✅ **DO:**
```python
# Use sessions for related crawls
config = CrawlerRunConfig(session_id="user-flow")
await crawler.arun(login_url, config=config)
await crawler.arun(dashboard_url, config=config)
await crawler.kill_session("user-flow")  # Clean up when done
```

❌ **DON'T:**
```python
# Creating new session IDs unnecessarily
for i in range(100):
    config = CrawlerRunConfig(session_id=f"session-{i}")
    await crawler.arun(url, config=config)
# 100 unclosed sessions accumulate!
```

### 7.3 Error Handling

```python
async def robust_crawl(urls):
    browser_cfg = BrowserConfig(
        browser_type="chromium",
        cdp_url="http://localhost:9222"
    )

    try:
        async with AsyncWebCrawler(config=browser_cfg) as crawler:
            results = await crawler.arun_many(urls)

            # Separate successes and failures
            successes = [r for r in results if r.success]
            failures = [r for r in results if not r.success]

            print(f"✓ {len(successes)} succeeded")
            print(f"✗ {len(failures)} failed")

            # Retry failures with different config
            if failures:
                retry_urls = [r.url for r in failures]
                retry_config = CrawlerRunConfig(
                    page_timeout=120000,  # Longer timeout
                    wait_until="networkidle"
                )
                retry_results = await crawler.arun_many(
                    retry_urls,
                    config=retry_config
                )

            return successes + retry_results

    except Exception as e:
        print(f"Fatal error: {e}")
        return []
```

---

## 8. Troubleshooting

### 8.1 Connection Issues

**Problem**: `Cannot connect to CDP browser`

```python
# Check CDP browser is running
$ lsof -i :9222
# Should show: Chromium  PID USER   FD   TYPE  ...

# Or start it if not running
$ crwl cdp
```

**Problem**: `ERR_ABORTED` errors in concurrent crawls

✅ **Fixed in v0.7.6**: This issue has been resolved. Pages are now properly isolated with locking.

### 8.2 Performance Issues

**Problem**: Too many open tabs

```python
# Ensure you're not using session_id for everything
config = CrawlerRunConfig()  # No session_id
await crawler.arun_many(urls, config=config)
# Pages auto-close after crawling
```

**Problem**: Memory leaks

```python
# Always use context manager
async with AsyncWebCrawler(config=browser_cfg) as crawler:
    # Crawling code here
    pass
# Automatic cleanup on exit
```

### 8.3 State Issues

**Problem**: Cookies not persisting

```python
# Use the same context (automatic with CDP)
browser_cfg = BrowserConfig(cdp_url="http://localhost:9222")
# All crawls share cookies automatically
```

**Problem**: Need isolated state

```python
# Use different CDP endpoints or non-CDP browsers
browser_cfg_1 = BrowserConfig(cdp_url="http://localhost:9222")
browser_cfg_2 = BrowserConfig(cdp_url="http://localhost:9223")
# Completely isolated browsers
```

---

## 9. Comparison: CDP vs Regular Browsers

| Feature | CDP Browser | Regular Browser |
|---------|-------------|-----------------|
| **Window Management** | ✅ Single window, multiple tabs | ❌ New window per context |
| **Startup Time** | ✅ Instant (already running) | ⏱️ ~2-3s per launch |
| **State Sharing** | ✅ Shared cookies/localStorage | ⚠️ Isolated by default |
| **Concurrent Safety** | ✅ Automatic locking | ✅ Separate processes |
| **Memory Usage** | ✅ Lower (shared browser) | ⚠️ Higher (multiple processes) |
| **Session Persistence** | ✅ Native support | ✅ Via session_id |
| **Stealth Mode** | ❌ Not compatible | ✅ Full support |
| **Best For** | Development, authenticated crawls | Production, isolated crawls |

---

## 10. Real-World Examples

### 10.1 E-commerce Product Scraping

```python
async def scrape_products():
    browser_cfg = BrowserConfig(
        browser_type="chromium",
        cdp_url="http://localhost:9222"
    )

    # Get product URLs from category page
    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        category_result = await crawler.arun(
            url="https://shop.example.com/category",
            config=CrawlerRunConfig(
                css_selector=".product-link"
            )
        )

        # Extract product URLs
        product_urls = extract_urls(category_result.links)

        # Crawl all products concurrently
        product_results = await crawler.arun_many(
            urls=product_urls,
            config=CrawlerRunConfig(
                css_selector=".product-details",
                semaphore_count=5  # Polite crawling
            )
        )

        return [extract_product_data(r) for r in product_results]
```

### 10.2 News Article Monitoring

```python
import asyncio
from datetime import datetime

async def monitor_news_sites():
    browser_cfg = BrowserConfig(
        browser_type="chromium",
        cdp_url="http://localhost:9222"
    )

    news_sites = [
        "https://news.site1.com",
        "https://news.site2.com",
        "https://news.site3.com"
    ]

    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        while True:
            print(f"\n[{datetime.now()}] Checking for updates...")

            results = await crawler.arun_many(
                urls=news_sites,
                config=CrawlerRunConfig(
                    cache_mode=CacheMode.BYPASS,  # Always fresh
                    css_selector=".article-headline"
                )
            )

            for result in results:
                if result.success:
                    headlines = extract_headlines(result)
                    for headline in headlines:
                        if is_new(headline):
                            notify_user(headline)

            # Check every 5 minutes
            await asyncio.sleep(300)
```

---

## 11. Summary

CDP browser crawling offers:

- 🚀 **Performance**: Faster startup, lower resource usage
- 🔄 **State Management**: Shared cookies and authentication
- 🎯 **Concurrent Safety**: Automatic page isolation and cleanup
- 💻 **Developer Friendly**: Visual debugging with DevTools

**When to use CDP:**
- Development and debugging
- Authenticated crawling (login required)
- Sequential crawls needing state
- Resource-constrained environments

**When to use regular browsers:**
- Production deployments
- Maximum isolation required
- Stealth mode needed
- Distributed/cloud crawling

For most use cases, **CDP browsers provide the best balance** of performance, convenience, and safety.