Fix memory leaks and race conditions when using arun_many() with managed CDP browsers. Each crawl now gets proper page isolation with automatic cleanup while maintaining shared browser context. Key fixes: - Close non-session pages after crawling to prevent tab accumulation - Add thread-safe page creation with locks to avoid concurrent access - Improve page lifecycle management for managed vs non-managed browsers - Keep session pages alive for authentication persistence - Prevent TOCTOU (time-of-check-time-of-use) race conditions This ensures stable parallel crawling without memory growth or browser instability.
16 KiB
CDP Browser Crawling
New in v0.7.6: Efficient concurrent crawling with managed CDP (Chrome DevTools Protocol) browsers. Connect to a running browser instance and perform multiple crawls without spawning new windows.
1. Overview
When working with CDP browsers, you can connect to an existing browser instance instead of launching a new one for each crawl. This is particularly useful for:
- Development: Keep your browser open with DevTools for debugging
- Persistent Sessions: Maintain authentication across multiple crawls
- Resource Efficiency: Reuse a single browser instance for multiple operations
- Concurrent Crawling: Run multiple crawls simultaneously with proper isolation
Key Benefits:
- ✅ Single browser window with multiple tabs (no window clutter)
- ✅ Shared state (cookies, localStorage) across crawls
- ✅ Concurrent safety with automatic page isolation
- ✅ Automatic cleanup to prevent memory leaks
- ✅ Works seamlessly with
arun_many()for parallel crawling
2. Quick Start
2.1 Starting a CDP Browser
Use the Crawl4AI CLI to start a managed CDP browser:
# Start CDP browser on default port (9222)
crwl cdp
# Start on custom port
crwl cdp -d 9223
# Start in headless mode
crwl cdp --headless
The browser will stay running until you press 'q' or close the terminal.
2.2 Basic CDP Connection
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
async def main():
# Configure CDP connection
browser_cfg = BrowserConfig(
browser_type="chromium",
cdp_url="http://localhost:9222",
verbose=True
)
# Crawl a single URL
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(
url="https://example.com",
config=CrawlerRunConfig()
)
print(f"Success: {result.success}")
print(f"Content length: {len(result.markdown)}")
if __name__ == "__main__":
asyncio.run(main())
3. Concurrent Crawling with arun_many()
The real power of CDP crawling shines with arun_many(). The browser manager automatically handles:
- Page Isolation: Each crawl gets its own tab
- Context Sharing: All tabs share cookies and localStorage
- Concurrent Safety: Proper locking prevents race conditions
- Auto Cleanup: Tabs are closed after crawling (except sessions)
3.1 Basic Concurrent Crawling
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def crawl_multiple_urls():
# URLs to crawl
urls = [
"https://example.com",
"https://httpbin.org/html",
"https://www.python.org",
]
# Configure CDP browser
browser_cfg = BrowserConfig(
browser_type="chromium",
cdp_url="http://localhost:9222",
verbose=False
)
# Configure crawler (bypass cache for fresh data)
crawler_cfg = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS
)
# Crawl all URLs concurrently
async with AsyncWebCrawler(config=browser_cfg) as crawler:
results = await crawler.arun_many(
urls=urls,
config=crawler_cfg
)
# Process results
for result in results:
print(f"\nURL: {result.url}")
if result.success:
print(f"✓ Success | Content length: {len(result.markdown)}")
else:
print(f"✗ Failed: {result.error_message}")
if __name__ == "__main__":
asyncio.run(crawl_multiple_urls())
3.2 With Session Management
Use sessions to maintain authentication and state across individual crawls:
async def crawl_with_sessions():
browser_cfg = BrowserConfig(
browser_type="chromium",
cdp_url="http://localhost:9222"
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
# First crawl: Login page
login_result = await crawler.arun(
url="https://example.com/login",
config=CrawlerRunConfig(
session_id="my-session", # Session persists
js_code="document.querySelector('#login').click();"
)
)
# Second crawl: Reuse authenticated session
dashboard_result = await crawler.arun(
url="https://example.com/dashboard",
config=CrawlerRunConfig(
session_id="my-session" # Same session, cookies preserved
)
)
4. How It Works
4.1 Browser Context Reuse
When using CDP browsers, Crawl4AI:
- Connects to the existing browser via CDP URL
- Reuses the default browser context (single window)
- Creates new pages (tabs) for each crawl
- Locks page creation to prevent concurrent races
- Cleans up pages after crawling (unless it's a session)
# Internal behavior (simplified)
if self.config.use_managed_browser:
context = self.default_context # Shared context
# Thread-safe page creation
async with self._page_lock:
page = await context.new_page() # New tab per crawl
# After crawl completes
if not config.session_id:
await page.close() # Auto cleanup
4.2 Page Lifecycle
graph TD
A[Start Crawl] --> B{Has session_id?}
B -->|Yes| C[Reuse existing page]
B -->|No| D[Create new page/tab]
D --> E[Navigate & Extract]
C --> E
E --> F{Is session?}
F -->|Yes| G[Keep page open]
F -->|No| H[Close page]
H --> I[End]
G --> I
4.3 State Sharing
All pages in the same context share:
- 🍪 Cookies: Authentication tokens, preferences
- 💾 localStorage: Client-side data storage
- 🔐 sessionStorage: Per-tab session data
- 🌐 Network cache: Shared HTTP cache
This makes it perfect for crawling authenticated sites or maintaining state across multiple pages.
5. Configuration Options
5.1 BrowserConfig for CDP
browser_cfg = BrowserConfig(
browser_type="chromium", # Must be "chromium" for CDP
cdp_url="http://localhost:9222", # CDP endpoint URL
verbose=True, # Log browser operations
# Optional: Override headers for all requests
headers={
"Accept-Language": "en-US,en;q=0.9",
},
# Optional: Set user agent
user_agent="Mozilla/5.0 ...",
# Optional: Enable stealth mode (requires dedicated browser)
# enable_stealth=False, # Not compatible with CDP
)
5.2 CrawlerRunConfig Options
crawler_cfg = CrawlerRunConfig(
# Session management
session_id="my-session", # Persist page across calls
# Caching
cache_mode=CacheMode.BYPASS, # Fresh data every time
# Browser location (affects timezone, locale)
locale="en-US",
timezone_id="America/New_York",
geolocation={
"latitude": 40.7128,
"longitude": -74.0060
},
# Proxy (per-crawl override)
proxy_config={
"server": "http://proxy.example.com:8080",
"username": "user",
"password": "pass"
}
)
6. Advanced Patterns
6.1 Streaming Results
Process URLs as they complete instead of waiting for all:
async def stream_crawl_results():
browser_cfg = BrowserConfig(
browser_type="chromium",
cdp_url="http://localhost:9222"
)
urls = ["https://example.com" for _ in range(100)]
async with AsyncWebCrawler(config=browser_cfg) as crawler:
# Stream results as they complete
async for result in crawler.arun_many(
urls=urls,
config=CrawlerRunConfig(stream=True)
):
if result.success:
print(f"✓ {result.url}: {len(result.markdown)} chars")
# Process immediately instead of waiting for all
await save_to_database(result)
6.2 Custom Concurrency Control
from crawl4ai import CrawlerRunConfig
# Limit concurrent crawls to 3
crawler_cfg = CrawlerRunConfig(
semaphore_count=3, # Max 3 concurrent requests
mean_delay=0.5, # Average 0.5s delay between requests
max_range=1.0, # +/- 1s random delay
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
results = await crawler.arun_many(urls, config=crawler_cfg)
6.3 Multi-Config Crawling
Different configurations for different URL groups:
from crawl4ai import CrawlerRunConfig
# Fast crawl for static pages
fast_config = CrawlerRunConfig(
wait_until="domcontentloaded",
page_timeout=30000
)
# Slow crawl for dynamic pages
slow_config = CrawlerRunConfig(
wait_until="networkidle",
page_timeout=60000,
js_code="window.scrollTo(0, document.body.scrollHeight);"
)
configs = [fast_config, slow_config, fast_config]
urls = ["https://static.com", "https://dynamic.com", "https://static2.com"]
async with AsyncWebCrawler(config=browser_cfg) as crawler:
results = await crawler.arun_many(urls, configs=configs)
7. Best Practices
7.1 Resource Management
✅ DO:
# Use context manager for automatic cleanup
async with AsyncWebCrawler(config=browser_cfg) as crawler:
results = await crawler.arun_many(urls)
# Browser connection closed automatically
❌ DON'T:
# Manual management risks resource leaks
crawler = AsyncWebCrawler(config=browser_cfg)
await crawler.start()
results = await crawler.arun_many(urls)
# Forgot to call crawler.close()!
7.2 Session Management
✅ DO:
# Use sessions for related crawls
config = CrawlerRunConfig(session_id="user-flow")
await crawler.arun(login_url, config=config)
await crawler.arun(dashboard_url, config=config)
await crawler.kill_session("user-flow") # Clean up when done
❌ DON'T:
# Creating new session IDs unnecessarily
for i in range(100):
config = CrawlerRunConfig(session_id=f"session-{i}")
await crawler.arun(url, config=config)
# 100 unclosed sessions accumulate!
7.3 Error Handling
async def robust_crawl(urls):
browser_cfg = BrowserConfig(
browser_type="chromium",
cdp_url="http://localhost:9222"
)
try:
async with AsyncWebCrawler(config=browser_cfg) as crawler:
results = await crawler.arun_many(urls)
# Separate successes and failures
successes = [r for r in results if r.success]
failures = [r for r in results if not r.success]
print(f"✓ {len(successes)} succeeded")
print(f"✗ {len(failures)} failed")
# Retry failures with different config
if failures:
retry_urls = [r.url for r in failures]
retry_config = CrawlerRunConfig(
page_timeout=120000, # Longer timeout
wait_until="networkidle"
)
retry_results = await crawler.arun_many(
retry_urls,
config=retry_config
)
return successes + retry_results
except Exception as e:
print(f"Fatal error: {e}")
return []
8. Troubleshooting
8.1 Connection Issues
Problem: Cannot connect to CDP browser
# Check CDP browser is running
$ lsof -i :9222
# Should show: Chromium PID USER FD TYPE ...
# Or start it if not running
$ crwl cdp
Problem: ERR_ABORTED errors in concurrent crawls
✅ Fixed in v0.7.6: This issue has been resolved. Pages are now properly isolated with locking.
8.2 Performance Issues
Problem: Too many open tabs
# Ensure you're not using session_id for everything
config = CrawlerRunConfig() # No session_id
await crawler.arun_many(urls, config=config)
# Pages auto-close after crawling
Problem: Memory leaks
# Always use context manager
async with AsyncWebCrawler(config=browser_cfg) as crawler:
# Crawling code here
pass
# Automatic cleanup on exit
8.3 State Issues
Problem: Cookies not persisting
# Use the same context (automatic with CDP)
browser_cfg = BrowserConfig(cdp_url="http://localhost:9222")
# All crawls share cookies automatically
Problem: Need isolated state
# Use different CDP endpoints or non-CDP browsers
browser_cfg_1 = BrowserConfig(cdp_url="http://localhost:9222")
browser_cfg_2 = BrowserConfig(cdp_url="http://localhost:9223")
# Completely isolated browsers
9. Comparison: CDP vs Regular Browsers
| Feature | CDP Browser | Regular Browser |
|---|---|---|
| Window Management | ✅ Single window, multiple tabs | ❌ New window per context |
| Startup Time | ✅ Instant (already running) | ⏱️ ~2-3s per launch |
| State Sharing | ✅ Shared cookies/localStorage | ⚠️ Isolated by default |
| Concurrent Safety | ✅ Automatic locking | ✅ Separate processes |
| Memory Usage | ✅ Lower (shared browser) | ⚠️ Higher (multiple processes) |
| Session Persistence | ✅ Native support | ✅ Via session_id |
| Stealth Mode | ❌ Not compatible | ✅ Full support |
| Best For | Development, authenticated crawls | Production, isolated crawls |
10. Real-World Examples
10.1 E-commerce Product Scraping
async def scrape_products():
browser_cfg = BrowserConfig(
browser_type="chromium",
cdp_url="http://localhost:9222"
)
# Get product URLs from category page
async with AsyncWebCrawler(config=browser_cfg) as crawler:
category_result = await crawler.arun(
url="https://shop.example.com/category",
config=CrawlerRunConfig(
css_selector=".product-link"
)
)
# Extract product URLs
product_urls = extract_urls(category_result.links)
# Crawl all products concurrently
product_results = await crawler.arun_many(
urls=product_urls,
config=CrawlerRunConfig(
css_selector=".product-details",
semaphore_count=5 # Polite crawling
)
)
return [extract_product_data(r) for r in product_results]
10.2 News Article Monitoring
import asyncio
from datetime import datetime
async def monitor_news_sites():
browser_cfg = BrowserConfig(
browser_type="chromium",
cdp_url="http://localhost:9222"
)
news_sites = [
"https://news.site1.com",
"https://news.site2.com",
"https://news.site3.com"
]
async with AsyncWebCrawler(config=browser_cfg) as crawler:
while True:
print(f"\n[{datetime.now()}] Checking for updates...")
results = await crawler.arun_many(
urls=news_sites,
config=CrawlerRunConfig(
cache_mode=CacheMode.BYPASS, # Always fresh
css_selector=".article-headline"
)
)
for result in results:
if result.success:
headlines = extract_headlines(result)
for headline in headlines:
if is_new(headline):
notify_user(headline)
# Check every 5 minutes
await asyncio.sleep(300)
11. Summary
CDP browser crawling offers:
- 🚀 Performance: Faster startup, lower resource usage
- 🔄 State Management: Shared cookies and authentication
- 🎯 Concurrent Safety: Automatic page isolation and cleanup
- 💻 Developer Friendly: Visual debugging with DevTools
When to use CDP:
- Development and debugging
- Authenticated crawling (login required)
- Sequential crawls needing state
- Resource-constrained environments
When to use regular browsers:
- Production deployments
- Maximum isolation required
- Stealth mode needed
- Distributed/cloud crawling
For most use cases, CDP browsers provide the best balance of performance, convenience, and safety.