Implement initial MVP for Docker-based browser management in Crawl4ai, enabling
remote browser execution in containerized environments.
Key Changes:
- Add browser_farm module with Docker support components:
* BrowserFarmService: Manages browser endpoints
* DockerBrowser: Handles Docker browser communication
* Basic health check implementation
* Dockerfile with optimized Chrome/Playwright setup:
- Based on python:3.10-slim for minimal size
- Includes all required system dependencies
- Auto-installs crawl4ai and sets up Playwright
- Configures Chrome with remote debugging
- Uses socat for port forwarding (9223)
- Update core components:
* Rename use_managed_browser to use_remote_browser for clarity
* Modify BrowserManager to support Docker mode
* Add Docker configuration in BrowserConfig
* Update context handling for remote browsers
- Add example:
* hello_world_docker.py demonstrating Docker browser usage
Technical Details:
- Docker container exposes port 9223 (mapped to host:9333)
- Uses CDP (Chrome DevTools Protocol) for remote connection
- Maintains compatibility with existing managed browser features
- Simplified endpoint management for MVP phase
- Optimized Docker setup:
* Minimal dependencies installation
* Proper Chrome flags for containerized environment
* Headless mode with GPU disabled
* Security considerations (no-sandbox mode)
Testing:
- Extensive Docker configuration testing and optimization
- Verified with hello_world_docker.py example
- Confirmed remote browser connection and crawling functionality
- Tested basic health checks
This is the first step towards a scalable browser farm solution, setting up
the foundation for future enhancements like resource monitoring, multiple
browser instances, and container lifecycle management.
26 KiB
Unclecode:
I plan to create a Docker container that runs multiple instances of the Chromium browser. We will connect these browsers to my library remotely using Playwright for crawling. I will share part of the code that manages the Playwright, which supports connecting to a remote browser. We pass a user data directory to persist data. My ultimate goal is to have a Docker container with many browsers, allowing me to spawn multiple instances in the cloud for users to crawl at scale.
When we launch this Docker, it initially does not run any browsers. Then we will create a new module in library in charge if managing this Docker, and that will create the browsers within that Docker, enabling us to set up multiple browsers with different configurations, settings, and user data directories, or even without a user data directory. We will communicate with these browsers accordingly.
To achieve this, we need to create a class responsible for monitoring and managing the Docker. This class will create the Docker, turn it on, or build it for the first time. Later, we will convert this Docker into an image containing all the required libraries, preferably based on Ubuntu for ease of use. The Python wrapper we create will include functions to manage these multiple Docker instances.
To simplify the initial approach, we can start by imagining a single Docker container. We will create this Docker, keep it running, and then develop a Python class capable of creating a browser. We will use Crawl4ai to connect to that browser and crawl a simple website. If this is successful, we can gradually expand from there.
Before implementing any code, I would like you to rewrite what I just explained to ensure we have a common understanding. Please digest this information and present it in a professional manner, reflecting my needs and suggesting how we should proceed without jumping into coding just yet.
It's really important to me that you act as a professional software engineer in this area, understand my needs, and suggest the best way to handle them.
AI
Core Objective:
- Create a scalable browser farm for distributed web crawling using Docker containers
- Enable remote connection to these browsers via Playwright
- Support multiple concurrent browser instances with different configurations
Architecture Components:
- Docker Container Layer:
- Base container running Ubuntu
- No pre-running browsers at startup
- Contains all necessary dependencies for Chromium/Playwright
- Ability to spawn multiple browser instances on demand
- Browser Management Layer:
- Python class to handle Docker lifecycle (create/start/stop/rebuild)
- Functions to spawn/manage browser instances within the container
- Support for different browser configs (with/without user data dirs)
- Connection management via Playwright's remote capabilities
- Integration with Crawl4ai:
- Connect library to remote browsers
- Handle crawling operations through remote browser instances
- Manage browser sessions and persistence
Suggested Development Phases:
- Single Container MVP:
- Create basic Docker setup
- Implement Python management class
- Test single browser spawning/connection
- Validate with simple crawl operation
- Multi-Browser Extension:
- Add multi-browser support within container
- Implement browser pool management
- Add configuration profiles
- Scale Out:
- Multiple container support
- Load balancing
- Cloud deployment considerations
Key Technical Considerations:
- Container resource management
- Browser process isolation
- Connection pooling
- Error recovery/resilience
- Resource cleanup
Would you like me to elaborate on any of these aspects before we move to implementation?
Unclecode
Ok, before we start ask me if any follow up questions?
AI
Sure, I have a few crucial questions to clarify the architecture:
-
For browser persistence, do you plan to map user data directories from the host to the container, or keep them inside the container? This affects container lifecycle management.
-
When spawning browsers, would you prefer:
- Pre-defined number at container startup
- Dynamic creation on-demand
- Or both options?
-
For browser connection management:
- Do you need connection pooling?
- Should failed browsers auto-restart?
- What's your strategy for connection timeouts/retries?
-
For monitoring/health checks:
- What metrics do you need?
- How do you want to handle browser crashes?
- Do you need logs from individual browser instances?
-
For security:
- Will browsers need different isolation levels?
- Do you need network segregation between instances?
- Any specific security requirements for the user data?
These answers will help shape the implementation details. Which should we tackle first?
Unclecode
For the first question, we should support both options. This means that when you create a new browser, we can define whether it will map to the user data directories or be created inside Docker. We can create a Docker container and, by default, map a folder from the home directory of Crawl4ai to use as the default location. When we create or pass any user data directory, we will use that folder. Therefore, when users want to create a data directory, they must save it there and then pass it.
For the second question, we need dynamic creation. When we create, we can request how many we need. One very important aspect is that this browser management layer must understand resources in a Docker environment, including available memory, CPU, and base resources, and then make decisions. If we need to create a new Docker, it should handle that as well. It's also beneficial to have a predefined number as another option for setup and creation. The key point is that our browser management leaders in this system must understand available resources and act dynamically.
We must remember that when we have multiple browsers, each browser allows us to create multiple contexts, such as pages or browser tabs, which can operate in parallel, with each one being a separate process. Essentially, we are creating a collection of multi-process applications, so it is crucial to consider the resources. This system must never crash. We should always aim to use close to 70-80% of available resources and not exceed that, while being smart about distributing all incoming requests.
We can think of this browser management layer as a proxy server, like Nginx. We are building a wrapper that handles many crawling requests and distributes them among multiple Dockers running separately, each with multiple browser instances. However, the developer experience should not be affected by this complexity. Developers should only interact with one browser. In my library, it is possible to pass a session ID, which translates to one page in a browser. When there is a session ID, it means we always use the same page repeatedly to maintain the status of the previous crawling system. This orchestration should ensure simplicity for the developer.
We have this concept of connection pooling because we create all these browsers and make them available for immediate use. Otherwise, every time we want to create a browser, we have to wait for the object to be created. So, we definitely need this connection pooling. However, as I mentioned, developers shouldn't need to be aware of this; they should just use the library as they currently do. If a browser fails, we must have an auto-restart feature. I assume we have a layer that controls health, performance, and efficiency. This layer decides to increase or decrease the number of available running processes or manage memory. It should have the authority to restart, add, reduce, or make any similar decisions. This relates to timeouts and retries. In my current library, when users crawl, they can pass many parameters. I will share those parameters with you and provide a sample of the code to ensure you understand what the library does before we implement anything.
For number four, monitoring and health checks, a few things come to mind, but I would like you to add your ideas as well. One important aspect is that if a browser crashes, we need to bring it back into the game if available resources allow. We must log and easily identify the reason, especially when debugging what happened. Every single page in this structure must have a unique locator address. This address includes the Docker, the browser, and finally the page. The combination of this unique address allows us to connect the developer's request to that particular page within this structure. When developers share a session ID, they want to ensure they access the same page during crawling sessions repeatedly. This is crucial. As I mentioned, developers don’t need to know anything about these structures; for them, it’s just a simple interface. We must monitor and regularly check that all running pages and browsers are available. Additionally, when we create a browser, we create multiple pages. This means that every browser has a certain number of pages already open and ready for use. The goal is to maintain a pool of connections to all these pages to enable instant crawling. It's important to note that we don’t just create browsers; we open pages in those browsers and then make them available in our connection tool.
For no 5, for now we skip it.
Let's revise your understanding so far. Remember we start small, then get some real result, and iteratively we add. I want to hv a minimal level that allows my user to start using Crawl4ai in this way so they no concern about browser and use library as they use in their own machine, this helps them go production easier, then itertively receiving reproted bugs, fisxng them we add all features.
AI
Let me summarize the core architecture and initial MVP based on your detailed explanation:
Core Architecture:
- Browser Management Layer acts as a proxy/orchestrator
- Each Docker can host multiple browsers
- Each browser can have multiple pages/contexts
- Resources are monitored and managed at Docker/browser/page level
- Connection pooling to pre-initialize pages for instant use
- Unique addressing: Docker->Browser->Page
- Support for both mapped and container-internal user data directories
MVP Focus:
-
Single Docker Container:
- Dynamic browser creation
- Pre-initialized pages per browser
- Basic resource monitoring
- Connection pooling
- Session persistence via page reuse
-
Connection System:
- Unique locator system (Docker/Browser/Page)
- Session ID mapping to specific pages
- Transparent to end users of Crawl4ai
-
Resource Management:
- Basic monitoring of memory/CPU
- Keep resource usage under 70-80%
- Auto-restart for failed browsers
- Simple health checks
Critical Questions before starting:
- What's the minimum number of pre-initialized pages per browser?
- What metrics define a "healthy" browser/page?
- Should the first version support mapped user directories, or start with in-container only?
- What's the minimum acceptable resource headroom before scaling?
Would you like to clarify any of these points before we start with the initial implementation?
Unclecode
To answer your question, remember that when we finish this, our CLI will have a command line to activate this cluster for end users (crawl4ai-deploy). When they do that, they can pass parameters or attach a YAML file containing all the necessary parameters. For many of them, we can support a fixed number. For example, when you set pre-initialized pages, we can also have dynamic options. Dynamic means we look at the resources, consider a buffer, and then decide how many pages to create based on the available resources. That's how I like to think about it.
For number two, you can suggest a good starting metric to determine if we have a healthy page. The most important thing is that the page object is available and usable. I think something like that would work.
Regarding how to pass directories, For the first version, I'm concerned that when users use the system, they may already have a user directory they want to utilize. For example, they might open Chrome browsers remotely, log in to some websites, and want to use that data. So, I think it’s better to map the folder, by the way after installing library ~/.crawl4ai is the home directory; otherwise, we have to transfer this data, and that’s what I believe we should do.
I don't understand the last question about acceptable resource headroom.
AI
For the resource headroom question, I was asking about when we should trigger scaling decisions - like when should we stop creating new pages/browsers based on resource usage. But I see from your earlier points that we should aim to stay under 70-80% resource utilization.
Let me clarify a few more specifics before implementation:
- CLI Configuration:
# Example crawl4ai-deploy.yaml
browsers:
init_mode: "dynamic" # or "fixed"
fixed_count: 3 # if fixed mode
pages_per_browser:
mode: "dynamic" # or "fixed"
fixed_count: 5 # if fixed mode
user_data_dir: "~/.crawl4ai/browser_profiles"
resources:
max_cpu_percent: 80
max_memory_percent: 80
headroom_buffer: 20 # percent to keep free
health_check:
interval: 30 # seconds
metrics:
- page_response_time
- browser_memory
- connection_status
- For health checks, I propose:
- Page object responds to basic Playwright commands
- Browser process is running
- Memory usage per browser instance
- Connection can be established
- Basic navigation works
Do these configurations look right? Should we adjust anything before starting the implementation?
Also, for the MVP, would you prefer we start with:
- Docker setup + basic browser spawning
- Or the Python management layer first?
Unclecode
Sounds good, about how to start, My brain works better when I receive immediate feedback rather than working for a long time and then testing it. A good way for me to start is to create this management layer that sets up the Docker, then creates the browser and page, or reads the YAML file to do that, and then uses Crawl4ai to perform simple crawling. I can skip the Docker creation part, but I know it's not very difficult. Alternatively, I could create a Docker in the command line and then use it, but I prefer to see this part. I hope to complete this creation process, make it available, connect it to my current library, and then run it. So, yeah.
AI
Got it - let's start with immediate feedback loop. Here's our first steps:
- Create basic Browser Manager class that:
- Creates/controls Docker container
- Creates browsers with pages
- Provides connection info for Crawl4ai
-
Simple YAML config reader
-
Quick test script using Crawl4ai to verify it works
Before coding, need to confirm:
- Do you want to start with a hardcoded Docker setup first (just Python side), or include Docker file creation too?
- For testing - what's a simple site we should use?
- Should I see your current Playwright connection code first to ensure compatibility?
This way we get something running end-to-end quickly, then iterate on each piece.
Unclecode
Ok check: @plans/docstring.md @crawl4ai/async_crawler_strategy.py @docs/example/hello_world.py
We need a service class in charge of creation/management/communication of this Docker network (cluster node or whatever you call it), we launch it using cli, it basically will be a running service and monitoring the cluster (aka redis server or mongodb server). It keep checking, managing crashes and more... and communicate with labrary.
Then BrowserManager, when see use_docker is set ON, it uses a new class (let's call it DockerBroswer) and that call the running service to get a host/port to the selected or available browser, and create a RemoteConnector pointing to that address, from this point forward, rest is like we are using RemoteConnector (I am thinking to change the name to RemoteBrowser) :
AsyncWebCrawler -> AsyncPlaywrightCrawlerStrategy -> BrowserManager -> DockerBroswer (communicate with running service to get host/port) -> RemoteConnector
Is that clear? And please be my guest and suggest better names for these new classes
AI
Ah, now I understand better! Let me propose a clearer architecture:
# 1. Service Layer (Running as daemon)
class BrowserFarmService: # or CrawlClusterService?
"""Manages the Docker browser farm, running as a service"""
def __init__(self, config_path: str = "~/.crawl4ai/farm.yaml"):
self.containers = {} # track running containers
self.browser_pool = {} # available browsers
self.resource_monitor = ResourceMonitor()
async def start(self):
"""Start the service, init Docker setup"""
pass
async def get_available_browser(self) -> BrowserEndpoint:
"""Returns available browser endpoint or creates new one"""
pass
async def health_check(self):
"""Monitor containers/browsers health"""
pass
# 2. Client Layer (Used by BrowserManager)
class DockerBrowser: # or RemoteBrowser?
"""Client to communicate with BrowserFarmService"""
def __init__(self, service_url: str = "http://localhost:11235"):
self.service = BrowserFarmServiceClient(service_url)
async def get_browser(self) -> tuple[str, int]:
"""Get host/port for an available browser"""
endpoint = await self.service.get_browser()
return endpoint.host, endpoint.port
# 3. Connect via RemoteConnector (renamed to RemoteBrowser)
class RemoteBrowser: # formerly RemoteConnector
"""Connects to remote browser instances"""
pass
Better names could be:
- BrowserFarmService -> CrawlClusterService
- DockerBrowser -> RemoteBrowserClient
- RemoteConnector -> RemoteConnector
Does this align better with your vision? What parts need refinement?
Unclecode
From this point, I want to continue the conversation in another chat session. I need you to write a detailed outline of what we are supposed to do, including the major plan you suggested, all the components, and the part we are currently working on and how we want to achieve it. These details are important. I will then move to the end of the chat session, including the steps you just mentioned, and continue from there. We need to start with understanding and defining the problem. Imagine the new chat session; the large language model in that session doesn't know anything. We have to explain everything we are going to do, including the important files from the codebase, what they are, why we are doing this, and what we are trying to achieve. After that, we will discuss the design and the structures we are trying to apply. Then, we will focus on the MVP, the first stage we are currently working on. Make sure this outline is detailed and covers everything.
AI
Here's a comprehensive outline for the new session:
Overview: Crawl4ai Docker Browser Management System
1. Understanding Current System
- Crawl4ai is an async web crawler using Playwright
- Key components:
AsyncWebCrawler: Main interface for usersAsyncPlaywrightCrawlerStrategy: Manages browser automationBrowserManager: Handles browser lifecycleManagedBrowser: Connects to browser instances
- Current architecture supports local browser management
- Uses user data directories for session persistence
2. Problem Statement
- Users need production-ready browser management
- Requirements:
- Scale browser instances across containers
- Resource monitoring and management
- Session persistence and recovery
- Simple developer experience
- Connection pooling
- Health monitoring
- Must maintain current API compatibility
3. Proposed Architecture
3.1 Components
-
CrawlClusterService(Daemon)- Manages Docker containers
- Monitors resources
- Handles browser lifecycle
- Provides browser endpoints
-
RemoteBrowserClient- Communicates with CrawlClusterService
- Gets available browser endpoints
- Handles connection management
-
RemoteConnector(formerly ManagedBrowser)- Connects to remote browser instances
- Manages browser contexts and pages
3.2 Flow
AsyncWebCrawler -> AsyncPlaywrightCrawlerStrategy -> BrowserManager
-> RemoteBrowserClient -> CrawlClusterService -> RemoteConnector
4. MVP Implementation (Current Focus)
4.1 Phase 1 - Basic Docker Setup
- Docker container with Chrome
- Exposed debugging port
- Basic health check endpoint
- Single browser instance
4.2 Phase 2 - Service Layer (Current Step)
- Simple
BrowserFarmService - Returns hardcoded endpoint
- Basic health monitoring
- Docker browser client integration
4.3 Phase 3 - Integration
- Update BrowserConfig for Docker mode
- Modify BrowserManager for remote connections
- Test with hello world example
5. Future Enhancements
5.1 Service Features
- Multiple browser instances
- Resource monitoring
- Auto-scaling
- Connection pooling
- Pre-initialized pages
5.2 Management Features
- Browser health checks
- Crash recovery
- Session persistence
- Resource optimization
5.3 Developer Tools
- CLI for cluster management
- Monitoring dashboard
- Configuration management
- Debugging tools
6. Current Implementation Status
- Docker container working
- Basic endpoint structure defined
- Starting service layer implementation
- Need to:
- Complete BrowserFarmService
- Implement DockerBrowser
- Update BrowserManager
- Test integration
7. Key Files
crawl4ai/
├── browser_farm/
│ ├── service.py # BrowserFarmService
│ ├── docker_browser.py # RemoteBrowserClient
│ └── models.py # Shared models
├── async_configs.py # BrowserConfig updates
└── async_crawler_strategy.py # BrowserManager changes
Is this outline detailed enough for starting the new session?
Unclecode
Sounds good, yes, correct. Remember my philosophy, build as little as possible but not incomplete then test it. So let's do this:
1/ Manually we create a docker, make it ready to have a browser. Already Done:
FROM python:3.10-slim
# System dependencies for Playwright/Chromium
RUN apt-get update && apt-get install -y \
wget \
gnupg \
libglib2.0-0 \
libnss3 \
libnspr4 \
libatk1.0-0 \
libatk-bridge2.0-0 \
libcups2 \
libdrm2 \
libdbus-1-3 \
libxcb1 \
libxkbcommon0 \
libx11-6 \
libx11-xcb1 \
libxcb-dri3-0 \
libxcomposite1 \
libxdamage1 \
libxext6 \
libxfixes3 \
libxrandr2 \
libgbm1 \
libpango-1.0-0 \
libcairo2 \
libasound2 \
socat \
&& rm -rf /var/lib/apt/lists/*
# Install crawl4ai and setup
RUN pip install crawl4ai
RUN crawl4ai-setup
# Add startup script
RUN echo '#!/bin/bash\n\
/root/.cache/ms-playwright/chromium-1148/chrome-linux/chrome \
--remote-debugging-port=9222 \
--no-sandbox \
--headless=new \
--disable-gpu &\n\
sleep 2\n\
socat TCP-LISTEN:9223,fork,reuseaddr TCP:127.0.0.1:9222\n' > /start.sh && \
chmod +x /start.sh
EXPOSE 9223
CMD ["/start.sh"]
I tested like this:
docker build -t crawl4ai-browser . docker run -d -p 9333:9223 crawl4ai-browser curl http://localhost:9333/json/version
2/ We create BrowserFarmService, but for now just back the host:port we created 3/ We focus on BrowserManager, and create DockerBrowser, to connect dots 4/ We run the hello word but on Docker
So we ignore: 1/ pre-filled pages 2/ running service
Ok, continue from 2 on ward
AI
Got it - let's implement this step by step.
- Simple BrowserFarmService that returns our hardcoded browser:
# crawl4ai/browser_farm/service.py
from dataclasses import dataclass
@dataclass
class BrowserEndpoint:
host: str
port: int
class BrowserFarmService:
"""MVP: Returns hardcoded browser endpoint"""
def __init__(self):
# For MVP, hardcode the values
self._browser = BrowserEndpoint(
host="localhost",
port=9333 # Mapped from Docker's 9223
)
async def get_available_browser(self) -> BrowserEndpoint:
"""Returns our single browser endpoint"""
return self._browser
async def health_check(self) -> bool:
"""Basic health check - just verify endpoint responds"""
import aiohttp
try:
async with aiohttp.ClientSession() as session:
url = f"http://{self._browser.host}:{self._browser.port}/json/version"
async with session.get(url) as response:
return response.status == 200
except:
return False
- DockerBrowser to connect BrowserManager with the service:
# crawl4ai/browser_farm/docker_browser.py
from .service import BrowserFarmService
class DockerBrowser:
"""Client to get browser endpoints from BrowserFarmService"""
def __init__(self):
self.service = BrowserFarmService()
async def get_browser_endpoint(self) -> tuple[str, int]:
"""Get host/port for the browser"""
endpoint = await self.service.get_available_browser()
if not await self.service.health_check():
raise ConnectionError("Browser endpoint not healthy")
return endpoint.host, endpoint.port
Update BrowserConfig and BrowserManager:
# crawl4ai/async_configs.py
class BrowserConfig:
def __init__(self,
# ... existing params ...
use_docker: bool = False,
**kwargs):
self.use_docker = use_docker
# ... rest of init
# In BrowserManager.__init__
def __init__(self, browser_config: BrowserConfig, logger=None):
# ... existing init code ...
# Initialize DockerBrowser if needed
self.docker_browser = None
if self.config.use_docker:
from .browser_farm.docker_browser import DockerBrowser
self.docker_browser = DockerBrowser()
# In BrowserManager.start
async def start(self):
if self.playwright is None:
from playwright.async_api import async_playwright
self.playwright = await async_playwright().start()
if self.config.use_docker:
host, port = await self.docker_browser.get_browser_endpoint()
cdp_url = f"http://{host}:{port}"
self.browser = await self.playwright.chromium.connect_over_cdp(cdp_url)
# ... rest of the connection code same as use_remote_browser
Want me to show the hello world example with these changes?