Files

UncleCode 7aaaaae461 feat(browser-farm): Add Docker browser support for remote crawling

Implement initial MVP for Docker-based browser management in Crawl4ai, enabling
remote browser execution in containerized environments.

Key Changes:
- Add browser_farm module with Docker support components:
  * BrowserFarmService: Manages browser endpoints
  * DockerBrowser: Handles Docker browser communication
  * Basic health check implementation
  * Dockerfile with optimized Chrome/Playwright setup:
    - Based on python:3.10-slim for minimal size
    - Includes all required system dependencies
    - Auto-installs crawl4ai and sets up Playwright
    - Configures Chrome with remote debugging
    - Uses socat for port forwarding (9223)

- Update core components:
  * Rename use_managed_browser to use_remote_browser for clarity
  * Modify BrowserManager to support Docker mode
  * Add Docker configuration in BrowserConfig
  * Update context handling for remote browsers

- Add example:
  * hello_world_docker.py demonstrating Docker browser usage

Technical Details:
- Docker container exposes port 9223 (mapped to host:9333)
- Uses CDP (Chrome DevTools Protocol) for remote connection
- Maintains compatibility with existing managed browser features
- Simplified endpoint management for MVP phase
- Optimized Docker setup:
  * Minimal dependencies installation
  * Proper Chrome flags for containerized environment
  * Headless mode with GPU disabled
  * Security considerations (no-sandbox mode)

Testing:
- Extensive Docker configuration testing and optimization
- Verified with hello_world_docker.py example
- Confirmed remote browser connection and crawling functionality
- Tested basic health checks

This is the first step towards a scalable browser farm solution, setting up
the foundation for future enhancements like resource monitoring, multiple
browser instances, and container lifecycle management.

2025-01-02 18:41:36 +08:00

26 KiB

Raw Permalink Blame History

Unclecode:

I plan to create a Docker container that runs multiple instances of the Chromium browser. We will connect these browsers to my library remotely using Playwright for crawling. I will share part of the code that manages the Playwright, which supports connecting to a remote browser. We pass a user data directory to persist data. My ultimate goal is to have a Docker container with many browsers, allowing me to spawn multiple instances in the cloud for users to crawl at scale.

When we launch this Docker, it initially does not run any browsers. Then we will create a new module in library in charge if managing this Docker, and that will create the browsers within that Docker, enabling us to set up multiple browsers with different configurations, settings, and user data directories, or even without a user data directory. We will communicate with these browsers accordingly.

To achieve this, we need to create a class responsible for monitoring and managing the Docker. This class will create the Docker, turn it on, or build it for the first time. Later, we will convert this Docker into an image containing all the required libraries, preferably based on Ubuntu for ease of use. The Python wrapper we create will include functions to manage these multiple Docker instances.

To simplify the initial approach, we can start by imagining a single Docker container. We will create this Docker, keep it running, and then develop a Python class capable of creating a browser. We will use Crawl4ai to connect to that browser and crawl a simple website. If this is successful, we can gradually expand from there.

Before implementing any code, I would like you to rewrite what I just explained to ensure we have a common understanding. Please digest this information and present it in a professional manner, reflecting my needs and suggesting how we should proceed without jumping into coding just yet.

It's really important to me that you act as a professional software engineer in this area, understand my needs, and suggest the best way to handle them.

AI

Core Objective:

Create a scalable browser farm for distributed web crawling using Docker containers
Enable remote connection to these browsers via Playwright
Support multiple concurrent browser instances with different configurations

Architecture Components:

Docker Container Layer:

Base container running Ubuntu
No pre-running browsers at startup
Contains all necessary dependencies for Chromium/Playwright
Ability to spawn multiple browser instances on demand

Browser Management Layer:

Python class to handle Docker lifecycle (create/start/stop/rebuild)
Functions to spawn/manage browser instances within the container
Support for different browser configs (with/without user data dirs)
Connection management via Playwright's remote capabilities

Integration with Crawl4ai:

Connect library to remote browsers
Handle crawling operations through remote browser instances
Manage browser sessions and persistence

Suggested Development Phases:

Single Container MVP:

Create basic Docker setup
Implement Python management class
Test single browser spawning/connection
Validate with simple crawl operation

Multi-Browser Extension:

Add multi-browser support within container
Implement browser pool management
Add configuration profiles

Scale Out:

Multiple container support
Load balancing
Cloud deployment considerations

Key Technical Considerations:

Container resource management
Browser process isolation
Connection pooling
Error recovery/resilience
Resource cleanup

Would you like me to elaborate on any of these aspects before we move to implementation?

Unclecode

Ok, before we start ask me if any follow up questions?

AI

Sure, I have a few crucial questions to clarify the architecture:

For browser persistence, do you plan to map user data directories from the host to the container, or keep them inside the container? This affects container lifecycle management.
When spawning browsers, would you prefer:
- Pre-defined number at container startup
- Dynamic creation on-demand
- Or both options?
For browser connection management:
- Do you need connection pooling?
- Should failed browsers auto-restart?
- What's your strategy for connection timeouts/retries?
For monitoring/health checks:
- What metrics do you need?
- How do you want to handle browser crashes?
- Do you need logs from individual browser instances?
For security:
- Will browsers need different isolation levels?
- Do you need network segregation between instances?
- Any specific security requirements for the user data?

These answers will help shape the implementation details. Which should we tackle first?

Unclecode

For the first question, we should support both options. This means that when you create a new browser, we can define whether it will map to the user data directories or be created inside Docker. We can create a Docker container and, by default, map a folder from the home directory of Crawl4ai to use as the default location. When we create or pass any user data directory, we will use that folder. Therefore, when users want to create a data directory, they must save it there and then pass it.

For the second question, we need dynamic creation. When we create, we can request how many we need. One very important aspect is that this browser management layer must understand resources in a Docker environment, including available memory, CPU, and base resources, and then make decisions. If we need to create a new Docker, it should handle that as well. It's also beneficial to have a predefined number as another option for setup and creation. The key point is that our browser management leaders in this system must understand available resources and act dynamically.

We must remember that when we have multiple browsers, each browser allows us to create multiple contexts, such as pages or browser tabs, which can operate in parallel, with each one being a separate process. Essentially, we are creating a collection of multi-process applications, so it is crucial to consider the resources. This system must never crash. We should always aim to use close to 70-80% of available resources and not exceed that, while being smart about distributing all incoming requests.

We can think of this browser management layer as a proxy server, like Nginx. We are building a wrapper that handles many crawling requests and distributes them among multiple Dockers running separately, each with multiple browser instances. However, the developer experience should not be affected by this complexity. Developers should only interact with one browser. In my library, it is possible to pass a session ID, which translates to one page in a browser. When there is a session ID, it means we always use the same page repeatedly to maintain the status of the previous crawling system. This orchestration should ensure simplicity for the developer.

We have this concept of connection pooling because we create all these browsers and make them available for immediate use. Otherwise, every time we want to create a browser, we have to wait for the object to be created. So, we definitely need this connection pooling. However, as I mentioned, developers shouldn't need to be aware of this; they should just use the library as they currently do. If a browser fails, we must have an auto-restart feature. I assume we have a layer that controls health, performance, and efficiency. This layer decides to increase or decrease the number of available running processes or manage memory. It should have the authority to restart, add, reduce, or make any similar decisions. This relates to timeouts and retries. In my current library, when users crawl, they can pass many parameters. I will share those parameters with you and provide a sample of the code to ensure you understand what the library does before we implement anything.

For number four, monitoring and health checks, a few things come to mind, but I would like you to add your ideas as well. One important aspect is that if a browser crashes, we need to bring it back into the game if available resources allow. We must log and easily identify the reason, especially when debugging what happened. Every single page in this structure must have a unique locator address. This address includes the Docker, the browser, and finally the page. The combination of this unique address allows us to connect the developer's request to that particular page within this structure. When developers share a session ID, they want to ensure they access the same page during crawling sessions repeatedly. This is crucial. As I mentioned, developers don’t need to know anything about these structures; for them, it’s just a simple interface. We must monitor and regularly check that all running pages and browsers are available. Additionally, when we create a browser, we create multiple pages. This means that every browser has a certain number of pages already open and ready for use. The goal is to maintain a pool of connections to all these pages to enable instant crawling. It's important to note that we don’t just create browsers; we open pages in those browsers and then make them available in our connection tool.

For no 5, for now we skip it.

Let's revise your understanding so far. Remember we start small, then get some real result, and iteratively we add. I want to hv a minimal level that allows my user to start using Crawl4ai in this way so they no concern about browser and use library as they use in their own machine, this helps them go production easier, then itertively receiving reproted bugs, fisxng them we add all features.

AI

Let me summarize the core architecture and initial MVP based on your detailed explanation:

Core Architecture:

Browser Management Layer acts as a proxy/orchestrator
Each Docker can host multiple browsers
Each browser can have multiple pages/contexts
Resources are monitored and managed at Docker/browser/page level
Connection pooling to pre-initialize pages for instant use
Unique addressing: Docker->Browser->Page
Support for both mapped and container-internal user data directories

MVP Focus:

Single Docker Container:
- Dynamic browser creation
- Pre-initialized pages per browser
- Basic resource monitoring
- Connection pooling
- Session persistence via page reuse
Connection System:
- Unique locator system (Docker/Browser/Page)
- Session ID mapping to specific pages
- Transparent to end users of Crawl4ai
Resource Management:
- Basic monitoring of memory/CPU
- Keep resource usage under 70-80%
- Auto-restart for failed browsers
- Simple health checks

Critical Questions before starting:

What's the minimum number of pre-initialized pages per browser?
What metrics define a "healthy" browser/page?
Should the first version support mapped user directories, or start with in-container only?
What's the minimum acceptable resource headroom before scaling?

Would you like to clarify any of these points before we start with the initial implementation?

Unclecode

To answer your question, remember that when we finish this, our CLI will have a command line to activate this cluster for end users (crawl4ai-deploy). When they do that, they can pass parameters or attach a YAML file containing all the necessary parameters. For many of them, we can support a fixed number. For example, when you set pre-initialized pages, we can also have dynamic options. Dynamic means we look at the resources, consider a buffer, and then decide how many pages to create based on the available resources. That's how I like to think about it.

For number two, you can suggest a good starting metric to determine if we have a healthy page. The most important thing is that the page object is available and usable. I think something like that would work.

Regarding how to pass directories, For the first version, I'm concerned that when users use the system, they may already have a user directory they want to utilize. For example, they might open Chrome browsers remotely, log in to some websites, and want to use that data. So, I think it’s better to map the folder, by the way after installing library ~/.crawl4ai is the home directory; otherwise, we have to transfer this data, and that’s what I believe we should do.

I don't understand the last question about acceptable resource headroom.

AI

For the resource headroom question, I was asking about when we should trigger scaling decisions - like when should we stop creating new pages/browsers based on resource usage. But I see from your earlier points that we should aim to stay under 70-80% resource utilization.

Let me clarify a few more specifics before implementation:

CLI Configuration:

# Example crawl4ai-deploy.yaml
browsers:
  init_mode: "dynamic"  # or "fixed"
  fixed_count: 3        # if fixed mode
  pages_per_browser:
    mode: "dynamic"     # or "fixed"
    fixed_count: 5      # if fixed mode
  user_data_dir: "~/.crawl4ai/browser_profiles"

resources:
  max_cpu_percent: 80
  max_memory_percent: 80
  headroom_buffer: 20   # percent to keep free

health_check:
  interval: 30  # seconds
  metrics:
    - page_response_time
    - browser_memory
    - connection_status

For health checks, I propose:

Page object responds to basic Playwright commands
Browser process is running
Memory usage per browser instance
Connection can be established
Basic navigation works

Do these configurations look right? Should we adjust anything before starting the implementation?

Also, for the MVP, would you prefer we start with:

Docker setup + basic browser spawning
Or the Python management layer first?

Unclecode

Sounds good, about how to start, My brain works better when I receive immediate feedback rather than working for a long time and then testing it. A good way for me to start is to create this management layer that sets up the Docker, then creates the browser and page, or reads the YAML file to do that, and then uses Crawl4ai to perform simple crawling. I can skip the Docker creation part, but I know it's not very difficult. Alternatively, I could create a Docker in the command line and then use it, but I prefer to see this part. I hope to complete this creation process, make it available, connect it to my current library, and then run it. So, yeah.

AI

Got it - let's start with immediate feedback loop. Here's our first steps:

Create basic Browser Manager class that:

Creates/controls Docker container
Creates browsers with pages
Provides connection info for Crawl4ai

Simple YAML config reader
Quick test script using Crawl4ai to verify it works

Before coding, need to confirm:

Do you want to start with a hardcoded Docker setup first (just Python side), or include Docker file creation too?
For testing - what's a simple site we should use?
Should I see your current Playwright connection code first to ensure compatibility?

This way we get something running end-to-end quickly, then iterate on each piece.

Unclecode

Ok check: @plans/docstring.md @crawl4ai/async_crawler_strategy.py @docs/example/hello_world.py

We need a service class in charge of creation/management/communication of this Docker network (cluster node or whatever you call it), we launch it using cli, it basically will be a running service and monitoring the cluster (aka redis server or mongodb server). It keep checking, managing crashes and more... and communicate with labrary.

Then BrowserManager, when see use_docker is set ON, it uses a new class (let's call it DockerBroswer) and that call the running service to get a host/port to the selected or available browser, and create a RemoteConnector pointing to that address, from this point forward, rest is like we are using RemoteConnector (I am thinking to change the name to RemoteBrowser) :

AsyncWebCrawler -> AsyncPlaywrightCrawlerStrategy -> BrowserManager -> DockerBroswer (communicate with running service to get host/port) -> RemoteConnector

Is that clear? And please be my guest and suggest better names for these new classes

AI

Ah, now I understand better! Let me propose a clearer architecture:

# 1. Service Layer (Running as daemon)
class BrowserFarmService:  # or CrawlClusterService?
    """Manages the Docker browser farm, running as a service"""
    def __init__(self, config_path: str = "~/.crawl4ai/farm.yaml"):
        self.containers = {}  # track running containers
        self.browser_pool = {}  # available browsers
        self.resource_monitor = ResourceMonitor()
    
    async def start(self):
        """Start the service, init Docker setup"""
        pass

    async def get_available_browser(self) -> BrowserEndpoint:
        """Returns available browser endpoint or creates new one"""
        pass

    async def health_check(self):
        """Monitor containers/browsers health"""
        pass

# 2. Client Layer (Used by BrowserManager)
class DockerBrowser:  # or RemoteBrowser?
    """Client to communicate with BrowserFarmService"""
    def __init__(self, service_url: str = "http://localhost:11235"):
        self.service = BrowserFarmServiceClient(service_url)
    
    async def get_browser(self) -> tuple[str, int]:
        """Get host/port for an available browser"""
        endpoint = await self.service.get_browser()
        return endpoint.host, endpoint.port

# 3. Connect via RemoteConnector (renamed to RemoteBrowser)
class RemoteBrowser:  # formerly RemoteConnector
    """Connects to remote browser instances"""
    pass

Better names could be:

BrowserFarmService -> CrawlClusterService
DockerBrowser -> RemoteBrowserClient
RemoteConnector -> RemoteConnector

Does this align better with your vision? What parts need refinement?

Unclecode

From this point, I want to continue the conversation in another chat session. I need you to write a detailed outline of what we are supposed to do, including the major plan you suggested, all the components, and the part we are currently working on and how we want to achieve it. These details are important. I will then move to the end of the chat session, including the steps you just mentioned, and continue from there. We need to start with understanding and defining the problem. Imagine the new chat session; the large language model in that session doesn't know anything. We have to explain everything we are going to do, including the important files from the codebase, what they are, why we are doing this, and what we are trying to achieve. After that, we will discuss the design and the structures we are trying to apply. Then, we will focus on the MVP, the first stage we are currently working on. Make sure this outline is detailed and covers everything.

AI

Here's a comprehensive outline for the new session:

Overview: Crawl4ai Docker Browser Management System

1. Understanding Current System

Crawl4ai is an async web crawler using Playwright
Key components:
- AsyncWebCrawler: Main interface for users
- AsyncPlaywrightCrawlerStrategy: Manages browser automation
- BrowserManager: Handles browser lifecycle
- ManagedBrowser: Connects to browser instances
Current architecture supports local browser management
Uses user data directories for session persistence

2. Problem Statement

Users need production-ready browser management
Requirements:
- Scale browser instances across containers
- Resource monitoring and management
- Session persistence and recovery
- Simple developer experience
- Connection pooling
- Health monitoring
Must maintain current API compatibility

3. Proposed Architecture

3.1 Components

CrawlClusterService (Daemon)
- Manages Docker containers
- Monitors resources
- Handles browser lifecycle
- Provides browser endpoints
RemoteBrowserClient
- Communicates with CrawlClusterService
- Gets available browser endpoints
- Handles connection management
RemoteConnector (formerly ManagedBrowser)
- Connects to remote browser instances
- Manages browser contexts and pages

3.2 Flow

AsyncWebCrawler -> AsyncPlaywrightCrawlerStrategy -> BrowserManager 
-> RemoteBrowserClient -> CrawlClusterService -> RemoteConnector

4. MVP Implementation (Current Focus)

4.1 Phase 1 - Basic Docker Setup

Docker container with Chrome
Exposed debugging port
Basic health check endpoint
Single browser instance

4.2 Phase 2 - Service Layer (Current Step)

Simple BrowserFarmService
Returns hardcoded endpoint
Basic health monitoring
Docker browser client integration

4.3 Phase 3 - Integration

Update BrowserConfig for Docker mode
Modify BrowserManager for remote connections
Test with hello world example

5. Future Enhancements

5.1 Service Features

Multiple browser instances
Resource monitoring
Auto-scaling
Connection pooling
Pre-initialized pages

5.2 Management Features

Browser health checks
Crash recovery
Session persistence
Resource optimization

5.3 Developer Tools

CLI for cluster management
Monitoring dashboard
Configuration management
Debugging tools

6. Current Implementation Status

Docker container working
Basic endpoint structure defined
Starting service layer implementation
Need to:
1. Complete BrowserFarmService
2. Implement DockerBrowser
3. Update BrowserManager
4. Test integration

7. Key Files

crawl4ai/
├── browser_farm/
│   ├── service.py         # BrowserFarmService
│   ├── docker_browser.py  # RemoteBrowserClient
│   └── models.py         # Shared models
├── async_configs.py      # BrowserConfig updates
└── async_crawler_strategy.py  # BrowserManager changes

Is this outline detailed enough for starting the new session?

Unclecode

Sounds good, yes, correct. Remember my philosophy, build as little as possible but not incomplete then test it. So let's do this:

1/ Manually we create a docker, make it ready to have a browser. Already Done:

FROM python:3.10-slim

# System dependencies for Playwright/Chromium
RUN apt-get update && apt-get install -y \
    wget \
    gnupg \
    libglib2.0-0 \
    libnss3 \
    libnspr4 \
    libatk1.0-0 \
    libatk-bridge2.0-0 \
    libcups2 \
    libdrm2 \
    libdbus-1-3 \
    libxcb1 \
    libxkbcommon0 \
    libx11-6 \
    libx11-xcb1 \
    libxcb-dri3-0 \
    libxcomposite1 \
    libxdamage1 \
    libxext6 \
    libxfixes3 \
    libxrandr2 \
    libgbm1 \
    libpango-1.0-0 \
    libcairo2 \
    libasound2 \
    socat \
    && rm -rf /var/lib/apt/lists/*

# Install crawl4ai and setup
RUN pip install crawl4ai
RUN crawl4ai-setup

# Add startup script
RUN echo '#!/bin/bash\n\
/root/.cache/ms-playwright/chromium-1148/chrome-linux/chrome \
  --remote-debugging-port=9222 \
  --no-sandbox \
  --headless=new \
  --disable-gpu &\n\
sleep 2\n\
socat TCP-LISTEN:9223,fork,reuseaddr TCP:127.0.0.1:9222\n' > /start.sh && \
    chmod +x /start.sh

EXPOSE 9223
CMD ["/start.sh"]

I tested like this:

docker build -t crawl4ai-browser . docker run -d -p 9333:9223 crawl4ai-browser curl http://localhost:9333/json/version

2/ We create BrowserFarmService, but for now just back the host:port we created 3/ We focus on BrowserManager, and create DockerBrowser, to connect dots 4/ We run the hello word but on Docker

So we ignore: 1/ pre-filled pages 2/ running service

Ok, continue from 2 on ward

AI

Got it - let's implement this step by step.

Simple BrowserFarmService that returns our hardcoded browser:

# crawl4ai/browser_farm/service.py
from dataclasses import dataclass

@dataclass
class BrowserEndpoint:
    host: str
    port: int

class BrowserFarmService:
    """MVP: Returns hardcoded browser endpoint"""
    def __init__(self):
        # For MVP, hardcode the values 
        self._browser = BrowserEndpoint(
            host="localhost",
            port=9333  # Mapped from Docker's 9223
        )
    
    async def get_available_browser(self) -> BrowserEndpoint:
        """Returns our single browser endpoint"""
        return self._browser

    async def health_check(self) -> bool:
        """Basic health check - just verify endpoint responds"""
        import aiohttp
        try:
            async with aiohttp.ClientSession() as session:
                url = f"http://{self._browser.host}:{self._browser.port}/json/version"
                async with session.get(url) as response:
                    return response.status == 200
        except:
            return False

DockerBrowser to connect BrowserManager with the service:

# crawl4ai/browser_farm/docker_browser.py
from .service import BrowserFarmService

class DockerBrowser:
    """Client to get browser endpoints from BrowserFarmService"""
    
    def __init__(self):
        self.service = BrowserFarmService()
    
    async def get_browser_endpoint(self) -> tuple[str, int]:
        """Get host/port for the browser"""
        endpoint = await self.service.get_available_browser()
        if not await self.service.health_check():
            raise ConnectionError("Browser endpoint not healthy")
        return endpoint.host, endpoint.port

Update BrowserConfig and BrowserManager:

# crawl4ai/async_configs.py
class BrowserConfig:
    def __init__(self, 
                 # ... existing params ...
                 use_docker: bool = False,
                 **kwargs):
        self.use_docker = use_docker
        # ... rest of init

# In BrowserManager.__init__
    def __init__(self, browser_config: BrowserConfig, logger=None):
        # ... existing init code ...
        
        # Initialize DockerBrowser if needed
        self.docker_browser = None
        if self.config.use_docker:
            from .browser_farm.docker_browser import DockerBrowser
            self.docker_browser = DockerBrowser()

# In BrowserManager.start
    async def start(self):
        if self.playwright is None:
            from playwright.async_api import async_playwright
            self.playwright = await async_playwright().start()

        if self.config.use_docker:
            host, port = await self.docker_browser.get_browser_endpoint()
            cdp_url = f"http://{host}:{port}"
            self.browser = await self.playwright.chromium.connect_over_cdp(cdp_url)
            # ... rest of the connection code same as use_remote_browser

Want me to show the hello world example with these changes?

26 KiB Raw Permalink Blame History Unescape Escape

Unclecode:

AI

Unclecode

AI

Unclecode

AI

Unclecode

AI

Unclecode

AI

Unclecode

AI

Unclecode

AI

Overview: Crawl4ai Docker Browser Management System

1. Understanding Current System

2. Problem Statement

3. Proposed Architecture

3.1 Components

3.2 Flow

4. MVP Implementation (Current Focus)

4.1 Phase 1 - Basic Docker Setup

4.2 Phase 2 - Service Layer (Current Step)

4.3 Phase 3 - Integration

5. Future Enhancements

5.1 Service Features

5.2 Management Features

5.3 Developer Tools

6. Current Implementation Status

7. Key Files

Unclecode

AI

26 KiB

Raw Permalink Blame History