Implement initial MVP for Docker-based browser management in Crawl4ai, enabling
remote browser execution in containerized environments.
Key Changes:
- Add browser_farm module with Docker support components:
* BrowserFarmService: Manages browser endpoints
* DockerBrowser: Handles Docker browser communication
* Basic health check implementation
* Dockerfile with optimized Chrome/Playwright setup:
- Based on python:3.10-slim for minimal size
- Includes all required system dependencies
- Auto-installs crawl4ai and sets up Playwright
- Configures Chrome with remote debugging
- Uses socat for port forwarding (9223)
- Update core components:
* Rename use_managed_browser to use_remote_browser for clarity
* Modify BrowserManager to support Docker mode
* Add Docker configuration in BrowserConfig
* Update context handling for remote browsers
- Add example:
* hello_world_docker.py demonstrating Docker browser usage
Technical Details:
- Docker container exposes port 9223 (mapped to host:9333)
- Uses CDP (Chrome DevTools Protocol) for remote connection
- Maintains compatibility with existing managed browser features
- Simplified endpoint management for MVP phase
- Optimized Docker setup:
* Minimal dependencies installation
* Proper Chrome flags for containerized environment
* Headless mode with GPU disabled
* Security considerations (no-sandbox mode)
Testing:
- Extensive Docker configuration testing and optimization
- Verified with hello_world_docker.py example
- Confirmed remote browser connection and crawling functionality
- Tested basic health checks
This is the first step towards a scalable browser farm solution, setting up
the foundation for future enhancements like resource monitoring, multiple
browser instances, and container lifecycle management.
49 lines
978 B
Docker
49 lines
978 B
Docker
FROM python:3.10-slim
|
|
|
|
# System dependencies for Playwright/Chromium
|
|
RUN apt-get update && apt-get install -y \
|
|
wget \
|
|
gnupg \
|
|
libglib2.0-0 \
|
|
libnss3 \
|
|
libnspr4 \
|
|
libatk1.0-0 \
|
|
libatk-bridge2.0-0 \
|
|
libcups2 \
|
|
libdrm2 \
|
|
libdbus-1-3 \
|
|
libxcb1 \
|
|
libxkbcommon0 \
|
|
libx11-6 \
|
|
libx11-xcb1 \
|
|
libxcb-dri3-0 \
|
|
libxcomposite1 \
|
|
libxdamage1 \
|
|
libxext6 \
|
|
libxfixes3 \
|
|
libxrandr2 \
|
|
libgbm1 \
|
|
libpango-1.0-0 \
|
|
libcairo2 \
|
|
libasound2 \
|
|
socat \
|
|
&& rm -rf /var/lib/apt/lists/*
|
|
|
|
# Install crawl4ai and setup
|
|
RUN pip install crawl4ai
|
|
RUN crawl4ai-setup
|
|
|
|
# Add startup script
|
|
RUN echo '#!/bin/bash\n\
|
|
/root/.cache/ms-playwright/chromium-1148/chrome-linux/chrome \
|
|
--remote-debugging-port=9222 \
|
|
--no-sandbox \
|
|
--headless=new \
|
|
--disable-gpu &\n\
|
|
sleep 2\n\
|
|
socat TCP-LISTEN:9223,fork,reuseaddr TCP:127.0.0.1:9222\n' > /start.sh && \
|
|
chmod +x /start.sh
|
|
|
|
EXPOSE 9223
|
|
CMD ["/start.sh"]
|