Files
crawl4ai/crawl4ai/browser_farm/Dockerfile
UncleCode 7aaaaae461 feat(browser-farm): Add Docker browser support for remote crawling
Implement initial MVP for Docker-based browser management in Crawl4ai, enabling
remote browser execution in containerized environments.

Key Changes:
- Add browser_farm module with Docker support components:
  * BrowserFarmService: Manages browser endpoints
  * DockerBrowser: Handles Docker browser communication
  * Basic health check implementation
  * Dockerfile with optimized Chrome/Playwright setup:
    - Based on python:3.10-slim for minimal size
    - Includes all required system dependencies
    - Auto-installs crawl4ai and sets up Playwright
    - Configures Chrome with remote debugging
    - Uses socat for port forwarding (9223)

- Update core components:
  * Rename use_managed_browser to use_remote_browser for clarity
  * Modify BrowserManager to support Docker mode
  * Add Docker configuration in BrowserConfig
  * Update context handling for remote browsers

- Add example:
  * hello_world_docker.py demonstrating Docker browser usage

Technical Details:
- Docker container exposes port 9223 (mapped to host:9333)
- Uses CDP (Chrome DevTools Protocol) for remote connection
- Maintains compatibility with existing managed browser features
- Simplified endpoint management for MVP phase
- Optimized Docker setup:
  * Minimal dependencies installation
  * Proper Chrome flags for containerized environment
  * Headless mode with GPU disabled
  * Security considerations (no-sandbox mode)

Testing:
- Extensive Docker configuration testing and optimization
- Verified with hello_world_docker.py example
- Confirmed remote browser connection and crawling functionality
- Tested basic health checks

This is the first step towards a scalable browser farm solution, setting up
the foundation for future enhancements like resource monitoring, multiple
browser instances, and container lifecycle management.
2025-01-02 18:41:36 +08:00

49 lines
978 B
Docker

FROM python:3.10-slim
# System dependencies for Playwright/Chromium
RUN apt-get update && apt-get install -y \
wget \
gnupg \
libglib2.0-0 \
libnss3 \
libnspr4 \
libatk1.0-0 \
libatk-bridge2.0-0 \
libcups2 \
libdrm2 \
libdbus-1-3 \
libxcb1 \
libxkbcommon0 \
libx11-6 \
libx11-xcb1 \
libxcb-dri3-0 \
libxcomposite1 \
libxdamage1 \
libxext6 \
libxfixes3 \
libxrandr2 \
libgbm1 \
libpango-1.0-0 \
libcairo2 \
libasound2 \
socat \
&& rm -rf /var/lib/apt/lists/*
# Install crawl4ai and setup
RUN pip install crawl4ai
RUN crawl4ai-setup
# Add startup script
RUN echo '#!/bin/bash\n\
/root/.cache/ms-playwright/chromium-1148/chrome-linux/chrome \
--remote-debugging-port=9222 \
--no-sandbox \
--headless=new \
--disable-gpu &\n\
sleep 2\n\
socat TCP-LISTEN:9223,fork,reuseaddr TCP:127.0.0.1:9222\n' > /start.sh && \
chmod +x /start.sh
EXPOSE 9223
CMD ["/start.sh"]