feat(browser-farm): Add Docker browser support for remote crawling

Implement initial MVP for Docker-based browser management in Crawl4ai, enabling
remote browser execution in containerized environments.

Key Changes:
- Add browser_farm module with Docker support components:
  * BrowserFarmService: Manages browser endpoints
  * DockerBrowser: Handles Docker browser communication
  * Basic health check implementation
  * Dockerfile with optimized Chrome/Playwright setup:
    - Based on python:3.10-slim for minimal size
    - Includes all required system dependencies
    - Auto-installs crawl4ai and sets up Playwright
    - Configures Chrome with remote debugging
    - Uses socat for port forwarding (9223)

- Update core components:
  * Rename use_managed_browser to use_remote_browser for clarity
  * Modify BrowserManager to support Docker mode
  * Add Docker configuration in BrowserConfig
  * Update context handling for remote browsers

- Add example:
  * hello_world_docker.py demonstrating Docker browser usage

Technical Details:
- Docker container exposes port 9223 (mapped to host:9333)
- Uses CDP (Chrome DevTools Protocol) for remote connection
- Maintains compatibility with existing managed browser features
- Simplified endpoint management for MVP phase
- Optimized Docker setup:
  * Minimal dependencies installation
  * Proper Chrome flags for containerized environment
  * Headless mode with GPU disabled
  * Security considerations (no-sandbox mode)

Testing:
- Extensive Docker configuration testing and optimization
- Verified with hello_world_docker.py example
- Confirmed remote browser connection and crawling functionality
- Tested basic health checks

This is the first step towards a scalable browser farm solution, setting up
the foundation for future enhancements like resource monitoring, multiple
browser instances, and container lifecycle management.
This commit is contained in:
UncleCode
2025-01-02 18:41:36 +08:00
parent 24b3da717a
commit 7aaaaae461
16 changed files with 1072 additions and 73 deletions

View File

@@ -10,7 +10,7 @@
| async_crawler_strategy.py | headers | `kwargs.get("headers", {})` | AsyncPlaywrightCrawlerStrategy | Custom HTTP headers for requests |
| async_crawler_strategy.py | verbose | `kwargs.get("verbose", False)` | AsyncPlaywrightCrawlerStrategy | Enable detailed logging output |
| async_crawler_strategy.py | sleep_on_close | `kwargs.get("sleep_on_close", False)` | AsyncPlaywrightCrawlerStrategy | Add delay before closing browser |
| async_crawler_strategy.py | use_managed_browser | `kwargs.get("use_managed_browser", False)` | AsyncPlaywrightCrawlerStrategy | Use managed browser instance |
| async_crawler_strategy.py | use_remote_browser | `kwargs.get("use_remote_browser", False)` | AsyncPlaywrightCrawlerStrategy | Use managed browser instance |
| async_crawler_strategy.py | user_data_dir | `kwargs.get("user_data_dir", None)` | AsyncPlaywrightCrawlerStrategy | Custom directory for browser profile data |
| async_crawler_strategy.py | session_id | `kwargs.get("session_id")` | AsyncPlaywrightCrawlerStrategy | Unique identifier for browser session |
| async_crawler_strategy.py | override_navigator | `kwargs.get("override_navigator", False)` | AsyncPlaywrightCrawlerStrategy | Override browser navigator properties |