Files

UncleCode 7aaaaae461 feat(browser-farm): Add Docker browser support for remote crawling

Implement initial MVP for Docker-based browser management in Crawl4ai, enabling
remote browser execution in containerized environments.

Key Changes:
- Add browser_farm module with Docker support components:
  * BrowserFarmService: Manages browser endpoints
  * DockerBrowser: Handles Docker browser communication
  * Basic health check implementation
  * Dockerfile with optimized Chrome/Playwright setup:
    - Based on python:3.10-slim for minimal size
    - Includes all required system dependencies
    - Auto-installs crawl4ai and sets up Playwright
    - Configures Chrome with remote debugging
    - Uses socat for port forwarding (9223)

- Update core components:
  * Rename use_managed_browser to use_remote_browser for clarity
  * Modify BrowserManager to support Docker mode
  * Add Docker configuration in BrowserConfig
  * Update context handling for remote browsers

- Add example:
  * hello_world_docker.py demonstrating Docker browser usage

Technical Details:
- Docker container exposes port 9223 (mapped to host:9333)
- Uses CDP (Chrome DevTools Protocol) for remote connection
- Maintains compatibility with existing managed browser features
- Simplified endpoint management for MVP phase
- Optimized Docker setup:
  * Minimal dependencies installation
  * Proper Chrome flags for containerized environment
  * Headless mode with GPU disabled
  * Security considerations (no-sandbox mode)

Testing:
- Extensive Docker configuration testing and optimization
- Verified with hello_world_docker.py example
- Confirmed remote browser connection and crawling functionality
- Tested basic health checks

This is the first step towards a scalable browser farm solution, setting up
the foundation for future enhancements like resource monitoring, multiple
browser instances, and container lifecycle management.

2025-01-02 18:41:36 +08:00

4.8 KiB

Raw Blame History

Parameter Reference Table

File Name	Parameter Name	Code Usage	Strategy/Class	Description
async_crawler_strategy.py	user_agent	`kwargs.get("user_agent")`	AsyncPlaywrightCrawlerStrategy	User agent string for browser identification
async_crawler_strategy.py	proxy	`kwargs.get("proxy")`	AsyncPlaywrightCrawlerStrategy	Proxy server configuration for network requests
async_crawler_strategy.py	proxy_config	`kwargs.get("proxy_config")`	AsyncPlaywrightCrawlerStrategy	Detailed proxy configuration including auth
async_crawler_strategy.py	headless	`kwargs.get("headless", True)`	AsyncPlaywrightCrawlerStrategy	Whether to run browser in headless mode
async_crawler_strategy.py	browser_type	`kwargs.get("browser_type", "chromium")`	AsyncPlaywrightCrawlerStrategy	Type of browser to use (chromium/firefox/webkit)
async_crawler_strategy.py	headers	`kwargs.get("headers", {})`	AsyncPlaywrightCrawlerStrategy	Custom HTTP headers for requests
async_crawler_strategy.py	verbose	`kwargs.get("verbose", False)`	AsyncPlaywrightCrawlerStrategy	Enable detailed logging output
async_crawler_strategy.py	sleep_on_close	`kwargs.get("sleep_on_close", False)`	AsyncPlaywrightCrawlerStrategy	Add delay before closing browser
async_crawler_strategy.py	use_remote_browser	`kwargs.get("use_remote_browser", False)`	AsyncPlaywrightCrawlerStrategy	Use managed browser instance
async_crawler_strategy.py	user_data_dir	`kwargs.get("user_data_dir", None)`	AsyncPlaywrightCrawlerStrategy	Custom directory for browser profile data
async_crawler_strategy.py	session_id	`kwargs.get("session_id")`	AsyncPlaywrightCrawlerStrategy	Unique identifier for browser session
async_crawler_strategy.py	override_navigator	`kwargs.get("override_navigator", False)`	AsyncPlaywrightCrawlerStrategy	Override browser navigator properties
async_crawler_strategy.py	simulate_user	`kwargs.get("simulate_user", False)`	AsyncPlaywrightCrawlerStrategy	Simulate human-like behavior
async_crawler_strategy.py	magic	`kwargs.get("magic", False)`	AsyncPlaywrightCrawlerStrategy	Enable advanced anti-detection features
async_crawler_strategy.py	log_console	`kwargs.get("log_console", False)`	AsyncPlaywrightCrawlerStrategy	Log browser console messages
async_crawler_strategy.py	js_only	`kwargs.get("js_only", False)`	AsyncPlaywrightCrawlerStrategy	Only execute JavaScript without page load
async_crawler_strategy.py	page_timeout	`kwargs.get("page_timeout", 60000)`	AsyncPlaywrightCrawlerStrategy	Timeout for page load in milliseconds
async_crawler_strategy.py	ignore_body_visibility	`kwargs.get("ignore_body_visibility", True)`	AsyncPlaywrightCrawlerStrategy	Process page even if body is hidden
async_crawler_strategy.py	js_code	`kwargs.get("js_code", kwargs.get("js", self.js_code))`	AsyncPlaywrightCrawlerStrategy	Custom JavaScript code to execute
async_crawler_strategy.py	wait_for	`kwargs.get("wait_for")`	AsyncPlaywrightCrawlerStrategy	Wait for specific element/condition
async_crawler_strategy.py	process_iframes	`kwargs.get("process_iframes", False)`	AsyncPlaywrightCrawlerStrategy	Extract content from iframes
async_crawler_strategy.py	delay_before_return_html	`kwargs.get("delay_before_return_html")`	AsyncPlaywrightCrawlerStrategy	Additional delay before returning HTML
async_crawler_strategy.py	remove_overlay_elements	`kwargs.get("remove_overlay_elements", False)`	AsyncPlaywrightCrawlerStrategy	Remove pop-ups and overlay elements
async_crawler_strategy.py	screenshot	`kwargs.get("screenshot")`	AsyncPlaywrightCrawlerStrategy	Take page screenshot
async_crawler_strategy.py	screenshot_wait_for	`kwargs.get("screenshot_wait_for")`	AsyncPlaywrightCrawlerStrategy	Wait before taking screenshot
async_crawler_strategy.py	semaphore_count	`kwargs.get("semaphore_count", 5)`	AsyncPlaywrightCrawlerStrategy	Concurrent request limit
async_webcrawler.py	verbose	`kwargs.get("verbose", False)`	AsyncWebCrawler	Enable detailed logging
async_webcrawler.py	warmup	`kwargs.get("warmup", True)`	AsyncWebCrawler	Initialize crawler with warmup request
async_webcrawler.py	session_id	`kwargs.get("session_id", None)`	AsyncWebCrawler	Session identifier for browser reuse
async_webcrawler.py	only_text	`kwargs.get("only_text", False)`	AsyncWebCrawler	Extract only text content
async_webcrawler.py	bypass_cache	`kwargs.get("bypass_cache", False)`	AsyncWebCrawler	Skip cache and force fresh crawl
async_webcrawler.py	cache_mode	`kwargs.get("cache_mode", CacheMode.ENABLE)`	AsyncWebCrawler	Cache handling mode for request

4.8 KiB Raw Blame History

Parameter Reference Table

4.8 KiB

Raw Blame History