Implement initial MVP for Docker-based browser management in Crawl4ai, enabling
remote browser execution in containerized environments.
Key Changes:
- Add browser_farm module with Docker support components:
* BrowserFarmService: Manages browser endpoints
* DockerBrowser: Handles Docker browser communication
* Basic health check implementation
* Dockerfile with optimized Chrome/Playwright setup:
- Based on python:3.10-slim for minimal size
- Includes all required system dependencies
- Auto-installs crawl4ai and sets up Playwright
- Configures Chrome with remote debugging
- Uses socat for port forwarding (9223)
- Update core components:
* Rename use_managed_browser to use_remote_browser for clarity
* Modify BrowserManager to support Docker mode
* Add Docker configuration in BrowserConfig
* Update context handling for remote browsers
- Add example:
* hello_world_docker.py demonstrating Docker browser usage
Technical Details:
- Docker container exposes port 9223 (mapped to host:9333)
- Uses CDP (Chrome DevTools Protocol) for remote connection
- Maintains compatibility with existing managed browser features
- Simplified endpoint management for MVP phase
- Optimized Docker setup:
* Minimal dependencies installation
* Proper Chrome flags for containerized environment
* Headless mode with GPU disabled
* Security considerations (no-sandbox mode)
Testing:
- Extensive Docker configuration testing and optimization
- Verified with hello_world_docker.py example
- Confirmed remote browser connection and crawling functionality
- Tested basic health checks
This is the first step towards a scalable browser farm solution, setting up
the foundation for future enhancements like resource monitoring, multiple
browser instances, and container lifecycle management.
4.8 KiB
4.8 KiB
Parameter Reference Table
| File Name | Parameter Name | Code Usage | Strategy/Class | Description |
|---|---|---|---|---|
| async_crawler_strategy.py | user_agent | kwargs.get("user_agent") |
AsyncPlaywrightCrawlerStrategy | User agent string for browser identification |
| async_crawler_strategy.py | proxy | kwargs.get("proxy") |
AsyncPlaywrightCrawlerStrategy | Proxy server configuration for network requests |
| async_crawler_strategy.py | proxy_config | kwargs.get("proxy_config") |
AsyncPlaywrightCrawlerStrategy | Detailed proxy configuration including auth |
| async_crawler_strategy.py | headless | kwargs.get("headless", True) |
AsyncPlaywrightCrawlerStrategy | Whether to run browser in headless mode |
| async_crawler_strategy.py | browser_type | kwargs.get("browser_type", "chromium") |
AsyncPlaywrightCrawlerStrategy | Type of browser to use (chromium/firefox/webkit) |
| async_crawler_strategy.py | headers | kwargs.get("headers", {}) |
AsyncPlaywrightCrawlerStrategy | Custom HTTP headers for requests |
| async_crawler_strategy.py | verbose | kwargs.get("verbose", False) |
AsyncPlaywrightCrawlerStrategy | Enable detailed logging output |
| async_crawler_strategy.py | sleep_on_close | kwargs.get("sleep_on_close", False) |
AsyncPlaywrightCrawlerStrategy | Add delay before closing browser |
| async_crawler_strategy.py | use_remote_browser | kwargs.get("use_remote_browser", False) |
AsyncPlaywrightCrawlerStrategy | Use managed browser instance |
| async_crawler_strategy.py | user_data_dir | kwargs.get("user_data_dir", None) |
AsyncPlaywrightCrawlerStrategy | Custom directory for browser profile data |
| async_crawler_strategy.py | session_id | kwargs.get("session_id") |
AsyncPlaywrightCrawlerStrategy | Unique identifier for browser session |
| async_crawler_strategy.py | override_navigator | kwargs.get("override_navigator", False) |
AsyncPlaywrightCrawlerStrategy | Override browser navigator properties |
| async_crawler_strategy.py | simulate_user | kwargs.get("simulate_user", False) |
AsyncPlaywrightCrawlerStrategy | Simulate human-like behavior |
| async_crawler_strategy.py | magic | kwargs.get("magic", False) |
AsyncPlaywrightCrawlerStrategy | Enable advanced anti-detection features |
| async_crawler_strategy.py | log_console | kwargs.get("log_console", False) |
AsyncPlaywrightCrawlerStrategy | Log browser console messages |
| async_crawler_strategy.py | js_only | kwargs.get("js_only", False) |
AsyncPlaywrightCrawlerStrategy | Only execute JavaScript without page load |
| async_crawler_strategy.py | page_timeout | kwargs.get("page_timeout", 60000) |
AsyncPlaywrightCrawlerStrategy | Timeout for page load in milliseconds |
| async_crawler_strategy.py | ignore_body_visibility | kwargs.get("ignore_body_visibility", True) |
AsyncPlaywrightCrawlerStrategy | Process page even if body is hidden |
| async_crawler_strategy.py | js_code | kwargs.get("js_code", kwargs.get("js", self.js_code)) |
AsyncPlaywrightCrawlerStrategy | Custom JavaScript code to execute |
| async_crawler_strategy.py | wait_for | kwargs.get("wait_for") |
AsyncPlaywrightCrawlerStrategy | Wait for specific element/condition |
| async_crawler_strategy.py | process_iframes | kwargs.get("process_iframes", False) |
AsyncPlaywrightCrawlerStrategy | Extract content from iframes |
| async_crawler_strategy.py | delay_before_return_html | kwargs.get("delay_before_return_html") |
AsyncPlaywrightCrawlerStrategy | Additional delay before returning HTML |
| async_crawler_strategy.py | remove_overlay_elements | kwargs.get("remove_overlay_elements", False) |
AsyncPlaywrightCrawlerStrategy | Remove pop-ups and overlay elements |
| async_crawler_strategy.py | screenshot | kwargs.get("screenshot") |
AsyncPlaywrightCrawlerStrategy | Take page screenshot |
| async_crawler_strategy.py | screenshot_wait_for | kwargs.get("screenshot_wait_for") |
AsyncPlaywrightCrawlerStrategy | Wait before taking screenshot |
| async_crawler_strategy.py | semaphore_count | kwargs.get("semaphore_count", 5) |
AsyncPlaywrightCrawlerStrategy | Concurrent request limit |
| async_webcrawler.py | verbose | kwargs.get("verbose", False) |
AsyncWebCrawler | Enable detailed logging |
| async_webcrawler.py | warmup | kwargs.get("warmup", True) |
AsyncWebCrawler | Initialize crawler with warmup request |
| async_webcrawler.py | session_id | kwargs.get("session_id", None) |
AsyncWebCrawler | Session identifier for browser reuse |
| async_webcrawler.py | only_text | kwargs.get("only_text", False) |
AsyncWebCrawler | Extract only text content |
| async_webcrawler.py | bypass_cache | kwargs.get("bypass_cache", False) |
AsyncWebCrawler | Skip cache and force fresh crawl |
| async_webcrawler.py | cache_mode | kwargs.get("cache_mode", CacheMode.ENABLE) |
AsyncWebCrawler | Cache handling mode for request |