feat(browser-farm): Add Docker browser support for remote crawling
Implement initial MVP for Docker-based browser management in Crawl4ai, enabling
remote browser execution in containerized environments.
Key Changes:
- Add browser_farm module with Docker support components:
* BrowserFarmService: Manages browser endpoints
* DockerBrowser: Handles Docker browser communication
* Basic health check implementation
* Dockerfile with optimized Chrome/Playwright setup:
- Based on python:3.10-slim for minimal size
- Includes all required system dependencies
- Auto-installs crawl4ai and sets up Playwright
- Configures Chrome with remote debugging
- Uses socat for port forwarding (9223)
- Update core components:
* Rename use_managed_browser to use_remote_browser for clarity
* Modify BrowserManager to support Docker mode
* Add Docker configuration in BrowserConfig
* Update context handling for remote browsers
- Add example:
* hello_world_docker.py demonstrating Docker browser usage
Technical Details:
- Docker container exposes port 9223 (mapped to host:9333)
- Uses CDP (Chrome DevTools Protocol) for remote connection
- Maintains compatibility with existing managed browser features
- Simplified endpoint management for MVP phase
- Optimized Docker setup:
* Minimal dependencies installation
* Proper Chrome flags for containerized environment
* Headless mode with GPU disabled
* Security considerations (no-sandbox mode)
Testing:
- Extensive Docker configuration testing and optimization
- Verified with hello_world_docker.py example
- Confirmed remote browser connection and crawling functionality
- Tested basic health checks
This is the first step towards a scalable browser farm solution, setting up
the foundation for future enhancements like resource monitoring, multiple
browser instances, and container lifecycle management.
This commit is contained in:
@@ -63,7 +63,7 @@ async with AsyncWebCrawler(config=config) as crawler:
|
||||
```
|
||||
|
||||
### Managed Browser
|
||||
The `ManagedBrowser` class offers a high-level abstraction for managing browser instances, emphasizing resource management, debugging capabilities, and anti-bot measures.
|
||||
The `RemoteConnector` class offers a high-level abstraction for managing browser instances, emphasizing resource management, debugging capabilities, and anti-bot measures.
|
||||
|
||||
#### How It Works
|
||||
- **Browser Process Management**:
|
||||
@@ -82,7 +82,7 @@ The `ManagedBrowser` class offers a high-level abstraction for managing browser
|
||||
- **Scalability**:
|
||||
- Handles multiple browser instances concurrently, scaling dynamically based on workload.
|
||||
|
||||
#### Example: Using `ManagedBrowser`
|
||||
#### Example: Using `RemoteConnector`
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
|
||||
Reference in New Issue
Block a user