Implement initial MVP for Docker-based browser management in Crawl4ai, enabling
remote browser execution in containerized environments.
Key Changes:
- Add browser_farm module with Docker support components:
* BrowserFarmService: Manages browser endpoints
* DockerBrowser: Handles Docker browser communication
* Basic health check implementation
* Dockerfile with optimized Chrome/Playwright setup:
- Based on python:3.10-slim for minimal size
- Includes all required system dependencies
- Auto-installs crawl4ai and sets up Playwright
- Configures Chrome with remote debugging
- Uses socat for port forwarding (9223)
- Update core components:
* Rename use_managed_browser to use_remote_browser for clarity
* Modify BrowserManager to support Docker mode
* Add Docker configuration in BrowserConfig
* Update context handling for remote browsers
- Add example:
* hello_world_docker.py demonstrating Docker browser usage
Technical Details:
- Docker container exposes port 9223 (mapped to host:9333)
- Uses CDP (Chrome DevTools Protocol) for remote connection
- Maintains compatibility with existing managed browser features
- Simplified endpoint management for MVP phase
- Optimized Docker setup:
* Minimal dependencies installation
* Proper Chrome flags for containerized environment
* Headless mode with GPU disabled
* Security considerations (no-sandbox mode)
Testing:
- Extensive Docker configuration testing and optimization
- Verified with hello_world_docker.py example
- Confirmed remote browser connection and crawling functionality
- Tested basic health checks
This is the first step towards a scalable browser farm solution, setting up
the foundation for future enhancements like resource monitoring, multiple
browser instances, and container lifecycle management.
- Set wait_for_images default to false for better performance
- Simplify response attribute copying in AsyncWebCrawler
- Update hello_world example with proper content filtering
BREAKING CHANGE: Updated `chrome_channel` to "chromium" to fix compatibility with the new Chromium headless implementation. This resolves the error `playwright._impl._errors.Error: BrowserType.launch: Chromium distribution 'chrome' is not found`, caused by the removal of the old headless mode in Chromium.
With this change, channels like "chrome" and "msedge" now default to the new headless mode, aligning with upstream updates in Playwright v1.49. The new headless mode uses the real Chrome browser, offering more authenticity, reliability, and feature parity with the full browser.
Additionally, simplified fallback logic by directly assigning `chrome_channel` based on `browser_type` or defaulting to "chromium".
Refer to:
- https://playwright.dev/python/docs/browsers#chromium
- https://github.com/microsoft/playwright/issues/33566
- Replace explicit package listing with setuptools.find
- Include all crawl4ai.* packages automatically
- Use `packages = {find = {where = ["."], include = ["crawl4ai*"]}}` syntax
- Bump version to 0.4.243
This change simplifies package maintenance by automatically discovering
all subpackages under crawl4ai namespace instead of listing them manually.
- Add --force flag to Playwright browser installation
- Add doctor command to test crawling functionality
- Install Chrome and Chromium browsers explicitly
- Add crawl4ai-doctor entry point in pyproject.toml
- Implement simple health check focused on crawling test
- Add pyproject.toml for PEP 517 build system support
- Configure dependencies, scripts, and metadata in pyproject.toml
- Set Python requirement to >=3.9 and add support up to 3.13
- Keep setup.py for backwards compatibility
- Move package dependencies and entry points to pyproject.toml
- Remove setup_docs() call from post_install()
- Simplify error messages for Playwright installation failures
- Use sys.executable for more accurate Python path in error messages
- Add --with-deps flag to Playwright install command