feat(browser-farm): Add Docker browser support for remote crawling

Implement initial MVP for Docker-based browser management in Crawl4ai, enabling remote browser execution in containerized environments. Key Changes: - Add browser_farm module with Docker support components: * BrowserFarmService: Manages browser endpoints * DockerBrowser: Handles Docker browser communication * Basic health check implementation * Dockerfile with optimized Chrome/Playwright setup: - Based on python:3.10-slim for minimal size - Includes all required system dependencies - Auto-installs crawl4ai and sets up Playwright - Configures Chrome with remote debugging - Uses socat for port forwarding (9223) - Update core components: * Rename use_managed_browser to use_remote_browser for clarity * Modify BrowserManager to support Docker mode * Add Docker configuration in BrowserConfig * Update context handling for remote browsers - Add example: * hello_world_docker.py demonstrating Docker browser usage Technical Details: - Docker container exposes port 9223 (mapped to host:9333) - Uses CDP (Chrome DevTools Protocol) for remote connection - Maintains compatibility with existing managed browser features - Simplified endpoint management for MVP phase - Optimized Docker setup: * Minimal dependencies installation * Proper Chrome flags for containerized environment * Headless mode with GPU disabled * Security considerations (no-sandbox mode) Testing: - Extensive Docker configuration testing and optimization - Verified with hello_world_docker.py example - Confirmed remote browser connection and crawling functionality - Tested basic health checks This is the first step towards a scalable browser farm solution, setting up the foundation for future enhancements like resource monitoring, multiple browser instances, and container lifecycle management.
2025-01-02 18:41:36 +08:00
parent 24b3da717a
commit 7aaaaae461
16 changed files with 1072 additions and 73 deletions
--- a/docs/examples/hello_world_docker.py
+++ b/docs/examples/hello_world_docker.py
@@ -0,0 +1,31 @@
+import asyncio
+from crawl4ai import *
+
+async def main():
+    # Configure browser to use Docker
+    browser_config = BrowserConfig(
+        headless=True, 
+        verbose=True,
+        use_docker=True  # Enable Docker browser
+    )
+    
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        crawler_config = CrawlerRunConfig(
+            cache_mode=CacheMode.BYPASS,
+            markdown_generator=DefaultMarkdownGenerator(
+                content_filter=PruningContentFilter(
+                    threshold=0.48, 
+                    threshold_type="fixed",
+                    min_word_threshold=0
+                )
+            )
+        )
+        
+        result = await crawler.arun(
+            url="https://www.helloworld.org",
+            config=crawler_config
+        )
+        print(result.markdown_v2.raw_markdown[:500])
+
+if __name__ == "__main__":
+    asyncio.run(main())