feat(docker): implement smart browser pool with 10x memory efficiency

Major refactoring to eliminate memory leaks and enable high-scale crawling: - **Smart 3-Tier Browser Pool**: - Permanent browser (always-ready default config) - Hot pool (configs used 3+ times, longer TTL) - Cold pool (new/rare configs, short TTL) - Auto-promotion: cold → hot after 3 uses - 100% pool reuse achieved in tests - **Container-Aware Memory Detection**: - Read cgroup v1/v2 memory limits (not host metrics) - Accurate memory pressure detection in Docker - Memory-based browser creation blocking - **Adaptive Janitor**: - Dynamic cleanup intervals (10s/30s/60s based on memory) - Tiered TTLs: cold 30-300s, hot 120-600s - Aggressive cleanup at high memory pressure - **Unified Pool Usage**: - All endpoints now use pool (/html, /screenshot, /pdf, /execute_js, /md, /llm) - Fixed config signature mismatch (permanent browser matches endpoints) - get_default_browser_config() helper for consistency - **Configuration**: - Reduced idle_ttl: 1800s → 300s (30min → 5min) - Fixed port: 11234 → 11235 (match Gunicorn) **Performance Results** (from stress tests): - Memory: 10x reduction (500-700MB × N → 270MB permanent) - Latency: 30-50x faster (<100ms pool hits vs 3-5s startup) - Reuse: 100% for default config, 60%+ for variants - Capacity: 100+ concurrent requests (vs ~20 before) - Leak: 0 MB/cycle (stable across tests) **Test Infrastructure**: - 7-phase sequential test suite (tests/) - Docker stats integration + log analysis - Pool promotion verification - Memory leak detection - Full endpoint coverage Fixes memory issues reported in production deployments.
2025-10-17 20:38:39 +08:00
parent 216019f29a
commit b97eaeea4c
14 changed files with 1979 additions and 118 deletions
--- a/deploy/docker/api.py
+++ b/deploy/docker/api.py
@@ -66,6 +66,7 @@ async def handle_llm_qa(
    config: dict
 ) -> str:
    """Process QA using LLM with crawled content as context."""
+    from crawler_pool import get_crawler
    try:
        if not url.startswith(('http://', 'https://')) and not url.startswith(("raw:", "raw://")):
            url = 'https://' + url
@@ -74,15 +75,21 @@ async def handle_llm_qa(
        if last_q_index != -1:
            url = url[:last_q_index]

-        # Get markdown content
-        async with AsyncWebCrawler() as crawler:
-            result = await crawler.arun(url)
-            if not result.success:
-                raise HTTPException(
-                    status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
-                    detail=result.error_message
-                )
-            content = result.markdown.fit_markdown or result.markdown.raw_markdown
+        # Get markdown content (use default config)
+        from utils import load_config
+        cfg = load_config()
+        browser_cfg = BrowserConfig(
+            extra_args=cfg["crawler"]["browser"].get("extra_args", []),
+            **cfg["crawler"]["browser"].get("kwargs", {}),
+        )
+        crawler = await get_crawler(browser_cfg)
+        result = await crawler.arun(url)
+        if not result.success:
+            raise HTTPException(
+                status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+                detail=result.error_message
+            )
+        content = result.markdown.fit_markdown or result.markdown.raw_markdown

        # Create prompt and get LLM response
        prompt = f"""Use the following content as context to answer the question.
@@ -224,25 +231,32 @@ async def handle_markdown_request(

        cache_mode = CacheMode.ENABLED if cache == "1" else CacheMode.WRITE_ONLY

-        async with AsyncWebCrawler() as crawler:
-            result = await crawler.arun(
-                url=decoded_url,
-                config=CrawlerRunConfig(
-                    markdown_generator=md_generator,
-                    scraping_strategy=LXMLWebScrapingStrategy(),
-                    cache_mode=cache_mode
-                )
+        from crawler_pool import get_crawler
+        from utils import load_config as _load_config
+        _cfg = _load_config()
+        browser_cfg = BrowserConfig(
+            extra_args=_cfg["crawler"]["browser"].get("extra_args", []),
+            **_cfg["crawler"]["browser"].get("kwargs", {}),
+        )
+        crawler = await get_crawler(browser_cfg)
+        result = await crawler.arun(
+            url=decoded_url,
+            config=CrawlerRunConfig(
+                markdown_generator=md_generator,
+                scraping_strategy=LXMLWebScrapingStrategy(),
+                cache_mode=cache_mode
            )
-            
-            if not result.success:
-                raise HTTPException(
-                    status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
-                    detail=result.error_message
-                )
+        )

-            return (result.markdown.raw_markdown 
-                   if filter_type == FilterType.RAW 
-                   else result.markdown.fit_markdown)
+        if not result.success:
+            raise HTTPException(
+                status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+                detail=result.error_message
+            )
+
+        return (result.markdown.raw_markdown
+               if filter_type == FilterType.RAW
+               else result.markdown.fit_markdown)

    except Exception as e:
        logger.error(f"Markdown error: {str(e)}", exc_info=True)