Compare commits

..

15 Commits

Author SHA1 Message Date
UncleCode
4dfd270161 fix: #855
feat(deep-crawling): enhance dispatcher to handle multi-page crawl results

Modify MemoryAdaptiveDispatcher to properly handle results from deep crawling operations:
- Add support for processing multiple results from deep crawling
- Implement memory usage distribution across multiple results
- Update task monitoring for deep crawling scenarios
- Modify return types in deep crawling strategies to use RunManyReturn
2025-03-24 22:54:53 +08:00
UncleCode
8c08521301 feat(browser): add Docker-based browser automation strategy
Implements a new browser strategy that runs Chrome in Docker containers,
providing better isolation and cross-platform consistency. Features include:
- Connect and launch modes for different container configurations
- Persistent storage support for maintaining browser state
- Container registry for efficient reuse
- Comprehensive test suite for Docker browser functionality

This addition allows users to run browser automation workloads in isolated
containers, improving security and resource management.
2025-03-24 21:36:58 +08:00
UncleCode
462d5765e2 fix(browser): improve storage state persistence in CDP strategy
Enhance storage state persistence mechanism in CDP browser strategy by:
- Explicitly saving storage state for each browser context
- Using proper file path for storage state
- Removing unnecessary sleep delay

Also includes test improvements:
- Simplified test configurations in playwright tests
- Temporarily disabled some CDP tests
2025-03-23 21:06:41 +08:00
UncleCode
6eeb2e4076 feat(browser): enhance browser context creation with user data directory support and improved storage state handling 2025-03-23 19:07:13 +08:00
UncleCode
0094cac675 refactor(browser): improve parallel crawling and browser management
Remove PagePoolConfig in favor of direct page management in browser strategies.
Add get_pages() method for efficient parallel page creation.
Improve storage state handling and persistence.
Add comprehensive parallel crawling tests and performance analysis.

BREAKING CHANGE: Removed PagePoolConfig class and related functionality.
2025-03-23 18:53:24 +08:00
UncleCode
4ab0893ffb feat(browser): implement modular browser management system
Adds a new browser management system with strategy pattern implementation:
- Introduces BrowserManager class with strategy pattern support
- Adds PlaywrightBrowserStrategy, CDPBrowserStrategy, and BuiltinBrowserStrategy
- Implements BrowserProfileManager for profile management
- Adds PagePoolConfig for browser page pooling
- Includes comprehensive test suite for all browser strategies

BREAKING CHANGE: Browser management has been moved to browser/ module. Direct usage of browser_manager.py and browser_profiler.py is deprecated.
2025-03-21 22:50:00 +08:00
UncleCode
6432ff1257 feat(browser): add builtin browser management system
Implements a persistent browser management system that allows running a single shared browser instance
that can be reused across multiple crawler sessions. Key changes include:

- Added browser_mode config option with 'builtin', 'dedicated', and 'custom' modes
- Implemented builtin browser management in BrowserProfiler
- Added CLI commands for managing builtin browser (start, stop, status, restart, view)
- Modified browser process handling to support detached processes
- Added automatic builtin browser setup during package installation

BREAKING CHANGE: The browser_mode config option changes how browser instances are managed
2025-03-20 12:13:59 +08:00
UncleCode
5358ac0fc2 refactor: clean up imports and improve JSON schema generation instructions 2025-03-18 18:53:34 +08:00
UncleCode
a24799918c feat(llm): add additional LLM configuration parameters
Extend LLMConfig class to support more fine-grained control over LLM behavior by adding:
- temperature control
- max tokens limit
- top_p sampling
- frequency and presence penalties
- stop sequences
- number of completions

These parameters allow for better customization of LLM responses.
2025-03-14 21:36:23 +08:00
UncleCode
a31d7b86be feat(changelog): update CHANGELOG for version 0.5.0.post5 with new features, changes, fixes, and breaking changes 2025-03-14 15:26:37 +08:00
UncleCode
7884a98be7 feat(crawler): add experimental parameters support and optimize browser handling
Add experimental parameters dictionary to CrawlerRunConfig to support beta features
Make CSP nonce headers optional via experimental config
Remove default cookie injection
Clean up browser context creation code
Improve code formatting in API handler

BREAKING CHANGE: Default cookie injection has been removed from page initialization
2025-03-14 14:39:24 +08:00
UncleCode
6e3c048328 feat(api): refactor crawl request handling to streamline single and multiple URL processing 2025-03-13 22:30:38 +08:00
UncleCode
b750542e6d feat(crawler): optimize single URL handling and add performance comparison
Add special handling for single URL requests in Docker API to use arun() instead of arun_many()
Add new example script demonstrating performance differences between sequential and parallel crawling
Update cache mode from aggressive to bypass in examples and tests
Remove unused dependencies (zstandard, msgpack)

BREAKING CHANGE: Changed default cache_mode from aggressive to bypass in examples
2025-03-13 22:15:15 +08:00
UncleCode
dc36997a08 feat(schema): improve HTML preprocessing for schema generation
Add new preprocess_html_for_schema utility function to better handle HTML cleaning
for schema generation. This replaces the previous optimize_html function in the
GoogleSearchCrawler and includes smarter attribute handling and pattern detection.

Other changes:
- Update default provider to gpt-4o
- Add DEFAULT_PROVIDER_API_KEY constant
- Make LLMConfig creation more flexible with create_llm_config helper
- Add new dependencies: zstandard and msgpack

This change improves schema generation reliability while reducing noise in the
processed HTML.
2025-03-12 22:40:46 +08:00
UncleCode
1630fbdafe feat(monitor): add real-time crawler monitoring system with memory management
Implements a comprehensive monitoring and visualization system for tracking web crawler operations in real-time. The system includes:
- Terminal-based dashboard with rich UI for displaying task statuses
- Memory pressure monitoring and adaptive dispatch control
- Queue statistics and performance metrics tracking
- Detailed task progress visualization
- Stress testing framework for memory management

This addition helps operators track crawler performance and manage memory usage more effectively.
2025-03-12 19:05:24 +08:00
95 changed files with 10329 additions and 6886 deletions

3
.gitignore vendored
View File

@@ -255,3 +255,6 @@ continue_config.json
.llm.env .llm.env
.private/ .private/
CLAUDE_MONITOR.md
CLAUDE.md

View File

@@ -5,6 +5,39 @@ All notable changes to Crawl4AI will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## Version 0.5.0.post5 (2025-03-14)
### Added
- *(crawler)* Add experimental parameters dictionary to CrawlerRunConfig to support beta features
- *(tables)* Add comprehensive table detection and extraction functionality with scoring system
- *(monitor)* Add real-time crawler monitoring system with memory management
- *(content)* Add target_elements parameter for selective content extraction
- *(browser)* Add standalone CDP browser launch capability
- *(schema)* Add preprocess_html_for_schema utility for better HTML cleaning
- *(api)* Add special handling for single URL requests in Docker API
### Changed
- *(filters)* Add reverse option to URLPatternFilter for inverting filter logic
- *(browser)* Make CSP nonce headers optional via experimental config
- *(browser)* Remove default cookie injection from page initialization
- *(crawler)* Optimize response handling for single-URL processing
- *(api)* Refactor crawl request handling to streamline processing
- *(config)* Update default provider to gpt-4o
- *(cache)* Change default cache_mode from aggressive to bypass in examples
### Fixed
- *(browser)* Clean up browser context creation code
- *(api)* Improve code formatting in API handler
### Breaking Changes
- WebScrapingStrategy no longer returns 'scraped_html' in its output dictionary
- Table extraction logic has been modified to better handle thead/tbody structures
- Default cookie injection has been removed from page initialization
## Version 0.5.0 (2025-03-02) ## Version 0.5.0 (2025-03-02)
### Added ### Added

View File

@@ -33,13 +33,12 @@ from .content_filter_strategy import (
LLMContentFilter, LLMContentFilter,
RelevantContentFilter, RelevantContentFilter,
) )
from .models import CrawlResult, MarkdownGenerationResult from .models import CrawlResult, MarkdownGenerationResult, DisplayMode
from .components.crawler_monitor import CrawlerMonitor
from .async_dispatcher import ( from .async_dispatcher import (
MemoryAdaptiveDispatcher, MemoryAdaptiveDispatcher,
SemaphoreDispatcher, SemaphoreDispatcher,
RateLimiter, RateLimiter,
CrawlerMonitor,
DisplayMode,
BaseDispatcher, BaseDispatcher,
) )
from .docker_client import Crawl4aiDockerClient from .docker_client import Crawl4aiDockerClient

View File

@@ -1,6 +1,7 @@
import os import os
from .config import ( from .config import (
DEFAULT_PROVIDER, DEFAULT_PROVIDER,
DEFAULT_PROVIDER_API_KEY,
MIN_WORD_THRESHOLD, MIN_WORD_THRESHOLD,
IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD, IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
PROVIDER_MODELS, PROVIDER_MODELS,
@@ -27,6 +28,10 @@ from typing import Any, Dict, Optional
from enum import Enum from enum import Enum
from .proxy_strategy import ProxyConfig from .proxy_strategy import ProxyConfig
try:
from .browser.docker_config import DockerConfig
except ImportError:
DockerConfig = None
def to_serializable_dict(obj: Any, ignore_default_value : bool = False) -> Dict: def to_serializable_dict(obj: Any, ignore_default_value : bool = False) -> Dict:
@@ -168,6 +173,12 @@ class BrowserConfig:
Default: "chromium". Default: "chromium".
headless (bool): Whether to run the browser in headless mode (no visible GUI). headless (bool): Whether to run the browser in headless mode (no visible GUI).
Default: True. Default: True.
browser_mode (str): Determines how the browser should be initialized:
"builtin" - use the builtin CDP browser running in background
"dedicated" - create a new dedicated browser instance each time
"custom" - use explicit CDP settings provided in cdp_url
"docker" - run browser in Docker container with isolation
Default: "dedicated"
use_managed_browser (bool): Launch the browser using a managed approach (e.g., via CDP), allowing use_managed_browser (bool): Launch the browser using a managed approach (e.g., via CDP), allowing
advanced manipulation. Default: False. advanced manipulation. Default: False.
cdp_url (str): URL for the Chrome DevTools Protocol (CDP) endpoint. Default: "ws://localhost:9222/devtools/browser/". cdp_url (str): URL for the Chrome DevTools Protocol (CDP) endpoint. Default: "ws://localhost:9222/devtools/browser/".
@@ -184,6 +195,8 @@ class BrowserConfig:
Default: None. Default: None.
proxy_config (ProxyConfig or dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}. proxy_config (ProxyConfig or dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
If None, no additional proxy config. Default: None. If None, no additional proxy config. Default: None.
docker_config (DockerConfig or dict or None): Configuration for Docker-based browser automation.
Contains settings for Docker container operation. Default: None.
viewport_width (int): Default viewport width for pages. Default: 1080. viewport_width (int): Default viewport width for pages. Default: 1080.
viewport_height (int): Default viewport height for pages. Default: 600. viewport_height (int): Default viewport height for pages. Default: 600.
viewport (dict): Default viewport dimensions for pages. If set, overrides viewport_width and viewport_height. viewport (dict): Default viewport dimensions for pages. If set, overrides viewport_width and viewport_height.
@@ -194,7 +207,7 @@ class BrowserConfig:
Default: False. Default: False.
downloads_path (str or None): Directory to store downloaded files. If None and accept_downloads is True, downloads_path (str or None): Directory to store downloaded files. If None and accept_downloads is True,
a default path will be created. Default: None. a default path will be created. Default: None.
storage_state (str or dict or None): Path or object describing storage state (cookies, localStorage). storage_state (str or dict or None): An in-memory storage state (cookies, localStorage).
Default: None. Default: None.
ignore_https_errors (bool): Ignore HTTPS certificate errors. Default: True. ignore_https_errors (bool): Ignore HTTPS certificate errors. Default: True.
java_script_enabled (bool): Enable JavaScript execution in pages. Default: True. java_script_enabled (bool): Enable JavaScript execution in pages. Default: True.
@@ -220,6 +233,7 @@ class BrowserConfig:
self, self,
browser_type: str = "chromium", browser_type: str = "chromium",
headless: bool = True, headless: bool = True,
browser_mode: str = "dedicated",
use_managed_browser: bool = False, use_managed_browser: bool = False,
cdp_url: str = None, cdp_url: str = None,
use_persistent_context: bool = False, use_persistent_context: bool = False,
@@ -228,6 +242,7 @@ class BrowserConfig:
channel: str = "chromium", channel: str = "chromium",
proxy: str = None, proxy: str = None,
proxy_config: Union[ProxyConfig, dict, None] = None, proxy_config: Union[ProxyConfig, dict, None] = None,
docker_config: Union["DockerConfig", dict, None] = None,
viewport_width: int = 1080, viewport_width: int = 1080,
viewport_height: int = 600, viewport_height: int = 600,
viewport: dict = None, viewport: dict = None,
@@ -256,6 +271,7 @@ class BrowserConfig:
): ):
self.browser_type = browser_type self.browser_type = browser_type
self.headless = headless self.headless = headless
self.browser_mode = browser_mode
self.use_managed_browser = use_managed_browser self.use_managed_browser = use_managed_browser
self.cdp_url = cdp_url self.cdp_url = cdp_url
self.use_persistent_context = use_persistent_context self.use_persistent_context = use_persistent_context
@@ -267,6 +283,12 @@ class BrowserConfig:
self.chrome_channel = "" self.chrome_channel = ""
self.proxy = proxy self.proxy = proxy
self.proxy_config = proxy_config self.proxy_config = proxy_config
# Handle docker configuration
if isinstance(docker_config, dict) and DockerConfig is not None:
self.docker_config = DockerConfig.from_kwargs(docker_config)
else:
self.docker_config = docker_config
self.viewport_width = viewport_width self.viewport_width = viewport_width
self.viewport_height = viewport_height self.viewport_height = viewport_height
self.viewport = viewport self.viewport = viewport
@@ -289,6 +311,7 @@ class BrowserConfig:
self.sleep_on_close = sleep_on_close self.sleep_on_close = sleep_on_close
self.verbose = verbose self.verbose = verbose
self.debugging_port = debugging_port self.debugging_port = debugging_port
self.host = host
fa_user_agenr_generator = ValidUAGenerator() fa_user_agenr_generator = ValidUAGenerator()
if self.user_agent_mode == "random": if self.user_agent_mode == "random":
@@ -301,6 +324,22 @@ class BrowserConfig:
self.browser_hint = UAGen.generate_client_hints(self.user_agent) self.browser_hint = UAGen.generate_client_hints(self.user_agent)
self.headers.setdefault("sec-ch-ua", self.browser_hint) self.headers.setdefault("sec-ch-ua", self.browser_hint)
# Set appropriate browser management flags based on browser_mode
if self.browser_mode == "builtin":
# Builtin mode uses managed browser connecting to builtin CDP endpoint
self.use_managed_browser = True
# cdp_url will be set later by browser_manager
elif self.browser_mode == "docker":
# Docker mode uses managed browser with CDP to connect to browser in container
self.use_managed_browser = True
# cdp_url will be set later by docker browser strategy
elif self.browser_mode == "custom" and self.cdp_url:
# Custom mode with explicit CDP URL
self.use_managed_browser = True
elif self.browser_mode == "dedicated":
# Dedicated mode uses a new browser instance each time
pass
# If persistent context is requested, ensure managed browser is enabled # If persistent context is requested, ensure managed browser is enabled
if self.use_persistent_context: if self.use_persistent_context:
self.use_managed_browser = True self.use_managed_browser = True
@@ -310,6 +349,7 @@ class BrowserConfig:
return BrowserConfig( return BrowserConfig(
browser_type=kwargs.get("browser_type", "chromium"), browser_type=kwargs.get("browser_type", "chromium"),
headless=kwargs.get("headless", True), headless=kwargs.get("headless", True),
browser_mode=kwargs.get("browser_mode", "dedicated"),
use_managed_browser=kwargs.get("use_managed_browser", False), use_managed_browser=kwargs.get("use_managed_browser", False),
cdp_url=kwargs.get("cdp_url"), cdp_url=kwargs.get("cdp_url"),
use_persistent_context=kwargs.get("use_persistent_context", False), use_persistent_context=kwargs.get("use_persistent_context", False),
@@ -318,6 +358,7 @@ class BrowserConfig:
channel=kwargs.get("channel", "chromium"), channel=kwargs.get("channel", "chromium"),
proxy=kwargs.get("proxy"), proxy=kwargs.get("proxy"),
proxy_config=kwargs.get("proxy_config", None), proxy_config=kwargs.get("proxy_config", None),
docker_config=kwargs.get("docker_config", None),
viewport_width=kwargs.get("viewport_width", 1080), viewport_width=kwargs.get("viewport_width", 1080),
viewport_height=kwargs.get("viewport_height", 600), viewport_height=kwargs.get("viewport_height", 600),
accept_downloads=kwargs.get("accept_downloads", False), accept_downloads=kwargs.get("accept_downloads", False),
@@ -337,12 +378,15 @@ class BrowserConfig:
text_mode=kwargs.get("text_mode", False), text_mode=kwargs.get("text_mode", False),
light_mode=kwargs.get("light_mode", False), light_mode=kwargs.get("light_mode", False),
extra_args=kwargs.get("extra_args", []), extra_args=kwargs.get("extra_args", []),
debugging_port=kwargs.get("debugging_port", 9222),
host=kwargs.get("host", "localhost"),
) )
def to_dict(self): def to_dict(self):
return { result = {
"browser_type": self.browser_type, "browser_type": self.browser_type,
"headless": self.headless, "headless": self.headless,
"browser_mode": self.browser_mode,
"use_managed_browser": self.use_managed_browser, "use_managed_browser": self.use_managed_browser,
"cdp_url": self.cdp_url, "cdp_url": self.cdp_url,
"use_persistent_context": self.use_persistent_context, "use_persistent_context": self.use_persistent_context,
@@ -369,7 +413,17 @@ class BrowserConfig:
"sleep_on_close": self.sleep_on_close, "sleep_on_close": self.sleep_on_close,
"verbose": self.verbose, "verbose": self.verbose,
"debugging_port": self.debugging_port, "debugging_port": self.debugging_port,
"host": self.host,
} }
# Include docker_config if it exists
if hasattr(self, "docker_config") and self.docker_config is not None:
if hasattr(self.docker_config, "to_dict"):
result["docker_config"] = self.docker_config.to_dict()
else:
result["docker_config"] = self.docker_config
return result
def clone(self, **kwargs): def clone(self, **kwargs):
"""Create a copy of this configuration with updated values. """Create a copy of this configuration with updated values.
@@ -649,6 +703,12 @@ class CrawlerRunConfig():
user_agent_generator_config (dict or None): Configuration for user agent generation if user_agent_mode is set. user_agent_generator_config (dict or None): Configuration for user agent generation if user_agent_mode is set.
Default: None. Default: None.
# Experimental Parameters
experimental (dict): Dictionary containing experimental parameters that are in beta phase.
This allows passing temporary features that are not yet fully integrated
into the main parameter set.
Default: None.
url: str = None # This is not a compulsory parameter url: str = None # This is not a compulsory parameter
""" """
@@ -731,6 +791,8 @@ class CrawlerRunConfig():
user_agent_generator_config: dict = {}, user_agent_generator_config: dict = {},
# Deep Crawl Parameters # Deep Crawl Parameters
deep_crawl_strategy: Optional[DeepCrawlStrategy] = None, deep_crawl_strategy: Optional[DeepCrawlStrategy] = None,
# Experimental Parameters
experimental: Dict[str, Any] = None,
): ):
# TODO: Planning to set properties dynamically based on the __init__ signature # TODO: Planning to set properties dynamically based on the __init__ signature
self.url = url self.url = url
@@ -844,6 +906,9 @@ class CrawlerRunConfig():
# Deep Crawl Parameters # Deep Crawl Parameters
self.deep_crawl_strategy = deep_crawl_strategy self.deep_crawl_strategy = deep_crawl_strategy
# Experimental Parameters
self.experimental = experimental or {}
def __getattr__(self, name): def __getattr__(self, name):
@@ -952,6 +1017,8 @@ class CrawlerRunConfig():
# Deep Crawl Parameters # Deep Crawl Parameters
deep_crawl_strategy=kwargs.get("deep_crawl_strategy"), deep_crawl_strategy=kwargs.get("deep_crawl_strategy"),
url=kwargs.get("url"), url=kwargs.get("url"),
# Experimental Parameters
experimental=kwargs.get("experimental"),
) )
# Create a funciton returns dict of the object # Create a funciton returns dict of the object
@@ -1036,6 +1103,7 @@ class CrawlerRunConfig():
"user_agent_generator_config": self.user_agent_generator_config, "user_agent_generator_config": self.user_agent_generator_config,
"deep_crawl_strategy": self.deep_crawl_strategy, "deep_crawl_strategy": self.deep_crawl_strategy,
"url": self.url, "url": self.url,
"experimental": self.experimental,
} }
def clone(self, **kwargs): def clone(self, **kwargs):
@@ -1071,6 +1139,13 @@ class LLMConfig:
provider: str = DEFAULT_PROVIDER, provider: str = DEFAULT_PROVIDER,
api_token: Optional[str] = None, api_token: Optional[str] = None,
base_url: Optional[str] = None, base_url: Optional[str] = None,
temprature: Optional[float] = None,
max_tokens: Optional[int] = None,
top_p: Optional[float] = None,
frequency_penalty: Optional[float] = None,
presence_penalty: Optional[float] = None,
stop: Optional[List[str]] = None,
n: Optional[int] = None,
): ):
"""Configuaration class for LLM provider and API token.""" """Configuaration class for LLM provider and API token."""
self.provider = provider self.provider = provider
@@ -1080,10 +1155,16 @@ class LLMConfig:
self.api_token = os.getenv(api_token[4:]) self.api_token = os.getenv(api_token[4:])
else: else:
self.api_token = PROVIDER_MODELS.get(provider, "no-token") or os.getenv( self.api_token = PROVIDER_MODELS.get(provider, "no-token") or os.getenv(
"OPENAI_API_KEY" DEFAULT_PROVIDER_API_KEY
) )
self.base_url = base_url self.base_url = base_url
self.temprature = temprature
self.max_tokens = max_tokens
self.top_p = top_p
self.frequency_penalty = frequency_penalty
self.presence_penalty = presence_penalty
self.stop = stop
self.n = n
@staticmethod @staticmethod
def from_kwargs(kwargs: dict) -> "LLMConfig": def from_kwargs(kwargs: dict) -> "LLMConfig":
@@ -1091,13 +1172,27 @@ class LLMConfig:
provider=kwargs.get("provider", DEFAULT_PROVIDER), provider=kwargs.get("provider", DEFAULT_PROVIDER),
api_token=kwargs.get("api_token"), api_token=kwargs.get("api_token"),
base_url=kwargs.get("base_url"), base_url=kwargs.get("base_url"),
temprature=kwargs.get("temprature"),
max_tokens=kwargs.get("max_tokens"),
top_p=kwargs.get("top_p"),
frequency_penalty=kwargs.get("frequency_penalty"),
presence_penalty=kwargs.get("presence_penalty"),
stop=kwargs.get("stop"),
n=kwargs.get("n")
) )
def to_dict(self): def to_dict(self):
return { return {
"provider": self.provider, "provider": self.provider,
"api_token": self.api_token, "api_token": self.api_token,
"base_url": self.base_url "base_url": self.base_url,
"temprature": self.temprature,
"max_tokens": self.max_tokens,
"top_p": self.top_p,
"frequency_penalty": self.frequency_penalty,
"presence_penalty": self.presence_penalty,
"stop": self.stop,
"n": self.n
} }
def clone(self, **kwargs): def clone(self, **kwargs):

View File

@@ -507,10 +507,12 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
# Get page for session # Get page for session
page, context = await self.browser_manager.get_page(crawlerRunConfig=config) page, context = await self.browser_manager.get_page(crawlerRunConfig=config)
# await page.goto(URL)
# Add default cookie # Add default cookie
await context.add_cookies( # await context.add_cookies(
[{"name": "cookiesEnabled", "value": "true", "url": url}] # [{"name": "cookiesEnabled", "value": "true", "url": url}]
) # )
# Handle navigator overrides # Handle navigator overrides
if config.override_navigator or config.simulate_user or config.magic: if config.override_navigator or config.simulate_user or config.magic:
@@ -562,14 +564,15 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
try: try:
# Generate a unique nonce for this request # Generate a unique nonce for this request
nonce = hashlib.sha256(os.urandom(32)).hexdigest() if config.experimental.get("use_csp_nonce", False):
nonce = hashlib.sha256(os.urandom(32)).hexdigest()
# Add CSP headers to the request # Add CSP headers to the request
await page.set_extra_http_headers( await page.set_extra_http_headers(
{ {
"Content-Security-Policy": f"default-src 'self'; script-src 'self' 'nonce-{nonce}' 'strict-dynamic'" "Content-Security-Policy": f"default-src 'self'; script-src 'self' 'nonce-{nonce}' 'strict-dynamic'"
} }
) )
response = await page.goto( response = await page.goto(
url, wait_until=config.wait_until, timeout=config.page_timeout url, wait_until=config.wait_until, timeout=config.page_timeout

View File

@@ -4,19 +4,14 @@ import aiosqlite
import asyncio import asyncio
from typing import Optional, Dict from typing import Optional, Dict
from contextlib import asynccontextmanager from contextlib import asynccontextmanager
import json # Added for serialization/deserialization import json
from .utils import ensure_content_dirs, generate_content_hash
from .models import CrawlResult, MarkdownGenerationResult, StringCompatibleMarkdown from .models import CrawlResult, MarkdownGenerationResult, StringCompatibleMarkdown
# , StringCompatibleMarkdown
import aiofiles import aiofiles
from .utils import VersionManager
from .async_logger import AsyncLogger from .async_logger import AsyncLogger
from .utils import get_error_context, create_box_message
# Set up logging from .utils import ensure_content_dirs, generate_content_hash
# logging.basicConfig(level=logging.INFO) from .utils import VersionManager
# logger = logging.getLogger(__name__) from .utils import get_error_context, create_box_message
# logger.setLevel(logging.INFO)
base_directory = DB_PATH = os.path.join( base_directory = DB_PATH = os.path.join(
os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai" os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai"

View File

@@ -1,20 +1,18 @@
from typing import Dict, Optional, List, Tuple from typing import Dict, Optional, List, Tuple, Union
from .async_configs import CrawlerRunConfig from .async_configs import CrawlerRunConfig
from .models import ( from .models import (
CrawlResult, CrawlResult,
CrawlerTaskResult, CrawlerTaskResult,
CrawlStatus, CrawlStatus,
DisplayMode,
CrawlStats,
DomainState, DomainState,
) )
from rich.live import Live from .components.crawler_monitor import CrawlerMonitor
from rich.table import Table
from rich.console import Console from .types import AsyncWebCrawler
from rich import box
from datetime import timedelta, datetime
from collections.abc import AsyncGenerator from collections.abc import AsyncGenerator
import time import time
import psutil import psutil
import asyncio import asyncio
@@ -24,8 +22,6 @@ from urllib.parse import urlparse
import random import random
from abc import ABC, abstractmethod from abc import ABC, abstractmethod
from math import inf as infinity
class RateLimiter: class RateLimiter:
def __init__( def __init__(
@@ -87,201 +83,6 @@ class RateLimiter:
return True return True
class CrawlerMonitor:
def __init__(
self,
max_visible_rows: int = 15,
display_mode: DisplayMode = DisplayMode.DETAILED,
):
self.console = Console()
self.max_visible_rows = max_visible_rows
self.display_mode = display_mode
self.stats: Dict[str, CrawlStats] = {}
self.process = psutil.Process()
self.start_time = time.time()
self.live = Live(self._create_table(), refresh_per_second=2)
def start(self):
self.live.start()
def stop(self):
self.live.stop()
def add_task(self, task_id: str, url: str):
self.stats[task_id] = CrawlStats(
task_id=task_id, url=url, status=CrawlStatus.QUEUED
)
self.live.update(self._create_table())
def update_task(self, task_id: str, **kwargs):
if task_id in self.stats:
for key, value in kwargs.items():
setattr(self.stats[task_id], key, value)
self.live.update(self._create_table())
def _create_aggregated_table(self) -> Table:
"""Creates a compact table showing only aggregated statistics"""
table = Table(
box=box.ROUNDED,
title="Crawler Status Overview",
title_style="bold magenta",
header_style="bold blue",
show_lines=True,
)
# Calculate statistics
total_tasks = len(self.stats)
queued = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.QUEUED
)
in_progress = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
)
completed = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
)
failed = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
)
# Memory statistics
current_memory = self.process.memory_info().rss / (1024 * 1024)
total_task_memory = sum(stat.memory_usage for stat in self.stats.values())
peak_memory = max(
(stat.peak_memory for stat in self.stats.values()), default=0.0
)
# Duration
duration = time.time() - self.start_time
# Create status row
table.add_column("Status", style="bold cyan")
table.add_column("Count", justify="right")
table.add_column("Percentage", justify="right")
table.add_row("Total Tasks", str(total_tasks), "100%")
table.add_row(
"[yellow]In Queue[/yellow]",
str(queued),
f"{(queued / total_tasks * 100):.1f}%" if total_tasks > 0 else "0%",
)
table.add_row(
"[blue]In Progress[/blue]",
str(in_progress),
f"{(in_progress / total_tasks * 100):.1f}%" if total_tasks > 0 else "0%",
)
table.add_row(
"[green]Completed[/green]",
str(completed),
f"{(completed / total_tasks * 100):.1f}%" if total_tasks > 0 else "0%",
)
table.add_row(
"[red]Failed[/red]",
str(failed),
f"{(failed / total_tasks * 100):.1f}%" if total_tasks > 0 else "0%",
)
# Add memory information
table.add_section()
table.add_row(
"[magenta]Current Memory[/magenta]", f"{current_memory:.1f} MB", ""
)
table.add_row(
"[magenta]Total Task Memory[/magenta]", f"{total_task_memory:.1f} MB", ""
)
table.add_row(
"[magenta]Peak Task Memory[/magenta]", f"{peak_memory:.1f} MB", ""
)
table.add_row(
"[yellow]Runtime[/yellow]",
str(timedelta(seconds=int(duration))),
"",
)
return table
def _create_detailed_table(self) -> Table:
table = Table(
box=box.ROUNDED,
title="Crawler Performance Monitor",
title_style="bold magenta",
header_style="bold blue",
)
# Add columns
table.add_column("Task ID", style="cyan", no_wrap=True)
table.add_column("URL", style="cyan", no_wrap=True)
table.add_column("Status", style="bold")
table.add_column("Memory (MB)", justify="right")
table.add_column("Peak (MB)", justify="right")
table.add_column("Duration", justify="right")
table.add_column("Info", style="italic")
# Add summary row
total_memory = sum(stat.memory_usage for stat in self.stats.values())
active_count = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
)
completed_count = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
)
failed_count = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
)
table.add_row(
"[bold yellow]SUMMARY",
f"Total: {len(self.stats)}",
f"Active: {active_count}",
f"{total_memory:.1f}",
f"{self.process.memory_info().rss / (1024 * 1024):.1f}",
str(
timedelta(
seconds=int(time.time() - self.start_time)
)
),
f"{completed_count}{failed_count}",
style="bold",
)
table.add_section()
# Add rows for each task
visible_stats = sorted(
self.stats.values(),
key=lambda x: (
x.status != CrawlStatus.IN_PROGRESS,
x.status != CrawlStatus.QUEUED,
x.end_time or infinity,
),
)[: self.max_visible_rows]
for stat in visible_stats:
status_style = {
CrawlStatus.QUEUED: "white",
CrawlStatus.IN_PROGRESS: "yellow",
CrawlStatus.COMPLETED: "green",
CrawlStatus.FAILED: "red",
}[stat.status]
table.add_row(
stat.task_id[:8], # Show first 8 chars of task ID
stat.url[:40] + "..." if len(stat.url) > 40 else stat.url,
f"[{status_style}]{stat.status.value}[/{status_style}]",
f"{stat.memory_usage:.1f}",
f"{stat.peak_memory:.1f}",
stat.duration,
stat.error_message[:40] if stat.error_message else "",
)
return table
def _create_table(self) -> Table:
"""Creates the appropriate table based on display mode"""
if self.display_mode == DisplayMode.AGGREGATED:
return self._create_aggregated_table()
return self._create_detailed_table()
class BaseDispatcher(ABC): class BaseDispatcher(ABC):
def __init__( def __init__(
@@ -309,7 +110,7 @@ class BaseDispatcher(ABC):
async def run_urls( async def run_urls(
self, self,
urls: List[str], urls: List[str],
crawler: "AsyncWebCrawler", # noqa: F821 crawler: AsyncWebCrawler, # noqa: F821
config: CrawlerRunConfig, config: CrawlerRunConfig,
monitor: Optional[CrawlerMonitor] = None, monitor: Optional[CrawlerMonitor] = None,
) -> List[CrawlerTaskResult]: ) -> List[CrawlerTaskResult]:
@@ -320,71 +121,189 @@ class MemoryAdaptiveDispatcher(BaseDispatcher):
def __init__( def __init__(
self, self,
memory_threshold_percent: float = 90.0, memory_threshold_percent: float = 90.0,
critical_threshold_percent: float = 95.0, # New critical threshold
recovery_threshold_percent: float = 85.0, # New recovery threshold
check_interval: float = 1.0, check_interval: float = 1.0,
max_session_permit: int = 20, max_session_permit: int = 20,
memory_wait_timeout: float = 300.0, # 5 minutes default timeout fairness_timeout: float = 600.0, # 10 minutes before prioritizing long-waiting URLs
rate_limiter: Optional[RateLimiter] = None, rate_limiter: Optional[RateLimiter] = None,
monitor: Optional[CrawlerMonitor] = None, monitor: Optional[CrawlerMonitor] = None,
): ):
super().__init__(rate_limiter, monitor) super().__init__(rate_limiter, monitor)
self.memory_threshold_percent = memory_threshold_percent self.memory_threshold_percent = memory_threshold_percent
self.critical_threshold_percent = critical_threshold_percent
self.recovery_threshold_percent = recovery_threshold_percent
self.check_interval = check_interval self.check_interval = check_interval
self.max_session_permit = max_session_permit self.max_session_permit = max_session_permit
self.memory_wait_timeout = memory_wait_timeout self.fairness_timeout = fairness_timeout
self.result_queue = asyncio.Queue() # Queue for storing results self.result_queue = asyncio.Queue()
self.task_queue = asyncio.PriorityQueue() # Priority queue for better management
self.memory_pressure_mode = False # Flag to indicate when we're in memory pressure mode
self.current_memory_percent = 0.0 # Track current memory usage
async def _memory_monitor_task(self):
"""Background task to continuously monitor memory usage and update state"""
while True:
self.current_memory_percent = psutil.virtual_memory().percent
# Enter memory pressure mode if we cross the threshold
if not self.memory_pressure_mode and self.current_memory_percent >= self.memory_threshold_percent:
self.memory_pressure_mode = True
if self.monitor:
self.monitor.update_memory_status("PRESSURE")
# Exit memory pressure mode if we go below recovery threshold
elif self.memory_pressure_mode and self.current_memory_percent <= self.recovery_threshold_percent:
self.memory_pressure_mode = False
if self.monitor:
self.monitor.update_memory_status("NORMAL")
# In critical mode, we might need to take more drastic action
if self.current_memory_percent >= self.critical_threshold_percent:
if self.monitor:
self.monitor.update_memory_status("CRITICAL")
# We could implement additional memory-saving measures here
await asyncio.sleep(self.check_interval)
def _get_priority_score(self, wait_time: float, retry_count: int) -> float:
"""Calculate priority score (lower is higher priority)
- URLs waiting longer than fairness_timeout get higher priority
- More retry attempts decreases priority
"""
if wait_time > self.fairness_timeout:
# High priority for long-waiting URLs
return -wait_time
# Standard priority based on retries
return retry_count
async def crawl_url( async def crawl_url(
self, self,
url: str, url: str,
config: CrawlerRunConfig, config: CrawlerRunConfig,
task_id: str, task_id: str,
) -> CrawlerTaskResult: retry_count: int = 0,
) -> Union[CrawlerTaskResult, List[CrawlerTaskResult]]:
start_time = time.time() start_time = time.time()
error_message = "" error_message = ""
memory_usage = peak_memory = 0.0 memory_usage = peak_memory = 0.0
# Get starting memory for accurate measurement
process = psutil.Process()
start_memory = process.memory_info().rss / (1024 * 1024)
try: try:
if self.monitor: if self.monitor:
self.monitor.update_task( self.monitor.update_task(
task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time task_id,
status=CrawlStatus.IN_PROGRESS,
start_time=start_time,
retry_count=retry_count
) )
self.concurrent_sessions += 1 self.concurrent_sessions += 1
if self.rate_limiter: if self.rate_limiter:
await self.rate_limiter.wait_if_needed(url) await self.rate_limiter.wait_if_needed(url)
process = psutil.Process() # Check if we're in critical memory state
start_memory = process.memory_info().rss / (1024 * 1024) if self.current_memory_percent >= self.critical_threshold_percent:
# Requeue this task with increased priority and retry count
enqueue_time = time.time()
priority = self._get_priority_score(enqueue_time - start_time, retry_count + 1)
await self.task_queue.put((priority, (url, task_id, retry_count + 1, enqueue_time)))
# Update monitoring
if self.monitor:
self.monitor.update_task(
task_id,
status=CrawlStatus.QUEUED,
error_message="Requeued due to critical memory pressure"
)
# Return placeholder result with requeued status
return CrawlerTaskResult(
task_id=task_id,
url=url,
result=CrawlResult(
url=url, html="", metadata={"status": "requeued"},
success=False, error_message="Requeued due to critical memory pressure"
),
memory_usage=0,
peak_memory=0,
start_time=start_time,
end_time=time.time(),
error_message="Requeued due to critical memory pressure",
retry_count=retry_count + 1
)
# Execute the crawl
result = await self.crawler.arun(url, config=config, session_id=task_id) result = await self.crawler.arun(url, config=config, session_id=task_id)
# Measure memory usage
end_memory = process.memory_info().rss / (1024 * 1024) end_memory = process.memory_info().rss / (1024 * 1024)
memory_usage = peak_memory = end_memory - start_memory memory_usage = peak_memory = end_memory - start_memory
if self.rate_limiter and result.status_code: # Check if we have a container with multiple results (deep crawl result)
if isinstance(result, list) or (hasattr(result, '_results') and len(result._results) > 1):
# Handle deep crawling results - create a list of task results
task_results = []
result_list = result if isinstance(result, list) else result._results
for idx, single_result in enumerate(result_list):
# Create individual task result for each crawled page
sub_task_id = f"{task_id}_{idx}"
single_memory = memory_usage / len(result_list) # Distribute memory usage
# Only update rate limiter for first result which corresponds to the original URL
if idx == 0 and self.rate_limiter and hasattr(single_result, 'status_code') and single_result.status_code:
if not self.rate_limiter.update_delay(url, single_result.status_code):
error_msg = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
task_result = CrawlerTaskResult(
task_id=sub_task_id,
url=single_result.url,
result=single_result,
memory_usage=single_memory,
peak_memory=single_memory,
start_time=start_time,
end_time=time.time(),
error_message=single_result.error_message if not single_result.success else "",
retry_count=retry_count
)
task_results.append(task_result)
# Update monitor with completion status based on the first/primary result
if self.monitor:
primary_result = result_list[0]
if not primary_result.success:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
else:
self.monitor.update_task(
task_id,
status=CrawlStatus.COMPLETED,
extra_info=f"Deep crawl: {len(result_list)} pages"
)
return task_results
# Handle single result (original behavior)
if self.rate_limiter and hasattr(result, 'status_code') and result.status_code:
if not self.rate_limiter.update_delay(url, result.status_code): if not self.rate_limiter.update_delay(url, result.status_code):
error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}" error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
if self.monitor: if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED) self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
result = CrawlerTaskResult(
task_id=task_id, # Update status based on result
url=url,
result=result,
memory_usage=memory_usage,
peak_memory=peak_memory,
start_time=start_time,
end_time=time.time(),
error_message=error_message,
)
await self.result_queue.put(result)
return result
if not result.success: if not result.success:
error_message = result.error_message error_message = result.error_message
if self.monitor: if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED) self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
elif self.monitor: elif self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED) self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
except Exception as e: except Exception as e:
error_message = str(e) error_message = str(e)
if self.monitor: if self.monitor:
@@ -392,7 +311,7 @@ class MemoryAdaptiveDispatcher(BaseDispatcher):
result = CrawlResult( result = CrawlResult(
url=url, html="", metadata={}, success=False, error_message=str(e) url=url, html="", metadata={}, success=False, error_message=str(e)
) )
finally: finally:
end_time = time.time() end_time = time.time()
if self.monitor: if self.monitor:
@@ -402,9 +321,10 @@ class MemoryAdaptiveDispatcher(BaseDispatcher):
memory_usage=memory_usage, memory_usage=memory_usage,
peak_memory=peak_memory, peak_memory=peak_memory,
error_message=error_message, error_message=error_message,
retry_count=retry_count
) )
self.concurrent_sessions -= 1 self.concurrent_sessions -= 1
return CrawlerTaskResult( return CrawlerTaskResult(
task_id=task_id, task_id=task_id,
url=url, url=url,
@@ -414,116 +334,245 @@ class MemoryAdaptiveDispatcher(BaseDispatcher):
start_time=start_time, start_time=start_time,
end_time=end_time, end_time=end_time,
error_message=error_message, error_message=error_message,
retry_count=retry_count
) )
async def run_urls( async def run_urls(
self, self,
urls: List[str], urls: List[str],
crawler: "AsyncWebCrawler", # noqa: F821 crawler: AsyncWebCrawler,
config: CrawlerRunConfig, config: CrawlerRunConfig,
) -> List[CrawlerTaskResult]: ) -> List[CrawlerTaskResult]:
self.crawler = crawler self.crawler = crawler
# Start the memory monitor task
memory_monitor = asyncio.create_task(self._memory_monitor_task())
if self.monitor: if self.monitor:
self.monitor.start() self.monitor.start()
results = []
try: try:
pending_tasks = []
active_tasks = []
task_queue = []
for url in urls:
task_id = str(uuid.uuid4())
if self.monitor:
self.monitor.add_task(task_id, url)
task_queue.append((url, task_id))
while task_queue or active_tasks:
wait_start_time = time.time()
while len(active_tasks) < self.max_session_permit and task_queue:
if psutil.virtual_memory().percent >= self.memory_threshold_percent:
# Check if we've exceeded the timeout
if time.time() - wait_start_time > self.memory_wait_timeout:
raise MemoryError(
f"Memory usage above threshold ({self.memory_threshold_percent}%) for more than {self.memory_wait_timeout} seconds"
)
await asyncio.sleep(self.check_interval)
continue
url, task_id = task_queue.pop(0)
task = asyncio.create_task(self.crawl_url(url, config, task_id))
active_tasks.append(task)
if not active_tasks:
await asyncio.sleep(self.check_interval)
continue
done, pending = await asyncio.wait(
active_tasks, return_when=asyncio.FIRST_COMPLETED
)
pending_tasks.extend(done)
active_tasks = list(pending)
return await asyncio.gather(*pending_tasks)
finally:
if self.monitor:
self.monitor.stop()
async def run_urls_stream(
self,
urls: List[str],
crawler: "AsyncWebCrawler", # noqa: F821
config: CrawlerRunConfig,
) -> AsyncGenerator[CrawlerTaskResult, None]:
self.crawler = crawler
if self.monitor:
self.monitor.start()
try:
active_tasks = []
task_queue = []
completed_count = 0
total_urls = len(urls)
# Initialize task queue # Initialize task queue
for url in urls: for url in urls:
task_id = str(uuid.uuid4()) task_id = str(uuid.uuid4())
if self.monitor: if self.monitor:
self.monitor.add_task(task_id, url) self.monitor.add_task(task_id, url)
task_queue.append((url, task_id)) # Add to queue with initial priority 0, retry count 0, and current time
await self.task_queue.put((0, (url, task_id, 0, time.time())))
while completed_count < total_urls:
# Start new tasks if memory permits active_tasks = []
while len(active_tasks) < self.max_session_permit and task_queue:
if psutil.virtual_memory().percent >= self.memory_threshold_percent: # Process until both queues are empty
await asyncio.sleep(self.check_interval) while not self.task_queue.empty() or active_tasks:
continue # If memory pressure is low, start new tasks
if not self.memory_pressure_mode and len(active_tasks) < self.max_session_permit:
url, task_id = task_queue.pop(0) try:
task = asyncio.create_task(self.crawl_url(url, config, task_id)) # Try to get a task with timeout to avoid blocking indefinitely
active_tasks.append(task) priority, (url, task_id, retry_count, enqueue_time) = await asyncio.wait_for(
self.task_queue.get(), timeout=0.1
if not active_tasks and not task_queue: )
break
# Create and start the task
# Wait for any task to complete and yield results task = asyncio.create_task(
self.crawl_url(url, config, task_id, retry_count)
)
active_tasks.append(task)
# Update waiting time in monitor
if self.monitor:
wait_time = time.time() - enqueue_time
self.monitor.update_task(
task_id,
wait_time=wait_time,
status=CrawlStatus.IN_PROGRESS
)
except asyncio.TimeoutError:
# No tasks in queue, that's fine
pass
# Wait for completion even if queue is starved
if active_tasks: if active_tasks:
done, pending = await asyncio.wait( done, pending = await asyncio.wait(
active_tasks, timeout=0.1, return_when=asyncio.FIRST_COMPLETED active_tasks, timeout=0.1, return_when=asyncio.FIRST_COMPLETED
) )
# Process completed tasks
for completed_task in done: for completed_task in done:
result = await completed_task task_result = await completed_task
completed_count += 1
yield result # Handle both single results and lists of results
if isinstance(task_result, list):
results.extend(task_result)
else:
results.append(task_result)
# Update active tasks list
active_tasks = list(pending) active_tasks = list(pending)
else: else:
await asyncio.sleep(self.check_interval) # If no active tasks but still waiting, sleep briefly
await asyncio.sleep(self.check_interval / 2)
# Update priorities for waiting tasks if needed
await self._update_queue_priorities()
return results
except Exception as e:
if self.monitor:
self.monitor.update_memory_status(f"QUEUE_ERROR: {str(e)}")
finally: finally:
# Clean up
memory_monitor.cancel()
if self.monitor: if self.monitor:
self.monitor.stop() self.monitor.stop()
async def _update_queue_priorities(self):
"""Periodically update priorities of items in the queue to prevent starvation"""
# Skip if queue is empty
if self.task_queue.empty():
return
# Use a drain-and-refill approach to update all priorities
temp_items = []
# Drain the queue (with a safety timeout to prevent blocking)
try:
drain_start = time.time()
while not self.task_queue.empty() and time.time() - drain_start < 5.0: # 5 second safety timeout
try:
# Get item from queue with timeout
priority, (url, task_id, retry_count, enqueue_time) = await asyncio.wait_for(
self.task_queue.get(), timeout=0.1
)
# Calculate new priority based on current wait time
current_time = time.time()
wait_time = current_time - enqueue_time
new_priority = self._get_priority_score(wait_time, retry_count)
# Store with updated priority
temp_items.append((new_priority, (url, task_id, retry_count, enqueue_time)))
# Update monitoring stats for this task
if self.monitor and task_id in self.monitor.stats:
self.monitor.update_task(task_id, wait_time=wait_time)
except asyncio.TimeoutError:
# Queue might be empty or very slow
break
except Exception as e:
# If anything goes wrong, make sure we refill the queue with what we've got
self.monitor.update_memory_status(f"QUEUE_ERROR: {str(e)}")
# Calculate queue statistics
if temp_items and self.monitor:
total_queued = len(temp_items)
wait_times = [item[1][3] for item in temp_items]
highest_wait_time = time.time() - min(wait_times) if wait_times else 0
avg_wait_time = sum(time.time() - t for t in wait_times) / len(wait_times) if wait_times else 0
# Update queue statistics in monitor
self.monitor.update_queue_statistics(
total_queued=total_queued,
highest_wait_time=highest_wait_time,
avg_wait_time=avg_wait_time
)
# Sort by priority (lowest number = highest priority)
temp_items.sort(key=lambda x: x[0])
# Refill the queue with updated priorities
for item in temp_items:
await self.task_queue.put(item)
async def run_urls_stream(
self,
urls: List[str],
crawler: AsyncWebCrawler,
config: CrawlerRunConfig,
) -> AsyncGenerator[CrawlerTaskResult, None]:
self.crawler = crawler
# Start the memory monitor task
memory_monitor = asyncio.create_task(self._memory_monitor_task())
if self.monitor:
self.monitor.start()
try:
# Initialize task queue
for url in urls:
task_id = str(uuid.uuid4())
if self.monitor:
self.monitor.add_task(task_id, url)
# Add to queue with initial priority 0, retry count 0, and current time
await self.task_queue.put((0, (url, task_id, 0, time.time())))
active_tasks = []
completed_count = 0
total_urls = len(urls)
while completed_count < total_urls:
# If memory pressure is low, start new tasks
if not self.memory_pressure_mode and len(active_tasks) < self.max_session_permit:
try:
# Try to get a task with timeout
priority, (url, task_id, retry_count, enqueue_time) = await asyncio.wait_for(
self.task_queue.get(), timeout=0.1
)
# Create and start the task
task = asyncio.create_task(
self.crawl_url(url, config, task_id, retry_count)
)
active_tasks.append(task)
# Update waiting time in monitor
if self.monitor:
wait_time = time.time() - enqueue_time
self.monitor.update_task(
task_id,
wait_time=wait_time,
status=CrawlStatus.IN_PROGRESS
)
except asyncio.TimeoutError:
# No tasks in queue, that's fine
pass
# Process completed tasks and yield results
if active_tasks:
done, pending = await asyncio.wait(
active_tasks, timeout=0.1, return_when=asyncio.FIRST_COMPLETED
)
for completed_task in done:
result = await completed_task
# Only count as completed if it wasn't requeued
if "requeued" not in result.error_message:
completed_count += 1
yield result
# Update active tasks list
active_tasks = list(pending)
else:
# If no active tasks but still waiting, sleep briefly
await asyncio.sleep(self.check_interval / 2)
# Update priorities for waiting tasks if needed
await self._update_queue_priorities()
finally:
# Clean up
memory_monitor.cancel()
if self.monitor:
self.monitor.stop()
class SemaphoreDispatcher(BaseDispatcher): class SemaphoreDispatcher(BaseDispatcher):
def __init__( def __init__(
@@ -620,7 +669,7 @@ class SemaphoreDispatcher(BaseDispatcher):
async def run_urls( async def run_urls(
self, self,
crawler: "AsyncWebCrawler", # noqa: F821 crawler: AsyncWebCrawler, # noqa: F821
urls: List[str], urls: List[str],
config: CrawlerRunConfig, config: CrawlerRunConfig,
) -> List[CrawlerTaskResult]: ) -> List[CrawlerTaskResult]:
@@ -644,4 +693,4 @@ class SemaphoreDispatcher(BaseDispatcher):
return await asyncio.gather(*tasks, return_exceptions=True) return await asyncio.gather(*tasks, return_exceptions=True)
finally: finally:
if self.monitor: if self.monitor:
self.monitor.stop() self.monitor.stop()

View File

@@ -13,11 +13,10 @@ from contextlib import asynccontextmanager
from .models import CrawlResult, MarkdownGenerationResult, DispatchResult, ScrapingResult from .models import CrawlResult, MarkdownGenerationResult, DispatchResult, ScrapingResult
from .async_database import async_db_manager from .async_database import async_db_manager
from .chunking_strategy import * # noqa: F403 from .chunking_strategy import * # noqa: F403
from .chunking_strategy import RegexChunking, ChunkingStrategy, IdentityChunking from .chunking_strategy import IdentityChunking
from .content_filter_strategy import * # noqa: F403 from .content_filter_strategy import * # noqa: F403
from .content_filter_strategy import RelevantContentFilter
from .extraction_strategy import * # noqa: F403 from .extraction_strategy import * # noqa: F403
from .extraction_strategy import NoExtractionStrategy, ExtractionStrategy from .extraction_strategy import NoExtractionStrategy
from .async_crawler_strategy import ( from .async_crawler_strategy import (
AsyncCrawlerStrategy, AsyncCrawlerStrategy,
AsyncPlaywrightCrawlerStrategy, AsyncPlaywrightCrawlerStrategy,
@@ -34,7 +33,6 @@ from .async_configs import BrowserConfig, CrawlerRunConfig
from .async_dispatcher import * # noqa: F403 from .async_dispatcher import * # noqa: F403
from .async_dispatcher import BaseDispatcher, MemoryAdaptiveDispatcher, RateLimiter from .async_dispatcher import BaseDispatcher, MemoryAdaptiveDispatcher, RateLimiter
from .config import MIN_WORD_THRESHOLD
from .utils import ( from .utils import (
sanitize_input_encode, sanitize_input_encode,
InvalidCSSSelectorError, InvalidCSSSelectorError,
@@ -203,13 +201,35 @@ class AsyncWebCrawler:
This is equivalent to using 'async with' but gives more control over the lifecycle. This is equivalent to using 'async with' but gives more control over the lifecycle.
This method will: This method will:
1. Initialize the browser and context 1. Check for builtin browser if browser_mode is 'builtin'
2. Perform warmup sequence 2. Initialize the browser and context
3. Return the crawler instance for method chaining 3. Perform warmup sequence
4. Return the crawler instance for method chaining
Returns: Returns:
AsyncWebCrawler: The initialized crawler instance AsyncWebCrawler: The initialized crawler instance
""" """
# Check for builtin browser if requested
if self.browser_config.browser_mode == "builtin" and not self.browser_config.cdp_url:
# Import here to avoid circular imports
from .browser_profiler import BrowserProfiler
profiler = BrowserProfiler(logger=self.logger)
# Get builtin browser info or launch if needed
browser_info = profiler.get_builtin_browser_info()
if not browser_info:
self.logger.info("Builtin browser not found, launching new instance...", tag="BROWSER")
cdp_url = await profiler.launch_builtin_browser()
if not cdp_url:
self.logger.warning("Failed to launch builtin browser, falling back to dedicated browser", tag="BROWSER")
else:
self.browser_config.cdp_url = cdp_url
self.browser_config.use_managed_browser = True
else:
self.logger.info(f"Using existing builtin browser at {browser_info.get('cdp_url')}", tag="BROWSER")
self.browser_config.cdp_url = browser_info.get('cdp_url')
self.browser_config.use_managed_browser = True
await self.crawler_strategy.__aenter__() await self.crawler_strategy.__aenter__()
await self.awarmup() await self.awarmup()
return self return self
@@ -282,6 +302,10 @@ class AsyncWebCrawler:
Returns: Returns:
CrawlResult: The result of crawling and processing CrawlResult: The result of crawling and processing
""" """
# Auto-start if not ready
if not self.ready:
await self.start()
config = config or CrawlerRunConfig() config = config or CrawlerRunConfig()
if not isinstance(url, str) or not url: if not isinstance(url, str) or not url:
raise ValueError("Invalid URL, make sure the URL is a non-empty string") raise ValueError("Invalid URL, make sure the URL is a non-empty string")
@@ -649,18 +673,6 @@ class AsyncWebCrawler:
urls: List[str], urls: List[str],
config: Optional[CrawlerRunConfig] = None, config: Optional[CrawlerRunConfig] = None,
dispatcher: Optional[BaseDispatcher] = None, dispatcher: Optional[BaseDispatcher] = None,
# Legacy parameters maintained for backwards compatibility
# word_count_threshold=MIN_WORD_THRESHOLD,
# extraction_strategy: ExtractionStrategy = None,
# chunking_strategy: ChunkingStrategy = RegexChunking(),
# content_filter: RelevantContentFilter = None,
# cache_mode: Optional[CacheMode] = None,
# bypass_cache: bool = False,
# css_selector: str = None,
# screenshot: bool = False,
# pdf: bool = False,
# user_agent: str = None,
# verbose=True,
**kwargs **kwargs
) -> RunManyReturn: ) -> RunManyReturn:
""" """
@@ -694,20 +706,7 @@ class AsyncWebCrawler:
print(f"Processed {result.url}: {len(result.markdown)} chars") print(f"Processed {result.url}: {len(result.markdown)} chars")
""" """
config = config or CrawlerRunConfig() config = config or CrawlerRunConfig()
# if config is None:
# config = CrawlerRunConfig(
# word_count_threshold=word_count_threshold,
# extraction_strategy=extraction_strategy,
# chunking_strategy=chunking_strategy,
# content_filter=content_filter,
# cache_mode=cache_mode,
# bypass_cache=bypass_cache,
# css_selector=css_selector,
# screenshot=screenshot,
# pdf=pdf,
# verbose=verbose,
# **kwargs,
# )
if dispatcher is None: if dispatcher is None:
dispatcher = MemoryAdaptiveDispatcher( dispatcher = MemoryAdaptiveDispatcher(

View File

@@ -0,0 +1,10 @@
"""Browser management module for Crawl4AI.
This module provides browser management capabilities using different strategies
for browser creation and interaction.
"""
from .manager import BrowserManager
from .profiles import BrowserProfileManager
__all__ = ['BrowserManager', 'BrowserProfileManager']

View File

@@ -0,0 +1,61 @@
FROM ubuntu:22.04
# Install dependencies with comprehensive Chromium support
RUN apt-get update && apt-get install -y --no-install-recommends \
wget \
gnupg \
ca-certificates \
fonts-liberation \
# Sound support
libasound2 \
# Accessibility support
libatspi2.0-0 \
libatk1.0-0 \
libatk-bridge2.0-0 \
# Graphics and rendering
libdrm2 \
libgbm1 \
libgtk-3-0 \
libxcomposite1 \
libxdamage1 \
libxext6 \
libxfixes3 \
libxrandr2 \
# X11 and window system
libx11-6 \
libxcb1 \
libxkbcommon0 \
# Text and internationalization
libpango-1.0-0 \
libcairo2 \
# Printing support
libcups2 \
# System libraries
libdbus-1-3 \
libnss3 \
libnspr4 \
libglib2.0-0 \
# Utilities
xdg-utils \
socat \
# Process management
procps \
# Clean up
&& rm -rf /var/lib/apt/lists/*
# Install Chrome
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - && \
echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list && \
apt-get update && \
apt-get install -y google-chrome-stable && \
rm -rf /var/lib/apt/lists/*
# Create data directory for user data
RUN mkdir -p /data && chmod 777 /data
# Add a startup script
COPY start.sh /start.sh
RUN chmod +x /start.sh
# Set entrypoint
ENTRYPOINT ["/start.sh"]

View File

@@ -0,0 +1,57 @@
FROM ubuntu:22.04
# Install dependencies with comprehensive Chromium support
RUN apt-get update && apt-get install -y --no-install-recommends \
wget \
gnupg \
ca-certificates \
fonts-liberation \
# Sound support
libasound2 \
# Accessibility support
libatspi2.0-0 \
libatk1.0-0 \
libatk-bridge2.0-0 \
# Graphics and rendering
libdrm2 \
libgbm1 \
libgtk-3-0 \
libxcomposite1 \
libxdamage1 \
libxext6 \
libxfixes3 \
libxrandr2 \
# X11 and window system
libx11-6 \
libxcb1 \
libxkbcommon0 \
# Text and internationalization
libpango-1.0-0 \
libcairo2 \
# Printing support
libcups2 \
# System libraries
libdbus-1-3 \
libnss3 \
libnspr4 \
libglib2.0-0 \
# Utilities
xdg-utils \
socat \
# Process management
procps \
# Clean up
&& rm -rf /var/lib/apt/lists/*
# Install Chrome
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - && \
echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list && \
apt-get update && \
apt-get install -y google-chrome-stable && \
rm -rf /var/lib/apt/lists/*
# Create data directory for user data
RUN mkdir -p /data && chmod 777 /data
# Keep container running without starting Chrome
CMD ["tail", "-f", "/dev/null"]

View File

@@ -0,0 +1,133 @@
"""Docker configuration module for Crawl4AI browser automation.
This module provides configuration classes for Docker-based browser automation,
allowing flexible configuration of Docker containers for browsing.
"""
from typing import Dict, List, Optional, Union
class DockerConfig:
"""Configuration for Docker-based browser automation.
This class contains Docker-specific settings to avoid cluttering BrowserConfig.
Attributes:
mode (str): Docker operation mode - "connect" or "launch".
- "connect": Uses a container with Chrome already running
- "launch": Dynamically configures and starts Chrome in container
image (str): Docker image to use. If None, defaults from DockerUtils are used.
registry_file (str): Path to container registry file for persistence.
persistent (bool): Keep container running after browser closes.
remove_on_exit (bool): Remove container on exit when not persistent.
network (str): Docker network to use.
volumes (List[str]): Volume mappings (e.g., ["host_path:container_path"]).
env_vars (Dict[str, str]): Environment variables to set in container.
extra_args (List[str]): Additional docker run arguments.
host_port (int): Host port to map to container's 9223 port.
user_data_dir (str): Path to user data directory on host.
container_user_data_dir (str): Path to user data directory in container.
"""
def __init__(
self,
mode: str = "connect", # "connect" or "launch"
image: Optional[str] = None, # Docker image to use
registry_file: Optional[str] = None, # Path to registry file
persistent: bool = False, # Keep container running after browser closes
remove_on_exit: bool = True, # Remove container on exit when not persistent
network: Optional[str] = None, # Docker network to use
volumes: List[str] = None, # Volume mappings
env_vars: Dict[str, str] = None, # Environment variables
extra_args: List[str] = None, # Additional docker run arguments
host_port: Optional[int] = None, # Host port to map to container's 9223
user_data_dir: Optional[str] = None, # Path to user data directory on host
container_user_data_dir: str = "/data", # Path to user data directory in container
):
"""Initialize Docker configuration.
Args:
mode: Docker operation mode ("connect" or "launch")
image: Docker image to use
registry_file: Path to container registry file
persistent: Whether to keep container running after browser closes
remove_on_exit: Whether to remove container on exit when not persistent
network: Docker network to use
volumes: Volume mappings as list of strings
env_vars: Environment variables as dictionary
extra_args: Additional docker run arguments
host_port: Host port to map to container's 9223
user_data_dir: Path to user data directory on host
container_user_data_dir: Path to user data directory in container
"""
self.mode = mode
self.image = image # If None, defaults will be used from DockerUtils
self.registry_file = registry_file
self.persistent = persistent
self.remove_on_exit = remove_on_exit
self.network = network
self.volumes = volumes or []
self.env_vars = env_vars or {}
self.extra_args = extra_args or []
self.host_port = host_port
self.user_data_dir = user_data_dir
self.container_user_data_dir = container_user_data_dir
def to_dict(self) -> Dict:
"""Convert this configuration to a dictionary.
Returns:
Dictionary representation of this configuration
"""
return {
"mode": self.mode,
"image": self.image,
"registry_file": self.registry_file,
"persistent": self.persistent,
"remove_on_exit": self.remove_on_exit,
"network": self.network,
"volumes": self.volumes,
"env_vars": self.env_vars,
"extra_args": self.extra_args,
"host_port": self.host_port,
"user_data_dir": self.user_data_dir,
"container_user_data_dir": self.container_user_data_dir
}
@staticmethod
def from_kwargs(kwargs: Dict) -> "DockerConfig":
"""Create a DockerConfig from a dictionary of keyword arguments.
Args:
kwargs: Dictionary of configuration options
Returns:
New DockerConfig instance
"""
return DockerConfig(
mode=kwargs.get("mode", "connect"),
image=kwargs.get("image"),
registry_file=kwargs.get("registry_file"),
persistent=kwargs.get("persistent", False),
remove_on_exit=kwargs.get("remove_on_exit", True),
network=kwargs.get("network"),
volumes=kwargs.get("volumes"),
env_vars=kwargs.get("env_vars"),
extra_args=kwargs.get("extra_args"),
host_port=kwargs.get("host_port"),
user_data_dir=kwargs.get("user_data_dir"),
container_user_data_dir=kwargs.get("container_user_data_dir", "/data")
)
def clone(self, **kwargs) -> "DockerConfig":
"""Create a copy of this configuration with updated values.
Args:
**kwargs: Key-value pairs of configuration options to update
Returns:
DockerConfig: A new instance with the specified updates
"""
config_dict = self.to_dict()
config_dict.update(kwargs)
return DockerConfig.from_kwargs(config_dict)

View File

@@ -0,0 +1,174 @@
"""Docker registry module for Crawl4AI.
This module provides a registry system for tracking and reusing Docker containers
across browser sessions, improving performance and resource utilization.
"""
import os
import json
import time
from typing import Dict, Optional
from ..utils import get_home_folder
class DockerRegistry:
"""Manages a registry of Docker containers used for browser automation.
This registry tracks containers by configuration hash, allowing reuse of appropriately
configured containers instead of creating new ones for each session.
Attributes:
registry_file (str): Path to the registry file
containers (dict): Dictionary of container information
port_map (dict): Map of host ports to container IDs
last_port (int): Last port assigned
"""
def __init__(self, registry_file: Optional[str] = None):
"""Initialize the registry with an optional path to the registry file.
Args:
registry_file: Path to the registry file. If None, uses default path.
"""
self.registry_file = registry_file or os.path.join(get_home_folder(), "docker_browser_registry.json")
self.containers = {}
self.port_map = {}
self.last_port = 9222
self.load()
def load(self):
"""Load container registry from file."""
if os.path.exists(self.registry_file):
try:
with open(self.registry_file, 'r') as f:
registry_data = json.load(f)
self.containers = registry_data.get("containers", {})
self.port_map = registry_data.get("ports", {})
self.last_port = registry_data.get("last_port", 9222)
except Exception:
# Reset to defaults on error
self.containers = {}
self.port_map = {}
self.last_port = 9222
else:
# Initialize with defaults if file doesn't exist
self.containers = {}
self.port_map = {}
self.last_port = 9222
def save(self):
"""Save container registry to file."""
os.makedirs(os.path.dirname(self.registry_file), exist_ok=True)
with open(self.registry_file, 'w') as f:
json.dump({
"containers": self.containers,
"ports": self.port_map,
"last_port": self.last_port
}, f, indent=2)
def register_container(self, container_id: str, host_port: int, config_hash: str):
"""Register a container with its configuration hash and port mapping.
Args:
container_id: Docker container ID
host_port: Host port mapped to container
config_hash: Hash of configuration used to create container
"""
self.containers[container_id] = {
"host_port": host_port,
"config_hash": config_hash,
"created_at": time.time()
}
self.port_map[str(host_port)] = container_id
self.save()
def unregister_container(self, container_id: str):
"""Unregister a container.
Args:
container_id: Docker container ID to unregister
"""
if container_id in self.containers:
host_port = self.containers[container_id]["host_port"]
if str(host_port) in self.port_map:
del self.port_map[str(host_port)]
del self.containers[container_id]
self.save()
def find_container_by_config(self, config_hash: str, docker_utils) -> Optional[str]:
"""Find a container that matches the given configuration hash.
Args:
config_hash: Hash of configuration to match
docker_utils: DockerUtils instance to check running containers
Returns:
Container ID if found, None otherwise
"""
for container_id, data in self.containers.items():
if data["config_hash"] == config_hash and docker_utils.is_container_running(container_id):
return container_id
return None
def get_container_host_port(self, container_id: str) -> Optional[int]:
"""Get the host port mapped to the container.
Args:
container_id: Docker container ID
Returns:
Host port if container is registered, None otherwise
"""
if container_id in self.containers:
return self.containers[container_id]["host_port"]
return None
def get_next_available_port(self, docker_utils) -> int:
"""Get the next available host port for Docker mapping.
Args:
docker_utils: DockerUtils instance to check port availability
Returns:
Available port number
"""
# Start from last port + 1
port = self.last_port + 1
# Check if port is in use (either in our registry or system-wide)
while port in self.port_map or docker_utils.is_port_in_use(port):
port += 1
# Update last port
self.last_port = port
self.save()
return port
def get_container_config_hash(self, container_id: str) -> Optional[str]:
"""Get the configuration hash for a container.
Args:
container_id: Docker container ID
Returns:
Configuration hash if container is registered, None otherwise
"""
if container_id in self.containers:
return self.containers[container_id]["config_hash"]
return None
def cleanup_stale_containers(self, docker_utils):
"""Clean up containers that are no longer running.
Args:
docker_utils: DockerUtils instance to check container status
"""
to_remove = []
for container_id in self.containers:
if not docker_utils.is_container_running(container_id):
to_remove.append(container_id)
for container_id in to_remove:
self.unregister_container(container_id)

View File

@@ -0,0 +1,286 @@
"""Docker browser strategy module for Crawl4AI.
This module provides browser strategies for running browsers in Docker containers,
which offers better isolation, consistency across platforms, and easy scaling.
"""
import os
import uuid
import asyncio
from typing import Dict, List, Optional, Tuple, Union
from pathlib import Path
from playwright.async_api import Page, BrowserContext
from ..async_logger import AsyncLogger
from ..async_configs import BrowserConfig, CrawlerRunConfig
from .docker_config import DockerConfig
from .docker_registry import DockerRegistry
from .docker_utils import DockerUtils
from .strategies import BuiltinBrowserStrategy
class DockerBrowserStrategy(BuiltinBrowserStrategy):
"""Docker-based browser strategy.
Extends the BuiltinBrowserStrategy to run browsers in Docker containers.
Supports two modes:
1. "connect" - Uses a Docker image with Chrome already running
2. "launch" - Starts Chrome within the container with custom settings
Attributes:
docker_config: Docker-specific configuration options
container_id: ID of current Docker container
container_name: Name assigned to the container
registry: Registry for tracking and reusing containers
docker_utils: Utilities for Docker operations
chrome_process_id: Process ID of Chrome within container
socat_process_id: Process ID of socat within container
internal_cdp_port: Chrome's internal CDP port
internal_mapped_port: Port that socat maps to internally
"""
def __init__(self, config: BrowserConfig, logger: Optional[AsyncLogger] = None):
"""Initialize the Docker browser strategy.
Args:
config: Browser configuration including Docker-specific settings
logger: Logger for recording events and errors
"""
super().__init__(config, logger)
# Initialize Docker-specific attributes
self.docker_config = self.config.docker_config or DockerConfig()
self.container_id = None
self.container_name = f"crawl4ai-browser-{uuid.uuid4().hex[:8]}"
self.registry = DockerRegistry(self.docker_config.registry_file)
self.docker_utils = DockerUtils(logger)
self.chrome_process_id = None
self.socat_process_id = None
self.internal_cdp_port = 9222 # Chrome's internal CDP port
self.internal_mapped_port = 9223 # Port that socat maps to internally
self.shutting_down = False
async def _generate_config_hash(self) -> str:
"""Generate a hash of the configuration for container matching.
Returns:
Hash string uniquely identifying this configuration
"""
# Create a dict with the relevant parts of the config
config_dict = {
"image": self.docker_config.image,
"mode": self.docker_config.mode,
"browser_type": self.config.browser_type,
"headless": self.config.headless,
}
# Add browser-specific config if in launch mode
if self.docker_config.mode == "launch":
config_dict.update({
"text_mode": self.config.text_mode,
"light_mode": self.config.light_mode,
"viewport_width": self.config.viewport_width,
"viewport_height": self.config.viewport_height,
})
# Use the utility method to generate the hash
return self.docker_utils.generate_config_hash(config_dict)
async def _get_or_create_cdp_url(self) -> str:
"""Get CDP URL by either creating a new container or using an existing one.
Returns:
CDP URL for connecting to the browser
Raises:
Exception: If container creation or browser launch fails
"""
# If CDP URL is explicitly provided, use it
if self.config.cdp_url:
return self.config.cdp_url
# Ensure Docker image exists (will build if needed)
image_name = await self.docker_utils.ensure_docker_image_exists(
self.docker_config.image,
self.docker_config.mode
)
# Generate config hash for container matching
config_hash = await self._generate_config_hash()
# Look for existing container with matching config
container_id = self.registry.find_container_by_config(config_hash, self.docker_utils)
if container_id:
# Use existing container
self.container_id = container_id
host_port = self.registry.get_container_host_port(container_id)
if self.logger:
self.logger.info(f"Using existing Docker container: {container_id[:12]}", tag="DOCKER")
else:
# Get a port for the new container
host_port = self.docker_config.host_port or self.registry.get_next_available_port(self.docker_utils)
# Prepare volumes list
volumes = list(self.docker_config.volumes)
# Add user data directory if specified
if self.docker_config.user_data_dir:
# Ensure user data directory exists
os.makedirs(self.docker_config.user_data_dir, exist_ok=True)
volumes.append(f"{self.docker_config.user_data_dir}:{self.docker_config.container_user_data_dir}")
# Update config user_data_dir to point to container path
self.config.user_data_dir = self.docker_config.container_user_data_dir
# Create a new container
container_id = await self.docker_utils.create_container(
image_name=image_name,
host_port=host_port,
container_name=self.container_name,
volumes=volumes,
network=self.docker_config.network,
env_vars=self.docker_config.env_vars,
extra_args=self.docker_config.extra_args
)
if not container_id:
raise Exception("Failed to create Docker container")
self.container_id = container_id
# Register the container
self.registry.register_container(container_id, host_port, config_hash)
# Wait for container to be ready
await self.docker_utils.wait_for_container_ready(container_id)
# Handle specific setup based on mode
if self.docker_config.mode == "launch":
# In launch mode, we need to start socat and Chrome
await self.docker_utils.start_socat_in_container(container_id)
# Build browser arguments
browser_args = self._build_browser_args()
# Launch Chrome
await self.docker_utils.launch_chrome_in_container(container_id, browser_args)
# Get PIDs for later cleanup
self.chrome_process_id = await self.docker_utils.get_process_id_in_container(
container_id, "chrome"
)
self.socat_process_id = await self.docker_utils.get_process_id_in_container(
container_id, "socat"
)
# Wait for CDP to be ready
await self.docker_utils.wait_for_cdp_ready(host_port)
if self.logger:
self.logger.success(f"Docker container ready: {container_id[:12]} on port {host_port}", tag="DOCKER")
# Return CDP URL
return f"http://localhost:{host_port}"
def _build_browser_args(self) -> List[str]:
"""Build Chrome command line arguments based on BrowserConfig.
Returns:
List of command line arguments for Chrome
"""
args = [
"--no-sandbox",
"--disable-gpu",
f"--remote-debugging-port={self.internal_cdp_port}",
"--remote-debugging-address=0.0.0.0", # Allow external connections
"--disable-dev-shm-usage",
]
if self.config.headless:
args.append("--headless=new")
if self.config.viewport_width and self.config.viewport_height:
args.append(f"--window-size={self.config.viewport_width},{self.config.viewport_height}")
if self.config.user_agent:
args.append(f"--user-agent={self.config.user_agent}")
if self.config.text_mode:
args.extend([
"--blink-settings=imagesEnabled=false",
"--disable-remote-fonts",
"--disable-images",
"--disable-javascript",
])
if self.config.light_mode:
# Import here to avoid circular import
from .utils import get_browser_disable_options
args.extend(get_browser_disable_options())
if self.config.user_data_dir:
args.append(f"--user-data-dir={self.config.user_data_dir}")
if self.config.extra_args:
args.extend(self.config.extra_args)
return args
async def close(self):
"""Close the browser and clean up Docker container if needed."""
# Set shutting_down flag to prevent race conditions
self.shutting_down = True
# Store state if needed before closing
if self.browser and self.docker_config.user_data_dir and self.docker_config.persistent:
for context in self.browser.contexts:
try:
storage_path = os.path.join(self.docker_config.user_data_dir, "storage_state.json")
await context.storage_state(path=storage_path)
if self.logger:
self.logger.debug("Persisted storage state before closing browser", tag="DOCKER")
except Exception as e:
if self.logger:
self.logger.warning(
message="Failed to persist storage state: {error}",
tag="DOCKER",
params={"error": str(e)}
)
# Close browser connection (but not container)
if self.browser:
await self.browser.close()
self.browser = None
# Only clean up container if not persistent
if self.container_id and not self.docker_config.persistent:
# Stop Chrome process in "launch" mode
if self.docker_config.mode == "launch" and self.chrome_process_id:
await self.docker_utils.stop_process_in_container(
self.container_id, self.chrome_process_id
)
# Stop socat process in "launch" mode
if self.docker_config.mode == "launch" and self.socat_process_id:
await self.docker_utils.stop_process_in_container(
self.container_id, self.socat_process_id
)
# Remove or stop container based on configuration
if self.docker_config.remove_on_exit:
await self.docker_utils.remove_container(self.container_id)
# Unregister from registry
self.registry.unregister_container(self.container_id)
else:
await self.docker_utils.stop_container(self.container_id)
self.container_id = None
# Close Playwright
if self.playwright:
await self.playwright.stop()
self.playwright = None
self.shutting_down = False

View File

@@ -0,0 +1,582 @@
import os
import json
import asyncio
import hashlib
import tempfile
import shutil
import socket
import subprocess
from typing import Dict, List, Optional, Tuple, Union
class DockerUtils:
"""Utility class for Docker operations in browser automation.
This class provides methods for managing Docker images, containers,
and related operations needed for browser automation. It handles
image building, container lifecycle, port management, and registry operations.
Attributes:
DOCKER_FOLDER (str): Path to folder containing Docker files
DOCKER_CONNECT_FILE (str): Path to Dockerfile for connect mode
DOCKER_LAUNCH_FILE (str): Path to Dockerfile for launch mode
DOCKER_START_SCRIPT (str): Path to startup script for connect mode
DEFAULT_CONNECT_IMAGE (str): Default image name for connect mode
DEFAULT_LAUNCH_IMAGE (str): Default image name for launch mode
logger: Optional logger instance
"""
# File paths for Docker resources
DOCKER_FOLDER = os.path.join(os.path.dirname(__file__), "docker")
DOCKER_CONNECT_FILE = os.path.join(DOCKER_FOLDER, "connect.Dockerfile")
DOCKER_LAUNCH_FILE = os.path.join(DOCKER_FOLDER, "launch.Dockerfile")
DOCKER_START_SCRIPT = os.path.join(DOCKER_FOLDER, "start.sh")
# Default image names
DEFAULT_CONNECT_IMAGE = "crawl4ai/browser-connect:latest"
DEFAULT_LAUNCH_IMAGE = "crawl4ai/browser-launch:latest"
def __init__(self, logger=None):
"""Initialize Docker utilities.
Args:
logger: Optional logger for recording operations
"""
self.logger = logger
# Image Management Methods
async def check_image_exists(self, image_name: str) -> bool:
"""Check if a Docker image exists.
Args:
image_name: Name of the Docker image to check
Returns:
bool: True if the image exists, False otherwise
"""
cmd = ["docker", "image", "inspect", image_name]
try:
process = await asyncio.create_subprocess_exec(
*cmd, stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE
)
_, _ = await process.communicate()
return process.returncode == 0
except Exception as e:
if self.logger:
self.logger.debug(f"Error checking if image exists: {str(e)}", tag="DOCKER")
return False
async def build_docker_image(self, image_name: str, dockerfile_path: str,
files_to_copy: Dict[str, str] = None) -> bool:
"""Build a Docker image from a Dockerfile.
Args:
image_name: Name to give the built image
dockerfile_path: Path to the Dockerfile
files_to_copy: Dict of {dest_name: source_path} for files to copy to build context
Returns:
bool: True if image was built successfully, False otherwise
"""
# Create a temporary build context
with tempfile.TemporaryDirectory() as temp_dir:
# Copy the Dockerfile
shutil.copy(dockerfile_path, os.path.join(temp_dir, "Dockerfile"))
# Copy any additional files needed
if files_to_copy:
for dest_name, source_path in files_to_copy.items():
shutil.copy(source_path, os.path.join(temp_dir, dest_name))
# Build the image
cmd = [
"docker", "build",
"-t", image_name,
temp_dir
]
if self.logger:
self.logger.debug(f"Building Docker image with command: {' '.join(cmd)}", tag="DOCKER")
process = await asyncio.create_subprocess_exec(
*cmd, stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE
)
stdout, stderr = await process.communicate()
if process.returncode != 0:
if self.logger:
self.logger.error(
message="Failed to build Docker image: {error}",
tag="DOCKER",
params={"error": stderr.decode()}
)
return False
if self.logger:
self.logger.success(f"Successfully built Docker image: {image_name}", tag="DOCKER")
return True
async def ensure_docker_image_exists(self, image_name: str, mode: str = "connect") -> str:
"""Ensure the required Docker image exists, creating it if necessary.
Args:
image_name: Name of the Docker image
mode: Either "connect" or "launch" to determine which image to build
Returns:
str: Name of the available Docker image
Raises:
Exception: If image doesn't exist and can't be built
"""
# If image name is not specified, use default based on mode
if not image_name:
image_name = self.DEFAULT_CONNECT_IMAGE if mode == "connect" else self.DEFAULT_LAUNCH_IMAGE
# Check if the image already exists
if await self.check_image_exists(image_name):
if self.logger:
self.logger.debug(f"Docker image {image_name} already exists", tag="DOCKER")
return image_name
# If we're using a custom image that doesn't exist, warn and fail
if (image_name != self.DEFAULT_CONNECT_IMAGE and image_name != self.DEFAULT_LAUNCH_IMAGE):
if self.logger:
self.logger.warning(
f"Custom Docker image {image_name} not found and cannot be automatically created",
tag="DOCKER"
)
raise Exception(f"Docker image {image_name} not found")
# Build the appropriate default image
if self.logger:
self.logger.info(f"Docker image {image_name} not found, creating it now...", tag="DOCKER")
if mode == "connect":
success = await self.build_docker_image(
image_name,
self.DOCKER_CONNECT_FILE,
{"start.sh": self.DOCKER_START_SCRIPT}
)
else:
success = await self.build_docker_image(
image_name,
self.DOCKER_LAUNCH_FILE
)
if not success:
raise Exception(f"Failed to create Docker image {image_name}")
return image_name
# Container Management Methods
async def create_container(self, image_name: str, host_port: int,
container_name: Optional[str] = None,
volumes: List[str] = None,
network: Optional[str] = None,
env_vars: Dict[str, str] = None,
extra_args: List[str] = None) -> Optional[str]:
"""Create a new Docker container.
Args:
image_name: Docker image to use
host_port: Port on host to map to container port 9223
container_name: Optional name for the container
volumes: List of volume mappings (e.g., ["host_path:container_path"])
network: Optional Docker network to use
env_vars: Dictionary of environment variables
extra_args: Additional docker run arguments
Returns:
str: Container ID if successful, None otherwise
"""
# Prepare container command
cmd = [
"docker", "run",
"--detach",
]
# Add container name if specified
if container_name:
cmd.extend(["--name", container_name])
# Add port mapping
cmd.extend(["-p", f"{host_port}:9223"])
# Add volumes
if volumes:
for volume in volumes:
cmd.extend(["-v", volume])
# Add network if specified
if network:
cmd.extend(["--network", network])
# Add environment variables
if env_vars:
for key, value in env_vars.items():
cmd.extend(["-e", f"{key}={value}"])
# Add extra args
if extra_args:
cmd.extend(extra_args)
# Add image
cmd.append(image_name)
if self.logger:
self.logger.debug(f"Creating Docker container with command: {' '.join(cmd)}", tag="DOCKER")
# Run docker command
try:
process = await asyncio.create_subprocess_exec(
*cmd, stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE
)
stdout, stderr = await process.communicate()
if process.returncode != 0:
if self.logger:
self.logger.error(
message="Failed to create Docker container: {error}",
tag="DOCKER",
params={"error": stderr.decode()}
)
return None
# Get container ID
container_id = stdout.decode().strip()
if self.logger:
self.logger.success(f"Created Docker container: {container_id[:12]}", tag="DOCKER")
return container_id
except Exception as e:
if self.logger:
self.logger.error(
message="Error creating Docker container: {error}",
tag="DOCKER",
params={"error": str(e)}
)
return None
async def is_container_running(self, container_id: str) -> bool:
"""Check if a container is running.
Args:
container_id: ID of the container to check
Returns:
bool: True if the container is running, False otherwise
"""
cmd = ["docker", "inspect", "--format", "{{.State.Running}}", container_id]
try:
process = await asyncio.create_subprocess_exec(
*cmd, stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE
)
stdout, _ = await process.communicate()
return process.returncode == 0 and stdout.decode().strip() == "true"
except Exception as e:
if self.logger:
self.logger.debug(f"Error checking if container is running: {str(e)}", tag="DOCKER")
return False
async def wait_for_container_ready(self, container_id: str, timeout: int = 30) -> bool:
"""Wait for the container to be in running state.
Args:
container_id: ID of the container to wait for
timeout: Maximum time to wait in seconds
Returns:
bool: True if container is ready, False if timeout occurred
"""
for _ in range(timeout):
if await self.is_container_running(container_id):
return True
await asyncio.sleep(1)
if self.logger:
self.logger.warning(f"Container {container_id[:12]} not ready after {timeout}s timeout", tag="DOCKER")
return False
async def stop_container(self, container_id: str) -> bool:
"""Stop a Docker container.
Args:
container_id: ID of the container to stop
Returns:
bool: True if stopped successfully, False otherwise
"""
cmd = ["docker", "stop", container_id]
try:
process = await asyncio.create_subprocess_exec(*cmd)
await process.communicate()
if self.logger:
self.logger.debug(f"Stopped container: {container_id[:12]}", tag="DOCKER")
return process.returncode == 0
except Exception as e:
if self.logger:
self.logger.warning(
message="Failed to stop container: {error}",
tag="DOCKER",
params={"error": str(e)}
)
return False
async def remove_container(self, container_id: str, force: bool = True) -> bool:
"""Remove a Docker container.
Args:
container_id: ID of the container to remove
force: Whether to force removal
Returns:
bool: True if removed successfully, False otherwise
"""
cmd = ["docker", "rm"]
if force:
cmd.append("-f")
cmd.append(container_id)
try:
process = await asyncio.create_subprocess_exec(*cmd)
await process.communicate()
if self.logger:
self.logger.debug(f"Removed container: {container_id[:12]}", tag="DOCKER")
return process.returncode == 0
except Exception as e:
if self.logger:
self.logger.warning(
message="Failed to remove container: {error}",
tag="DOCKER",
params={"error": str(e)}
)
return False
# Container Command Execution Methods
async def exec_in_container(self, container_id: str, command: List[str],
detach: bool = False) -> Tuple[int, str, str]:
"""Execute a command in a running container.
Args:
container_id: ID of the container
command: Command to execute as a list of strings
detach: Whether to run the command in detached mode
Returns:
Tuple of (return_code, stdout, stderr)
"""
cmd = ["docker", "exec"]
if detach:
cmd.append("-d")
cmd.append(container_id)
cmd.extend(command)
try:
process = await asyncio.create_subprocess_exec(
*cmd, stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE
)
stdout, stderr = await process.communicate()
return process.returncode, stdout.decode(), stderr.decode()
except Exception as e:
if self.logger:
self.logger.error(
message="Error executing command in container: {error}",
tag="DOCKER",
params={"error": str(e)}
)
return -1, "", str(e)
async def start_socat_in_container(self, container_id: str) -> bool:
"""Start socat in the container to map port 9222 to 9223.
Args:
container_id: ID of the container
Returns:
bool: True if socat started successfully, False otherwise
"""
# Command to run socat as a background process
cmd = ["socat", "TCP-LISTEN:9223,fork", "TCP:localhost:9222"]
returncode, _, stderr = await self.exec_in_container(container_id, cmd, detach=True)
if returncode != 0:
if self.logger:
self.logger.error(
message="Failed to start socat in container: {error}",
tag="DOCKER",
params={"error": stderr}
)
return False
if self.logger:
self.logger.debug(f"Started socat in container: {container_id[:12]}", tag="DOCKER")
# Wait a moment for socat to start
await asyncio.sleep(1)
return True
async def launch_chrome_in_container(self, container_id: str, browser_args: List[str]) -> bool:
"""Launch Chrome inside the container with specified arguments.
Args:
container_id: ID of the container
browser_args: Chrome command line arguments
Returns:
bool: True if Chrome started successfully, False otherwise
"""
# Build Chrome command
chrome_cmd = ["google-chrome"]
chrome_cmd.extend(browser_args)
returncode, _, stderr = await self.exec_in_container(container_id, chrome_cmd, detach=True)
if returncode != 0:
if self.logger:
self.logger.error(
message="Failed to launch Chrome in container: {error}",
tag="DOCKER",
params={"error": stderr}
)
return False
if self.logger:
self.logger.debug(f"Launched Chrome in container: {container_id[:12]}", tag="DOCKER")
return True
async def get_process_id_in_container(self, container_id: str, process_name: str) -> Optional[int]:
"""Get the process ID for a process in the container.
Args:
container_id: ID of the container
process_name: Name pattern to search for
Returns:
int: Process ID if found, None otherwise
"""
cmd = ["pgrep", "-f", process_name]
returncode, stdout, _ = await self.exec_in_container(container_id, cmd)
if returncode == 0 and stdout.strip():
pid = int(stdout.strip().split("\n")[0])
return pid
return None
async def stop_process_in_container(self, container_id: str, pid: int) -> bool:
"""Stop a process in the container by PID.
Args:
container_id: ID of the container
pid: Process ID to stop
Returns:
bool: True if process was stopped, False otherwise
"""
cmd = ["kill", "-TERM", str(pid)]
returncode, _, stderr = await self.exec_in_container(container_id, cmd)
if returncode != 0:
if self.logger:
self.logger.warning(
message="Failed to stop process in container: {error}",
tag="DOCKER",
params={"error": stderr}
)
return False
if self.logger:
self.logger.debug(f"Stopped process {pid} in container: {container_id[:12]}", tag="DOCKER")
return True
# Network and Port Methods
async def wait_for_cdp_ready(self, host_port: int, timeout: int = 30) -> bool:
"""Wait for the CDP endpoint to be ready.
Args:
host_port: Port to check for CDP endpoint
timeout: Maximum time to wait in seconds
Returns:
bool: True if CDP endpoint is ready, False if timeout occurred
"""
import aiohttp
url = f"http://localhost:{host_port}/json/version"
for _ in range(timeout):
try:
async with aiohttp.ClientSession() as session:
async with session.get(url, timeout=1) as response:
if response.status == 200:
if self.logger:
self.logger.debug(f"CDP endpoint ready on port {host_port}", tag="DOCKER")
return True
except Exception:
pass
await asyncio.sleep(1)
if self.logger:
self.logger.warning(f"CDP endpoint not ready on port {host_port} after {timeout}s timeout", tag="DOCKER")
return False
def is_port_in_use(self, port: int) -> bool:
"""Check if a port is already in use on the host.
Args:
port: Port number to check
Returns:
bool: True if port is in use, False otherwise
"""
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
return s.connect_ex(('localhost', port)) == 0
def get_next_available_port(self, start_port: int = 9223) -> int:
"""Get the next available port starting from a given port.
Args:
start_port: Port number to start checking from
Returns:
int: First available port number
"""
port = start_port
while self.is_port_in_use(port):
port += 1
return port
# Configuration Hash Methods
def generate_config_hash(self, config_dict: Dict) -> str:
"""Generate a hash of the configuration for container matching.
Args:
config_dict: Dictionary of configuration parameters
Returns:
str: Hash string uniquely identifying this configuration
"""
# Convert to canonical JSON string and hash
config_json = json.dumps(config_dict, sort_keys=True)
return hashlib.sha256(config_json.encode()).hexdigest()

204
crawl4ai/browser/manager.py Normal file
View File

@@ -0,0 +1,204 @@
"""Browser manager module for Crawl4AI.
This module provides a central browser management class that uses the
strategy pattern internally while maintaining the existing API.
It also implements a page pooling mechanism for improved performance.
"""
import asyncio
import time
from typing import Optional, Tuple, List
from playwright.async_api import Page, BrowserContext
from ..async_logger import AsyncLogger
from ..async_configs import BrowserConfig, CrawlerRunConfig
from .strategies import (
BaseBrowserStrategy,
PlaywrightBrowserStrategy,
CDPBrowserStrategy,
BuiltinBrowserStrategy
)
# Import DockerBrowserStrategy if available
try:
from .docker_strategy import DockerBrowserStrategy
except ImportError:
DockerBrowserStrategy = None
class BrowserManager:
"""Main interface for browser management in Crawl4AI.
This class maintains backward compatibility with the existing implementation
while using the strategy pattern internally for different browser types.
Attributes:
config (BrowserConfig): Configuration object containing all browser settings
logger: Logger instance for recording events and errors
browser: The browser instance
default_context: The default browser context
managed_browser: The managed browser instance
playwright: The Playwright instance
sessions: Dictionary to store session information
session_ttl: Session timeout in seconds
"""
def __init__(self, browser_config: Optional[BrowserConfig] = None, logger: Optional[AsyncLogger] = None):
"""Initialize the BrowserManager with a browser configuration.
Args:
browser_config: Configuration object containing all browser settings
logger: Logger instance for recording events and errors
"""
self.config = browser_config or BrowserConfig()
self.logger = logger
# Create strategy based on configuration
self._strategy = self._create_strategy()
# Initialize state variables for compatibility with existing code
self.browser = None
self.default_context = None
self.managed_browser = None
self.playwright = None
# For session management (from existing implementation)
self.sessions = {}
self.session_ttl = 1800 # 30 minutes
def _create_strategy(self) -> BaseBrowserStrategy:
"""Create appropriate browser strategy based on configuration.
Returns:
BaseBrowserStrategy: The selected browser strategy
"""
if self.config.browser_mode == "builtin":
return BuiltinBrowserStrategy(self.config, self.logger)
elif self.config.browser_mode == "docker":
if DockerBrowserStrategy is None:
if self.logger:
self.logger.error(
"Docker browser strategy requested but not available. "
"Falling back to PlaywrightBrowserStrategy.",
tag="BROWSER"
)
return PlaywrightBrowserStrategy(self.config, self.logger)
return DockerBrowserStrategy(self.config, self.logger)
elif self.config.cdp_url or self.config.use_managed_browser:
return CDPBrowserStrategy(self.config, self.logger)
else:
return PlaywrightBrowserStrategy(self.config, self.logger)
async def start(self):
"""Start the browser instance and set up the default context.
Returns:
self: For method chaining
"""
# Start the strategy
await self._strategy.start()
# Update legacy references
self.browser = self._strategy.browser
self.default_context = self._strategy.default_context
# Set browser process reference (for CDP strategy)
if hasattr(self._strategy, 'browser_process'):
self.managed_browser = self._strategy
# Set Playwright reference
self.playwright = self._strategy.playwright
# Sync sessions if needed
if hasattr(self._strategy, 'sessions'):
self.sessions = self._strategy.sessions
self.session_ttl = self._strategy.session_ttl
return self
async def get_page(self, crawlerRunConfig: CrawlerRunConfig) -> Tuple[Page, BrowserContext]:
"""Get a page for the given configuration.
Args:
crawlerRunConfig: Configuration object for the crawler run
Returns:
Tuple of (Page, BrowserContext)
"""
# Delegate to strategy
page, context = await self._strategy.get_page(crawlerRunConfig)
# Sync sessions if needed
if hasattr(self._strategy, 'sessions'):
self.sessions = self._strategy.sessions
return page, context
async def get_pages(self, crawlerRunConfig: CrawlerRunConfig, count: int = 1) -> List[Tuple[Page, BrowserContext]]:
"""Get multiple pages with the same configuration.
This method efficiently creates multiple browser pages using the same configuration,
which is useful for parallel crawling of multiple URLs.
Args:
crawlerRunConfig: Configuration for the pages
count: Number of pages to create
Returns:
List of (Page, Context) tuples
"""
# Delegate to strategy
pages = await self._strategy.get_pages(crawlerRunConfig, count)
# Sync sessions if needed
if hasattr(self._strategy, 'sessions'):
self.sessions = self._strategy.sessions
return pages
async def kill_session(self, session_id: str):
"""Kill a browser session and clean up resources.
Args:
session_id: The session ID to kill
"""
# Handle kill_session via our strategy if it supports it
if hasattr(self._strategy, '_kill_session'):
await self._strategy._kill_session(session_id)
elif session_id in self.sessions:
context, page, _ = self.sessions[session_id]
await page.close()
# Only close context if not using CDP
if not self.config.use_managed_browser and not self.config.cdp_url and not self.config.browser_mode == "builtin":
await context.close()
del self.sessions[session_id]
def _cleanup_expired_sessions(self):
"""Clean up expired sessions based on TTL."""
# Use strategy's implementation if available
if hasattr(self._strategy, '_cleanup_expired_sessions'):
self._strategy._cleanup_expired_sessions()
return
# Otherwise use our own implementation
current_time = time.time()
expired_sessions = [
sid
for sid, (_, _, last_used) in self.sessions.items()
if current_time - last_used > self.session_ttl
]
for sid in expired_sessions:
asyncio.create_task(self.kill_session(sid))
async def close(self):
"""Close the browser and clean up resources."""
# Delegate to strategy
await self._strategy.close()
# Reset legacy references
self.browser = None
self.default_context = None
self.managed_browser = None
self.playwright = None
self.sessions = {}

View File

View File

@@ -0,0 +1,457 @@
"""Browser profile management module for Crawl4AI.
This module provides functionality for creating and managing browser profiles
that can be used for authenticated browsing.
"""
import os
import asyncio
import signal
import sys
import datetime
import uuid
import shutil
from typing import List, Dict, Optional, Any
from colorama import Fore, Style, init
from ..async_configs import BrowserConfig
from ..async_logger import AsyncLogger, AsyncLoggerBase
from ..utils import get_home_folder
class BrowserProfileManager:
"""Manages browser profiles for Crawl4AI.
This class provides functionality to create and manage browser profiles
that can be used for authenticated browsing with Crawl4AI.
Profiles are stored by default in ~/.crawl4ai/profiles/
"""
def __init__(self, logger: Optional[AsyncLoggerBase] = None):
"""Initialize the BrowserProfileManager.
Args:
logger: Logger for outputting messages. If None, a default AsyncLogger is created.
"""
# Initialize colorama for colorful terminal output
init()
# Create a logger if not provided
if logger is None:
self.logger = AsyncLogger(verbose=True)
elif not isinstance(logger, AsyncLoggerBase):
self.logger = AsyncLogger(verbose=True)
else:
self.logger = logger
# Ensure profiles directory exists
self.profiles_dir = os.path.join(get_home_folder(), "profiles")
os.makedirs(self.profiles_dir, exist_ok=True)
async def create_profile(self,
profile_name: Optional[str] = None,
browser_config: Optional[BrowserConfig] = None) -> Optional[str]:
"""Create a browser profile interactively.
Args:
profile_name: Name for the profile. If None, a name is generated.
browser_config: Configuration for the browser. If None, a default configuration is used.
Returns:
Path to the created profile directory, or None if creation failed
"""
# Create default browser config if none provided
if browser_config is None:
browser_config = BrowserConfig(
browser_type="chromium",
headless=False, # Must be visible for user interaction
verbose=True
)
else:
# Ensure headless is False for user interaction
browser_config.headless = False
# Generate profile name if not provided
if not profile_name:
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
profile_name = f"profile_{timestamp}_{uuid.uuid4().hex[:6]}"
# Sanitize profile name (replace spaces and special chars)
profile_name = "".join(c if c.isalnum() or c in "-_" else "_" for c in profile_name)
# Set user data directory
profile_path = os.path.join(self.profiles_dir, profile_name)
os.makedirs(profile_path, exist_ok=True)
# Print instructions for the user with colorama formatting
border = f"{Fore.CYAN}{'='*80}{Style.RESET_ALL}"
self.logger.info(f"\n{border}", tag="PROFILE")
self.logger.info(f"Creating browser profile: {Fore.GREEN}{profile_name}{Style.RESET_ALL}", tag="PROFILE")
self.logger.info(f"Profile directory: {Fore.YELLOW}{profile_path}{Style.RESET_ALL}", tag="PROFILE")
self.logger.info("\nInstructions:", tag="PROFILE")
self.logger.info("1. A browser window will open for you to set up your profile.", tag="PROFILE")
self.logger.info(f"2. {Fore.CYAN}Log in to websites{Style.RESET_ALL}, configure settings, etc. as needed.", tag="PROFILE")
self.logger.info(f"3. When you're done, {Fore.YELLOW}press 'q' in this terminal{Style.RESET_ALL} to close the browser.", tag="PROFILE")
self.logger.info("4. The profile will be saved and ready to use with Crawl4AI.", tag="PROFILE")
self.logger.info(f"{border}\n", tag="PROFILE")
# Import the necessary classes with local imports to avoid circular references
from .strategies import CDPBrowserStrategy
# Set browser config to use the profile path
browser_config.user_data_dir = profile_path
# Create a CDP browser strategy for the profile creation
browser_strategy = CDPBrowserStrategy(browser_config, self.logger)
# Set up signal handlers to ensure cleanup on interrupt
original_sigint = signal.getsignal(signal.SIGINT)
original_sigterm = signal.getsignal(signal.SIGTERM)
# Define cleanup handler for signals
async def cleanup_handler(sig, frame):
self.logger.warning("\nCleaning up browser process...", tag="PROFILE")
await browser_strategy.close()
# Restore original signal handlers
signal.signal(signal.SIGINT, original_sigint)
signal.signal(signal.SIGTERM, original_sigterm)
if sig == signal.SIGINT:
self.logger.error("Profile creation interrupted. Profile may be incomplete.", tag="PROFILE")
sys.exit(1)
# Set signal handlers
def sigint_handler(sig, frame):
asyncio.create_task(cleanup_handler(sig, frame))
signal.signal(signal.SIGINT, sigint_handler)
signal.signal(signal.SIGTERM, sigint_handler)
# Event to signal when user is done with the browser
user_done_event = asyncio.Event()
# Run keyboard input loop in a separate task
async def listen_for_quit_command():
import termios
import tty
import select
# First output the prompt
self.logger.info(f"{Fore.CYAN}Press '{Fore.WHITE}q{Fore.CYAN}' when you've finished using the browser...{Style.RESET_ALL}", tag="PROFILE")
# Save original terminal settings
fd = sys.stdin.fileno()
old_settings = termios.tcgetattr(fd)
try:
# Switch to non-canonical mode (no line buffering)
tty.setcbreak(fd)
while True:
# Check if input is available (non-blocking)
readable, _, _ = select.select([sys.stdin], [], [], 0.5)
if readable:
key = sys.stdin.read(1)
if key.lower() == 'q':
self.logger.info(f"{Fore.GREEN}Closing browser and saving profile...{Style.RESET_ALL}", tag="PROFILE")
user_done_event.set()
return
# Check if the browser process has already exited
if browser_strategy.browser_process and browser_strategy.browser_process.poll() is not None:
self.logger.info("Browser already closed. Ending input listener.", tag="PROFILE")
user_done_event.set()
return
await asyncio.sleep(0.1)
finally:
# Restore terminal settings
termios.tcsetattr(fd, termios.TCSADRAIN, old_settings)
try:
# Start the browser
await browser_strategy.start()
# Check if browser started successfully
if not browser_strategy.browser_process:
self.logger.error("Failed to start browser process.", tag="PROFILE")
return None
self.logger.info(f"Browser launched. {Fore.CYAN}Waiting for you to finish...{Style.RESET_ALL}", tag="PROFILE")
# Start listening for keyboard input
listener_task = asyncio.create_task(listen_for_quit_command())
# Wait for either the user to press 'q' or for the browser process to exit naturally
while not user_done_event.is_set() and browser_strategy.browser_process.poll() is None:
await asyncio.sleep(0.5)
# Cancel the listener task if it's still running
if not listener_task.done():
listener_task.cancel()
try:
await listener_task
except asyncio.CancelledError:
pass
# If the browser is still running and the user pressed 'q', terminate it
if browser_strategy.browser_process.poll() is None and user_done_event.is_set():
self.logger.info("Terminating browser process...", tag="PROFILE")
await browser_strategy.close()
self.logger.success(f"Browser closed. Profile saved at: {Fore.GREEN}{profile_path}{Style.RESET_ALL}", tag="PROFILE")
except Exception as e:
self.logger.error(f"Error creating profile: {str(e)}", tag="PROFILE")
await browser_strategy.close()
return None
finally:
# Restore original signal handlers
signal.signal(signal.SIGINT, original_sigint)
signal.signal(signal.SIGTERM, original_sigterm)
# Make sure browser is fully cleaned up
await browser_strategy.close()
# Return the profile path
return profile_path
def list_profiles(self) -> List[Dict[str, Any]]:
"""List all available browser profiles.
Returns:
List of dictionaries containing profile information
"""
if not os.path.exists(self.profiles_dir):
return []
profiles = []
for name in os.listdir(self.profiles_dir):
profile_path = os.path.join(self.profiles_dir, name)
# Skip if not a directory
if not os.path.isdir(profile_path):
continue
# Check if this looks like a valid browser profile
# For Chromium: Look for Preferences file
# For Firefox: Look for prefs.js file
is_valid = False
if os.path.exists(os.path.join(profile_path, "Preferences")) or \
os.path.exists(os.path.join(profile_path, "Default", "Preferences")):
is_valid = "chromium"
elif os.path.exists(os.path.join(profile_path, "prefs.js")):
is_valid = "firefox"
if is_valid:
# Get creation time
created = datetime.datetime.fromtimestamp(
os.path.getctime(profile_path)
)
profiles.append({
"name": name,
"path": profile_path,
"created": created,
"type": is_valid
})
# Sort by creation time, newest first
profiles.sort(key=lambda x: x["created"], reverse=True)
return profiles
def get_profile_path(self, profile_name: str) -> Optional[str]:
"""Get the full path to a profile by name.
Args:
profile_name: Name of the profile (not the full path)
Returns:
Full path to the profile directory, or None if not found
"""
profile_path = os.path.join(self.profiles_dir, profile_name)
# Check if path exists and is a valid profile
if not os.path.isdir(profile_path):
# Check if profile_name itself is full path
if os.path.isabs(profile_name):
profile_path = profile_name
else:
return None
# Look for profile indicators
is_profile = (
os.path.exists(os.path.join(profile_path, "Preferences")) or
os.path.exists(os.path.join(profile_path, "Default", "Preferences")) or
os.path.exists(os.path.join(profile_path, "prefs.js"))
)
if not is_profile:
return None # Not a valid browser profile
return profile_path
def delete_profile(self, profile_name_or_path: str) -> bool:
"""Delete a browser profile by name or path.
Args:
profile_name_or_path: Name of the profile or full path to profile directory
Returns:
True if the profile was deleted successfully, False otherwise
"""
# Determine if input is a name or a path
if os.path.isabs(profile_name_or_path):
# Full path provided
profile_path = profile_name_or_path
else:
# Just a name provided, construct path
profile_path = os.path.join(self.profiles_dir, profile_name_or_path)
# Check if path exists and is a valid profile
if not os.path.isdir(profile_path):
return False
# Look for profile indicators
is_profile = (
os.path.exists(os.path.join(profile_path, "Preferences")) or
os.path.exists(os.path.join(profile_path, "Default", "Preferences")) or
os.path.exists(os.path.join(profile_path, "prefs.js"))
)
if not is_profile:
return False # Not a valid browser profile
# Delete the profile directory
try:
shutil.rmtree(profile_path)
return True
except Exception:
return False
async def interactive_manager(self, crawl_callback=None):
"""Launch an interactive profile management console.
Args:
crawl_callback: Function to call when selecting option to use
a profile for crawling. It will be called with (profile_path, url).
"""
while True:
self.logger.info(f"\n{Fore.CYAN}Profile Management Options:{Style.RESET_ALL}", tag="MENU")
self.logger.info(f"1. {Fore.GREEN}Create a new profile{Style.RESET_ALL}", tag="MENU")
self.logger.info(f"2. {Fore.YELLOW}List available profiles{Style.RESET_ALL}", tag="MENU")
self.logger.info(f"3. {Fore.RED}Delete a profile{Style.RESET_ALL}", tag="MENU")
# Only show crawl option if callback provided
if crawl_callback:
self.logger.info(f"4. {Fore.CYAN}Use a profile to crawl a website{Style.RESET_ALL}", tag="MENU")
self.logger.info(f"5. {Fore.MAGENTA}Exit{Style.RESET_ALL}", tag="MENU")
exit_option = "5"
else:
self.logger.info(f"4. {Fore.MAGENTA}Exit{Style.RESET_ALL}", tag="MENU")
exit_option = "4"
choice = input(f"\n{Fore.CYAN}Enter your choice (1-{exit_option}): {Style.RESET_ALL}")
if choice == "1":
# Create new profile
name = input(f"{Fore.GREEN}Enter a name for the new profile (or press Enter for auto-generated name): {Style.RESET_ALL}")
await self.create_profile(name or None)
elif choice == "2":
# List profiles
profiles = self.list_profiles()
if not profiles:
self.logger.warning(" No profiles found. Create one first with option 1.", tag="PROFILES")
continue
# Print profile information with colorama formatting
self.logger.info("\nAvailable profiles:", tag="PROFILES")
for i, profile in enumerate(profiles):
self.logger.info(f"[{i+1}] {Fore.CYAN}{profile['name']}{Style.RESET_ALL}", tag="PROFILES")
self.logger.info(f" Path: {Fore.YELLOW}{profile['path']}{Style.RESET_ALL}", tag="PROFILES")
self.logger.info(f" Created: {profile['created'].strftime('%Y-%m-%d %H:%M:%S')}", tag="PROFILES")
self.logger.info(f" Browser type: {profile['type']}", tag="PROFILES")
self.logger.info("", tag="PROFILES") # Empty line for spacing
elif choice == "3":
# Delete profile
profiles = self.list_profiles()
if not profiles:
self.logger.warning("No profiles found to delete", tag="PROFILES")
continue
# Display numbered list
self.logger.info(f"\n{Fore.YELLOW}Available profiles:{Style.RESET_ALL}", tag="PROFILES")
for i, profile in enumerate(profiles):
self.logger.info(f"[{i+1}] {profile['name']}", tag="PROFILES")
# Get profile to delete
profile_idx = input(f"{Fore.RED}Enter the number of the profile to delete (or 'c' to cancel): {Style.RESET_ALL}")
if profile_idx.lower() == 'c':
continue
try:
idx = int(profile_idx) - 1
if 0 <= idx < len(profiles):
profile_name = profiles[idx]["name"]
self.logger.info(f"Deleting profile: {Fore.YELLOW}{profile_name}{Style.RESET_ALL}", tag="PROFILES")
# Confirm deletion
confirm = input(f"{Fore.RED}Are you sure you want to delete this profile? (y/n): {Style.RESET_ALL}")
if confirm.lower() == 'y':
success = self.delete_profile(profiles[idx]["path"])
if success:
self.logger.success(f"Profile {Fore.GREEN}{profile_name}{Style.RESET_ALL} deleted successfully", tag="PROFILES")
else:
self.logger.error(f"Failed to delete profile {Fore.RED}{profile_name}{Style.RESET_ALL}", tag="PROFILES")
else:
self.logger.error("Invalid profile number", tag="PROFILES")
except ValueError:
self.logger.error("Please enter a valid number", tag="PROFILES")
elif choice == "4" and crawl_callback:
# Use profile to crawl a site
profiles = self.list_profiles()
if not profiles:
self.logger.warning("No profiles found. Create one first.", tag="PROFILES")
continue
# Display numbered list
self.logger.info(f"\n{Fore.YELLOW}Available profiles:{Style.RESET_ALL}", tag="PROFILES")
for i, profile in enumerate(profiles):
self.logger.info(f"[{i+1}] {profile['name']}", tag="PROFILES")
# Get profile to use
profile_idx = input(f"{Fore.CYAN}Enter the number of the profile to use (or 'c' to cancel): {Style.RESET_ALL}")
if profile_idx.lower() == 'c':
continue
try:
idx = int(profile_idx) - 1
if 0 <= idx < len(profiles):
profile_path = profiles[idx]["path"]
url = input(f"{Fore.CYAN}Enter the URL to crawl: {Style.RESET_ALL}")
if url:
# Call the provided crawl callback
await crawl_callback(profile_path, url)
else:
self.logger.error("No URL provided", tag="CRAWL")
else:
self.logger.error("Invalid profile number", tag="PROFILES")
except ValueError:
self.logger.error("Please enter a valid number", tag="PROFILES")
elif (choice == "4" and not crawl_callback) or (choice == "5" and crawl_callback):
# Exit
self.logger.info("Exiting profile management", tag="MENU")
break
else:
self.logger.error(f"Invalid choice. Please enter a number between 1 and {exit_option}.", tag="MENU")

File diff suppressed because it is too large Load Diff

328
crawl4ai/browser/utils.py Normal file
View File

@@ -0,0 +1,328 @@
"""Browser utilities module for Crawl4AI.
This module provides utility functions for browser management,
including process management, CDP connection utilities,
and Playwright instance management.
"""
import asyncio
import os
import sys
import time
import tempfile
import subprocess
from typing import Optional
from playwright.async_api import async_playwright
from ..utils import get_chromium_path
from ..async_configs import BrowserConfig, CrawlerRunConfig
from ..async_logger import AsyncLogger
_playwright_instance = None
async def get_playwright():
"""Get or create the Playwright instance (singleton pattern).
Returns:
Playwright: The Playwright instance
"""
global _playwright_instance
if _playwright_instance is None or True:
_playwright_instance = await async_playwright().start()
return _playwright_instance
async def get_browser_executable(browser_type: str) -> str:
"""Get the path to browser executable, with platform-specific handling.
Args:
browser_type: Type of browser (chromium, firefox, webkit)
Returns:
Path to browser executable
"""
return await get_chromium_path(browser_type)
def create_temp_directory(prefix="browser-profile-") -> str:
"""Create a temporary directory for browser data.
Args:
prefix: Prefix for the temporary directory name
Returns:
Path to the created temporary directory
"""
return tempfile.mkdtemp(prefix=prefix)
def is_windows() -> bool:
"""Check if the current platform is Windows.
Returns:
True if Windows, False otherwise
"""
return sys.platform == "win32"
def is_macos() -> bool:
"""Check if the current platform is macOS.
Returns:
True if macOS, False otherwise
"""
return sys.platform == "darwin"
def is_linux() -> bool:
"""Check if the current platform is Linux.
Returns:
True if Linux, False otherwise
"""
return not (is_windows() or is_macos())
def is_browser_running(pid: Optional[int]) -> bool:
"""Check if a process with the given PID is running.
Args:
pid: Process ID to check
Returns:
bool: True if the process is running, False otherwise
"""
if not pid:
return False
try:
# Check if the process exists
if is_windows():
process = subprocess.run(["tasklist", "/FI", f"PID eq {pid}"],
capture_output=True, text=True)
return str(pid) in process.stdout
else:
# Unix-like systems
os.kill(pid, 0) # This doesn't actually kill the process, just checks if it exists
return True
except (ProcessLookupError, PermissionError, OSError):
return False
def get_browser_disable_options() -> list:
"""Get standard list of browser disable options for performance.
Returns:
List of command-line options to disable various browser features
"""
return [
"--disable-background-networking",
"--disable-background-timer-throttling",
"--disable-backgrounding-occluded-windows",
"--disable-breakpad",
"--disable-client-side-phishing-detection",
"--disable-component-extensions-with-background-pages",
"--disable-default-apps",
"--disable-extensions",
"--disable-features=TranslateUI",
"--disable-hang-monitor",
"--disable-ipc-flooding-protection",
"--disable-popup-blocking",
"--disable-prompt-on-repost",
"--disable-sync",
"--force-color-profile=srgb",
"--metrics-recording-only",
"--no-first-run",
"--password-store=basic",
"--use-mock-keychain",
]
async def find_optimal_browser_config(total_urls=50, verbose=True, rate_limit_delay=0.2):
"""Find optimal browser configuration for crawling a specific number of URLs.
Args:
total_urls: Number of URLs to crawl
verbose: Whether to print progress
rate_limit_delay: Delay between page loads to avoid rate limiting
Returns:
dict: Contains fastest, lowest_memory, and optimal configurations
"""
from .manager import BrowserManager
if verbose:
print(f"\n=== Finding optimal configuration for crawling {total_urls} URLs ===\n")
# Generate test URLs with timestamp to avoid caching
timestamp = int(time.time())
urls = [f"https://example.com/page_{i}?t={timestamp}" for i in range(total_urls)]
# Limit browser configurations to test (1 browser to max 10)
max_browsers = min(10, total_urls)
configs_to_test = []
# Generate configurations (browser count, pages distribution)
for num_browsers in range(1, max_browsers + 1):
base_pages = total_urls // num_browsers
remainder = total_urls % num_browsers
# Create distribution array like [3, 3, 2, 2] (some browsers get one more page)
if remainder > 0:
distribution = [base_pages + 1] * remainder + [base_pages] * (num_browsers - remainder)
else:
distribution = [base_pages] * num_browsers
configs_to_test.append((num_browsers, distribution))
results = []
# Test each configuration
for browser_count, page_distribution in configs_to_test:
if verbose:
print(f"Testing {browser_count} browsers with distribution {tuple(page_distribution)}")
try:
# Track memory if possible
try:
import psutil
process = psutil.Process()
start_memory = process.memory_info().rss / (1024 * 1024) # MB
except ImportError:
if verbose:
print("Memory tracking not available (psutil not installed)")
start_memory = 0
# Start browsers in parallel
managers = []
start_tasks = []
start_time = time.time()
logger = AsyncLogger(verbose=True, log_file=None)
for i in range(browser_count):
config = BrowserConfig(headless=True)
manager = BrowserManager(browser_config=config, logger=logger)
start_tasks.append(manager.start())
managers.append(manager)
await asyncio.gather(*start_tasks)
# Distribute URLs among browsers
urls_per_manager = {}
url_index = 0
for i, manager in enumerate(managers):
pages_for_this_browser = page_distribution[i]
end_index = url_index + pages_for_this_browser
urls_per_manager[manager] = urls[url_index:end_index]
url_index = end_index
# Create pages for each browser
all_pages = []
for manager, manager_urls in urls_per_manager.items():
if not manager_urls:
continue
pages = await manager.get_pages(CrawlerRunConfig(), count=len(manager_urls))
all_pages.extend(zip(pages, manager_urls))
# Crawl pages with delay to avoid rate limiting
async def crawl_page(page_ctx, url):
page, _ = page_ctx
try:
await page.goto(url)
if rate_limit_delay > 0:
await asyncio.sleep(rate_limit_delay)
title = await page.title()
return title
finally:
await page.close()
crawl_start = time.time()
crawl_tasks = [crawl_page(page_ctx, url) for page_ctx, url in all_pages]
await asyncio.gather(*crawl_tasks)
crawl_time = time.time() - crawl_start
total_time = time.time() - start_time
# Measure final memory usage
if start_memory > 0:
end_memory = process.memory_info().rss / (1024 * 1024)
memory_used = end_memory - start_memory
else:
memory_used = 0
# Close all browsers
for manager in managers:
await manager.close()
# Calculate metrics
pages_per_second = total_urls / crawl_time
# Calculate efficiency score (higher is better)
# This balances speed vs memory
if memory_used > 0:
efficiency = pages_per_second / (memory_used + 1)
else:
efficiency = pages_per_second
# Store result
result = {
"browser_count": browser_count,
"distribution": tuple(page_distribution),
"crawl_time": crawl_time,
"total_time": total_time,
"memory_used": memory_used,
"pages_per_second": pages_per_second,
"efficiency": efficiency
}
results.append(result)
if verbose:
print(f" ✓ Crawled {total_urls} pages in {crawl_time:.2f}s ({pages_per_second:.1f} pages/sec)")
if memory_used > 0:
print(f" ✓ Memory used: {memory_used:.1f}MB ({memory_used/total_urls:.1f}MB per page)")
print(f" ✓ Efficiency score: {efficiency:.4f}")
except Exception as e:
if verbose:
print(f" ✗ Error: {str(e)}")
# Clean up
for manager in managers:
try:
await manager.close()
except:
pass
# If no successful results, return None
if not results:
return None
# Find best configurations
fastest = sorted(results, key=lambda x: x["crawl_time"])[0]
# Only consider memory if available
memory_results = [r for r in results if r["memory_used"] > 0]
if memory_results:
lowest_memory = sorted(memory_results, key=lambda x: x["memory_used"])[0]
else:
lowest_memory = fastest
# Find most efficient (balanced speed vs memory)
optimal = sorted(results, key=lambda x: x["efficiency"], reverse=True)[0]
# Print summary
if verbose:
print("\n=== OPTIMAL CONFIGURATIONS ===")
print(f"⚡ Fastest: {fastest['browser_count']} browsers {fastest['distribution']}")
print(f" {fastest['crawl_time']:.2f}s, {fastest['pages_per_second']:.1f} pages/sec")
print(f"💾 Memory-efficient: {lowest_memory['browser_count']} browsers {lowest_memory['distribution']}")
if lowest_memory["memory_used"] > 0:
print(f" {lowest_memory['memory_used']:.1f}MB, {lowest_memory['memory_used']/total_urls:.2f}MB per page")
print(f"🌟 Balanced optimal: {optimal['browser_count']} browsers {optimal['distribution']}")
print(f" {optimal['crawl_time']:.2f}s, {optimal['pages_per_second']:.1f} pages/sec, score: {optimal['efficiency']:.4f}")
return {
"fastest": fastest,
"lowest_memory": lowest_memory,
"optimal": optimal,
"all_configs": results
}

View File

@@ -145,17 +145,60 @@ class ManagedBrowser:
# Start browser process # Start browser process
try: try:
self.browser_process = subprocess.Popen( # Use DETACHED_PROCESS flag on Windows to fully detach the process
args, stdout=subprocess.PIPE, stderr=subprocess.PIPE # On Unix, we'll use preexec_fn=os.setpgrp to start the process in a new process group
) if sys.platform == "win32":
# Monitor browser process output for errors self.browser_process = subprocess.Popen(
asyncio.create_task(self._monitor_browser_process()) args,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
creationflags=subprocess.DETACHED_PROCESS | subprocess.CREATE_NEW_PROCESS_GROUP
)
else:
self.browser_process = subprocess.Popen(
args,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
preexec_fn=os.setpgrp # Start in a new process group
)
# We'll monitor for a short time to make sure it starts properly, but won't keep monitoring
await asyncio.sleep(0.5) # Give browser time to start
await self._initial_startup_check()
await asyncio.sleep(2) # Give browser time to start await asyncio.sleep(2) # Give browser time to start
return f"http://{self.host}:{self.debugging_port}" return f"http://{self.host}:{self.debugging_port}"
except Exception as e: except Exception as e:
await self.cleanup() await self.cleanup()
raise Exception(f"Failed to start browser: {e}") raise Exception(f"Failed to start browser: {e}")
async def _initial_startup_check(self):
"""
Perform a quick check to make sure the browser started successfully.
This only runs once at startup rather than continuously monitoring.
"""
if not self.browser_process:
return
# Check that process started without immediate termination
await asyncio.sleep(0.5)
if self.browser_process.poll() is not None:
# Process already terminated
stdout, stderr = b"", b""
try:
stdout, stderr = self.browser_process.communicate(timeout=0.5)
except subprocess.TimeoutExpired:
pass
self.logger.error(
message="Browser process terminated during startup | Code: {code} | STDOUT: {stdout} | STDERR: {stderr}",
tag="ERROR",
params={
"code": self.browser_process.returncode,
"stdout": stdout.decode() if stdout else "",
"stderr": stderr.decode() if stderr else "",
},
)
async def _monitor_browser_process(self): async def _monitor_browser_process(self):
""" """
Monitor the browser process for unexpected termination. Monitor the browser process for unexpected termination.
@@ -167,6 +210,7 @@ class ManagedBrowser:
4. If any other error occurs, log the error message. 4. If any other error occurs, log the error message.
Note: This method should be called in a separate task to avoid blocking the main event loop. Note: This method should be called in a separate task to avoid blocking the main event loop.
This is DEPRECATED and should not be used for builtin browsers that need to outlive the Python process.
""" """
if self.browser_process: if self.browser_process:
try: try:
@@ -261,22 +305,33 @@ class ManagedBrowser:
if self.browser_process: if self.browser_process:
try: try:
self.browser_process.terminate() # For builtin browsers that should persist, we should check if it's a detached process
# Wait for process to end gracefully # Only terminate if we have proper control over the process
for _ in range(10): # 10 attempts, 100ms each if not self.browser_process.poll():
if self.browser_process.poll() is not None: # Process is still running
break self.browser_process.terminate()
await asyncio.sleep(0.1) # Wait for process to end gracefully
for _ in range(10): # 10 attempts, 100ms each
if self.browser_process.poll() is not None:
break
await asyncio.sleep(0.1)
# Force kill if still running # Force kill if still running
if self.browser_process.poll() is None: if self.browser_process.poll() is None:
self.browser_process.kill() if sys.platform == "win32":
await asyncio.sleep(0.1) # Brief wait for kill to take effect # On Windows we might need taskkill for detached processes
try:
subprocess.run(["taskkill", "/F", "/PID", str(self.browser_process.pid)])
except Exception:
self.browser_process.kill()
else:
self.browser_process.kill()
await asyncio.sleep(0.1) # Brief wait for kill to take effect
except Exception as e: except Exception as e:
self.logger.error( self.logger.error(
message="Error terminating browser: {error}", message="Error terminating browser: {error}",
tag="ERROR", tag="ERROR",
params={"error": str(e)}, params={"error": str(e)},
) )
@@ -379,7 +434,15 @@ class BrowserManager:
sessions (dict): Dictionary to store session information sessions (dict): Dictionary to store session information
session_ttl (int): Session timeout in seconds session_ttl (int): Session timeout in seconds
""" """
_playwright_instance = None
@classmethod
async def get_playwright(cls):
from playwright.async_api import async_playwright
if cls._playwright_instance is None:
cls._playwright_instance = await async_playwright().start()
return cls._playwright_instance
def __init__(self, browser_config: BrowserConfig, logger=None): def __init__(self, browser_config: BrowserConfig, logger=None):
""" """
@@ -429,6 +492,7 @@ class BrowserManager:
Note: This method should be called in a separate task to avoid blocking the main event loop. Note: This method should be called in a separate task to avoid blocking the main event loop.
""" """
self.playwright = await self.get_playwright()
if self.playwright is None: if self.playwright is None:
from playwright.async_api import async_playwright from playwright.async_api import async_playwright
@@ -443,19 +507,6 @@ class BrowserManager:
self.default_context = contexts[0] self.default_context = contexts[0]
else: else:
self.default_context = await self.create_browser_context() self.default_context = await self.create_browser_context()
# self.default_context = await self.browser.new_context(
# viewport={
# "width": self.config.viewport_width,
# "height": self.config.viewport_height,
# },
# storage_state=self.config.storage_state,
# user_agent=self.config.headers.get(
# "User-Agent", self.config.user_agent
# ),
# accept_downloads=self.config.accept_downloads,
# ignore_https_errors=self.config.ignore_https_errors,
# java_script_enabled=self.config.java_script_enabled,
# )
await self.setup_context(self.default_context) await self.setup_context(self.default_context)
else: else:
browser_args = self._build_browser_args() browser_args = self._build_browser_args()
@@ -470,6 +521,7 @@ class BrowserManager:
self.default_context = self.browser self.default_context = self.browser
def _build_browser_args(self) -> dict: def _build_browser_args(self) -> dict:
"""Build browser launch arguments from config.""" """Build browser launch arguments from config."""
args = [ args = [

View File

@@ -12,7 +12,10 @@ import sys
import datetime import datetime
import uuid import uuid
import shutil import shutil
from typing import List, Dict, Optional, Any import json
import subprocess
import time
from typing import List, Dict, Optional, Any, Tuple
from colorama import Fore, Style, init from colorama import Fore, Style, init
from .async_configs import BrowserConfig from .async_configs import BrowserConfig
@@ -56,6 +59,11 @@ class BrowserProfiler:
# Ensure profiles directory exists # Ensure profiles directory exists
self.profiles_dir = os.path.join(get_home_folder(), "profiles") self.profiles_dir = os.path.join(get_home_folder(), "profiles")
os.makedirs(self.profiles_dir, exist_ok=True) os.makedirs(self.profiles_dir, exist_ok=True)
# Builtin browser config file
self.builtin_browser_dir = os.path.join(get_home_folder(), "builtin-browser")
self.builtin_config_file = os.path.join(self.builtin_browser_dir, "browser_config.json")
os.makedirs(self.builtin_browser_dir, exist_ok=True)
async def create_profile(self, async def create_profile(self,
profile_name: Optional[str] = None, profile_name: Optional[str] = None,
@@ -547,12 +555,12 @@ class BrowserProfiler:
else: else:
self.logger.error(f"Invalid choice. Please enter a number between 1 and {exit_option}.", tag="MENU") self.logger.error(f"Invalid choice. Please enter a number between 1 and {exit_option}.", tag="MENU")
async def launch_standalone_browser(self, async def launch_standalone_browser(self,
browser_type: str = "chromium", browser_type: str = "chromium",
user_data_dir: Optional[str] = None, user_data_dir: Optional[str] = None,
debugging_port: int = 9222, debugging_port: int = 9222,
headless: bool = False) -> Optional[str]: headless: bool = False,
save_as_builtin: bool = False) -> Optional[str]:
""" """
Launch a standalone browser with CDP debugging enabled and keep it running Launch a standalone browser with CDP debugging enabled and keep it running
until the user presses 'q'. Returns and displays the CDP URL. until the user presses 'q'. Returns and displays the CDP URL.
@@ -766,4 +774,201 @@ class BrowserProfiler:
# Return the CDP URL # Return the CDP URL
return cdp_url return cdp_url
async def launch_builtin_browser(self,
browser_type: str = "chromium",
debugging_port: int = 9222,
headless: bool = True) -> Optional[str]:
"""
Launch a browser in the background for use as the builtin browser.
Args:
browser_type (str): Type of browser to launch ('chromium' or 'firefox')
debugging_port (int): Port to use for CDP debugging
headless (bool): Whether to run in headless mode
Returns:
str: CDP URL for the browser, or None if launch failed
"""
# Check if there's an existing browser still running
browser_info = self.get_builtin_browser_info()
if browser_info and self._is_browser_running(browser_info.get('pid')):
self.logger.info("Builtin browser is already running", tag="BUILTIN")
return browser_info.get('cdp_url')
# Create a user data directory for the builtin browser
user_data_dir = os.path.join(self.builtin_browser_dir, "user_data")
os.makedirs(user_data_dir, exist_ok=True)
# Create managed browser instance
managed_browser = ManagedBrowser(
browser_type=browser_type,
user_data_dir=user_data_dir,
headless=headless,
logger=self.logger,
debugging_port=debugging_port
)
try:
# Start the browser
await managed_browser.start()
# Check if browser started successfully
browser_process = managed_browser.browser_process
if not browser_process:
self.logger.error("Failed to start browser process.", tag="BUILTIN")
return None
# Get CDP URL
cdp_url = f"http://localhost:{debugging_port}"
# Try to verify browser is responsive by fetching version info
import aiohttp
json_url = f"{cdp_url}/json/version"
config_json = None
try:
async with aiohttp.ClientSession() as session:
for _ in range(10): # Try multiple times
try:
async with session.get(json_url) as response:
if response.status == 200:
config_json = await response.json()
break
except Exception:
pass
await asyncio.sleep(0.5)
except Exception as e:
self.logger.warning(f"Could not verify browser: {str(e)}", tag="BUILTIN")
# Save browser info
browser_info = {
'pid': browser_process.pid,
'cdp_url': cdp_url,
'user_data_dir': user_data_dir,
'browser_type': browser_type,
'debugging_port': debugging_port,
'start_time': time.time(),
'config': config_json
}
with open(self.builtin_config_file, 'w') as f:
json.dump(browser_info, f, indent=2)
# Detach from the browser process - don't keep any references
# This is important to allow the Python script to exit while the browser continues running
# We'll just record the PID and other info, and the browser will run independently
managed_browser.browser_process = None
self.logger.success(f"Builtin browser launched at CDP URL: {cdp_url}", tag="BUILTIN")
return cdp_url
except Exception as e:
self.logger.error(f"Error launching builtin browser: {str(e)}", tag="BUILTIN")
if managed_browser:
await managed_browser.cleanup()
return None
def get_builtin_browser_info(self) -> Optional[Dict[str, Any]]:
"""
Get information about the builtin browser.
Returns:
dict: Browser information or None if no builtin browser is configured
"""
if not os.path.exists(self.builtin_config_file):
return None
try:
with open(self.builtin_config_file, 'r') as f:
browser_info = json.load(f)
# Check if the browser is still running
if not self._is_browser_running(browser_info.get('pid')):
self.logger.warning("Builtin browser is not running", tag="BUILTIN")
return None
return browser_info
except Exception as e:
self.logger.error(f"Error reading builtin browser config: {str(e)}", tag="BUILTIN")
return None
def _is_browser_running(self, pid: Optional[int]) -> bool:
"""Check if a process with the given PID is running"""
if not pid:
return False
try:
# Check if the process exists
if sys.platform == "win32":
process = subprocess.run(["tasklist", "/FI", f"PID eq {pid}"],
capture_output=True, text=True)
return str(pid) in process.stdout
else:
# Unix-like systems
os.kill(pid, 0) # This doesn't actually kill the process, just checks if it exists
return True
except (ProcessLookupError, PermissionError, OSError):
return False
async def kill_builtin_browser(self) -> bool:
"""
Kill the builtin browser if it's running.
Returns:
bool: True if the browser was killed, False otherwise
"""
browser_info = self.get_builtin_browser_info()
if not browser_info:
self.logger.warning("No builtin browser found", tag="BUILTIN")
return False
pid = browser_info.get('pid')
if not pid:
return False
try:
if sys.platform == "win32":
subprocess.run(["taskkill", "/F", "/PID", str(pid)], check=True)
else:
os.kill(pid, signal.SIGTERM)
# Wait for termination
for _ in range(5):
if not self._is_browser_running(pid):
break
await asyncio.sleep(0.5)
else:
# Force kill if still running
os.kill(pid, signal.SIGKILL)
# Remove config file
if os.path.exists(self.builtin_config_file):
os.unlink(self.builtin_config_file)
self.logger.success("Builtin browser terminated", tag="BUILTIN")
return True
except Exception as e:
self.logger.error(f"Error killing builtin browser: {str(e)}", tag="BUILTIN")
return False
async def get_builtin_browser_status(self) -> Dict[str, Any]:
"""
Get status information about the builtin browser.
Returns:
dict: Status information with running, cdp_url, and info fields
"""
browser_info = self.get_builtin_browser_info()
if not browser_info:
return {
'running': False,
'cdp_url': None,
'info': None
}
return {
'running': True,
'cdp_url': browser_info.get('cdp_url'),
'info': browser_info
}

View File

@@ -341,6 +341,32 @@ For more documentation visit: https://github.com/unclecode/crawl4ai
crwl profiles # Select "Create new profile" option crwl profiles # Select "Create new profile" option
# 2. Then use that profile to crawl authenticated content: # 2. Then use that profile to crawl authenticated content:
crwl https://site-requiring-login.com/dashboard -p my-profile-name crwl https://site-requiring-login.com/dashboard -p my-profile-name
🔄 Builtin Browser Management:
# Start a builtin browser (runs in the background)
crwl browser start
# Check builtin browser status
crwl browser status
# Open a visible window to see the browser
crwl browser view --url https://example.com
# Stop the builtin browser
crwl browser stop
# Restart with different options
crwl browser restart --browser-type chromium --port 9223 --no-headless
# Use the builtin browser in your code
# (Just set browser_mode="builtin" in your BrowserConfig)
browser_config = BrowserConfig(
browser_mode="builtin",
headless=True
)
# Usage via CLI:
crwl https://example.com -b "browser_mode=builtin"
""" """
click.echo(examples) click.echo(examples)
@@ -575,6 +601,307 @@ def cli():
pass pass
@cli.group("browser")
def browser_cmd():
"""Manage browser instances for Crawl4AI
Commands to manage browser instances for Crawl4AI, including:
- status - Check status of the builtin browser
- start - Start a new builtin browser
- stop - Stop the running builtin browser
- restart - Restart the builtin browser
"""
pass
@browser_cmd.command("status")
def browser_status_cmd():
"""Show status of the builtin browser"""
profiler = BrowserProfiler()
try:
status = anyio.run(profiler.get_builtin_browser_status)
if status["running"]:
info = status["info"]
console.print(Panel(
f"[green]Builtin browser is running[/green]\n\n"
f"CDP URL: [cyan]{info['cdp_url']}[/cyan]\n"
f"Process ID: [yellow]{info['pid']}[/yellow]\n"
f"Browser type: [blue]{info['browser_type']}[/blue]\n"
f"User data directory: [magenta]{info['user_data_dir']}[/magenta]\n"
f"Started: [cyan]{time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(info['start_time']))}[/cyan]",
title="Builtin Browser Status",
border_style="green"
))
else:
console.print(Panel(
"[yellow]Builtin browser is not running[/yellow]\n\n"
"Use 'crwl browser start' to start a builtin browser",
title="Builtin Browser Status",
border_style="yellow"
))
except Exception as e:
console.print(f"[red]Error checking browser status: {str(e)}[/red]")
sys.exit(1)
@browser_cmd.command("start")
@click.option("--browser-type", "-b", type=click.Choice(["chromium", "firefox"]), default="chromium",
help="Browser type (default: chromium)")
@click.option("--port", "-p", type=int, default=9222, help="Debugging port (default: 9222)")
@click.option("--headless/--no-headless", default=True, help="Run browser in headless mode")
def browser_start_cmd(browser_type: str, port: int, headless: bool):
"""Start a builtin browser instance
This will start a persistent browser instance that can be used by Crawl4AI
by setting browser_mode="builtin" in BrowserConfig.
"""
profiler = BrowserProfiler()
# First check if browser is already running
status = anyio.run(profiler.get_builtin_browser_status)
if status["running"]:
console.print(Panel(
"[yellow]Builtin browser is already running[/yellow]\n\n"
f"CDP URL: [cyan]{status['cdp_url']}[/cyan]\n\n"
"Use 'crwl browser restart' to restart the browser",
title="Builtin Browser Start",
border_style="yellow"
))
return
try:
console.print(Panel(
f"[cyan]Starting builtin browser[/cyan]\n\n"
f"Browser type: [green]{browser_type}[/green]\n"
f"Debugging port: [yellow]{port}[/yellow]\n"
f"Headless: [cyan]{'Yes' if headless else 'No'}[/cyan]",
title="Builtin Browser Start",
border_style="cyan"
))
cdp_url = anyio.run(
profiler.launch_builtin_browser,
browser_type,
port,
headless
)
if cdp_url:
console.print(Panel(
f"[green]Builtin browser started successfully[/green]\n\n"
f"CDP URL: [cyan]{cdp_url}[/cyan]\n\n"
"This browser will be used automatically when setting browser_mode='builtin'",
title="Builtin Browser Start",
border_style="green"
))
else:
console.print(Panel(
"[red]Failed to start builtin browser[/red]",
title="Builtin Browser Start",
border_style="red"
))
sys.exit(1)
except Exception as e:
console.print(f"[red]Error starting builtin browser: {str(e)}[/red]")
sys.exit(1)
@browser_cmd.command("stop")
def browser_stop_cmd():
"""Stop the running builtin browser"""
profiler = BrowserProfiler()
try:
# First check if browser is running
status = anyio.run(profiler.get_builtin_browser_status)
if not status["running"]:
console.print(Panel(
"[yellow]No builtin browser is currently running[/yellow]",
title="Builtin Browser Stop",
border_style="yellow"
))
return
console.print(Panel(
"[cyan]Stopping builtin browser...[/cyan]",
title="Builtin Browser Stop",
border_style="cyan"
))
success = anyio.run(profiler.kill_builtin_browser)
if success:
console.print(Panel(
"[green]Builtin browser stopped successfully[/green]",
title="Builtin Browser Stop",
border_style="green"
))
else:
console.print(Panel(
"[red]Failed to stop builtin browser[/red]",
title="Builtin Browser Stop",
border_style="red"
))
sys.exit(1)
except Exception as e:
console.print(f"[red]Error stopping builtin browser: {str(e)}[/red]")
sys.exit(1)
@browser_cmd.command("view")
@click.option("--url", "-u", help="URL to navigate to (defaults to about:blank)")
def browser_view_cmd(url: Optional[str]):
"""
Open a visible window of the builtin browser
This command connects to the running builtin browser and opens a visible window,
allowing you to see what the browser is currently viewing or navigate to a URL.
"""
profiler = BrowserProfiler()
try:
# First check if browser is running
status = anyio.run(profiler.get_builtin_browser_status)
if not status["running"]:
console.print(Panel(
"[yellow]No builtin browser is currently running[/yellow]\n\n"
"Use 'crwl browser start' to start a builtin browser first",
title="Builtin Browser View",
border_style="yellow"
))
return
info = status["info"]
cdp_url = info["cdp_url"]
console.print(Panel(
f"[cyan]Opening visible window connected to builtin browser[/cyan]\n\n"
f"CDP URL: [green]{cdp_url}[/green]\n"
f"URL to load: [yellow]{url or 'about:blank'}[/yellow]",
title="Builtin Browser View",
border_style="cyan"
))
# Use the CDP URL to launch a new visible window
import subprocess
import os
# Determine the browser command based on platform
if sys.platform == "darwin": # macOS
browser_cmd = ["/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"]
elif sys.platform == "win32": # Windows
browser_cmd = ["C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe"]
else: # Linux
browser_cmd = ["google-chrome"]
# Add arguments
browser_args = [
f"--remote-debugging-port={info['debugging_port']}",
"--remote-debugging-address=localhost",
"--no-first-run",
"--no-default-browser-check"
]
# Add URL if provided
if url:
browser_args.append(url)
# Launch browser
try:
subprocess.Popen(browser_cmd + browser_args)
console.print("[green]Browser window opened. Close it when finished viewing.[/green]")
except Exception as e:
console.print(f"[red]Error launching browser: {str(e)}[/red]")
console.print(f"[yellow]Try connecting manually to {cdp_url} in Chrome or using the '--remote-debugging-port' flag.[/yellow]")
except Exception as e:
console.print(f"[red]Error viewing builtin browser: {str(e)}[/red]")
sys.exit(1)
@browser_cmd.command("restart")
@click.option("--browser-type", "-b", type=click.Choice(["chromium", "firefox"]), default=None,
help="Browser type (defaults to same as current)")
@click.option("--port", "-p", type=int, default=None, help="Debugging port (defaults to same as current)")
@click.option("--headless/--no-headless", default=None, help="Run browser in headless mode")
def browser_restart_cmd(browser_type: Optional[str], port: Optional[int], headless: Optional[bool]):
"""Restart the builtin browser
Stops the current builtin browser if running and starts a new one.
By default, uses the same configuration as the current browser.
"""
profiler = BrowserProfiler()
try:
# First check if browser is running and get its config
status = anyio.run(profiler.get_builtin_browser_status)
current_config = {}
if status["running"]:
info = status["info"]
current_config = {
"browser_type": info["browser_type"],
"port": info["debugging_port"],
"headless": True # Default assumption
}
# Stop the browser
console.print(Panel(
"[cyan]Stopping current builtin browser...[/cyan]",
title="Builtin Browser Restart",
border_style="cyan"
))
success = anyio.run(profiler.kill_builtin_browser)
if not success:
console.print(Panel(
"[red]Failed to stop current browser[/red]",
title="Builtin Browser Restart",
border_style="red"
))
sys.exit(1)
# Use provided options or defaults from current config
browser_type = browser_type or current_config.get("browser_type", "chromium")
port = port or current_config.get("port", 9222)
headless = headless if headless is not None else current_config.get("headless", True)
# Start a new browser
console.print(Panel(
f"[cyan]Starting new builtin browser[/cyan]\n\n"
f"Browser type: [green]{browser_type}[/green]\n"
f"Debugging port: [yellow]{port}[/yellow]\n"
f"Headless: [cyan]{'Yes' if headless else 'No'}[/cyan]",
title="Builtin Browser Restart",
border_style="cyan"
))
cdp_url = anyio.run(
profiler.launch_builtin_browser,
browser_type,
port,
headless
)
if cdp_url:
console.print(Panel(
f"[green]Builtin browser restarted successfully[/green]\n\n"
f"CDP URL: [cyan]{cdp_url}[/cyan]",
title="Builtin Browser Restart",
border_style="green"
))
else:
console.print(Panel(
"[red]Failed to restart builtin browser[/red]",
title="Builtin Browser Restart",
border_style="red"
))
sys.exit(1)
except Exception as e:
console.print(f"[red]Error restarting builtin browser: {str(e)}[/red]")
sys.exit(1)
@cli.command("cdp") @cli.command("cdp")
@click.option("--user-data-dir", "-d", help="Directory to use for browser data (will be created if it doesn't exist)") @click.option("--user-data-dir", "-d", help="Directory to use for browser data (will be created if it doesn't exist)")
@click.option("--port", "-P", type=int, default=9222, help="Debugging port (default: 9222)") @click.option("--port", "-P", type=int, default=9222, help="Debugging port (default: 9222)")
@@ -834,6 +1161,7 @@ def default(url: str, example: bool, browser_config: str, crawler_config: str, f
crwl profiles - Manage browser profiles for identity-based crawling crwl profiles - Manage browser profiles for identity-based crawling
crwl crawl - Crawl a website with advanced options crwl crawl - Crawl a website with advanced options
crwl cdp - Launch browser with CDP debugging enabled crwl cdp - Launch browser with CDP debugging enabled
crwl browser - Manage builtin browser (start, stop, status, restart)
crwl examples - Show more usage examples crwl examples - Show more usage examples
""" """

View File

@@ -0,0 +1,837 @@
import time
import uuid
import threading
import psutil
from datetime import datetime, timedelta
from typing import Dict, Optional, List
import threading
from rich.console import Console
from rich.layout import Layout
from rich.panel import Panel
from rich.table import Table
from rich.text import Text
from rich.live import Live
from rich import box
from ..models import CrawlStatus
class TerminalUI:
"""Terminal user interface for CrawlerMonitor using rich library."""
def __init__(self, refresh_rate: float = 1.0, max_width: int = 120):
"""
Initialize the terminal UI.
Args:
refresh_rate: How often to refresh the UI (in seconds)
max_width: Maximum width of the UI in characters
"""
self.console = Console(width=max_width)
self.layout = Layout()
self.refresh_rate = refresh_rate
self.stop_event = threading.Event()
self.ui_thread = None
self.monitor = None # Will be set by CrawlerMonitor
self.max_width = max_width
# Setup layout - vertical layout (top to bottom)
self.layout.split(
Layout(name="header", size=3),
Layout(name="pipeline_status", size=10),
Layout(name="task_details", ratio=1),
Layout(name="footer", size=3) # Increased footer size to fit all content
)
def start(self, monitor):
"""Start the UI thread."""
self.monitor = monitor
self.stop_event.clear()
self.ui_thread = threading.Thread(target=self._ui_loop)
self.ui_thread.daemon = True
self.ui_thread.start()
def stop(self):
"""Stop the UI thread."""
if self.ui_thread and self.ui_thread.is_alive():
self.stop_event.set()
# Only try to join if we're not in the UI thread
# This prevents "cannot join current thread" errors
if threading.current_thread() != self.ui_thread:
self.ui_thread.join(timeout=5.0)
def _ui_loop(self):
"""Main UI rendering loop."""
import sys
import select
import termios
import tty
# Setup terminal for non-blocking input
old_settings = termios.tcgetattr(sys.stdin)
try:
tty.setcbreak(sys.stdin.fileno())
# Use Live display to render the UI
with Live(self.layout, refresh_per_second=1/self.refresh_rate, screen=True) as live:
self.live = live # Store the live display for updates
# Main UI loop
while not self.stop_event.is_set():
self._update_display()
# Check for key press (non-blocking)
if select.select([sys.stdin], [], [], 0)[0]:
key = sys.stdin.read(1)
# Check for 'q' to quit
if key == 'q':
# Signal stop but don't call monitor.stop() from UI thread
# as it would cause the thread to try to join itself
self.stop_event.set()
self.monitor.is_running = False
break
time.sleep(self.refresh_rate)
# Just check if the monitor was stopped
if not self.monitor.is_running:
break
finally:
# Restore terminal settings
termios.tcsetattr(sys.stdin, termios.TCSADRAIN, old_settings)
def _update_display(self):
"""Update the terminal display with current statistics."""
if not self.monitor:
return
# Update crawler status panel
self.layout["header"].update(self._create_status_panel())
# Update pipeline status panel and task details panel
self.layout["pipeline_status"].update(self._create_pipeline_panel())
self.layout["task_details"].update(self._create_task_details_panel())
# Update footer
self.layout["footer"].update(self._create_footer())
def _create_status_panel(self) -> Panel:
"""Create the crawler status panel."""
summary = self.monitor.get_summary()
# Format memory status with icon
memory_status = self.monitor.get_memory_status()
memory_icon = "🟢" # Default NORMAL
if memory_status == "PRESSURE":
memory_icon = "🟠"
elif memory_status == "CRITICAL":
memory_icon = "🔴"
# Get current memory usage
current_memory = psutil.Process().memory_info().rss / (1024 * 1024) # MB
memory_percent = (current_memory / psutil.virtual_memory().total) * 100
# Format runtime
runtime = self.monitor._format_time(time.time() - self.monitor.start_time if self.monitor.start_time else 0)
# Create the status text
status_text = Text()
status_text.append(f"Web Crawler Dashboard | Runtime: {runtime} | Memory: {memory_percent:.1f}% {memory_icon}\n")
status_text.append(f"Status: {memory_status} | URLs: {summary['urls_completed']}/{summary['urls_total']} | ")
status_text.append(f"Peak Mem: {summary['peak_memory_percent']:.1f}% at {self.monitor._format_time(summary['peak_memory_time'])}")
return Panel(status_text, title="Crawler Status", border_style="blue")
def _create_pipeline_panel(self) -> Panel:
"""Create the pipeline status panel."""
summary = self.monitor.get_summary()
queue_stats = self.monitor.get_queue_stats()
# Create a table for status counts
table = Table(show_header=True, box=None)
table.add_column("Status", style="cyan")
table.add_column("Count", justify="right")
table.add_column("Percentage", justify="right")
table.add_column("Stat", style="cyan")
table.add_column("Value", justify="right")
# Calculate overall progress
progress = f"{summary['urls_completed']}/{summary['urls_total']}"
progress_percent = f"{summary['completion_percentage']:.1f}%"
# Add rows for each status
table.add_row(
"Overall Progress",
progress,
progress_percent,
"Est. Completion",
summary.get('estimated_completion_time', "N/A")
)
# Add rows for each status
status_counts = summary['status_counts']
total = summary['urls_total'] or 1 # Avoid division by zero
# Status rows
table.add_row(
"Completed",
str(status_counts.get(CrawlStatus.COMPLETED.name, 0)),
f"{status_counts.get(CrawlStatus.COMPLETED.name, 0) / total * 100:.1f}%",
"Avg. Time/URL",
f"{summary.get('avg_task_duration', 0):.2f}s"
)
table.add_row(
"Failed",
str(status_counts.get(CrawlStatus.FAILED.name, 0)),
f"{status_counts.get(CrawlStatus.FAILED.name, 0) / total * 100:.1f}%",
"Concurrent Tasks",
str(status_counts.get(CrawlStatus.IN_PROGRESS.name, 0))
)
table.add_row(
"In Progress",
str(status_counts.get(CrawlStatus.IN_PROGRESS.name, 0)),
f"{status_counts.get(CrawlStatus.IN_PROGRESS.name, 0) / total * 100:.1f}%",
"Queue Size",
str(queue_stats['total_queued'])
)
table.add_row(
"Queued",
str(status_counts.get(CrawlStatus.QUEUED.name, 0)),
f"{status_counts.get(CrawlStatus.QUEUED.name, 0) / total * 100:.1f}%",
"Max Wait Time",
f"{queue_stats['highest_wait_time']:.1f}s"
)
# Requeued is a special case as it's not a status
requeued_count = summary.get('requeued_count', 0)
table.add_row(
"Requeued",
str(requeued_count),
f"{summary.get('requeue_rate', 0):.1f}%",
"Avg Wait Time",
f"{queue_stats['avg_wait_time']:.1f}s"
)
# Add empty row for spacing
table.add_row(
"",
"",
"",
"Requeue Rate",
f"{summary.get('requeue_rate', 0):.1f}%"
)
return Panel(table, title="Pipeline Status", border_style="green")
def _create_task_details_panel(self) -> Panel:
"""Create the task details panel."""
# Create a table for task details
table = Table(show_header=True, expand=True)
table.add_column("Task ID", style="cyan", no_wrap=True, width=10)
table.add_column("URL", style="blue", ratio=3)
table.add_column("Status", style="green", width=15)
table.add_column("Memory", justify="right", width=8)
table.add_column("Peak", justify="right", width=8)
table.add_column("Duration", justify="right", width=10)
# Get all task stats
task_stats = self.monitor.get_all_task_stats()
# Add summary row
active_tasks = sum(1 for stats in task_stats.values()
if stats['status'] == CrawlStatus.IN_PROGRESS.name)
total_memory = sum(stats['memory_usage'] for stats in task_stats.values())
total_peak = sum(stats['peak_memory'] for stats in task_stats.values())
# Summary row with separators
table.add_row(
"SUMMARY",
f"Total: {len(task_stats)}",
f"Active: {active_tasks}",
f"{total_memory:.1f}",
f"{total_peak:.1f}",
"N/A"
)
# Add a separator
table.add_row("" * 10, "" * 20, "" * 10, "" * 8, "" * 8, "" * 10)
# Status icons
status_icons = {
CrawlStatus.QUEUED.name: "",
CrawlStatus.IN_PROGRESS.name: "🔄",
CrawlStatus.COMPLETED.name: "",
CrawlStatus.FAILED.name: ""
}
# Calculate how many rows we can display based on available space
# We can display more rows now that we have a dedicated panel
display_count = min(len(task_stats), 20) # Display up to 20 tasks
# Add rows for each task
for task_id, stats in sorted(
list(task_stats.items())[:display_count],
# Sort: 1. IN_PROGRESS first, 2. QUEUED, 3. COMPLETED/FAILED by recency
key=lambda x: (
0 if x[1]['status'] == CrawlStatus.IN_PROGRESS.name else
1 if x[1]['status'] == CrawlStatus.QUEUED.name else
2,
-1 * (x[1].get('end_time', 0) or 0) # Most recent first
)
):
# Truncate task_id and URL for display
short_id = task_id[:8]
url = stats['url']
if len(url) > 50: # Allow longer URLs in the dedicated panel
url = url[:47] + "..."
# Format status with icon
status = f"{status_icons.get(stats['status'], '?')} {stats['status']}"
# Add row
table.add_row(
short_id,
url,
status,
f"{stats['memory_usage']:.1f}",
f"{stats['peak_memory']:.1f}",
stats['duration'] if 'duration' in stats else "0:00"
)
return Panel(table, title="Task Details", border_style="yellow")
def _create_footer(self) -> Panel:
"""Create the footer panel."""
from rich.columns import Columns
from rich.align import Align
memory_status = self.monitor.get_memory_status()
memory_icon = "🟢" # Default NORMAL
if memory_status == "PRESSURE":
memory_icon = "🟠"
elif memory_status == "CRITICAL":
memory_icon = "🔴"
# Left section - memory status
left_text = Text()
left_text.append("Memory Status: ", style="bold")
status_style = "green" if memory_status == "NORMAL" else "yellow" if memory_status == "PRESSURE" else "red bold"
left_text.append(f"{memory_icon} {memory_status}", style=status_style)
# Center section - copyright
center_text = Text("© Crawl4AI 2025 | Made by UnclecCode", style="cyan italic")
# Right section - quit instruction
right_text = Text()
right_text.append("Press ", style="bold")
right_text.append("q", style="white on blue")
right_text.append(" to quit", style="bold")
# Create columns with the three sections
footer_content = Columns(
[
Align.left(left_text),
Align.center(center_text),
Align.right(right_text)
],
expand=True
)
# Create a more visible footer panel
return Panel(
footer_content,
border_style="white",
padding=(0, 1) # Add padding for better visibility
)
class CrawlerMonitor:
"""
Comprehensive monitoring and visualization system for tracking web crawler operations in real-time.
Provides a terminal-based dashboard that displays task statuses, memory usage, queue statistics,
and performance metrics.
"""
def __init__(
self,
urls_total: int = 0,
refresh_rate: float = 1.0,
enable_ui: bool = True,
max_width: int = 120
):
"""
Initialize the CrawlerMonitor.
Args:
urls_total: Total number of URLs to be crawled
refresh_rate: How often to refresh the UI (in seconds)
enable_ui: Whether to display the terminal UI
max_width: Maximum width of the UI in characters
"""
# Core monitoring attributes
self.stats = {} # Task ID -> stats dict
self.memory_status = "NORMAL"
self.start_time = None
self.end_time = None
self.is_running = False
self.queue_stats = {
"total_queued": 0,
"highest_wait_time": 0.0,
"avg_wait_time": 0.0
}
self.urls_total = urls_total
self.urls_completed = 0
self.peak_memory_percent = 0.0
self.peak_memory_time = 0.0
# Status counts
self.status_counts = {
CrawlStatus.QUEUED.name: 0,
CrawlStatus.IN_PROGRESS.name: 0,
CrawlStatus.COMPLETED.name: 0,
CrawlStatus.FAILED.name: 0
}
# Requeue tracking
self.requeued_count = 0
# Thread-safety
self._lock = threading.RLock()
# Terminal UI
self.enable_ui = enable_ui
self.terminal_ui = TerminalUI(
refresh_rate=refresh_rate,
max_width=max_width
) if enable_ui else None
def start(self):
"""
Start the monitoring session.
- Initializes the start_time
- Sets is_running to True
- Starts the terminal UI if enabled
"""
with self._lock:
self.start_time = time.time()
self.is_running = True
# Start the terminal UI
if self.enable_ui and self.terminal_ui:
self.terminal_ui.start(self)
def stop(self):
"""
Stop the monitoring session.
- Records end_time
- Sets is_running to False
- Stops the terminal UI
- Generates final summary statistics
"""
with self._lock:
self.end_time = time.time()
self.is_running = False
# Stop the terminal UI
if self.enable_ui and self.terminal_ui:
self.terminal_ui.stop()
def add_task(self, task_id: str, url: str):
"""
Register a new task with the monitor.
Args:
task_id: Unique identifier for the task
url: URL being crawled
The task is initialized with:
- status: QUEUED
- url: The URL to crawl
- enqueue_time: Current time
- memory_usage: 0
- peak_memory: 0
- wait_time: 0
- retry_count: 0
"""
with self._lock:
self.stats[task_id] = {
"task_id": task_id,
"url": url,
"status": CrawlStatus.QUEUED.name,
"enqueue_time": time.time(),
"start_time": None,
"end_time": None,
"memory_usage": 0.0,
"peak_memory": 0.0,
"error_message": "",
"wait_time": 0.0,
"retry_count": 0,
"duration": "0:00",
"counted_requeue": False
}
# Update status counts
self.status_counts[CrawlStatus.QUEUED.name] += 1
def update_task(
self,
task_id: str,
status: Optional[CrawlStatus] = None,
start_time: Optional[float] = None,
end_time: Optional[float] = None,
memory_usage: Optional[float] = None,
peak_memory: Optional[float] = None,
error_message: Optional[str] = None,
retry_count: Optional[int] = None,
wait_time: Optional[float] = None
):
"""
Update statistics for a specific task.
Args:
task_id: Unique identifier for the task
status: New status (QUEUED, IN_PROGRESS, COMPLETED, FAILED)
start_time: When task execution started
end_time: When task execution ended
memory_usage: Current memory usage in MB
peak_memory: Maximum memory usage in MB
error_message: Error description if failed
retry_count: Number of retry attempts
wait_time: Time spent in queue
Updates task statistics and updates status counts.
If status changes, decrements old status count and
increments new status count.
"""
with self._lock:
# Check if task exists
if task_id not in self.stats:
return
task_stats = self.stats[task_id]
# Update status counts if status is changing
old_status = task_stats["status"]
if status and status.name != old_status:
self.status_counts[old_status] -= 1
self.status_counts[status.name] += 1
# Track completion
if status == CrawlStatus.COMPLETED:
self.urls_completed += 1
# Track requeues
if old_status in [CrawlStatus.COMPLETED.name, CrawlStatus.FAILED.name] and not task_stats.get("counted_requeue", False):
self.requeued_count += 1
task_stats["counted_requeue"] = True
# Update task statistics
if status:
task_stats["status"] = status.name
if start_time is not None:
task_stats["start_time"] = start_time
if end_time is not None:
task_stats["end_time"] = end_time
if memory_usage is not None:
task_stats["memory_usage"] = memory_usage
# Update peak memory if necessary
current_percent = (memory_usage / psutil.virtual_memory().total) * 100
if current_percent > self.peak_memory_percent:
self.peak_memory_percent = current_percent
self.peak_memory_time = time.time()
if peak_memory is not None:
task_stats["peak_memory"] = peak_memory
if error_message is not None:
task_stats["error_message"] = error_message
if retry_count is not None:
task_stats["retry_count"] = retry_count
if wait_time is not None:
task_stats["wait_time"] = wait_time
# Calculate duration
if task_stats["start_time"]:
end = task_stats["end_time"] or time.time()
duration = end - task_stats["start_time"]
task_stats["duration"] = self._format_time(duration)
def update_memory_status(self, status: str):
"""
Update the current memory status.
Args:
status: Memory status (NORMAL, PRESSURE, CRITICAL, or custom)
Also updates the UI to reflect the new status.
"""
with self._lock:
self.memory_status = status
def update_queue_statistics(
self,
total_queued: int,
highest_wait_time: float,
avg_wait_time: float
):
"""
Update statistics related to the task queue.
Args:
total_queued: Number of tasks currently in queue
highest_wait_time: Longest wait time of any queued task
avg_wait_time: Average wait time across all queued tasks
"""
with self._lock:
self.queue_stats = {
"total_queued": total_queued,
"highest_wait_time": highest_wait_time,
"avg_wait_time": avg_wait_time
}
def get_task_stats(self, task_id: str) -> Dict:
"""
Get statistics for a specific task.
Args:
task_id: Unique identifier for the task
Returns:
Dictionary containing all task statistics
"""
with self._lock:
return self.stats.get(task_id, {}).copy()
def get_all_task_stats(self) -> Dict[str, Dict]:
"""
Get statistics for all tasks.
Returns:
Dictionary mapping task_ids to their statistics
"""
with self._lock:
return self.stats.copy()
def get_memory_status(self) -> str:
"""
Get the current memory status.
Returns:
Current memory status string
"""
with self._lock:
return self.memory_status
def get_queue_stats(self) -> Dict:
"""
Get current queue statistics.
Returns:
Dictionary with queue statistics including:
- total_queued: Number of tasks in queue
- highest_wait_time: Longest wait time
- avg_wait_time: Average wait time
"""
with self._lock:
return self.queue_stats.copy()
def get_summary(self) -> Dict:
"""
Get a summary of all crawler statistics.
Returns:
Dictionary containing:
- runtime: Total runtime in seconds
- urls_total: Total URLs to process
- urls_completed: Number of completed URLs
- completion_percentage: Percentage complete
- status_counts: Count of tasks in each status
- memory_status: Current memory status
- peak_memory_percent: Highest memory usage
- peak_memory_time: When peak memory occurred
- avg_task_duration: Average task processing time
- estimated_completion_time: Projected finish time
- requeue_rate: Percentage of tasks requeued
"""
with self._lock:
# Calculate runtime
current_time = time.time()
runtime = current_time - (self.start_time or current_time)
# Calculate completion percentage
completion_percentage = 0
if self.urls_total > 0:
completion_percentage = (self.urls_completed / self.urls_total) * 100
# Calculate average task duration for completed tasks
completed_tasks = [
task for task in self.stats.values()
if task["status"] == CrawlStatus.COMPLETED.name and task.get("start_time") and task.get("end_time")
]
avg_task_duration = 0
if completed_tasks:
total_duration = sum(task["end_time"] - task["start_time"] for task in completed_tasks)
avg_task_duration = total_duration / len(completed_tasks)
# Calculate requeue rate
requeue_rate = 0
if len(self.stats) > 0:
requeue_rate = (self.requeued_count / len(self.stats)) * 100
# Calculate estimated completion time
estimated_completion_time = "N/A"
if avg_task_duration > 0 and self.urls_total > 0 and self.urls_completed > 0:
remaining_tasks = self.urls_total - self.urls_completed
estimated_seconds = remaining_tasks * avg_task_duration
estimated_completion_time = self._format_time(estimated_seconds)
return {
"runtime": runtime,
"urls_total": self.urls_total,
"urls_completed": self.urls_completed,
"completion_percentage": completion_percentage,
"status_counts": self.status_counts.copy(),
"memory_status": self.memory_status,
"peak_memory_percent": self.peak_memory_percent,
"peak_memory_time": self.peak_memory_time,
"avg_task_duration": avg_task_duration,
"estimated_completion_time": estimated_completion_time,
"requeue_rate": requeue_rate,
"requeued_count": self.requeued_count
}
def render(self):
"""
Render the terminal UI.
This is the main UI rendering loop that:
1. Updates all statistics
2. Formats the display
3. Renders the ASCII interface
4. Handles keyboard input
Note: The actual rendering is handled by the TerminalUI class
which uses the rich library's Live display.
"""
if self.enable_ui and self.terminal_ui:
# Force an update of the UI
if hasattr(self.terminal_ui, '_update_display'):
self.terminal_ui._update_display()
def _format_time(self, seconds: float) -> str:
"""
Format time in hours:minutes:seconds.
Args:
seconds: Time in seconds
Returns:
Formatted time string (e.g., "1:23:45")
"""
delta = timedelta(seconds=int(seconds))
hours, remainder = divmod(delta.seconds, 3600)
minutes, seconds = divmod(remainder, 60)
if hours > 0:
return f"{hours}:{minutes:02}:{seconds:02}"
else:
return f"{minutes}:{seconds:02}"
def _calculate_estimated_completion(self) -> str:
"""
Calculate estimated completion time based on current progress.
Returns:
Formatted time string
"""
summary = self.get_summary()
return summary.get("estimated_completion_time", "N/A")
# Example code for testing
if __name__ == "__main__":
# Initialize the monitor
monitor = CrawlerMonitor(urls_total=100)
# Start monitoring
monitor.start()
try:
# Simulate some tasks
for i in range(20):
task_id = str(uuid.uuid4())
url = f"https://example.com/page{i}"
monitor.add_task(task_id, url)
# Simulate 20% of tasks are already running
if i < 4:
monitor.update_task(
task_id=task_id,
status=CrawlStatus.IN_PROGRESS,
start_time=time.time() - 30, # Started 30 seconds ago
memory_usage=10.5
)
# Simulate 10% of tasks are completed
if i >= 4 and i < 6:
start_time = time.time() - 60
end_time = time.time() - 15
monitor.update_task(
task_id=task_id,
status=CrawlStatus.IN_PROGRESS,
start_time=start_time,
memory_usage=8.2
)
monitor.update_task(
task_id=task_id,
status=CrawlStatus.COMPLETED,
end_time=end_time,
memory_usage=0,
peak_memory=15.7
)
# Simulate 5% of tasks fail
if i >= 6 and i < 7:
start_time = time.time() - 45
end_time = time.time() - 20
monitor.update_task(
task_id=task_id,
status=CrawlStatus.IN_PROGRESS,
start_time=start_time,
memory_usage=12.3
)
monitor.update_task(
task_id=task_id,
status=CrawlStatus.FAILED,
end_time=end_time,
memory_usage=0,
peak_memory=18.2,
error_message="Connection timeout"
)
# Simulate memory pressure
monitor.update_memory_status("PRESSURE")
# Simulate queue statistics
monitor.update_queue_statistics(
total_queued=16, # 20 - 4 (in progress)
highest_wait_time=120.5,
avg_wait_time=60.2
)
# Keep the monitor running for a demonstration
print("Crawler Monitor is running. Press 'q' to exit.")
while monitor.is_running:
time.sleep(0.1)
except KeyboardInterrupt:
print("\nExiting crawler monitor...")
finally:
# Stop the monitor
monitor.stop()
print("Crawler monitor exited successfully.")

View File

@@ -4,7 +4,8 @@ from dotenv import load_dotenv
load_dotenv() # Load environment variables from .env file load_dotenv() # Load environment variables from .env file
# Default provider, ONLY used when the extraction strategy is LLMExtractionStrategy # Default provider, ONLY used when the extraction strategy is LLMExtractionStrategy
DEFAULT_PROVIDER = "openai/gpt-4o-mini" DEFAULT_PROVIDER = "openai/gpt-4o"
DEFAULT_PROVIDER_API_KEY = "OPENAI_API_KEY"
MODEL_REPO_BRANCH = "new-release-0.0.2" MODEL_REPO_BRANCH = "new-release-0.0.2"
# Provider-model dictionary, ONLY used when the extraction strategy is LLMExtractionStrategy # Provider-model dictionary, ONLY used when the extraction strategy is LLMExtractionStrategy
PROVIDER_MODELS = { PROVIDER_MODELS = {

View File

@@ -1,6 +1,6 @@
from crawl4ai import BrowserConfig, AsyncWebCrawler, CrawlerRunConfig, CacheMode from crawl4ai import BrowserConfig, AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.hub import BaseCrawler from crawl4ai.hub import BaseCrawler
from crawl4ai.utils import optimize_html, get_home_folder from crawl4ai.utils import optimize_html, get_home_folder, preprocess_html_for_schema
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from pathlib import Path from pathlib import Path
import json import json
@@ -68,7 +68,8 @@ class GoogleSearchCrawler(BaseCrawler):
home_dir = get_home_folder() if not schema_cache_path else schema_cache_path home_dir = get_home_folder() if not schema_cache_path else schema_cache_path
os.makedirs(f"{home_dir}/schema", exist_ok=True) os.makedirs(f"{home_dir}/schema", exist_ok=True)
cleaned_html = optimize_html(html, threshold=100) # cleaned_html = optimize_html(html, threshold=100)
cleaned_html = preprocess_html_for_schema(html)
organic_schema = None organic_schema = None
if os.path.exists(f"{home_dir}/schema/organic_schema.json"): if os.path.exists(f"{home_dir}/schema/organic_schema.json"):

View File

@@ -7,6 +7,7 @@ from contextvars import ContextVar
from ..types import AsyncWebCrawler, CrawlerRunConfig, CrawlResult, RunManyReturn from ..types import AsyncWebCrawler, CrawlerRunConfig, CrawlResult, RunManyReturn
class DeepCrawlDecorator: class DeepCrawlDecorator:
"""Decorator that adds deep crawling capability to arun method.""" """Decorator that adds deep crawling capability to arun method."""
deep_crawl_active = ContextVar("deep_crawl_active", default=False) deep_crawl_active = ContextVar("deep_crawl_active", default=False)
@@ -59,7 +60,8 @@ class DeepCrawlStrategy(ABC):
start_url: str, start_url: str,
crawler: AsyncWebCrawler, crawler: AsyncWebCrawler,
config: CrawlerRunConfig, config: CrawlerRunConfig,
) -> List[CrawlResult]: # ) -> List[CrawlResult]:
) -> RunManyReturn:
""" """
Batch (non-streaming) mode: Batch (non-streaming) mode:
Processes one BFS level at a time, then yields all the results. Processes one BFS level at a time, then yields all the results.
@@ -72,7 +74,8 @@ class DeepCrawlStrategy(ABC):
start_url: str, start_url: str,
crawler: AsyncWebCrawler, crawler: AsyncWebCrawler,
config: CrawlerRunConfig, config: CrawlerRunConfig,
) -> AsyncGenerator[CrawlResult, None]: # ) -> AsyncGenerator[CrawlResult, None]:
) -> RunManyReturn:
""" """
Streaming mode: Streaming mode:
Processes one BFS level at a time and yields results immediately as they arrive. Processes one BFS level at a time and yields results immediately as they arrive.

View File

@@ -9,7 +9,7 @@ from ..models import TraversalStats
from .filters import FilterChain from .filters import FilterChain
from .scorers import URLScorer from .scorers import URLScorer
from . import DeepCrawlStrategy from . import DeepCrawlStrategy
from ..types import AsyncWebCrawler, CrawlerRunConfig, CrawlResult from ..types import AsyncWebCrawler, CrawlerRunConfig, CrawlResult, RunManyReturn
from ..utils import normalize_url_for_deep_crawl, efficient_normalize_url_for_deep_crawl from ..utils import normalize_url_for_deep_crawl, efficient_normalize_url_for_deep_crawl
from math import inf as infinity from math import inf as infinity
@@ -143,7 +143,8 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
start_url: str, start_url: str,
crawler: AsyncWebCrawler, crawler: AsyncWebCrawler,
config: CrawlerRunConfig, config: CrawlerRunConfig,
) -> List[CrawlResult]: # ) -> List[CrawlResult]:
) -> RunManyReturn:
""" """
Batch (non-streaming) mode: Batch (non-streaming) mode:
Processes one BFS level at a time, then yields all the results. Processes one BFS level at a time, then yields all the results.
@@ -191,7 +192,8 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
start_url: str, start_url: str,
crawler: AsyncWebCrawler, crawler: AsyncWebCrawler,
config: CrawlerRunConfig, config: CrawlerRunConfig,
) -> AsyncGenerator[CrawlResult, None]: # ) -> AsyncGenerator[CrawlResult, None]:
) -> RunManyReturn:
""" """
Streaming mode: Streaming mode:
Processes one BFS level at a time and yields results immediately as they arrive. Processes one BFS level at a time and yields results immediately as they arrive.

View File

@@ -3,7 +3,7 @@ from typing import AsyncGenerator, Optional, Set, Dict, List, Tuple
from ..models import CrawlResult from ..models import CrawlResult
from .bfs_strategy import BFSDeepCrawlStrategy # noqa from .bfs_strategy import BFSDeepCrawlStrategy # noqa
from ..types import AsyncWebCrawler, CrawlerRunConfig from ..types import AsyncWebCrawler, CrawlerRunConfig, RunManyReturn
class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy): class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
""" """
@@ -17,7 +17,8 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
start_url: str, start_url: str,
crawler: AsyncWebCrawler, crawler: AsyncWebCrawler,
config: CrawlerRunConfig, config: CrawlerRunConfig,
) -> List[CrawlResult]: # ) -> List[CrawlResult]:
) -> RunManyReturn:
""" """
Batch (non-streaming) DFS mode. Batch (non-streaming) DFS mode.
Uses a stack to traverse URLs in DFS order, aggregating CrawlResults into a list. Uses a stack to traverse URLs in DFS order, aggregating CrawlResults into a list.
@@ -65,7 +66,8 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
start_url: str, start_url: str,
crawler: AsyncWebCrawler, crawler: AsyncWebCrawler,
config: CrawlerRunConfig, config: CrawlerRunConfig,
) -> AsyncGenerator[CrawlResult, None]: # ) -> AsyncGenerator[CrawlResult, None]:
) -> RunManyReturn:
""" """
Streaming DFS mode. Streaming DFS mode.
Uses a stack to traverse URLs in DFS order and yields CrawlResults as they become available. Uses a stack to traverse URLs in DFS order and yields CrawlResults as they become available.

View File

@@ -34,7 +34,7 @@ from .model_loader import (
calculate_batch_size calculate_batch_size
) )
from .types import LLMConfig from .types import LLMConfig, create_llm_config
from functools import partial from functools import partial
import numpy as np import numpy as np
@@ -757,8 +757,6 @@ class LLMExtractionStrategy(ExtractionStrategy):
####################################################### #######################################################
# New extraction strategies for JSON-based extraction # # New extraction strategies for JSON-based extraction #
####################################################### #######################################################
class JsonElementExtractionStrategy(ExtractionStrategy): class JsonElementExtractionStrategy(ExtractionStrategy):
""" """
Abstract base class for extracting structured JSON from HTML content. Abstract base class for extracting structured JSON from HTML content.
@@ -1049,7 +1047,7 @@ class JsonElementExtractionStrategy(ExtractionStrategy):
schema_type: str = "CSS", # or XPATH schema_type: str = "CSS", # or XPATH
query: str = None, query: str = None,
target_json_example: str = None, target_json_example: str = None,
llm_config: 'LLMConfig' = None, llm_config: 'LLMConfig' = create_llm_config(),
provider: str = None, provider: str = None,
api_token: str = None, api_token: str = None,
**kwargs **kwargs
@@ -1081,7 +1079,7 @@ class JsonElementExtractionStrategy(ExtractionStrategy):
# Build the prompt # Build the prompt
system_message = { system_message = {
"role": "system", "role": "system",
"content": f"""You specialize in generating special JSON schemas for web scraping. This schema uses CSS or XPATH selectors to present a repetitive pattern in crawled HTML, such as a product in a product list or a search result item in a list of search results. You use this JSON schema to pass to a language model along with the HTML content to extract structured data from the HTML. The language model uses the JSON schema to extract data from the HTML and retrieve values for fields in the JSON schema, following the schema. "content": f"""You specialize in generating special JSON schemas for web scraping. This schema uses CSS or XPATH selectors to present a repetitive pattern in crawled HTML, such as a product in a product list or a search result item in a list of search results. We use this JSON schema to pass to a language model along with the HTML content to extract structured data from the HTML. The language model uses the JSON schema to extract data from the HTML and retrieve values for fields in the JSON schema, following the schema.
Generating this HTML manually is not feasible, so you need to generate the JSON schema using the HTML content. The HTML copied from the crawled website is provided below, which we believe contains the repetitive pattern. Generating this HTML manually is not feasible, so you need to generate the JSON schema using the HTML content. The HTML copied from the crawled website is provided below, which we believe contains the repetitive pattern.
@@ -1095,9 +1093,10 @@ Generating this HTML manually is not feasible, so you need to generate the JSON
In this context, the following items may or may not be present: In this context, the following items may or may not be present:
- Example of target JSON object: This is a sample of the final JSON object that we hope to extract from the HTML using the schema you are generating. - Example of target JSON object: This is a sample of the final JSON object that we hope to extract from the HTML using the schema you are generating.
- Extra Instructions: This is optional instructions to consider when generating the schema provided by the user. - Extra Instructions: This is optional instructions to consider when generating the schema provided by the user.
- Query or explanation of target/goal data item: This is a description of what data we are trying to extract from the HTML. This explanation means we're not sure about the rigid schema of the structures we want, so we leave it to you to use your expertise to create the best and most comprehensive structures aimed at maximizing data extraction from this page. You must ensure that you do not pick up nuances that may exist on a particular page. The focus should be on the data we are extracting, and it must be valid, safe, and robust based on the given HTML.
# What if there is no example of target JSON object? # What if there is no example of target JSON object and also no extra instructions or even no explanation of target/goal data item?
In this scenario, use your best judgment to generate the schema. Try to maximize the number of fields that you can extract from the HTML. In this scenario, use your best judgment to generate the schema. You need to examine the content of the page and understand the data it provides. If the page contains repetitive data, such as lists of items, products, jobs, places, books, or movies, focus on one single item that repeats. If the page is a detailed page about one product or item, create a schema to extract the entire structured data. At this stage, you must think and decide for yourself. Try to maximize the number of fields that you can extract from the HTML.
# What are the instructions and details for this schema generation? # What are the instructions and details for this schema generation?
{prompt_template}""" {prompt_template}"""
@@ -1114,11 +1113,18 @@ In this scenario, use your best judgment to generate the schema. Try to maximize
} }
if query: if query:
user_message["content"] += f"\n\nImportant Notes to Consider:\n{query}" user_message["content"] += f"\n\n## Query or explanation of target/goal data item:\n{query}"
if target_json_example: if target_json_example:
user_message["content"] += f"\n\nExample of target JSON object:\n{target_json_example}" user_message["content"] += f"\n\n## Example of target JSON object:\n```json\n{target_json_example}\n```"
if query and not target_json_example:
user_message["content"] += """IMPORTANT: To remind you, in this process, we are not providing a rigid example of the adjacent objects we seek. We rely on your understanding of the explanation provided in the above section. Make sure to grasp what we are looking for and, based on that, create the best schema.."""
elif not query and target_json_example:
user_message["content"] += """IMPORTANT: Please remember that in this process, we provided a proper example of a target JSON object. Make sure to adhere to the structure and create a schema that exactly fits this example. If you find that some elements on the page do not match completely, vote for the majority."""
elif not query and not target_json_example:
user_message["content"] += """IMPORTANT: Since we neither have a query nor an example, it is crucial to rely solely on the HTML content provided. Leverage your expertise to determine the schema based on the repetitive patterns observed in the content."""
user_message["content"] += """IMPORTANT: Ensure your schema is reliable, meaning do not use selectors that seem to generate dynamically and are not reliable. A reliable schema is what you want, as it consistently returns the same data even after many reloads of the page. user_message["content"] += """IMPORTANT: Ensure your schema remains reliable by avoiding selectors that appear to generate dynamically and are not dependable. You want a reliable schema, as it consistently returns the same data even after many page reloads.
Analyze the HTML and generate a JSON schema that follows the specified format. Only output valid JSON schema, nothing else. Analyze the HTML and generate a JSON schema that follows the specified format. Only output valid JSON schema, nothing else.
""" """
@@ -1140,7 +1146,6 @@ In this scenario, use your best judgment to generate the schema. Try to maximize
except Exception as e: except Exception as e:
raise Exception(f"Failed to generate schema: {str(e)}") raise Exception(f"Failed to generate schema: {str(e)}")
class JsonCssExtractionStrategy(JsonElementExtractionStrategy): class JsonCssExtractionStrategy(JsonElementExtractionStrategy):
""" """
Concrete implementation of `JsonElementExtractionStrategy` using CSS selectors. Concrete implementation of `JsonElementExtractionStrategy` using CSS selectors.

View File

@@ -45,7 +45,34 @@ def post_install():
setup_home_directory() setup_home_directory()
install_playwright() install_playwright()
run_migration() run_migration()
setup_builtin_browser()
logger.success("Post-installation setup completed!", tag="COMPLETE") logger.success("Post-installation setup completed!", tag="COMPLETE")
def setup_builtin_browser():
"""Set up a builtin browser for use with Crawl4AI"""
try:
logger.info("Setting up builtin browser...", tag="INIT")
asyncio.run(_setup_builtin_browser())
logger.success("Builtin browser setup completed!", tag="COMPLETE")
except Exception as e:
logger.warning(f"Failed to set up builtin browser: {e}")
logger.warning("You can manually set up a builtin browser using 'crawl4ai-doctor builtin-browser-start'")
async def _setup_builtin_browser():
try:
# Import BrowserProfiler here to avoid circular imports
from .browser_profiler import BrowserProfiler
profiler = BrowserProfiler(logger=logger)
# Launch the builtin browser
cdp_url = await profiler.launch_builtin_browser(headless=True)
if cdp_url:
logger.success(f"Builtin browser launched at {cdp_url}", tag="BROWSER")
else:
logger.warning("Failed to launch builtin browser", tag="BROWSER")
except Exception as e:
logger.warning(f"Error setting up builtin browser: {e}", tag="BROWSER")
raise
def install_playwright(): def install_playwright():

View File

@@ -1,4 +1,3 @@
from re import U
from pydantic import BaseModel, HttpUrl, PrivateAttr from pydantic import BaseModel, HttpUrl, PrivateAttr
from typing import List, Dict, Optional, Callable, Awaitable, Union, Any from typing import List, Dict, Optional, Callable, Awaitable, Union, Any
from enum import Enum from enum import Enum
@@ -28,6 +27,12 @@ class CrawlerTaskResult:
start_time: Union[datetime, float] start_time: Union[datetime, float]
end_time: Union[datetime, float] end_time: Union[datetime, float]
error_message: str = "" error_message: str = ""
retry_count: int = 0
wait_time: float = 0.0
@property
def success(self) -> bool:
return self.result.success
class CrawlStatus(Enum): class CrawlStatus(Enum):
@@ -67,6 +72,9 @@ class CrawlStats:
memory_usage: float = 0.0 memory_usage: float = 0.0
peak_memory: float = 0.0 peak_memory: float = 0.0
error_message: str = "" error_message: str = ""
wait_time: float = 0.0
retry_count: int = 0
counted_requeue: bool = False
@property @property
def duration(self) -> str: def duration(self) -> str:
@@ -87,6 +95,7 @@ class CrawlStats:
duration = end - start duration = end - start
return str(timedelta(seconds=int(duration.total_seconds()))) return str(timedelta(seconds=int(duration.total_seconds())))
class DisplayMode(Enum): class DisplayMode(Enum):
DETAILED = "DETAILED" DETAILED = "DETAILED"
AGGREGATED = "AGGREGATED" AGGREGATED = "AGGREGATED"

View File

@@ -178,4 +178,10 @@ if TYPE_CHECKING:
BestFirstCrawlingStrategy as BestFirstCrawlingStrategyType, BestFirstCrawlingStrategy as BestFirstCrawlingStrategyType,
DFSDeepCrawlStrategy as DFSDeepCrawlStrategyType, DFSDeepCrawlStrategy as DFSDeepCrawlStrategyType,
DeepCrawlDecorator as DeepCrawlDecoratorType, DeepCrawlDecorator as DeepCrawlDecoratorType,
) )
def create_llm_config(*args, **kwargs) -> 'LLMConfigType':
from .async_configs import LLMConfig
return LLMConfig(*args, **kwargs)

View File

@@ -26,7 +26,7 @@ import cProfile
import pstats import pstats
from functools import wraps from functools import wraps
import asyncio import asyncio
from lxml import etree, html as lhtml
import sqlite3 import sqlite3
import hashlib import hashlib
@@ -2617,3 +2617,116 @@ class HeadPeekr:
def get_title(head_content: str): def get_title(head_content: str):
title_match = re.search(r'<title>(.*?)</title>', head_content, re.IGNORECASE | re.DOTALL) title_match = re.search(r'<title>(.*?)</title>', head_content, re.IGNORECASE | re.DOTALL)
return title_match.group(1) if title_match else None return title_match.group(1) if title_match else None
def preprocess_html_for_schema(html_content, text_threshold=100, attr_value_threshold=200, max_size=100000):
"""
Preprocess HTML to reduce size while preserving structure for schema generation.
Args:
html_content (str): Raw HTML content
text_threshold (int): Maximum length for text nodes before truncation
attr_value_threshold (int): Maximum length for attribute values before truncation
max_size (int): Target maximum size for output HTML
Returns:
str: Preprocessed HTML content
"""
try:
# Parse HTML with error recovery
parser = etree.HTMLParser(remove_comments=True, remove_blank_text=True)
tree = lhtml.fromstring(html_content, parser=parser)
# 1. Remove HEAD section (keep only BODY)
head_elements = tree.xpath('//head')
for head in head_elements:
if head.getparent() is not None:
head.getparent().remove(head)
# 2. Define tags to remove completely
tags_to_remove = [
'script', 'style', 'noscript', 'iframe', 'canvas', 'svg',
'video', 'audio', 'source', 'track', 'map', 'area'
]
# Remove unwanted elements
for tag in tags_to_remove:
elements = tree.xpath(f'//{tag}')
for element in elements:
if element.getparent() is not None:
element.getparent().remove(element)
# 3. Process remaining elements to clean attributes and truncate text
for element in tree.iter():
# Skip if we're at the root level
if element.getparent() is None:
continue
# Clean non-essential attributes but preserve structural ones
# attribs_to_keep = {'id', 'class', 'name', 'href', 'src', 'type', 'value', 'data-'}
# This is more aggressive than the previous version
attribs_to_keep = {'id', 'class', 'name', 'type', 'value'}
# attributes_hates_truncate = ['id', 'class', "data-"]
# This means, I don't care, if an attribute is too long, truncate it, go and find a better css selector to build a schema
attributes_hates_truncate = []
# Process each attribute
for attrib in list(element.attrib.keys()):
# Keep if it's essential or starts with data-
if not (attrib in attribs_to_keep or attrib.startswith('data-')):
element.attrib.pop(attrib)
# Truncate long attribute values except for selectors
elif attrib not in attributes_hates_truncate and len(element.attrib[attrib]) > attr_value_threshold:
element.attrib[attrib] = element.attrib[attrib][:attr_value_threshold] + '...'
# Truncate text content if it's too long
if element.text and len(element.text.strip()) > text_threshold:
element.text = element.text.strip()[:text_threshold] + '...'
# Also truncate tail text if present
if element.tail and len(element.tail.strip()) > text_threshold:
element.tail = element.tail.strip()[:text_threshold] + '...'
# 4. Find repeated patterns and keep only a few examples
# This is a simplistic approach - more sophisticated pattern detection could be implemented
pattern_elements = {}
for element in tree.xpath('//*[contains(@class, "")]'):
parent = element.getparent()
if parent is None:
continue
# Create a signature based on tag and classes
classes = element.get('class', '')
if not classes:
continue
signature = f"{element.tag}.{classes}"
if signature in pattern_elements:
pattern_elements[signature].append(element)
else:
pattern_elements[signature] = [element]
# Keep only 3 examples of each repeating pattern
for signature, elements in pattern_elements.items():
if len(elements) > 3:
# Keep the first 2 and last elements
for element in elements[2:-1]:
if element.getparent() is not None:
element.getparent().remove(element)
# 5. Convert back to string
result = etree.tostring(tree, encoding='unicode', method='html')
# If still over the size limit, apply more aggressive truncation
if len(result) > max_size:
return result[:max_size] + "..."
return result
except Exception as e:
# Fallback for parsing errors
return html_content[:max_size] if len(html_content) > max_size else html_content

View File

@@ -1,137 +0,0 @@
FROM python:3.10-slim
# Set build arguments
ARG APP_HOME=/app
ARG GITHUB_REPO=https://github.com/unclecode/crawl4ai.git
ARG GITHUB_BRANCH=next
ARG USE_LOCAL=False
ARG CONFIG_PATH=""
ENV PYTHONFAULTHANDLER=1 \
PYTHONHASHSEED=random \
PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=1 \
PYTHONDONTWRITEBYTECODE=1 \
PIP_DISABLE_PIP_VERSION_CHECK=1 \
PIP_DEFAULT_TIMEOUT=100 \
DEBIAN_FRONTEND=noninteractive \
REDIS_HOST=localhost \
REDIS_PORT=6379
ARG PYTHON_VERSION=3.10
ARG INSTALL_TYPE=default
ARG ENABLE_GPU=false
ARG TARGETARCH
LABEL maintainer="unclecode"
LABEL description="🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & scraper"
LABEL version="1.0"
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
curl \
wget \
gnupg \
git \
cmake \
pkg-config \
python3-dev \
libjpeg-dev \
redis-server \
supervisor \
&& rm -rf /var/lib/apt/lists/*
RUN apt-get update && apt-get install -y --no-install-recommends \
libglib2.0-0 \
libnss3 \
libnspr4 \
libatk1.0-0 \
libatk-bridge2.0-0 \
libcups2 \
libdrm2 \
libdbus-1-3 \
libxcb1 \
libxkbcommon0 \
libx11-6 \
libxcomposite1 \
libxdamage1 \
libxext6 \
libxfixes3 \
libxrandr2 \
libgbm1 \
libpango-1.0-0 \
libcairo2 \
libasound2 \
libatspi2.0-0 \
&& rm -rf /var/lib/apt/lists/*
RUN if [ "$ENABLE_GPU" = "true" ] && [ "$TARGETARCH" = "amd64" ] ; then \
apt-get update && apt-get install -y --no-install-recommends \
nvidia-cuda-toolkit \
&& rm -rf /var/lib/apt/lists/* ; \
else \
echo "Skipping NVIDIA CUDA Toolkit installation (unsupported platform or GPU disabled)"; \
fi
RUN if [ "$TARGETARCH" = "arm64" ]; then \
echo "🦾 Installing ARM-specific optimizations"; \
apt-get update && apt-get install -y --no-install-recommends \
libopenblas-dev \
&& rm -rf /var/lib/apt/lists/*; \
elif [ "$TARGETARCH" = "amd64" ]; then \
echo "🖥️ Installing AMD64-specific optimizations"; \
apt-get update && apt-get install -y --no-install-recommends \
libomp-dev \
&& rm -rf /var/lib/apt/lists/*; \
else \
echo "Skipping platform-specific optimizations (unsupported platform)"; \
fi
WORKDIR ${APP_HOME}
RUN git clone --branch ${GITHUB_BRANCH} ${GITHUB_REPO} /tmp/crawl4ai
COPY docker/supervisord.conf .
COPY docker/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
RUN if [ "$INSTALL_TYPE" = "all" ] ; then \
pip install "/tmp/crawl4ai/[all]" && \
python -m nltk.downloader punkt stopwords && \
python -m crawl4ai.model_loader ; \
elif [ "$INSTALL_TYPE" = "torch" ] ; then \
pip install "/tmp/crawl4ai/[torch]" ; \
elif [ "$INSTALL_TYPE" = "transformer" ] ; then \
pip install "/tmp/crawl4ai/[transformer]" && \
python -m crawl4ai.model_loader ; \
else \
pip install "/tmp/crawl4ai" ; \
fi
RUN pip install --no-cache-dir --upgrade pip && \
python -c "import crawl4ai; print('✅ crawl4ai is ready to rock!')" && \
python -c "from playwright.sync_api import sync_playwright; print('✅ Playwright is feeling dramatic!')"
RUN playwright install --with-deps chromium
COPY docker/* ${APP_HOME}/
RUN if [ -n "$CONFIG_PATH" ] && [ -f "$CONFIG_PATH" ]; then \
echo "Using custom config from $CONFIG_PATH" && \
cp $CONFIG_PATH /app/config.yml; \
fi
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD bash -c '\
MEM=$(free -m | awk "/^Mem:/{print \$2}"); \
if [ $MEM -lt 2048 ]; then \
echo "⚠️ Warning: Less than 2GB RAM available! Your container might need a memory boost! 🚀"; \
exit 1; \
fi && \
redis-cli ping > /dev/null && \
curl -f http://localhost:8000/health || exit 1'
# EXPOSE 6379
CMD ["supervisord", "-c", "supervisord.conf"]

View File

@@ -1,3 +0,0 @@
project_name: PROJECT_NAME
domain_name: DOMAIN_NAME
aws_region: AWS_REGION

View File

@@ -1,729 +0,0 @@
#!/usr/bin/env python3
import argparse
import subprocess
import sys
import time
import json
import yaml
import requests
import os
# Steps for deployment
STEPS = [
"refresh_aws_auth",
"fetch_or_create_vpc_and_subnets",
"create_ecr_repositories",
"create_iam_role",
"create_security_groups",
"request_acm_certificate",
"build_and_push_docker",
"create_task_definition",
"setup_alb",
"deploy_ecs_service",
"configure_custom_domain",
"test_endpoints"
]
# Utility function to prompt user for confirmation
def confirm_step(step_name):
while True:
response = input(f"Proceed with {step_name}? (yes/no): ").strip().lower()
if response in ["yes", "no"]:
return response == "yes"
print("Please enter 'yes' or 'no'.")
# Utility function to run AWS CLI or shell commands and handle errors
def run_command(command, error_message, additional_diagnostics=None, cwd="."):
try:
result = subprocess.run(command, capture_output=True, text=True, check=True, cwd=cwd)
return result
except subprocess.CalledProcessError as e:
with open("error_context.md", "w") as f:
f.write(f"{error_message}:\n")
f.write(f"Command: {' '.join(command)}\n")
f.write(f"Exit Code: {e.returncode}\n")
f.write(f"Stdout: {e.stdout}\n")
f.write(f"Stderr: {e.stderr}\n")
if additional_diagnostics:
for diag_cmd in additional_diagnostics:
diag_result = subprocess.run(diag_cmd, capture_output=True, text=True)
f.write(f"\nDiagnostic command: {' '.join(diag_cmd)}\n")
f.write(f"Stdout: {diag_result.stdout}\n")
f.write(f"Stderr: {diag_result.stderr}\n")
raise Exception(f"{error_message}: {e.stderr}")
# Utility function to load or initialize state
def load_state(project_name):
state_file = f"{project_name}-state.json"
if os.path.exists(state_file):
with open(state_file, "r") as f:
return json.load(f)
return {"last_step": -1}
# Utility function to save state
def save_state(project_name, state):
state_file = f"{project_name}-state.json"
with open(state_file, "w") as f:
json.dump(state, f, indent=4)
# DNS Check Function
def check_dns_propagation(domain, alb_dns):
try:
result = subprocess.run(["dig", "+short", domain], capture_output=True, text=True)
if alb_dns in result.stdout:
return True
return False
except Exception as e:
print(f"Failed to check DNS: {e}")
return False
# Step Functions
def refresh_aws_auth(project_name, state, config):
if state["last_step"] >= 0:
print("Skipping refresh_aws_auth (already completed)")
return
if not confirm_step("Refresh AWS authentication"):
sys.exit("User aborted.")
run_command(
["aws", "sts", "get-caller-identity"],
"Failed to verify AWS credentials"
)
print("AWS authentication verified.")
state["last_step"] = 0
save_state(project_name, state)
def fetch_or_create_vpc_and_subnets(project_name, state, config):
if state["last_step"] >= 1:
print("Skipping fetch_or_create_vpc_and_subnets (already completed)")
return state["vpc_id"], state["public_subnets"]
if not confirm_step("Fetch or Create VPC and Subnets"):
sys.exit("User aborted.")
# Fetch AWS account ID
result = run_command(
["aws", "sts", "get-caller-identity"],
"Failed to get AWS account ID"
)
account_id = json.loads(result.stdout)["Account"]
# Fetch default VPC
result = run_command(
["aws", "ec2", "describe-vpcs", "--filters", "Name=isDefault,Values=true", "--region", config["aws_region"]],
"Failed to describe VPCs"
)
vpcs = json.loads(result.stdout).get("Vpcs", [])
if not vpcs:
result = run_command(
["aws", "ec2", "create-vpc", "--cidr-block", "10.0.0.0/16", "--region", config["aws_region"]],
"Failed to create VPC"
)
vpc_id = json.loads(result.stdout)["Vpc"]["VpcId"]
run_command(
["aws", "ec2", "modify-vpc-attribute", "--vpc-id", vpc_id, "--enable-dns-hostnames", "--region", config["aws_region"]],
"Failed to enable DNS hostnames"
)
else:
vpc_id = vpcs[0]["VpcId"]
# Fetch or create subnets
result = run_command(
["aws", "ec2", "describe-subnets", "--filters", f"Name=vpc-id,Values={vpc_id}", "--region", config["aws_region"]],
"Failed to describe subnets"
)
subnets = json.loads(result.stdout).get("Subnets", [])
if len(subnets) < 2:
azs = json.loads(run_command(
["aws", "ec2", "describe-availability-zones", "--region", config["aws_region"]],
"Failed to describe availability zones"
).stdout)["AvailabilityZones"][:2]
subnet_ids = []
for i, az in enumerate(azs):
az_name = az["ZoneName"]
result = run_command(
["aws", "ec2", "create-subnet", "--vpc-id", vpc_id, "--cidr-block", f"10.0.{i}.0/24", "--availability-zone", az_name, "--region", config["aws_region"]],
f"Failed to create subnet in {az_name}"
)
subnet_id = json.loads(result.stdout)["Subnet"]["SubnetId"]
subnet_ids.append(subnet_id)
run_command(
["aws", "ec2", "modify-subnet-attribute", "--subnet-id", subnet_id, "--map-public-ip-on-launch", "--region", config["aws_region"]],
f"Failed to make subnet {subnet_id} public"
)
else:
subnet_ids = [s["SubnetId"] for s in subnets[:2]]
# Ensure internet gateway
result = run_command(
["aws", "ec2", "describe-internet-gateways", "--filters", f"Name=attachment.vpc-id,Values={vpc_id}", "--region", config["aws_region"]],
"Failed to describe internet gateways"
)
igws = json.loads(result.stdout).get("InternetGateways", [])
if not igws:
result = run_command(
["aws", "ec2", "create-internet-gateway", "--region", config["aws_region"]],
"Failed to create internet gateway"
)
igw_id = json.loads(result.stdout)["InternetGateway"]["InternetGatewayId"]
run_command(
["aws", "ec2", "attach-internet-gateway", "--vpc-id", vpc_id, "--internet-gateway-id", igw_id, "--region", config["aws_region"]],
"Failed to attach internet gateway"
)
state["vpc_id"] = vpc_id
state["public_subnets"] = subnet_ids
state["last_step"] = 1
save_state(project_name, state)
print(f"VPC ID: {vpc_id}, Subnets: {subnet_ids}")
return vpc_id, subnet_ids
def create_ecr_repositories(project_name, state, config):
if state["last_step"] >= 2:
print("Skipping create_ecr_repositories (already completed)")
return
if not confirm_step("Create ECR Repositories"):
sys.exit("User aborted.")
account_id = json.loads(run_command(
["aws", "sts", "get-caller-identity"],
"Failed to get AWS account ID"
).stdout)["Account"]
repos = [project_name, f"{project_name}-nginx"]
for repo in repos:
result = subprocess.run(
["aws", "ecr", "describe-repositories", "--repository-names", repo, "--region", config["aws_region"]],
capture_output=True, text=True
)
if result.returncode != 0:
run_command(
["aws", "ecr", "create-repository", "--repository-name", repo, "--region", config["aws_region"]],
f"Failed to create ECR repository {repo}"
)
print(f"ECR repository {repo} is ready.")
state["last_step"] = 2
save_state(project_name, state)
def create_iam_role(project_name, state, config):
if state["last_step"] >= 3:
print("Skipping create_iam_role (already completed)")
return
if not confirm_step("Create IAM Role"):
sys.exit("User aborted.")
account_id = json.loads(run_command(
["aws", "sts", "get-caller-identity"],
"Failed to get AWS account ID"
).stdout)["Account"]
role_name = "ecsTaskExecutionRole"
trust_policy = {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {"Service": "ecs-tasks.amazonaws.com"},
"Action": "sts:AssumeRole"
}
]
}
with open("trust_policy.json", "w") as f:
json.dump(trust_policy, f)
result = subprocess.run(
["aws", "iam", "get-role", "--role-name", role_name],
capture_output=True, text=True
)
if result.returncode != 0:
run_command(
["aws", "iam", "create-role", "--role-name", role_name, "--assume-role-policy-document", "file://trust_policy.json"],
f"Failed to create IAM role {role_name}"
)
run_command(
["aws", "iam", "attach-role-policy", "--role-name", role_name, "--policy-arn", "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"],
"Failed to attach ECS task execution policy"
)
os.remove("trust_policy.json")
state["execution_role_arn"] = f"arn:aws:iam::{account_id}:role/{role_name}"
state["last_step"] = 3
save_state(project_name, state)
print(f"IAM role {role_name} configured.")
def create_security_groups(project_name, state, config):
if state["last_step"] >= 4:
print("Skipping create_security_groups (already completed)")
return state["alb_sg_id"], state["ecs_sg_id"]
if not confirm_step("Create Security Groups"):
sys.exit("User aborted.")
vpc_id = state["vpc_id"]
alb_sg_name = f"{project_name}-alb-sg"
result = run_command(
["aws", "ec2", "describe-security-groups", "--filters", f"Name=vpc-id,Values={vpc_id}", f"Name=group-name,Values={alb_sg_name}", "--region", config["aws_region"]],
"Failed to describe ALB security group"
)
if not json.loads(result.stdout).get("SecurityGroups"):
result = run_command(
["aws", "ec2", "create-security-group", "--group-name", alb_sg_name, "--description", "Security group for ALB", "--vpc-id", vpc_id, "--region", config["aws_region"]],
"Failed to create ALB security group"
)
alb_sg_id = json.loads(result.stdout)["GroupId"]
run_command(
["aws", "ec2", "authorize-security-group-ingress", "--group-id", alb_sg_id, "--protocol", "tcp", "--port", "80", "--cidr", "0.0.0.0/0", "--region", config["aws_region"]],
"Failed to authorize HTTP ingress"
)
run_command(
["aws", "ec2", "authorize-security-group-ingress", "--group-id", alb_sg_id, "--protocol", "tcp", "--port", "443", "--cidr", "0.0.0.0/0", "--region", config["aws_region"]],
"Failed to authorize HTTPS ingress"
)
else:
alb_sg_id = json.loads(result.stdout)["SecurityGroups"][0]["GroupId"]
ecs_sg_name = f"{project_name}-ecs-sg"
result = run_command(
["aws", "ec2", "describe-security-groups", "--filters", f"Name=vpc-id,Values={vpc_id}", f"Name=group-name,Values={ecs_sg_name}", "--region", config["aws_region"]],
"Failed to describe ECS security group"
)
if not json.loads(result.stdout).get("SecurityGroups"):
result = run_command(
["aws", "ec2", "create-security-group", "--group-name", ecs_sg_name, "--description", "Security group for ECS tasks", "--vpc-id", vpc_id, "--region", config["aws_region"]],
"Failed to create ECS security group"
)
ecs_sg_id = json.loads(result.stdout)["GroupId"]
run_command(
["aws", "ec2", "authorize-security-group-ingress", "--group-id", ecs_sg_id, "--protocol", "tcp", "--port", "80", "--source-group", alb_sg_id, "--region", config["aws_region"]],
"Failed to authorize ECS ingress"
)
else:
ecs_sg_id = json.loads(result.stdout)["SecurityGroups"][0]["GroupId"]
state["alb_sg_id"] = alb_sg_id
state["ecs_sg_id"] = ecs_sg_id
state["last_step"] = 4
save_state(project_name, state)
print("Security groups configured.")
return alb_sg_id, ecs_sg_id
def request_acm_certificate(project_name, state, config):
if state["last_step"] >= 5:
print("Skipping request_acm_certificate (already completed)")
return state["cert_arn"]
if not confirm_step("Request ACM Certificate"):
sys.exit("User aborted.")
domain_name = config["domain_name"]
result = run_command(
["aws", "acm", "describe-certificates", "--certificate-statuses", "ISSUED", "--region", config["aws_region"]],
"Failed to describe certificates"
)
certificates = json.loads(result.stdout).get("CertificateSummaryList", [])
cert_arn = next((c["CertificateArn"] for c in certificates if c["DomainName"] == domain_name), None)
if not cert_arn:
result = run_command(
["aws", "acm", "request-certificate", "--domain-name", domain_name, "--validation-method", "DNS", "--region", config["aws_region"]],
"Failed to request ACM certificate"
)
cert_arn = json.loads(result.stdout)["CertificateArn"]
time.sleep(10)
result = run_command(
["aws", "acm", "describe-certificate", "--certificate-arn", cert_arn, "--region", config["aws_region"]],
"Failed to describe certificate"
)
cert_details = json.loads(result.stdout)["Certificate"]
dns_validations = cert_details.get("DomainValidationOptions", [])
for validation in dns_validations:
if validation["ValidationMethod"] == "DNS" and "ResourceRecord" in validation:
record = validation["ResourceRecord"]
print(f"Please add this DNS record to validate the certificate for {domain_name}:")
print(f"Name: {record['Name']}")
print(f"Type: {record['Type']}")
print(f"Value: {record['Value']}")
print("Press Enter after adding the DNS record...")
input()
while True:
result = run_command(
["aws", "acm", "describe-certificate", "--certificate-arn", cert_arn, "--region", config["aws_region"]],
"Failed to check certificate status"
)
status = json.loads(result.stdout)["Certificate"]["Status"]
if status == "ISSUED":
break
elif status in ["FAILED", "REVOKED", "INACTIVE"]:
print("Certificate issuance failed.")
sys.exit(1)
time.sleep(10)
state["cert_arn"] = cert_arn
state["last_step"] = 5
save_state(project_name, state)
print(f"Certificate ARN: {cert_arn}")
return cert_arn
def build_and_push_docker(project_name, state, config):
if state["last_step"] >= 6:
print("Skipping build_and_push_docker (already completed)")
return state["fastapi_image"], state["nginx_image"]
if not confirm_step("Build and Push Docker Images"):
sys.exit("User aborted.")
with open("./version.txt", "r") as f:
version = f.read().strip()
account_id = json.loads(run_command(
["aws", "sts", "get-caller-identity"],
"Failed to get AWS account ID"
).stdout)["Account"]
region = config["aws_region"]
login_password = run_command(
["aws", "ecr", "get-login-password", "--region", region],
"Failed to get ECR login password"
).stdout.strip()
run_command(
["docker", "login", "--username", "AWS", "--password", login_password, f"{account_id}.dkr.ecr.{region}.amazonaws.com"],
"Failed to authenticate Docker to ECR"
)
fastapi_image = f"{account_id}.dkr.ecr.{region}.amazonaws.com/{project_name}:{version}"
run_command(
["docker", "build", "-f", "Dockerfile", "-t", fastapi_image, "."],
"Failed to build FastAPI Docker image"
)
run_command(
["docker", "push", fastapi_image],
"Failed to push FastAPI image"
)
nginx_image = f"{account_id}.dkr.ecr.{region}.amazonaws.com/{project_name}-nginx:{version}"
run_command(
["docker", "build", "-f", "Dockerfile", "-t", nginx_image, "."],
"Failed to build Nginx Docker image",
cwd="./nginx"
)
run_command(
["docker", "push", nginx_image],
"Failed to push Nginx image"
)
state["fastapi_image"] = fastapi_image
state["nginx_image"] = nginx_image
state["last_step"] = 6
save_state(project_name, state)
print("Docker images built and pushed.")
return fastapi_image, nginx_image
def create_task_definition(project_name, state, config):
if state["last_step"] >= 7:
print("Skipping create_task_definition (already completed)")
return state["task_def_arn"]
if not confirm_step("Create Task Definition"):
sys.exit("User aborted.")
log_group = f"/ecs/{project_name}-logs"
result = run_command(
["aws", "logs", "describe-log-groups", "--log-group-name-prefix", log_group, "--region", config["aws_region"]],
"Failed to describe log groups"
)
if not any(lg["logGroupName"] == log_group for lg in json.loads(result.stdout).get("logGroups", [])):
run_command(
["aws", "logs", "create-log-group", "--log-group-name", log_group, "--region", config["aws_region"]],
f"Failed to create log group {log_group}"
)
task_definition = {
"family": f"{project_name}-taskdef",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "512",
"memory": "2048",
"executionRoleArn": state["execution_role_arn"],
"containerDefinitions": [
{
"name": "fastapi",
"image": state["fastapi_image"],
"portMappings": [{"containerPort": 8000, "hostPort": 8000, "protocol": "tcp"}],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": log_group,
"awslogs-region": config["aws_region"],
"awslogs-stream-prefix": "fastapi"
}
}
},
{
"name": "nginx",
"image": state["nginx_image"],
"portMappings": [{"containerPort": 80, "hostPort": 80, "protocol": "tcp"}],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": log_group,
"awslogs-region": config["aws_region"],
"awslogs-stream-prefix": "nginx"
}
}
}
]
}
with open("task_def.json", "w") as f:
json.dump(task_definition, f)
result = run_command(
["aws", "ecs", "register-task-definition", "--cli-input-json", "file://task_def.json", "--region", config["aws_region"]],
"Failed to register task definition"
)
task_def_arn = json.loads(result.stdout)["taskDefinition"]["taskDefinitionArn"]
os.remove("task_def.json")
state["task_def_arn"] = task_def_arn
state["last_step"] = 7
save_state(project_name, state)
print("Task definition created.")
return task_def_arn
def setup_alb(project_name, state, config):
if state["last_step"] >= 8:
print("Skipping setup_alb (already completed)")
return state["alb_arn"], state["tg_arn"], state["alb_dns"]
if not confirm_step("Set Up ALB"):
sys.exit("User aborted.")
vpc_id = state["vpc_id"]
public_subnets = state["public_subnets"]
alb_name = f"{project_name}-alb"
result = subprocess.run(
["aws", "elbv2", "describe-load-balancers", "--names", alb_name, "--region", config["aws_region"]],
capture_output=True, text=True
)
if result.returncode != 0:
run_command(
["aws", "elbv2", "create-load-balancer", "--name", alb_name, "--subnets"] + public_subnets + ["--security-groups", state["alb_sg_id"], "--region", config["aws_region"]],
"Failed to create ALB"
)
alb_arn = json.loads(run_command(
["aws", "elbv2", "describe-load-balancers", "--names", alb_name, "--region", config["aws_region"]],
"Failed to describe ALB"
).stdout)["LoadBalancers"][0]["LoadBalancerArn"]
alb_dns = json.loads(run_command(
["aws", "elbv2", "describe-load-balancers", "--names", alb_name, "--region", config["aws_region"]],
"Failed to get ALB DNS name"
).stdout)["LoadBalancers"][0]["DNSName"]
tg_name = f"{project_name}-tg"
result = subprocess.run(
["aws", "elbv2", "describe-target-groups", "--names", tg_name, "--region", config["aws_region"]],
capture_output=True, text=True
)
if result.returncode != 0:
run_command(
["aws", "elbv2", "create-target-group", "--name", tg_name, "--protocol", "HTTP", "--port", "80", "--vpc-id", vpc_id, "--region", config["aws_region"]],
"Failed to create target group"
)
tg_arn = json.loads(run_command(
["aws", "elbv2", "describe-target-groups", "--names", tg_name, "--region", config["aws_region"]],
"Failed to describe target group"
).stdout)["TargetGroups"][0]["TargetGroupArn"]
result = run_command(
["aws", "elbv2", "describe-listeners", "--load-balancer-arn", alb_arn, "--region", config["aws_region"]],
"Failed to describe listeners"
)
listeners = json.loads(result.stdout).get("Listeners", [])
if not any(l["Port"] == 80 for l in listeners):
run_command(
["aws", "elbv2", "create-listener", "--load-balancer-arn", alb_arn, "--protocol", "HTTP", "--port", "80", "--default-actions", "Type=redirect,RedirectConfig={Protocol=HTTPS,Port=443,StatusCode=HTTP_301}", "--region", config["aws_region"]],
"Failed to create HTTP listener"
)
if not any(l["Port"] == 443 for l in listeners):
run_command(
["aws", "elbv2", "create-listener", "--load-balancer-arn", alb_arn, "--protocol", "HTTPS", "--port", "443", "--certificates", f"CertificateArn={state['cert_arn']}", "--default-actions", f"Type=forward,TargetGroupArn={tg_arn}", "--region", config["aws_region"]],
"Failed to create HTTPS listener"
)
state["alb_arn"] = alb_arn
state["tg_arn"] = tg_arn
state["alb_dns"] = alb_dns
state["last_step"] = 8
save_state(project_name, state)
print("ALB configured.")
return alb_arn, tg_arn, alb_dns
def deploy_ecs_service(project_name, state, config):
if state["last_step"] >= 9:
print("Skipping deploy_ecs_service (already completed)")
return
if not confirm_step("Deploy ECS Service"):
sys.exit("User aborted.")
cluster_name = f"{project_name}-cluster"
result = run_command(
["aws", "ecs", "describe-clusters", "--clusters", cluster_name, "--region", config["aws_region"]],
"Failed to describe clusters"
)
if not json.loads(result.stdout).get("clusters"):
run_command(
["aws", "ecs", "create-cluster", "--cluster-name", cluster_name, "--region", config["aws_region"]],
"Failed to create ECS cluster"
)
service_name = f"{project_name}-service"
result = run_command(
["aws", "ecs", "describe-services", "--cluster", cluster_name, "--services", service_name, "--region", config["aws_region"]],
"Failed to describe services",
additional_diagnostics=[["aws", "ecs", "list-tasks", "--cluster", cluster_name, "--service-name", service_name, "--region", config["aws_region"]]]
)
services = json.loads(result.stdout).get("services", [])
if not services or services[0]["status"] == "INACTIVE":
run_command(
["aws", "ecs", "create-service", "--cluster", cluster_name, "--service-name", service_name, "--task-definition", state["task_def_arn"], "--desired-count", "1", "--launch-type", "FARGATE", "--network-configuration", f"awsvpcConfiguration={{subnets={json.dumps(state['public_subnets'])},securityGroups=[{state['ecs_sg_id']}],assignPublicIp=ENABLED}}", "--load-balancers", f"targetGroupArn={state['tg_arn']},containerName=nginx,containerPort=80", "--region", config["aws_region"]],
"Failed to create ECS service"
)
else:
run_command(
["aws", "ecs", "update-service", "--cluster", cluster_name, "--service", service_name, "--task-definition", state["task_def_arn"], "--region", config["aws_region"]],
"Failed to update ECS service"
)
state["last_step"] = 9
save_state(project_name, state)
print("ECS service deployed.")
def configure_custom_domain(project_name, state, config):
if state["last_step"] >= 10:
print("Skipping configure_custom_domain (already completed)")
return
if not confirm_step("Configure Custom Domain"):
sys.exit("User aborted.")
domain_name = config["domain_name"]
alb_dns = state["alb_dns"]
print(f"Please add a CNAME record for {domain_name} pointing to {alb_dns} in your DNS provider.")
print("Press Enter after updating the DNS record...")
input()
while not check_dns_propagation(domain_name, alb_dns):
print("DNS propagation not complete. Waiting 30 seconds before retrying...")
time.sleep(30)
print("DNS propagation confirmed.")
state["last_step"] = 10
save_state(project_name, state)
print("Custom domain configured.")
def test_endpoints(project_name, state, config):
if state["last_step"] >= 11:
print("Skipping test_endpoints (already completed)")
return
if not confirm_step("Test Endpoints"):
sys.exit("User aborted.")
domain = config["domain_name"]
time.sleep(30) # Wait for service to stabilize
response = requests.get(f"https://{domain}/health", verify=False)
if response.status_code != 200:
with open("error_context.md", "w") as f:
f.write("Health endpoint test failed:\n")
f.write(f"Status Code: {response.status_code}\n")
f.write(f"Response: {response.text}\n")
sys.exit(1)
print("Health endpoint test passed.")
payload = {
"urls": ["https://example.com"],
"browser_config": {"headless": True},
"crawler_config": {"stream": False}
}
response = requests.post(f"https://{domain}/crawl", json=payload, verify=False)
if response.status_code != 200:
with open("error_context.md", "w") as f:
f.write("Crawl endpoint test failed:\n")
f.write(f"Status Code: {response.status_code}\n")
f.write(f"Response: {response.text}\n")
sys.exit(1)
print("Crawl endpoint test passed.")
state["last_step"] = 11
save_state(project_name, state)
print("Endpoints tested successfully.")
# Main Deployment Function
def deploy(project_name, force=False):
config_file = f"{project_name}-config.yml"
if not os.path.exists(config_file):
print(f"Configuration file {config_file} not found. Run 'init' first.")
sys.exit(1)
with open(config_file, "r") as f:
config = yaml.safe_load(f)
state = load_state(project_name)
if force:
state = {"last_step": -1}
last_step = state.get("last_step", -1)
for step_idx, step_name in enumerate(STEPS):
if step_idx <= last_step:
print(f"Skipping {step_name} (already completed)")
continue
print(f"Executing step: {step_name}")
func = globals()[step_name]
if step_name == "fetch_or_create_vpc_and_subnets":
vpc_id, public_subnets = func(project_name, state, config)
elif step_name == "create_security_groups":
alb_sg_id, ecs_sg_id = func(project_name, state, config)
elif step_name == "request_acm_certificate":
cert_arn = func(project_name, state, config)
elif step_name == "build_and_push_docker":
fastapi_image, nginx_image = func(project_name, state, config)
elif step_name == "create_task_definition":
task_def_arn = func(project_name, state, config)
elif step_name == "setup_alb":
alb_arn, tg_arn, alb_dns = func(project_name, state, config)
elif step_name == "deploy_ecs_service":
func(project_name, state, config)
elif step_name == "configure_custom_domain":
func(project_name, state, config)
elif step_name == "test_endpoints":
func(project_name, state, config)
else:
func(project_name, state, config)
# Init Command
def init(project_name, domain_name, aws_region):
config = {
"project_name": project_name,
"domain_name": domain_name,
"aws_region": aws_region
}
config_file = f"{project_name}-config.yml"
with open(config_file, "w") as f:
yaml.dump(config, f)
print(f"Configuration file {config_file} created.")
# Argument Parser
parser = argparse.ArgumentParser(description="Crawl4AI Deployment Script")
subparsers = parser.add_subparsers(dest="command")
# Init Parser
init_parser = subparsers.add_parser("init", help="Initialize configuration")
init_parser.add_argument("--project", required=True, help="Project name")
init_parser.add_argument("--domain", required=True, help="Domain name")
init_parser.add_argument("--region", required=True, help="AWS region")
# Deploy Parser
deploy_parser = subparsers.add_parser("deploy", help="Deploy the project")
deploy_parser.add_argument("--project", required=True, help="Project name")
deploy_parser.add_argument("--force", action="store_true", help="Force redeployment from start")
args = parser.parse_args()
if args.command == "init":
init(args.project, args.domain, args.region)
elif args.command == "deploy":
deploy(args.project, args.force)
else:
parser.print_help()

View File

@@ -1,31 +0,0 @@
# .dockerignore
*
# Allow specific files and directories when using local installation
!crawl4ai/
!docs/
!deploy/docker/
!setup.py
!pyproject.toml
!README.md
!LICENSE
!MANIFEST.in
!setup.cfg
!mkdocs.yml
.git/
__pycache__/
*.pyc
*.pyo
*.pyd
.DS_Store
.env
.venv
venv/
tests/
coverage.xml
*.log
*.swp
*.egg-info/
dist/
build/

View File

@@ -1,8 +0,0 @@
# LLM Provider Keys
OPENAI_API_KEY=your_openai_key_here
DEEPSEEK_API_KEY=your_deepseek_key_here
ANTHROPIC_API_KEY=your_anthropic_key_here
GROQ_API_KEY=your_groq_key_here
TOGETHER_API_KEY=your_together_key_here
MISTRAL_API_KEY=your_mistral_key_here
GEMINI_API_TOKEN=your_gemini_key_here

View File

@@ -1,847 +0,0 @@
# Crawl4AI Docker Guide 🐳
## Table of Contents
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Local Build](#local-build)
- [Docker Hub](#docker-hub)
- [Dockerfile Parameters](#dockerfile-parameters)
- [Using the API](#using-the-api)
- [Understanding Request Schema](#understanding-request-schema)
- [REST API Examples](#rest-api-examples)
- [Python SDK](#python-sdk)
- [Metrics & Monitoring](#metrics--monitoring)
- [Deployment Scenarios](#deployment-scenarios)
- [Complete Examples](#complete-examples)
- [Getting Help](#getting-help)
## Prerequisites
Before we dive in, make sure you have:
- Docker installed and running (version 20.10.0 or higher)
- At least 4GB of RAM available for the container
- Python 3.10+ (if using the Python SDK)
- Node.js 16+ (if using the Node.js examples)
> 💡 **Pro tip**: Run `docker info` to check your Docker installation and available resources.
## Installation
### Local Build
Let's get your local environment set up step by step!
#### 1. Building the Image
First, clone the repository and build the Docker image:
```bash
# Clone the repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai/deploy
# Build the Docker image
docker build --platform=linux/amd64 --no-cache -t crawl4ai .
# Or build for arm64
docker build --platform=linux/arm64 --no-cache -t crawl4ai .
```
#### 2. Environment Setup
If you plan to use LLMs (Language Models), you'll need to set up your API keys. Create a `.llm.env` file:
```env
# OpenAI
OPENAI_API_KEY=sk-your-key
# Anthropic
ANTHROPIC_API_KEY=your-anthropic-key
# DeepSeek
DEEPSEEK_API_KEY=your-deepseek-key
# Check out https://docs.litellm.ai/docs/providers for more providers!
```
> 🔑 **Note**: Keep your API keys secure! Never commit them to version control.
#### 3. Running the Container
You have several options for running the container:
Basic run (no LLM support):
```bash
docker run -d -p 8000:8000 --name crawl4ai crawl4ai
```
With LLM support:
```bash
docker run -d -p 8000:8000 \
--env-file .llm.env \
--name crawl4ai \
crawl4ai
```
Using host environment variables (Not a good practice, but works for local testing):
```bash
docker run -d -p 8000:8000 \
--env-file .llm.env \
--env "$(env)" \
--name crawl4ai \
crawl4ai
```
#### Multi-Platform Build
For distributing your image across different architectures, use `buildx`:
```bash
# Set up buildx builder
docker buildx create --use
# Build for multiple platforms
docker buildx build \
--platform linux/amd64,linux/arm64 \
-t crawl4ai \
--push \
.
```
> 💡 **Note**: Multi-platform builds require Docker Buildx and need to be pushed to a registry.
#### Development Build
For development, you might want to enable all features:
```bash
docker build -t crawl4ai
--build-arg INSTALL_TYPE=all \
--build-arg PYTHON_VERSION=3.10 \
--build-arg ENABLE_GPU=true \
.
```
#### GPU-Enabled Build
If you plan to use GPU acceleration:
```bash
docker build -t crawl4ai
--build-arg ENABLE_GPU=true \
deploy/docker/
```
### Build Arguments Explained
| Argument | Description | Default | Options |
|----------|-------------|---------|----------|
| PYTHON_VERSION | Python version | 3.10 | 3.8, 3.9, 3.10 |
| INSTALL_TYPE | Feature set | default | default, all, torch, transformer |
| ENABLE_GPU | GPU support | false | true, false |
| APP_HOME | Install path | /app | any valid path |
### Build Best Practices
1. **Choose the Right Install Type**
- `default`: Basic installation, smallest image, to be honest, I use this most of the time.
- `all`: Full features, larger image (include transformer, and nltk, make sure you really need them)
2. **Platform Considerations**
- Let Docker auto-detect platform unless you need cross-compilation
- Use --platform for specific architecture requirements
- Consider buildx for multi-architecture distribution
3. **Performance Optimization**
- The image automatically includes platform-specific optimizations
- AMD64 gets OpenMP optimizations
- ARM64 gets OpenBLAS optimizations
### Docker Hub
> 🚧 Coming soon! The image will be available at `crawl4ai`. Stay tuned!
## Using the API
In the following sections, we discuss two ways to communicate with the Docker server. One option is to use the client SDK that I developed for Python, and I will soon develop one for Node.js. I highly recommend this approach to avoid mistakes. Alternatively, you can take a more technical route by using the JSON structure and passing it to all the URLs, which I will explain in detail.
### Python SDK
The SDK makes things easier! Here's how to use it:
```python
from crawl4ai.docker_client import Crawl4aiDockerClient
from crawl4ai import BrowserConfig, CrawlerRunConfig
async def main():
async with Crawl4aiDockerClient(base_url="http://localhost:8000", verbose=True) as client:
# If JWT is enabled, you can authenticate like this: (more on this later)
# await client.authenticate("test@example.com")
# Non-streaming crawl
results = await client.crawl(
["https://example.com", "https://python.org"],
browser_config=BrowserConfig(headless=True),
crawler_config=CrawlerRunConfig()
)
print(f"Non-streaming results: {results}")
# Streaming crawl
crawler_config = CrawlerRunConfig(stream=True)
async for result in await client.crawl(
["https://example.com", "https://python.org"],
browser_config=BrowserConfig(headless=True),
crawler_config=crawler_config
):
print(f"Streamed result: {result}")
# Get schema
schema = await client.get_schema()
print(f"Schema: {schema}")
if __name__ == "__main__":
asyncio.run(main())
```
`Crawl4aiDockerClient` is an async context manager that handles the connection for you. You can pass in optional parameters for more control:
- `base_url` (str): Base URL of the Crawl4AI Docker server
- `timeout` (float): Default timeout for requests in seconds
- `verify_ssl` (bool): Whether to verify SSL certificates
- `verbose` (bool): Whether to show logging output
- `log_file` (str, optional): Path to log file if file logging is desired
This client SDK generates a properly structured JSON request for the server's HTTP API.
## Second Approach: Direct API Calls
This is super important! The API expects a specific structure that matches our Python classes. Let me show you how it works.
### Understanding Configuration Structure
Let's dive deep into how configurations work in Crawl4AI. Every configuration object follows a consistent pattern of `type` and `params`. This structure enables complex, nested configurations while maintaining clarity.
#### The Basic Pattern
Try this in Python to understand the structure:
```python
from crawl4ai import BrowserConfig
# Create a config and see its structure
config = BrowserConfig(headless=True)
print(config.dump())
```
This outputs:
```json
{
"type": "BrowserConfig",
"params": {
"headless": true
}
}
```
#### Simple vs Complex Values
The structure follows these rules:
- Simple values (strings, numbers, booleans, lists) are passed directly
- Complex values (classes, dictionaries) use the type-params pattern
For example, with dictionaries:
```json
{
"browser_config": {
"type": "BrowserConfig",
"params": {
"headless": true, // Simple boolean - direct value
"viewport": { // Complex dictionary - needs type-params
"type": "dict",
"value": {
"width": 1200,
"height": 800
}
}
}
}
}
```
#### Strategy Pattern and Nesting
Strategies (like chunking or content filtering) demonstrate why we need this structure. Consider this chunking configuration:
```json
{
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"chunking_strategy": {
"type": "RegexChunking", // Strategy implementation
"params": {
"patterns": ["\n\n", "\\.\\s+"]
}
}
}
}
}
```
Here, `chunking_strategy` accepts any chunking implementation. The `type` field tells the system which strategy to use, and `params` configures that specific strategy.
#### Complex Nested Example
Let's look at a more complex example with content filtering:
```json
{
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"markdown_generator": {
"type": "DefaultMarkdownGenerator",
"params": {
"content_filter": {
"type": "PruningContentFilter",
"params": {
"threshold": 0.48,
"threshold_type": "fixed"
}
}
}
}
}
}
}
```
This shows how deeply configurations can nest while maintaining a consistent structure.
#### Quick Grammar Overview
```
config := {
"type": string,
"params": {
key: simple_value | complex_value
}
}
simple_value := string | number | boolean | [simple_value]
complex_value := config | dict_value
dict_value := {
"type": "dict",
"value": object
}
```
#### Important Rules 🚨
- Always use the type-params pattern for class instances
- Use direct values for primitives (numbers, strings, booleans)
- Wrap dictionaries with {"type": "dict", "value": {...}}
- Arrays/lists are passed directly without type-params
- All parameters are optional unless specifically required
#### Pro Tip 💡
The easiest way to get the correct structure is to:
1. Create configuration objects in Python
2. Use the `dump()` method to see their JSON representation
3. Use that JSON in your API calls
Example:
```python
from crawl4ai import CrawlerRunConfig, PruningContentFilter
config = CrawlerRunConfig(
content_filter=PruningContentFilter(threshold=0.48)
)
print(config.dump()) # Use this JSON in your API calls
```
#### More Examples
**Advanced Crawler Configuration**
```json
{
"urls": ["https://example.com"],
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"cache_mode": "bypass",
"markdown_generator": {
"type": "DefaultMarkdownGenerator",
"params": {
"content_filter": {
"type": "PruningContentFilter",
"params": {
"threshold": 0.48,
"threshold_type": "fixed",
"min_word_threshold": 0
}
}
}
}
}
}
}
```
**Extraction Strategy**:
```json
{
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"extraction_strategy": {
"type": "JsonCssExtractionStrategy",
"params": {
"schema": {
"baseSelector": "article.post",
"fields": [
{"name": "title", "selector": "h1", "type": "text"},
{"name": "content", "selector": ".content", "type": "html"}
]
}
}
}
}
}
}
```
**LLM Extraction Strategy**
```json
{
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"extraction_strategy": {
"type": "LLMExtractionStrategy",
"params": {
"instruction": "Extract article title, author, publication date and main content",
"provider": "openai/gpt-4",
"api_token": "your-api-token",
"schema": {
"type": "dict",
"value": {
"title": "Article Schema",
"type": "object",
"properties": {
"title": {
"type": "string",
"description": "The article's headline"
},
"author": {
"type": "string",
"description": "The author's name"
},
"published_date": {
"type": "string",
"format": "date-time",
"description": "Publication date and time"
},
"content": {
"type": "string",
"description": "The main article content"
}
},
"required": ["title", "content"]
}
}
}
}
}
}
}
```
**Deep Crawler Example**
```json
{
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"deep_crawl_strategy": {
"type": "BFSDeepCrawlStrategy",
"params": {
"max_depth": 3,
"max_pages": 100,
"filter_chain": {
"type": "FastFilterChain",
"params": {
"filters": [
{
"type": "FastContentTypeFilter",
"params": {
"allowed_types": ["text/html", "application/xhtml+xml"]
}
},
{
"type": "FastDomainFilter",
"params": {
"allowed_domains": ["blog.*", "docs.*"],
"blocked_domains": ["ads.*", "analytics.*"]
}
},
{
"type": "FastURLPatternFilter",
"params": {
"allowed_patterns": ["^/blog/", "^/docs/"],
"blocked_patterns": [".*/ads/", ".*/sponsored/"]
}
}
]
}
},
"url_scorer": {
"type": "FastCompositeScorer",
"params": {
"scorers": [
{
"type": "FastKeywordRelevanceScorer",
"params": {
"keywords": ["tutorial", "guide", "documentation"],
"weight": 1.0
}
},
{
"type": "FastPathDepthScorer",
"params": {
"weight": 0.5,
"preferred_depth": 2
}
},
{
"type": "FastFreshnessScorer",
"params": {
"weight": 0.8,
"max_age_days": 365
}
}
]
}
}
}
}
}
}
}
```
### REST API Examples
Let's look at some practical examples:
#### Simple Crawl
```python
import requests
crawl_payload = {
"urls": ["https://example.com"],
"browser_config": {"headless": True},
"crawler_config": {"stream": False}
}
response = requests.post(
"http://localhost:8000/crawl",
# headers={"Authorization": f"Bearer {token}"}, # If JWT is enabled, more on this later
json=crawl_payload
)
print(response.json()) # Print the response for debugging
```
#### Streaming Results
```python
async def test_stream_crawl(session, token: str):
"""Test the /crawl/stream endpoint with multiple URLs."""
url = "http://localhost:8000/crawl/stream"
payload = {
"urls": [
"https://example.com",
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
],
"browser_config": {"headless": True, "viewport": {"width": 1200}},
"crawler_config": {"stream": True, "cache_mode": "aggressive"}
}
# headers = {"Authorization": f"Bearer {token}"} # If JWT is enabled, more on this later
try:
async with session.post(url, json=payload, headers=headers) as response:
status = response.status
print(f"Status: {status} (Expected: 200)")
assert status == 200, f"Expected 200, got {status}"
# Read streaming response line-by-line (NDJSON)
async for line in response.content:
if line:
data = json.loads(line.decode('utf-8').strip())
print(f"Streamed Result: {json.dumps(data, indent=2)}")
except Exception as e:
print(f"Error in streaming crawl test: {str(e)}")
```
## Metrics & Monitoring
Keep an eye on your crawler with these endpoints:
- `/health` - Quick health check
- `/metrics` - Detailed Prometheus metrics
- `/schema` - Full API schema
Example health check:
```bash
curl http://localhost:8000/health
```
## Deployment Scenarios
> 🚧 Coming soon! We'll cover:
> - Kubernetes deployment
> - Cloud provider setups (AWS, GCP, Azure)
> - High-availability configurations
> - Load balancing strategies
## Complete Examples
Check out the `examples` folder in our repository for full working examples! Here are two to get you started:
[Using Client SDK](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_python_sdk_example.py)
[Using REST API](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_python_rest_api_example.py)
## Server Configuration
The server's behavior can be customized through the `config.yml` file. Let's explore how to configure your Crawl4AI server for optimal performance and security.
### Understanding config.yml
The configuration file is located at `deploy/docker/config.yml`. You can either modify this file before building the image or mount a custom configuration when running the container.
Here's a detailed breakdown of the configuration options:
```yaml
# Application Configuration
app:
title: "Crawl4AI API" # Server title in OpenAPI docs
version: "1.0.0" # API version
host: "0.0.0.0" # Listen on all interfaces
port: 8000 # Server port
reload: True # Enable hot reloading (development only)
timeout_keep_alive: 300 # Keep-alive timeout in seconds
# Rate Limiting Configuration
rate_limiting:
enabled: True # Enable/disable rate limiting
default_limit: "100/minute" # Rate limit format: "number/timeunit"
trusted_proxies: [] # List of trusted proxy IPs
storage_uri: "memory://" # Use "redis://localhost:6379" for production
# Security Configuration
security:
enabled: false # Master toggle for security features
jwt_enabled: true # Enable JWT authentication
https_redirect: True # Force HTTPS
trusted_hosts: ["*"] # Allowed hosts (use specific domains in production)
headers: # Security headers
x_content_type_options: "nosniff"
x_frame_options: "DENY"
content_security_policy: "default-src 'self'"
strict_transport_security: "max-age=63072000; includeSubDomains"
# Crawler Configuration
crawler:
memory_threshold_percent: 95.0 # Memory usage threshold
rate_limiter:
base_delay: [1.0, 2.0] # Min and max delay between requests
timeouts:
stream_init: 30.0 # Stream initialization timeout
batch_process: 300.0 # Batch processing timeout
# Logging Configuration
logging:
level: "INFO" # Log level (DEBUG, INFO, WARNING, ERROR)
format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
# Observability Configuration
observability:
prometheus:
enabled: True # Enable Prometheus metrics
endpoint: "/metrics" # Metrics endpoint
health_check:
endpoint: "/health" # Health check endpoint
```
### JWT Authentication
When `security.jwt_enabled` is set to `true` in your config.yml, all endpoints require JWT authentication via bearer tokens. Here's how it works:
#### Getting a Token
```python
POST /token
Content-Type: application/json
{
"email": "user@example.com"
}
```
The endpoint returns:
```json
{
"email": "user@example.com",
"access_token": "eyJ0eXAiOiJKV1QiLCJhbGciOi...",
"token_type": "bearer"
}
```
#### Using the Token
Add the token to your requests:
```bash
curl -H "Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGci..." http://localhost:8000/crawl
```
Using the Python SDK:
```python
from crawl4ai.docker_client import Crawl4aiDockerClient
async with Crawl4aiDockerClient() as client:
# Authenticate first
await client.authenticate("user@example.com")
# Now all requests will include the token automatically
result = await client.crawl(urls=["https://example.com"])
```
#### Production Considerations 💡
The default implementation uses a simple email verification. For production use, consider:
- Email verification via OTP/magic links
- OAuth2 integration
- Rate limiting token generation
- Token expiration and refresh mechanisms
- IP-based restrictions
### Configuration Tips and Best Practices
1. **Production Settings** 🏭
```yaml
app:
reload: False # Disable reload in production
timeout_keep_alive: 120 # Lower timeout for better resource management
rate_limiting:
storage_uri: "redis://redis:6379" # Use Redis for distributed rate limiting
default_limit: "50/minute" # More conservative rate limit
security:
enabled: true # Enable all security features
trusted_hosts: ["your-domain.com"] # Restrict to your domain
```
2. **Development Settings** 🛠️
```yaml
app:
reload: True # Enable hot reloading
timeout_keep_alive: 300 # Longer timeout for debugging
logging:
level: "DEBUG" # More verbose logging
```
3. **High-Traffic Settings** 🚦
```yaml
crawler:
memory_threshold_percent: 85.0 # More conservative memory limit
rate_limiter:
base_delay: [2.0, 4.0] # More aggressive rate limiting
```
### Customizing Your Configuration
#### Method 1: Pre-build Configuration
```bash
# Copy and modify config before building
cd crawl4ai/deploy
vim custom-config.yml # Or use any editor
# Build with custom config
docker build --platform=linux/amd64 --no-cache -t crawl4ai:latest .
```
#### Method 2: Build-time Configuration
Use a custom config during build:
```bash
# Build with custom config
docker build --platform=linux/amd64 --no-cache \
--build-arg CONFIG_PATH=/path/to/custom-config.yml \
-t crawl4ai:latest .
```
#### Method 3: Runtime Configuration
```bash
# Mount custom config at runtime
docker run -d -p 8000:8000 \
-v $(pwd)/custom-config.yml:/app/config.yml \
crawl4ai-server:prod
```
> 💡 Note: When using Method 2, `/path/to/custom-config.yml` is relative to deploy directory.
> 💡 Note: When using Method 3, ensure your custom config file has all required fields as the container will use this instead of the built-in config.
### Configuration Recommendations
1. **Security First** 🔒
- Always enable security in production
- Use specific trusted_hosts instead of wildcards
- Set up proper rate limiting to protect your server
- Consider your environment before enabling HTTPS redirect
2. **Resource Management** 💻
- Adjust memory_threshold_percent based on available RAM
- Set timeouts according to your content size and network conditions
- Use Redis for rate limiting in multi-container setups
3. **Monitoring** 📊
- Enable Prometheus if you need metrics
- Set DEBUG logging in development, INFO in production
- Regular health check monitoring is crucial
4. **Performance Tuning** ⚡
- Start with conservative rate limiter delays
- Increase batch_process timeout for large content
- Adjust stream_init timeout based on initial response times
## Getting Help
We're here to help you succeed with Crawl4AI! Here's how to get support:
- 📖 Check our [full documentation](https://docs.crawl4ai.com)
- 🐛 Found a bug? [Open an issue](https://github.com/unclecode/crawl4ai/issues)
- 💬 Join our [Discord community](https://discord.gg/crawl4ai)
- ⭐ Star us on GitHub to show support!
## Summary
In this guide, we've covered everything you need to get started with Crawl4AI's Docker deployment:
- Building and running the Docker container
- Configuring the environment
- Making API requests with proper typing
- Using the Python SDK
- Monitoring your deployment
Remember, the examples in the `examples` folder are your friends - they show real-world usage patterns that you can adapt for your needs.
Keep exploring, and don't hesitate to reach out if you need help! We're building something amazing together. 🚀
Happy crawling! 🕷️

View File

@@ -1,442 +0,0 @@
import os
import json
import asyncio
from typing import List, Tuple
import logging
from typing import Optional, AsyncGenerator
from urllib.parse import unquote
from fastapi import HTTPException, Request, status
from fastapi.background import BackgroundTasks
from fastapi.responses import JSONResponse
from redis import asyncio as aioredis
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
LLMExtractionStrategy,
CacheMode,
BrowserConfig,
MemoryAdaptiveDispatcher,
RateLimiter
)
from crawl4ai.utils import perform_completion_with_backoff
from crawl4ai.content_filter_strategy import (
PruningContentFilter,
BM25ContentFilter,
LLMContentFilter
)
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
from utils import (
TaskStatus,
FilterType,
get_base_url,
is_task_id,
should_cleanup_task,
decode_redis_hash
)
logger = logging.getLogger(__name__)
async def handle_llm_qa(
url: str,
query: str,
config: dict
) -> str:
"""Process QA using LLM with crawled content as context."""
try:
# Extract base URL by finding last '?q=' occurrence
last_q_index = url.rfind('?q=')
if last_q_index != -1:
url = url[:last_q_index]
# Get markdown content
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url)
if not result.success:
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=result.error_message
)
content = result.markdown_v2.fit_markdown
# Create prompt and get LLM response
prompt = f"""Use the following content as context to answer the question.
Content:
{content}
Question: {query}
Answer:"""
response = perform_completion_with_backoff(
provider=config["llm"]["provider"],
prompt_with_variables=prompt,
api_token=os.environ.get(config["llm"].get("api_key_env", ""))
)
return response.choices[0].message.content
except Exception as e:
logger.error(f"QA processing error: {str(e)}", exc_info=True)
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=str(e)
)
async def process_llm_extraction(
redis: aioredis.Redis,
config: dict,
task_id: str,
url: str,
instruction: str,
schema: Optional[str] = None,
cache: str = "0"
) -> None:
"""Process LLM extraction in background."""
try:
# If config['llm'] has api_key then ignore the api_key_env
api_key = ""
if "api_key" in config["llm"]:
api_key = config["llm"]["api_key"]
else:
api_key = os.environ.get(config["llm"].get("api_key_env", None), "")
llm_strategy = LLMExtractionStrategy(
provider=config["llm"]["provider"],
api_token=api_key,
instruction=instruction,
schema=json.loads(schema) if schema else None,
)
cache_mode = CacheMode.ENABLED if cache == "1" else CacheMode.WRITE_ONLY
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=url,
config=CrawlerRunConfig(
extraction_strategy=llm_strategy,
scraping_strategy=LXMLWebScrapingStrategy(),
cache_mode=cache_mode
)
)
if not result.success:
await redis.hset(f"task:{task_id}", mapping={
"status": TaskStatus.FAILED,
"error": result.error_message
})
return
try:
content = json.loads(result.extracted_content)
except json.JSONDecodeError:
content = result.extracted_content
await redis.hset(f"task:{task_id}", mapping={
"status": TaskStatus.COMPLETED,
"result": json.dumps(content)
})
except Exception as e:
logger.error(f"LLM extraction error: {str(e)}", exc_info=True)
await redis.hset(f"task:{task_id}", mapping={
"status": TaskStatus.FAILED,
"error": str(e)
})
async def handle_markdown_request(
url: str,
filter_type: FilterType,
query: Optional[str] = None,
cache: str = "0",
config: Optional[dict] = None
) -> str:
"""Handle markdown generation requests."""
try:
decoded_url = unquote(url)
if not decoded_url.startswith(('http://', 'https://')):
decoded_url = 'https://' + decoded_url
if filter_type == FilterType.RAW:
md_generator = DefaultMarkdownGenerator()
else:
content_filter = {
FilterType.FIT: PruningContentFilter(),
FilterType.BM25: BM25ContentFilter(user_query=query or ""),
FilterType.LLM: LLMContentFilter(
provider=config["llm"]["provider"],
api_token=os.environ.get(config["llm"].get("api_key_env", None), ""),
instruction=query or "Extract main content"
)
}[filter_type]
md_generator = DefaultMarkdownGenerator(content_filter=content_filter)
cache_mode = CacheMode.ENABLED if cache == "1" else CacheMode.WRITE_ONLY
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=decoded_url,
config=CrawlerRunConfig(
markdown_generator=md_generator,
scraping_strategy=LXMLWebScrapingStrategy(),
cache_mode=cache_mode
)
)
if not result.success:
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=result.error_message
)
return (result.markdown_v2.raw_markdown
if filter_type == FilterType.RAW
else result.markdown_v2.fit_markdown)
except Exception as e:
logger.error(f"Markdown error: {str(e)}", exc_info=True)
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=str(e)
)
async def handle_llm_request(
redis: aioredis.Redis,
background_tasks: BackgroundTasks,
request: Request,
input_path: str,
query: Optional[str] = None,
schema: Optional[str] = None,
cache: str = "0",
config: Optional[dict] = None
) -> JSONResponse:
"""Handle LLM extraction requests."""
base_url = get_base_url(request)
try:
if is_task_id(input_path):
return await handle_task_status(
redis, input_path, base_url
)
if not query:
return JSONResponse({
"message": "Please provide an instruction",
"_links": {
"example": {
"href": f"{base_url}/llm/{input_path}?q=Extract+main+content",
"title": "Try this example"
}
}
})
return await create_new_task(
redis,
background_tasks,
input_path,
query,
schema,
cache,
base_url,
config
)
except Exception as e:
logger.error(f"LLM endpoint error: {str(e)}", exc_info=True)
return JSONResponse({
"error": str(e),
"_links": {
"retry": {"href": str(request.url)}
}
}, status_code=status.HTTP_500_INTERNAL_SERVER_ERROR)
async def handle_task_status(
redis: aioredis.Redis,
task_id: str,
base_url: str
) -> JSONResponse:
"""Handle task status check requests."""
task = await redis.hgetall(f"task:{task_id}")
if not task:
raise HTTPException(
status_code=status.HTTP_404_NOT_FOUND,
detail="Task not found"
)
task = decode_redis_hash(task)
response = create_task_response(task, task_id, base_url)
if task["status"] in [TaskStatus.COMPLETED, TaskStatus.FAILED]:
if should_cleanup_task(task["created_at"]):
await redis.delete(f"task:{task_id}")
return JSONResponse(response)
async def create_new_task(
redis: aioredis.Redis,
background_tasks: BackgroundTasks,
input_path: str,
query: str,
schema: Optional[str],
cache: str,
base_url: str,
config: dict
) -> JSONResponse:
"""Create and initialize a new task."""
decoded_url = unquote(input_path)
if not decoded_url.startswith(('http://', 'https://')):
decoded_url = 'https://' + decoded_url
from datetime import datetime
task_id = f"llm_{int(datetime.now().timestamp())}_{id(background_tasks)}"
await redis.hset(f"task:{task_id}", mapping={
"status": TaskStatus.PROCESSING,
"created_at": datetime.now().isoformat(),
"url": decoded_url
})
background_tasks.add_task(
process_llm_extraction,
redis,
config,
task_id,
decoded_url,
query,
schema,
cache
)
return JSONResponse({
"task_id": task_id,
"status": TaskStatus.PROCESSING,
"url": decoded_url,
"_links": {
"self": {"href": f"{base_url}/llm/{task_id}"},
"status": {"href": f"{base_url}/llm/{task_id}"}
}
})
def create_task_response(task: dict, task_id: str, base_url: str) -> dict:
"""Create response for task status check."""
response = {
"task_id": task_id,
"status": task["status"],
"created_at": task["created_at"],
"url": task["url"],
"_links": {
"self": {"href": f"{base_url}/llm/{task_id}"},
"refresh": {"href": f"{base_url}/llm/{task_id}"}
}
}
if task["status"] == TaskStatus.COMPLETED:
response["result"] = json.loads(task["result"])
elif task["status"] == TaskStatus.FAILED:
response["error"] = task["error"]
return response
async def stream_results(crawler: AsyncWebCrawler, results_gen: AsyncGenerator) -> AsyncGenerator[bytes, None]:
"""Stream results with heartbeats and completion markers."""
import json
from utils import datetime_handler
try:
async for result in results_gen:
try:
result_dict = result.model_dump()
logger.info(f"Streaming result for {result_dict.get('url', 'unknown')}")
data = json.dumps(result_dict, default=datetime_handler) + "\n"
yield data.encode('utf-8')
except Exception as e:
logger.error(f"Serialization error: {e}")
error_response = {"error": str(e), "url": getattr(result, 'url', 'unknown')}
yield (json.dumps(error_response) + "\n").encode('utf-8')
yield json.dumps({"status": "completed"}).encode('utf-8')
except asyncio.CancelledError:
logger.warning("Client disconnected during streaming")
finally:
try:
await crawler.close()
except Exception as e:
logger.error(f"Crawler cleanup error: {e}")
async def handle_crawl_request(
urls: List[str],
browser_config: dict,
crawler_config: dict,
config: dict
) -> dict:
"""Handle non-streaming crawl requests."""
try:
browser_config = BrowserConfig.load(browser_config)
crawler_config = CrawlerRunConfig.load(crawler_config)
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=config["crawler"]["memory_threshold_percent"],
rate_limiter=RateLimiter(
base_delay=tuple(config["crawler"]["rate_limiter"]["base_delay"])
)
)
async with AsyncWebCrawler(config=browser_config) as crawler:
results = await crawler.arun_many(
urls=urls,
config=crawler_config,
dispatcher=dispatcher
)
return {
"success": True,
"results": [result.model_dump() for result in results]
}
except Exception as e:
logger.error(f"Crawl error: {str(e)}", exc_info=True)
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=str(e)
)
async def handle_stream_crawl_request(
urls: List[str],
browser_config: dict,
crawler_config: dict,
config: dict
) -> Tuple[AsyncWebCrawler, AsyncGenerator]:
"""Handle streaming crawl requests."""
try:
browser_config = BrowserConfig.load(browser_config)
browser_config.verbose = True
crawler_config = CrawlerRunConfig.load(crawler_config)
crawler_config.scraping_strategy = LXMLWebScrapingStrategy()
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=config["crawler"]["memory_threshold_percent"],
rate_limiter=RateLimiter(
base_delay=tuple(config["crawler"]["rate_limiter"]["base_delay"])
)
)
crawler = AsyncWebCrawler(config=browser_config)
await crawler.start()
results_gen = await crawler.arun_many(
urls=urls,
config=crawler_config,
dispatcher=dispatcher
)
return crawler, results_gen
except Exception as e:
if 'crawler' in locals():
await crawler.close()
logger.error(f"Stream crawl error: {str(e)}", exc_info=True)
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=str(e)
)

View File

@@ -1,46 +0,0 @@
import os
from datetime import datetime, timedelta, timezone
from typing import Dict, Optional
from jwt import JWT, jwk_from_dict
from jwt.utils import get_int_from_datetime
from fastapi import Depends, HTTPException
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
from pydantic import EmailStr
from pydantic.main import BaseModel
import base64
instance = JWT()
security = HTTPBearer()
SECRET_KEY = os.environ.get("SECRET_KEY", "mysecret")
ACCESS_TOKEN_EXPIRE_MINUTES = 60
def get_jwk_from_secret(secret: str):
"""Convert a secret string into a JWK object."""
secret_bytes = secret.encode('utf-8')
b64_secret = base64.urlsafe_b64encode(secret_bytes).rstrip(b'=').decode('utf-8')
return jwk_from_dict({"kty": "oct", "k": b64_secret})
def create_access_token(data: dict, expires_delta: Optional[timedelta] = None) -> str:
"""Create a JWT access token with an expiration."""
to_encode = data.copy()
expire = datetime.now(timezone.utc) + (expires_delta or timedelta(minutes=ACCESS_TOKEN_EXPIRE_MINUTES))
to_encode.update({"exp": get_int_from_datetime(expire)})
signing_key = get_jwk_from_secret(SECRET_KEY)
return instance.encode(to_encode, signing_key, alg='HS256')
def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)) -> Dict:
"""Verify the JWT token from the Authorization header."""
token = credentials.credentials
verifying_key = get_jwk_from_secret(SECRET_KEY)
try:
payload = instance.decode(token, verifying_key, do_time_check=True, algorithms='HS256')
return payload
except Exception:
raise HTTPException(status_code=401, detail="Invalid or expired token")
def get_token_dependency(config: Dict):
"""Return the token dependency if JWT is enabled, else None."""
return verify_token if config.get("security", {}).get("jwt_enabled", False) else None
class TokenRequest(BaseModel):
email: EmailStr

View File

@@ -1,71 +0,0 @@
# Application Configuration
app:
title: "Crawl4AI API"
version: "1.0.0"
host: "0.0.0.0"
port: 8000
reload: True
timeout_keep_alive: 300
# Default LLM Configuration
llm:
provider: "openai/gpt-4o-mini"
api_key_env: "OPENAI_API_KEY"
# api_key: sk-... # If you pass the API key directly then api_key_env will be ignored
# Redis Configuration
redis:
host: "localhost"
port: 6379
db: 0
password: ""
ssl: False
ssl_cert_reqs: None
ssl_ca_certs: None
ssl_certfile: None
ssl_keyfile: None
ssl_cert_reqs: None
ssl_ca_certs: None
ssl_certfile: None
ssl_keyfile: None
# Rate Limiting Configuration
rate_limiting:
enabled: True
default_limit: "1000/minute"
trusted_proxies: []
storage_uri: "memory://" # Use "redis://localhost:6379" for production
# Security Configuration
security:
enabled: true
jwt_enabled: true
https_redirect: false
trusted_hosts: ["*"]
headers:
x_content_type_options: "nosniff"
x_frame_options: "DENY"
content_security_policy: "default-src 'self'"
strict_transport_security: "max-age=63072000; includeSubDomains"
# Crawler Configuration
crawler:
memory_threshold_percent: 95.0
rate_limiter:
base_delay: [1.0, 2.0]
timeouts:
stream_init: 30.0 # Timeout for stream initialization
batch_process: 300.0 # Timeout for batch processing
# Logging Configuration
logging:
level: "INFO"
format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
# Observability Configuration
observability:
prometheus:
enabled: True
endpoint: "/metrics"
health_check:
endpoint: "/health"

View File

@@ -1,10 +0,0 @@
crawl4ai
fastapi
uvicorn
gunicorn>=23.0.0
slowapi>=0.1.9
prometheus-fastapi-instrumentator>=7.0.2
redis>=5.2.1
jwt>=1.3.1
dnspython>=2.7.0
email-validator>=2.2.0

View File

@@ -1,181 +0,0 @@
import os
import sys
import time
from typing import List, Optional, Dict
from fastapi import FastAPI, HTTPException, Request, Query, Path, Depends
from fastapi.responses import StreamingResponse, RedirectResponse, PlainTextResponse, JSONResponse
from fastapi.middleware.httpsredirect import HTTPSRedirectMiddleware
from fastapi.middleware.trustedhost import TrustedHostMiddleware
from pydantic import BaseModel, Field
from slowapi import Limiter
from slowapi.util import get_remote_address
from prometheus_fastapi_instrumentator import Instrumentator
from redis import asyncio as aioredis
sys.path.append(os.path.dirname(os.path.realpath(__file__)))
from utils import FilterType, load_config, setup_logging, verify_email_domain
from api import (
handle_markdown_request,
handle_llm_qa,
handle_stream_crawl_request,
handle_crawl_request,
stream_results
)
from auth import create_access_token, get_token_dependency, TokenRequest # Import from auth.py
__version__ = "0.2.6"
class CrawlRequest(BaseModel):
urls: List[str] = Field(min_length=1, max_length=100)
browser_config: Optional[Dict] = Field(default_factory=dict)
crawler_config: Optional[Dict] = Field(default_factory=dict)
# Load configuration and setup
config = load_config()
setup_logging(config)
# Initialize Redis
redis = aioredis.from_url(config["redis"].get("uri", "redis://localhost"))
# Initialize rate limiter
limiter = Limiter(
key_func=get_remote_address,
default_limits=[config["rate_limiting"]["default_limit"]],
storage_uri=config["rate_limiting"]["storage_uri"]
)
app = FastAPI(
title=config["app"]["title"],
version=config["app"]["version"]
)
# Configure middleware
def setup_security_middleware(app, config):
sec_config = config.get("security", {})
if sec_config.get("enabled", False):
if sec_config.get("https_redirect", False):
app.add_middleware(HTTPSRedirectMiddleware)
if sec_config.get("trusted_hosts", []) != ["*"]:
app.add_middleware(TrustedHostMiddleware, allowed_hosts=sec_config["trusted_hosts"])
setup_security_middleware(app, config)
# Prometheus instrumentation
if config["observability"]["prometheus"]["enabled"]:
Instrumentator().instrument(app).expose(app)
# Get token dependency based on config
token_dependency = get_token_dependency(config)
# Middleware for security headers
@app.middleware("http")
async def add_security_headers(request: Request, call_next):
response = await call_next(request)
if config["security"]["enabled"]:
response.headers.update(config["security"]["headers"])
return response
# Token endpoint (always available, but usage depends on config)
@app.post("/token")
async def get_token(request_data: TokenRequest):
if not verify_email_domain(request_data.email):
raise HTTPException(status_code=400, detail="Invalid email domain")
token = create_access_token({"sub": request_data.email})
return {"email": request_data.email, "access_token": token, "token_type": "bearer"}
# Endpoints with conditional auth
@app.get("/md/{url:path}")
@limiter.limit(config["rate_limiting"]["default_limit"])
async def get_markdown(
request: Request,
url: str,
f: FilterType = FilterType.FIT,
q: Optional[str] = None,
c: Optional[str] = "0",
token_data: Optional[Dict] = Depends(token_dependency)
):
result = await handle_markdown_request(url, f, q, c, config)
return PlainTextResponse(result)
@app.get("/llm/{url:path}", description="URL should be without http/https prefix")
async def llm_endpoint(
request: Request,
url: str = Path(...),
q: Optional[str] = Query(None),
token_data: Optional[Dict] = Depends(token_dependency)
):
if not q:
raise HTTPException(status_code=400, detail="Query parameter 'q' is required")
if not url.startswith(('http://', 'https://')):
url = 'https://' + url
try:
answer = await handle_llm_qa(url, q, config)
return JSONResponse({"answer": answer})
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/schema")
async def get_schema():
from crawl4ai import BrowserConfig, CrawlerRunConfig
return {"browser": BrowserConfig().dump(), "crawler": CrawlerRunConfig().dump()}
@app.get(config["observability"]["health_check"]["endpoint"])
async def health():
return {"status": "ok", "timestamp": time.time(), "version": __version__}
@app.get(config["observability"]["prometheus"]["endpoint"])
async def metrics():
return RedirectResponse(url=config["observability"]["prometheus"]["endpoint"])
@app.post("/crawl")
@limiter.limit(config["rate_limiting"]["default_limit"])
async def crawl(
request: Request,
crawl_request: CrawlRequest,
token_data: Optional[Dict] = Depends(token_dependency)
):
if not crawl_request.urls:
raise HTTPException(status_code=400, detail="At least one URL required")
results = await handle_crawl_request(
urls=crawl_request.urls,
browser_config=crawl_request.browser_config,
crawler_config=crawl_request.crawler_config,
config=config
)
return JSONResponse(results)
@app.post("/crawl/stream")
@limiter.limit(config["rate_limiting"]["default_limit"])
async def crawl_stream(
request: Request,
crawl_request: CrawlRequest,
token_data: Optional[Dict] = Depends(token_dependency)
):
if not crawl_request.urls:
raise HTTPException(status_code=400, detail="At least one URL required")
crawler, results_gen = await handle_stream_crawl_request(
urls=crawl_request.urls,
browser_config=crawl_request.browser_config,
crawler_config=crawl_request.crawler_config,
config=config
)
return StreamingResponse(
stream_results(crawler, results_gen),
media_type='application/x-ndjson',
headers={'Cache-Control': 'no-cache', 'Connection': 'keep-alive', 'X-Stream-Status': 'active'}
)
if __name__ == "__main__":
import uvicorn
uvicorn.run(
"server:app",
host=config["app"]["host"],
port=config["app"]["port"],
reload=config["app"]["reload"],
timeout_keep_alive=config["app"]["timeout_keep_alive"]
)

View File

@@ -1,12 +0,0 @@
[supervisord]
nodaemon=true
[program:redis]
command=redis-server
autorestart=true
priority=10
[program:gunicorn]
command=gunicorn --bind 0.0.0.0:8000 --workers 4 --threads 2 --timeout 300 --graceful-timeout 60 --keep-alive 65 --log-level debug --worker-class uvicorn.workers.UvicornWorker --max-requests 1000 --max-requests-jitter 50 server:app
autorestart=true
priority=20

View File

@@ -1,66 +0,0 @@
import dns.resolver
import logging
import yaml
from datetime import datetime
from enum import Enum
from pathlib import Path
from fastapi import Request
from typing import Dict, Optional
class TaskStatus(str, Enum):
PROCESSING = "processing"
FAILED = "failed"
COMPLETED = "completed"
class FilterType(str, Enum):
RAW = "raw"
FIT = "fit"
BM25 = "bm25"
LLM = "llm"
def load_config() -> Dict:
"""Load and return application configuration."""
config_path = Path(__file__).parent / "config.yml"
with open(config_path, "r") as config_file:
return yaml.safe_load(config_file)
def setup_logging(config: Dict) -> None:
"""Configure application logging."""
logging.basicConfig(
level=config["logging"]["level"],
format=config["logging"]["format"]
)
def get_base_url(request: Request) -> str:
"""Get base URL including scheme and host."""
return f"{request.url.scheme}://{request.url.netloc}"
def is_task_id(value: str) -> bool:
"""Check if the value matches task ID pattern."""
return value.startswith("llm_") and "_" in value
def datetime_handler(obj: any) -> Optional[str]:
"""Handle datetime serialization for JSON."""
if hasattr(obj, 'isoformat'):
return obj.isoformat()
raise TypeError(f"Object of type {type(obj)} is not JSON serializable")
def should_cleanup_task(created_at: str) -> bool:
"""Check if task should be cleaned up based on creation time."""
created = datetime.fromisoformat(created_at)
return (datetime.now() - created).total_seconds() > 3600
def decode_redis_hash(hash_data: Dict[bytes, bytes]) -> Dict[str, str]:
"""Decode Redis hash data from bytes to strings."""
return {k.decode('utf-8'): v.decode('utf-8') for k, v in hash_data.items()}
def verify_email_domain(email: str) -> bool:
try:
domain = email.split('@')[1]
# Try to resolve MX records for the domain.
records = dns.resolver.resolve(domain, 'MX')
return True if records else False
except Exception as e:
return False

View File

@@ -1,77 +0,0 @@
# Crawl4AI API Quickstart
This document shows how to generate an API token and use it to call the `/crawl` and `/md` endpoints.
---
## 1. Crawl Example
Send a POST request to `/crawl` with the following JSON payload:
```json
{
"urls": ["https://example.com"],
"browser_config": { "headless": true, "verbose": true },
"crawler_config": { "stream": false, "cache_mode": "enabled" }
}
```
**cURL Command:**
```bash
curl -X POST "https://api.crawl4ai.com/crawl" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"],
"browser_config": {"headless": true, "verbose": true},
"crawler_config": {"stream": false, "cache_mode": "enabled"}
}'
```
---
## 2. Markdown Retrieval Example
To retrieve markdown from a given URL (e.g., `https://example.com`), use:
```bash
curl -X GET "https://api.crawl4ai.com/md/example.com" \
-H "Authorization: Bearer YOUR_API_TOKEN"
```
---
## 3. Python Code Example (Using `requests`)
Below is a sample Python script that demonstrates using the `requests` library to call the API endpoints:
```python
import requests
BASE_URL = "https://api.crawl4ai.com"
TOKEN = "YOUR_API_TOKEN" # Replace with your actual token
headers = {
"Authorization": f"Bearer {TOKEN}",
"Content-Type": "application/json"
}
# Crawl endpoint example
crawl_payload = {
"urls": ["https://example.com"],
"browser_config": {"headless": True, "verbose": True},
"crawler_config": {"stream": False, "cache_mode": "enabled"}
}
crawl_response = requests.post(f"{BASE_URL}/crawl", json=crawl_payload, headers=headers)
print("Crawl Response:", crawl_response.json())
# /md endpoint example
md_response = requests.get(f"{BASE_URL}/md/example.com", headers=headers)
print("Markdown Content:", md_response.text)
```
---
Happy crawling!

View File

@@ -1,2 +0,0 @@
FROM nginx:alpine
COPY nginx.conf /etc/nginx/conf.d/default.conf

View File

@@ -1,55 +0,0 @@
server {
listen 80;
server_name api.crawl4ai.com;
# Main logging settings
error_log /var/log/nginx/error.log debug;
access_log /var/log/nginx/access.log combined buffer=512k flush=1m;
# Timeout and buffering settings
proxy_connect_timeout 300;
proxy_send_timeout 300;
proxy_read_timeout 300;
send_timeout 300;
proxy_buffer_size 128k;
proxy_buffers 4 256k;
proxy_busy_buffers_size 256k;
# Health check location
location /health {
proxy_pass http://127.0.0.1:8000/health;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
# Main proxy for application endpoints
location / {
proxy_pass http://127.0.0.1:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
add_header X-Debug-Info $request_uri;
proxy_request_buffering off;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_buffering off;
}
# New endpoint: serve Nginx error log
location /nginx/error {
# Using "alias" to serve the error log file
alias /var/log/nginx/error.log;
# Optionally, you might restrict access with "allow" and "deny" directives.
}
# New endpoint: serve Nginx access log
location /nginx/access {
alias /var/log/nginx/access.log;
}
client_max_body_size 10M;
client_body_buffer_size 128k;
}

View File

@@ -1 +0,0 @@
v0.1.0

View File

@@ -554,7 +554,7 @@ async def test_stream_crawl(session, token: str):
"https://example.com/page3", "https://example.com/page3",
], ],
"browser_config": {"headless": True, "viewport": {"width": 1200}}, "browser_config": {"headless": True, "viewport": {"width": 1200}},
"crawler_config": {"stream": True, "cache_mode": "aggressive"} "crawler_config": {"stream": True, "cache_mode": "bypass"}
} }
# headers = {"Authorization": f"Bearer {token}"} # If JWT is enabled, more on this later # headers = {"Authorization": f"Bearer {token}"} # If JWT is enabled, more on this later

View File

@@ -2,6 +2,7 @@ import os
import json import json
import asyncio import asyncio
from typing import List, Tuple from typing import List, Tuple
from functools import partial
import logging import logging
from typing import Optional, AsyncGenerator from typing import Optional, AsyncGenerator
@@ -388,12 +389,13 @@ async def handle_crawl_request(
) )
async with AsyncWebCrawler(config=browser_config) as crawler: async with AsyncWebCrawler(config=browser_config) as crawler:
results = await crawler.arun_many( results = []
urls=urls, func = getattr(crawler, "arun" if len(urls) == 1 else "arun_many")
config=crawler_config, partial_func = partial(func,
dispatcher=dispatcher urls[0] if len(urls) == 1 else urls,
) config=crawler_config,
dispatcher=dispatcher)
results = await partial_func()
return { return {
"success": True, "success": True,
"results": [result.model_dump() for result in results] "results": [result.model_dump() for result in results]

View File

@@ -1,63 +0,0 @@
FROM --platform=linux/amd64 python:3.10-slim
# Install system dependencies required for Chromium and Git
RUN apt-get update && apt-get install -y \
python3-dev \
pkg-config \
libjpeg-dev \
gcc \
build-essential \
libnss3 \
libnspr4 \
libatk1.0-0 \
libatk-bridge2.0-0 \
libcups2 \
libdrm2 \
libxkbcommon0 \
libxcomposite1 \
libxdamage1 \
libxfixes3 \
libxrandr2 \
libgbm1 \
libasound2 \
libpango-1.0-0 \
libcairo2 \
procps \
git \
socat \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
# Make a directory for crawl4ai call it crawl4ai_repo
# RUN mkdir crawl4ai_repo
# # Clone Crawl4ai from the next branch and install it
# RUN git clone --branch next https://github.com/unclecode/crawl4ai.git ./crawl4ai_repo \
# && cd crawl4ai_repo \
# && pip install . \
# && cd .. \
# && rm -rf crawl4ai_repo
RUN python3 -m venv /app/venv
ENV PATH="/app/venv/bin:$PATH"
# RUN pip install git+https://github.com/unclecode/crawl4ai.git@next
# Copy requirements and install remaining dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt
# Copy application files
COPY resources /app/resources
COPY main.py .
COPY start.sh .
# Set permissions for Chrome binary and start script
RUN chmod +x /app/resources/chrome/headless_shell && \
chmod -R 755 /app/resources/chrome && \
chmod +x start.sh
ENV FUNCTION_TARGET=crawl
EXPOSE 8080 9223
CMD ["/app/start.sh"]

View File

@@ -1,8 +0,0 @@
project_id: PROJECT_ID
region: REGION_NAME
artifact_repo: ARTIFACT_REPO_NAME
function_name: FUNCTION_NAME
memory: "2048MB"
timeout: "540s"
local_image: "gcr.io/ARTIFACT_REPO_NAME/crawl4ai:latest"
test_query_url: "https://example.com"

View File

@@ -1,187 +0,0 @@
#!/usr/bin/env python3
import argparse
import subprocess
import sys
import yaml
import requests
def run_command(cmd, explanation, require_confirm=True, allow_already_exists=False):
print("\n=== {} ===".format(explanation))
if require_confirm:
input("Press Enter to run: [{}]\n".format(cmd))
print("Running: {}".format(cmd))
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
if result.returncode != 0:
if allow_already_exists and "ALREADY_EXISTS" in result.stderr:
print("Repository already exists, skipping creation.")
return ""
print("Error:\n{}".format(result.stderr))
sys.exit(1)
out = result.stdout.strip()
if out:
print("Output:\n{}".format(out))
return out
def load_config():
try:
with open("config.yml", "r") as f:
config = yaml.safe_load(f)
except Exception as e:
print("Failed to load config.yml: {}".format(e))
sys.exit(1)
required = ["project_id", "region", "artifact_repo", "function_name", "local_image"]
for key in required:
if key not in config or not config[key]:
print("Missing required config parameter: {}".format(key))
sys.exit(1)
return config
def deploy_function(config):
project_id = config["project_id"]
region = config["region"]
artifact_repo = config["artifact_repo"]
function_name = config["function_name"]
memory = config.get("memory", "2048MB")
timeout = config.get("timeout", "540s")
local_image = config["local_image"]
test_query_url = config.get("test_query_url", "https://example.com")
# Repository image format: "<region>-docker.pkg.dev/<project_id>/<artifact_repo>/<function_name>:latest"
repo_image = f"{region}-docker.pkg.dev/{project_id}/{artifact_repo}/{function_name}:latest"
# 1. Create Artifact Registry repository (skip if exists)
cmd = f"gcloud artifacts repositories create {artifact_repo} --repository-format=docker --location={region} --project={project_id}"
run_command(cmd, "Creating Artifact Registry repository (if it doesn't exist)", allow_already_exists=True)
# 2. Tag the local Docker image with the repository image name
cmd = f"docker tag {local_image} {repo_image}"
run_command(cmd, "Tagging Docker image for Artifact Registry")
# 3. Authenticate Docker to Artifact Registry
cmd = f"gcloud auth configure-docker {region}-docker.pkg.dev"
run_command(cmd, "Authenticating Docker to Artifact Registry")
# 4. Push the tagged Docker image to Artifact Registry
cmd = f"docker push {repo_image}"
run_command(cmd, "Pushing Docker image to Artifact Registry")
# 5. Deploy the Cloud Function using the custom container
cmd = (
f"gcloud beta functions deploy {function_name} "
f"--gen2 "
f"--runtime=python310 "
f"--entry-point=crawl "
f"--region={region} "
f"--docker-repository={region}-docker.pkg.dev/{project_id}/{artifact_repo} "
f"--trigger-http "
f"--memory={memory} "
f"--timeout={timeout} "
f"--project={project_id}"
)
run_command(cmd, "Deploying Cloud Function using custom container")
# 6. Set the Cloud Function to allow public (unauthenticated) invocations
cmd = (
f"gcloud functions add-iam-policy-binding {function_name} "
f"--region={region} "
f"--member='allUsers' "
f"--role='roles/cloudfunctions.invoker' "
f"--project={project_id}"
f"--quiet"
)
run_command(cmd, "Setting Cloud Function IAM to allow public invocations")
# 7. Retrieve the deployed Cloud Function URL
cmd = (
f"gcloud functions describe {function_name} "
f"--region={region} "
f"--project={project_id} "
f"--format='value(serviceConfig.uri)'"
)
deployed_url = run_command(cmd, "Extracting deployed Cloud Function URL", require_confirm=False)
print("\nDeployed URL: {}\n".format(deployed_url))
# 8. Test the deployed function
test_url = f"{deployed_url}?url={test_query_url}"
print("Testing function with: {}".format(test_url))
try:
response = requests.get(test_url)
print("Response status: {}".format(response.status_code))
print("Response body:\n{}".format(response.text))
if response.status_code == 200:
print("Test successful!")
else:
print("Non-200 response; check function logs.")
except Exception as e:
print("Test request error: {}".format(e))
sys.exit(1)
# 9. Final usage help
print("\nDeployment complete!")
print("Invoke your function with:")
print(f"curl '{deployed_url}?url={test_query_url}'")
print("For further instructions, refer to your documentation.")
def delete_function(config):
project_id = config["project_id"]
region = config["region"]
function_name = config["function_name"]
cmd = f"gcloud functions delete {function_name} --region={region} --project={project_id} --quiet"
run_command(cmd, "Deleting Cloud Function")
def describe_function(config):
project_id = config["project_id"]
region = config["region"]
function_name = config["function_name"]
cmd = (
f"gcloud functions describe {function_name} "
f"--region={region} "
f"--project={project_id} "
f"--format='value(serviceConfig.uri)'"
)
deployed_url = run_command(cmd, "Describing Cloud Function to extract URL", require_confirm=False)
print("\nCloud Function URL: {}\n".format(deployed_url))
def clear_all(config):
print("\n=== CLEAR ALL RESOURCES ===")
project_id = config["project_id"]
region = config["region"]
artifact_repo = config["artifact_repo"]
confirm = input("WARNING: This will DELETE the Cloud Function and the Artifact Registry repository. Are you sure? (y/N): ")
if confirm.lower() != "y":
print("Aborting clear operation.")
sys.exit(0)
# Delete the Cloud Function
delete_function(config)
# Delete the Artifact Registry repository
cmd = f"gcloud artifacts repositories delete {artifact_repo} --location={region} --project={project_id} --quiet"
run_command(cmd, "Deleting Artifact Registry repository", require_confirm=False)
print("All resources cleared.")
def main():
parser = argparse.ArgumentParser(description="Deploy, delete, describe, or clear Cloud Function resources using config.yml")
subparsers = parser.add_subparsers(dest="command", required=True)
subparsers.add_parser("deploy", help="Deploy the Cloud Function")
subparsers.add_parser("delete", help="Delete the deployed Cloud Function")
subparsers.add_parser("describe", help="Describe the Cloud Function and return its URL")
subparsers.add_parser("clear", help="Delete the Cloud Function and Artifact Registry repository")
args = parser.parse_args()
config = load_config()
if args.command == "deploy":
deploy_function(config)
elif args.command == "delete":
delete_function(config)
elif args.command == "describe":
describe_function(config)
elif args.command == "clear":
clear_all(config)
else:
parser.print_help()
if __name__ == "__main__":
main()

View File

@@ -1,204 +0,0 @@
# Deploying Crawl4ai on Google Cloud Functions
This guide explains how to deploy **Crawl4ai**—an opensource web crawler library—on Google Cloud Functions Gen2 using a custom container. We assume your project folder already includes:
- **Dockerfile:** Builds your container image (which installs Crawl4ai from its Git repository).
- **start.sh:** Activates your virtual environment and starts the function (using the Functions Framework).
- **main.py:** Contains your function logic with the entry point `crawl` (and imports Crawl4ai).
The guide is divided into two parts:
1. Manual deployment steps (using CLI commands)
2. Automated deployment using a Python script (`deploy.py`)
---
## Part 1: Manual Deployment Process
### Prerequisites
- **Google Cloud Project:** Ensure your project is active and billing is enabled.
- **Google Cloud CLI & Docker:** Installed and configured on your local machine.
- **Permissions:** You must have rights to create Cloud Functions and Artifact Registry repositories.
- **Files:** Your Dockerfile, start.sh, and main.py should be in the same directory.
### Step 1: Build Your Docker Image
Your Dockerfile packages Crawl4ai along with all its dependencies. Build your image with:
```bash
docker build -t gcr.io/<PROJECT_ID>/<FUNCTION_NAME>:latest .
```
Replace `<PROJECT_ID>` with your Google Cloud project ID and `<FUNCTION_NAME>` with your chosen function name (for example, `crawl4ai-t1`).
### Step 2: Create an Artifact Registry Repository
Cloud Functions Gen2 requires your custom container image to reside in an Artifact Registry repository. Create one by running:
```bash
gcloud artifacts repositories create <ARTIFACT_REPO> \
--repository-format=docker \
--location=<REGION> \
--project=<PROJECT_ID>
```
Replace `<ARTIFACT_REPO>` (for example, `crawl4ai`) and `<REGION>` (for example, `asia-east1`).
> **Note:** If you receive an `ALREADY_EXISTS` error, the repository is already created; simply proceed to the next step.
### Step 3: Tag Your Docker Image
Tag your locally built Docker image so it matches the Artifact Registry format:
```bash
docker tag gcr.io/<PROJECT_ID>/<FUNCTION_NAME>:latest <REGION>-docker.pkg.dev/<PROJECT_ID>/<ARTIFACT_REPO>/<FUNCTION_NAME>:latest
```
This step “renames” the image so you can push it to your repository.
### Step 4: Authenticate Docker to Artifact Registry
Configure Docker authentication to the Artifact Registry:
```bash
gcloud auth configure-docker <REGION>-docker.pkg.dev
```
This ensures Docker can securely push images to your registry using your Cloud credentials.
### Step 5: Push the Docker Image
Push the tagged image to Artifact Registry:
```bash
docker push <REGION>-docker.pkg.dev/<PROJECT_ID>/<ARTIFACT_REPO>/<FUNCTION_NAME>:latest
```
Once complete, your container image (with Crawl4ai installed) is hosted in Artifact Registry.
### Step 6: Deploy the Cloud Function
Deploy your function using the custom container image. Run:
```bash
gcloud beta functions deploy <FUNCTION_NAME> \
--gen2 \
--region=<REGION> \
--docker-repository=<REGION>-docker.pkg.dev/<PROJECT_ID>/<ARTIFACT_REPO> \
--trigger-http \
--memory=2048MB \
--timeout=540s \
--project=<PROJECT_ID>
```
This command tells Cloud Functions Gen2 to pull your container image from Artifact Registry and deploy it. Make sure your main.py defines the `crawl` entry point.
### Step 7: Make the Function Public
To allow external (unauthenticated) access, update the functions IAM policy:
```bash
gcloud functions add-iam-policy-binding <FUNCTION_NAME> \
--region=<REGION> \
--member="allUsers" \
--role="roles/cloudfunctions.invoker" \
--project=<PROJECT_ID> \
--quiet
```
Using the `--quiet` flag ensures the command runs noninteractively so the policy is applied immediately.
### Step 8: Retrieve and Test Your Function URL
Get the URL for your deployed function:
```bash
gcloud functions describe <FUNCTION_NAME> \
--region=<REGION> \
--project=<PROJECT_ID> \
--format='value(serviceConfig.uri)'
```
Test your deployment with a sample GET request (using curl or your browser):
```bash
curl "<FUNCTION_URL>?url=https://example.com"
```
Replace `<FUNCTION_URL>` with the output URL from the previous command. A successful test (HTTP status 200) means Crawl4ai is running on Cloud Functions.
---
## Part 2: Automated Deployment with deploy.py
For a more streamlined process, use the provided `deploy.py` script. This Python script automates the manual steps, prompting you to confirm key actions and providing detailed logs throughout the process.
### What deploy.py Does:
- **Reads Parameters:** It loads a `config.yml` file containing all necessary parameters such as `project_id`, `region`, `artifact_repo`, `function_name`, `local_image`, etc.
- **Creates/Skips Repository:** It creates the Artifact Registry repository (or skips if it already exists).
- **Tags & Pushes:** It tags your local Docker image and pushes it to the Artifact Registry.
- **Deploys the Function:** It deploys the Cloud Function with your custom container.
- **Updates IAM:** It sets the IAM policy to allow public access (using the `--quiet` flag).
- **Tests the Deployment:** It extracts the deployed URL and performs a test request.
- **Additional Commands:** You can also use subcommands in the script to delete or describe the deployed function, or even clear all resources.
### Example config.yml
Create a `config.yml` file in the same folder as your Dockerfile. An example configuration:
```yaml
project_id: your-project-id
region: asia-east1
artifact_repo: crawl4ai
function_name: crawl4ai-t1
memory: "2048MB"
timeout: "540s"
local_image: "gcr.io/your-project-id/crawl4ai-t1:latest"
test_query_url: "https://example.com"
```
### How to Use deploy.py
- **Deploy the Function:**
```bash
python deploy.py deploy
```
The script will guide you through each step, display the output, and ask for confirmation before executing critical commands.
- **Describe the Function:**
If you forget the function URL and want to retrieve it later:
```bash
python deploy.py describe
```
- **Delete the Function:**
To remove just the Cloud Function:
```bash
python deploy.py delete
```
- **Clear All Resources:**
To delete both the Cloud Function and the Artifact Registry repository:
```bash
python deploy.py clear
```
---
## Conclusion
This guide has walked you through two deployment methods for Crawl4ai on Google Cloud Functions Gen2:
1. **Manual Deployment:** Building your Docker image, pushing it to Artifact Registry, deploying the Cloud Function, and setting up IAM.
2. **Automated Deployment:** Using `deploy.py` with a configuration file to handle the entire process interactively.
By following these instructions, you can deploy, test, and manage your Crawl4ai-based Cloud Function with ease. Enjoy using Crawl4ai in your cloud environment!

View File

@@ -1,158 +0,0 @@
# Cleanup Chrome process on module unload
import atexit
import asyncio
import logging
import functions_framework
from flask import jsonify, Request
import os
import sys
import time
import subprocess
import signal
import requests
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
logger.info(f"Python version: {sys.version}")
logger.info(f"Python path: {sys.path}")
# Try to find where crawl4ai is coming from
try:
import crawl4ai
logger.info(f"Crawl4AI module location: {crawl4ai.__file__}")
logger.info(f"Contents of crawl4ai: {dir(crawl4ai)}")
except ImportError:
logger.error("Crawl4AI module not found")
# Now attempt the import
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, CrawlResult
# Configure logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)
# Paths and constants
FUNCTION_DIR = os.path.dirname(os.path.realpath(__file__))
CHROME_BINARY = os.path.join(FUNCTION_DIR, "resources/chrome/headless_shell")
CDP_PORT = 9222
def start_chrome():
"""Start Chrome process synchronously with exponential backoff."""
logger.debug("Starting Chrome process...")
chrome_args = [
CHROME_BINARY,
f"--remote-debugging-port={CDP_PORT}",
"--remote-debugging-address=0.0.0.0",
"--no-sandbox",
"--disable-setuid-sandbox",
"--headless=new",
"--disable-gpu",
"--disable-dev-shm-usage",
"--no-zygote",
"--single-process",
"--disable-features=site-per-process",
"--no-first-run",
"--disable-extensions"
]
process = subprocess.Popen(
chrome_args,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
preexec_fn=os.setsid
)
logger.debug(f"Chrome process started with PID: {process.pid}")
# Wait for CDP endpoint with exponential backoff
wait_time = 1 # Start with 1 second
max_wait_time = 16 # Cap at 16 seconds per retry
max_attempts = 10 # Total attempts
for attempt in range(max_attempts):
try:
response = requests.get(f"http://127.0.0.1:{CDP_PORT}/json/version", timeout=2)
if response.status_code == 200:
# Get ws URL from response
ws_url = response.json()['webSocketDebuggerUrl']
logger.debug("Chrome CDP is ready")
logger.debug(f"CDP URL: {ws_url}")
return process
except requests.exceptions.ConnectionError:
logger.debug(f"Waiting for CDP endpoint (attempt {attempt + 1}/{max_attempts}), retrying in {wait_time} seconds")
time.sleep(wait_time)
wait_time = min(wait_time * 2, max_wait_time) # Double wait time, up to max
# If we get here, all retries failed
stdout, stderr = process.communicate() # Get output for debugging
logger.error(f"Chrome stdout: {stdout.decode()}")
logger.error(f"Chrome stderr: {stderr.decode()}")
raise Exception("Chrome CDP endpoint failed to start after retries")
async def fetch_with_crawl4ai(url: str) -> dict:
"""Fetch page content using Crawl4ai and return the result object"""
# Get CDP URL from the running Chrome instance
version_response = requests.get(f'http://localhost:{CDP_PORT}/json/version')
cdp_url = version_response.json()['webSocketDebuggerUrl']
# Configure and run Crawl4ai
browser_config = BrowserConfig(cdp_url=cdp_url, use_managed_browser=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
)
result : CrawlResult = await crawler.arun(
url=url, config=crawler_config
)
return result.model_dump() # Convert Pydantic model to dict for JSON response
# Start Chrome when the module loads
logger.info("Starting Chrome process on module load")
chrome_process = start_chrome()
@functions_framework.http
def crawl(request: Request):
"""HTTP Cloud Function to fetch web content using Crawl4ai"""
try:
url = request.args.get('url')
if not url:
return jsonify({'error': 'URL parameter is required', 'status': 400}), 400
# Create and run an asyncio event loop
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
result = loop.run_until_complete(
asyncio.wait_for(fetch_with_crawl4ai(url), timeout=10.0)
)
return jsonify({
'status': 200,
'data': result
})
finally:
loop.close()
except Exception as e:
error_msg = f"Unexpected error: {str(e)}"
logger.error(error_msg, exc_info=True)
return jsonify({
'error': error_msg,
'status': 500,
'details': {
'error_type': type(e).__name__,
'stack_trace': str(e),
'chrome_running': chrome_process.poll() is None if chrome_process else False
}
}), 500
@atexit.register
def cleanup():
"""Cleanup Chrome process on shutdown"""
if chrome_process and chrome_process.poll() is None:
try:
os.killpg(os.getpgid(chrome_process.pid), signal.SIGTERM)
logger.info("Chrome process terminated")
except Exception as e:
logger.error(f"Failed to terminate Chrome process: {e}")

View File

@@ -1,5 +0,0 @@
functions-framework==3.*
flask==2.3.3
requests==2.31.0
websockets==12.0
git+https://github.com/unclecode/crawl4ai.git@next

View File

@@ -1,10 +0,0 @@
<?xml version="1.0" ?>
<!DOCTYPE fontconfig SYSTEM "fonts.dtd">
<fontconfig>
<dir>/var/task/.fonts</dir>
<dir>/var/task/fonts</dir>
<dir>/opt/fonts</dir>
<dir>/tmp/fonts</dir>
<cachedir>/tmp/fonts-cache/</cachedir>
<config></config>
</fontconfig>

View File

@@ -1 +0,0 @@
{"file_format_version": "1.0.0", "ICD": {"library_path": "./libvk_swiftshader.so", "api_version": "1.0.5"}}

View File

@@ -1,104 +0,0 @@
FROM python:3.12-bookworm AS python-builder
RUN pip install poetry
ENV POETRY_NO_INTERACTION=1 \
POETRY_CACHE_DIR=/tmp/poetry_cache
WORKDIR /app
COPY pyproject.toml poetry.lock ./
RUN --mount=type=cache,target=$POETRY_CACHE_DIR poetry export -f requirements.txt -o requirements.txt
# Install build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
python3-dev \
python3-setuptools \
python3-wheel \
python3-pip \
gcc \
g++ \
&& rm -rf /var/lib/apt/lists/*
# Install specific dependencies that have build issues
RUN pip install --no-cache-dir cchardet
FROM python:3.12-bookworm
# Install AWS Lambda Runtime Interface Client
RUN python3 -m pip install --no-cache-dir awslambdaric
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
curl \
wget \
gnupg \
git \
cmake \
pkg-config \
python3-dev \
libjpeg-dev \
redis-server \
supervisor \
&& rm -rf /var/lib/apt/lists/*
RUN apt-get update && apt-get install -y --no-install-recommends \
libglib2.0-0 \
libnss3 \
libnspr4 \
libatk1.0-0 \
libatk-bridge2.0-0 \
libcups2 \
libdrm2 \
libdbus-1-3 \
libxcb1 \
libxkbcommon0 \
libx11-6 \
libxcomposite1 \
libxdamage1 \
libxext6 \
libxfixes3 \
libxrandr2 \
libgbm1 \
libpango-1.0-0 \
libcairo2 \
libasound2 \
libatspi2.0-0 \
&& rm -rf /var/lib/apt/lists/*
# Install build essentials for any compilations needed
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
python3-dev \
&& rm -rf /var/lib/apt/lists/*
# Set up function directory and browser path
ARG FUNCTION_DIR="/function"
RUN mkdir -p "${FUNCTION_DIR}/pw-browsers"
RUN mkdir -p "/tmp/.crawl4ai"
# Set critical environment variables
ENV PLAYWRIGHT_BROWSERS_PATH="${FUNCTION_DIR}/pw-browsers" \
HOME="/tmp" \
CRAWL4_AI_BASE_DIRECTORY="/tmp/.crawl4ai"
# Create Craw4ai base directory
RUN mkdir -p ${CRAWL4_AI_BASE_DIRECTORY}
RUN pip install --no-cache-dir faust-cchardet
# Install Crawl4ai and dependencies
RUN pip install --no-cache-dir git+https://github.com/unclecode/crawl4ai.git@next
# Install Chromium only (no deps flag)
RUN playwright install chromium
# Copy function code
COPY lambda_function.py ${FUNCTION_DIR}/
# Set working directory
WORKDIR ${FUNCTION_DIR}
ENTRYPOINT [ "/usr/local/bin/python", "-m", "awslambdaric" ]
CMD [ "lambda_function.handler" ]

File diff suppressed because it is too large Load Diff

View File

@@ -1,345 +0,0 @@
# Deploying Crawl4ai on AWS Lambda
This guide walks you through deploying Crawl4ai as an AWS Lambda function with API Gateway integration. You'll learn how to set up, test, and clean up your deployment.
## Prerequisites
Before you begin, ensure you have:
- AWS CLI installed and configured (`aws configure`)
- Docker installed and running
- Python 3.8+ installed
- Basic familiarity with AWS services
## Project Files
Your project directory should contain:
- `Dockerfile`: Container configuration for Lambda
- `lambda_function.py`: Lambda handler code
- `deploy.py`: Our deployment script
## Step 1: Install Required Python Packages
Install the Python packages needed for our deployment script:
```bash
pip install typer rich
```
## Step 2: Run the Deployment Script
Our Python script automates the entire deployment process:
```bash
python deploy.py
```
The script will guide you through:
1. Configuration setup (AWS region, function name, memory allocation)
2. Docker image building
3. ECR repository creation
4. Lambda function deployment
5. API Gateway setup
6. Provisioned concurrency configuration (optional)
Follow the prompts and confirm each step by pressing Enter.
## Step 3: Manual Deployment (Alternative to the Script)
If you prefer to deploy manually or understand what the script does, follow these steps:
### Building and Pushing the Docker Image
```bash
# Build the Docker image
docker build -t crawl4ai-lambda .
# Create an ECR repository (if it doesn't exist)
aws ecr create-repository --repository-name crawl4ai-lambda
# Get ECR login password and login
aws ecr get-login-password | docker login --username AWS --password-stdin $(aws sts get-caller-identity --query Account --output text).dkr.ecr.us-east-1.amazonaws.com
# Tag the image
ECR_URI=$(aws ecr describe-repositories --repository-names crawl4ai-lambda --query 'repositories[0].repositoryUri' --output text)
docker tag crawl4ai-lambda:latest $ECR_URI:latest
# Push the image to ECR
docker push $ECR_URI:latest
```
### Creating the Lambda Function
```bash
# Get IAM role ARN (create it if needed)
ROLE_ARN=$(aws iam get-role --role-name lambda-execution-role --query 'Role.Arn' --output text)
# Create Lambda function
aws lambda create-function \
--function-name crawl4ai-function \
--package-type Image \
--code ImageUri=$ECR_URI:latest \
--role $ROLE_ARN \
--timeout 300 \
--memory-size 4096 \
--ephemeral-storage Size=10240 \
--environment "Variables={CRAWL4_AI_BASE_DIRECTORY=/tmp/.crawl4ai,HOME=/tmp,PLAYWRIGHT_BROWSERS_PATH=/function/pw-browsers}"
```
If you're updating an existing function:
```bash
# Update function code
aws lambda update-function-code \
--function-name crawl4ai-function \
--image-uri $ECR_URI:latest
# Update function configuration
aws lambda update-function-configuration \
--function-name crawl4ai-function \
--timeout 300 \
--memory-size 4096 \
--ephemeral-storage Size=10240 \
--environment "Variables={CRAWL4_AI_BASE_DIRECTORY=/tmp/.crawl4ai,HOME=/tmp,PLAYWRIGHT_BROWSERS_PATH=/function/pw-browsers}"
```
### Setting Up API Gateway
```bash
# Create API Gateway
API_ID=$(aws apigateway create-rest-api --name crawl4ai-api --query 'id' --output text)
# Get root resource ID
PARENT_ID=$(aws apigateway get-resources --rest-api-id $API_ID --query 'items[?path==`/`].id' --output text)
# Create resource
RESOURCE_ID=$(aws apigateway create-resource --rest-api-id $API_ID --parent-id $PARENT_ID --path-part "crawl" --query 'id' --output text)
# Create POST method
aws apigateway put-method --rest-api-id $API_ID --resource-id $RESOURCE_ID --http-method POST --authorization-type NONE
# Get Lambda function ARN
LAMBDA_ARN=$(aws lambda get-function --function-name crawl4ai-function --query 'Configuration.FunctionArn' --output text)
# Set Lambda integration
aws apigateway put-integration \
--rest-api-id $API_ID \
--resource-id $RESOURCE_ID \
--http-method POST \
--type AWS_PROXY \
--integration-http-method POST \
--uri arn:aws:apigateway:us-east-1:lambda:path/2015-03-31/functions/$LAMBDA_ARN/invocations
# Deploy API
aws apigateway create-deployment --rest-api-id $API_ID --stage-name prod
# Set Lambda permission
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
aws lambda add-permission \
--function-name crawl4ai-function \
--statement-id apigateway \
--action lambda:InvokeFunction \
--principal apigateway.amazonaws.com \
--source-arn "arn:aws:execute-api:us-east-1:$ACCOUNT_ID:$API_ID/*/POST/crawl"
```
### Setting Up Provisioned Concurrency (Optional)
This reduces cold starts:
```bash
# Publish a version
VERSION=$(aws lambda publish-version --function-name crawl4ai-function --query 'Version' --output text)
# Create alias
aws lambda create-alias \
--function-name crawl4ai-function \
--name prod \
--function-version $VERSION
# Configure provisioned concurrency
aws lambda put-provisioned-concurrency-config \
--function-name crawl4ai-function \
--qualifier prod \
--provisioned-concurrent-executions 2
# Update API Gateway to use alias
LAMBDA_ALIAS_ARN="arn:aws:lambda:us-east-1:$ACCOUNT_ID:function:crawl4ai-function:prod"
aws apigateway put-integration \
--rest-api-id $API_ID \
--resource-id $RESOURCE_ID \
--http-method POST \
--type AWS_PROXY \
--integration-http-method POST \
--uri arn:aws:apigateway:us-east-1:lambda:path/2015-03-31/functions/$LAMBDA_ALIAS_ARN/invocations
# Redeploy API Gateway
aws apigateway create-deployment --rest-api-id $API_ID --stage-name prod
```
## Step 4: Testing the Deployment
Once deployed, test your function with:
```bash
ENDPOINT_URL="https://$API_ID.execute-api.us-east-1.amazonaws.com/prod/crawl"
# Test with curl
curl -X POST $ENDPOINT_URL \
-H "Content-Type: application/json" \
-d '{"url":"https://example.com"}'
```
Or using Python:
```python
import requests
import json
url = "https://your-api-id.execute-api.us-east-1.amazonaws.com/prod/crawl"
payload = {
"url": "https://example.com",
"browser_config": {
"headless": True,
"verbose": False
},
"crawler_config": {
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"markdown_generator": {
"type": "DefaultMarkdownGenerator",
"params": {
"content_filter": {
"type": "PruningContentFilter",
"params": {
"threshold": 0.48,
"threshold_type": "fixed"
}
}
}
}
}
}
}
}
response = requests.post(url, json=payload)
result = response.json()
print(json.dumps(result, indent=2))
```
## Step 5: Cleaning Up Resources
To remove all AWS resources created for this deployment:
```bash
python deploy.py cleanup
```
Or manually:
```bash
# Delete API Gateway
aws apigateway delete-rest-api --rest-api-id $API_ID
# Remove provisioned concurrency (if configured)
aws lambda delete-provisioned-concurrency-config \
--function-name crawl4ai-function \
--qualifier prod
# Delete alias (if created)
aws lambda delete-alias \
--function-name crawl4ai-function \
--name prod
# Delete Lambda function
aws lambda delete-function --function-name crawl4ai-function
# Delete ECR repository
aws ecr delete-repository --repository-name crawl4ai-lambda --force
```
## Troubleshooting
### Cold Start Issues
If experiencing long cold starts:
- Enable provisioned concurrency
- Increase memory allocation (4096 MB recommended)
- Ensure the Lambda function has enough ephemeral storage
### Permission Errors
If you encounter permission errors:
- Check the IAM role has the necessary permissions
- Ensure API Gateway has permission to invoke the Lambda function
### Container Size Issues
If your container is too large:
- Optimize the Dockerfile
- Use multi-stage builds
- Consider removing unnecessary dependencies
## Performance Considerations
- Lambda memory affects CPU allocation - higher memory means faster execution
- Provisioned concurrency eliminates cold starts but costs more
- Optimize the Playwright setup for faster browser initialization
## Security Best Practices
- Use the principle of least privilege for IAM roles
- Implement API Gateway authentication for production deployments
- Consider using AWS KMS for storing sensitive configuration
## Useful AWS Console Links
Here are quick links to access important AWS console pages for monitoring and managing your deployment:
| Resource | Console Link |
|----------|-------------|
| Lambda Functions | [AWS Lambda Console](https://console.aws.amazon.com/lambda/home#/functions) |
| Lambda Function Logs | [CloudWatch Logs](https://console.aws.amazon.com/cloudwatch/home#logsV2:log-groups) |
| API Gateway | [API Gateway Console](https://console.aws.amazon.com/apigateway/home) |
| ECR Repositories | [ECR Console](https://console.aws.amazon.com/ecr/repositories) |
| IAM Roles | [IAM Console](https://console.aws.amazon.com/iamv2/home#/roles) |
| CloudWatch Metrics | [CloudWatch Metrics](https://console.aws.amazon.com/cloudwatch/home#metricsV2) |
### Monitoring Lambda Execution
To monitor your Lambda function:
1. Go to the [Lambda function console](https://console.aws.amazon.com/lambda/home#/functions)
2. Select your function (`crawl4ai-function`)
3. Click the "Monitor" tab to see:
- Invocation metrics
- Success/failure rates
- Duration statistics
### Viewing Lambda Logs
To see detailed execution logs:
1. Go to [CloudWatch Logs](https://console.aws.amazon.com/cloudwatch/home#logsV2:log-groups)
2. Find the log group named `/aws/lambda/crawl4ai-function`
3. Click to see the latest log streams
4. Each stream contains logs from a function execution
### Checking API Gateway Traffic
To monitor API requests:
1. Go to the [API Gateway console](https://console.aws.amazon.com/apigateway/home)
2. Select your API (`crawl4ai-api`)
3. Click "Dashboard" to see:
- API calls
- Latency
- Error rates
## Conclusion
You now have Crawl4ai running as a serverless function on AWS Lambda! This setup allows you to crawl websites on-demand without maintaining infrastructure, while paying only for the compute time you use.

View File

@@ -1,107 +0,0 @@
import json
import asyncio
import os
# Ensure environment variables and directories are set
os.environ['CRAWL4_AI_BASE_DIRECTORY'] = '/tmp/.crawl4ai'
os.environ['HOME'] = '/tmp'
# Create directory if it doesn't exist
os.makedirs('/tmp/.crawl4ai', exist_ok=True)
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
CacheMode
)
def handler(event, context):
# Parse the incoming event (API Gateway request)
try:
body = json.loads(event.get('body', '{}'))
url = body.get('url')
if not url:
return {
'statusCode': 400,
'body': json.dumps({'error': 'URL is required'})
}
# Get optional configurations or use defaults
browser_config_dict = body.get('browser_config', {})
crawler_config_dict = body.get('crawler_config', {})
# Run the crawler
result = asyncio.run(crawl(url, browser_config_dict, crawler_config_dict))
# Return successful response
return {
'statusCode': 200,
'headers': {
'Content-Type': 'application/json'
},
'body': json.dumps(result)
}
except Exception as e:
# Handle errors
import traceback
return {
'statusCode': 500,
'body': json.dumps({
'error': str(e),
'traceback': traceback.format_exc()
})
}
async def crawl(url, browser_config_dict, crawler_config_dict):
"""
Run the crawler with the provided configurations, with Lambda-specific settings
"""
# Start with user-provided config but override with Lambda-required settings
base_browser_config = BrowserConfig.load(browser_config_dict) if browser_config_dict else BrowserConfig()
# Apply Lambda-specific browser configurations
browser_config = BrowserConfig(
verbose=True,
browser_type="chromium",
headless=True,
user_agent_mode="random",
light_mode=True,
use_managed_browser=False,
extra_args=[
"--headless=new",
"--no-sandbox",
"--disable-dev-shm-usage",
"--disable-setuid-sandbox",
"--remote-allow-origins=*",
"--autoplay-policy=user-gesture-required",
"--single-process",
],
# # Carry over any other settings from user config that aren't overridden
# **{k: v for k, v in base_browser_config.model_dump().items()
# if k not in ['verbose', 'browser_type', 'headless', 'user_agent_mode',
# 'light_mode', 'use_managed_browser', 'extra_args']}
)
# Start with user-provided crawler config but ensure cache is bypassed
base_crawler_config = CrawlerRunConfig.load(crawler_config_dict) if crawler_config_dict else CrawlerRunConfig()
# Apply Lambda-specific crawler configurations
crawler_config = CrawlerRunConfig(
exclude_external_links=base_crawler_config.exclude_external_links,
remove_overlay_elements=True,
magic=True,
cache_mode=CacheMode.BYPASS,
# Carry over markdown generator and other settings
markdown_generator=base_crawler_config.markdown_generator
)
# Perform the crawl with Lambda-optimized settings
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url=url, config=crawler_config)
# Return serializable results
return result.model_dump()

View File

@@ -1,543 +0,0 @@
import os
import time
import uuid
from datetime import datetime
from typing import Dict, Any, Optional, List
import modal
from modal import Image, App, Volume, Secret, web_endpoint, function
# Configuration
APP_NAME = "crawl4ai-api"
CRAWL4AI_VERSION = "next" # Using the 'next' branch
PYTHON_VERSION = "3.10" # Compatible with playwright
DEFAULT_CREDITS = 1000
# Create a custom image with Crawl4ai and its dependencies
image = Image.debian_slim(python_version=PYTHON_VERSION).pip_install(
["fastapi[standard]", "pymongo", "pydantic"]
).run_commands(
"apt-get update",
"apt-get install -y software-properties-common",
"apt-get install -y git",
"apt-add-repository non-free",
"apt-add-repository contrib",
# Install crawl4ai from the next branch
f"pip install -U git+https://github.com/unclecode/crawl4ai.git@{CRAWL4AI_VERSION}",
"pip install -U fastapi[standard]",
"pip install -U pydantic",
# Install playwright and browsers
"crawl4ai-setup",
)
# Create persistent volume for user database
user_db = Volume.from_name("crawl4ai-users", create_if_missing=True)
# Create admin secret for secure operations
admin_secret = Secret.from_name("admin-secret", create_if_missing=True)
# Define the app
app = App(APP_NAME, image=image)
# Default configurations
DEFAULT_BROWSER_CONFIG = {
"headless": True,
"verbose": False,
}
DEFAULT_CRAWLER_CONFIG = {
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"markdown_generator": {
"type": "DefaultMarkdownGenerator",
"params": {
"content_filter": {
"type": "PruningContentFilter",
"params": {
"threshold": 0.48,
"threshold_type": "fixed"
}
}
}
}
}
}
}
# Database operations
@app.function(volumes={"/data": user_db})
def init_db() -> None:
"""Initialize database with indexes."""
from pymongo import MongoClient, ASCENDING
client = MongoClient("mongodb://localhost:27017")
db = client.crawl4ai_db
# Ensure indexes for faster lookups
db.users.create_index([("api_token", ASCENDING)], unique=True)
db.users.create_index([("email", ASCENDING)], unique=True)
# Create usage stats collection
db.usage_stats.create_index([("user_id", ASCENDING), ("timestamp", ASCENDING)])
print("Database initialized with required indexes")
@app.function(volumes={"/data": user_db})
def get_user_by_token(api_token: str) -> Optional[Dict[str, Any]]:
"""Get user by API token."""
from pymongo import MongoClient
from bson.objectid import ObjectId
client = MongoClient("mongodb://localhost:27017")
db = client.crawl4ai_db
user = db.users.find_one({"api_token": api_token})
if not user:
return None
# Convert ObjectId to string for serialization
user["_id"] = str(user["_id"])
return user
@app.function(volumes={"/data": user_db})
def create_user(email: str, name: str) -> Dict[str, Any]:
"""Create a new user with initial credits."""
from pymongo import MongoClient
from bson.objectid import ObjectId
client = MongoClient("mongodb://localhost:27017")
db = client.crawl4ai_db
# Generate API token
api_token = str(uuid.uuid4())
user = {
"email": email,
"name": name,
"api_token": api_token,
"credits": DEFAULT_CREDITS,
"created_at": datetime.utcnow(),
"updated_at": datetime.utcnow(),
"is_active": True
}
try:
result = db.users.insert_one(user)
user["_id"] = str(result.inserted_id)
return user
except Exception as e:
if "duplicate key error" in str(e):
return {"error": "User with this email already exists"}
raise
@app.function(volumes={"/data": user_db})
def update_user_credits(api_token: str, amount: int) -> Dict[str, Any]:
"""Update user credits (add or subtract)."""
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017")
db = client.crawl4ai_db
# First get current user to check credits
user = db.users.find_one({"api_token": api_token})
if not user:
return {"success": False, "error": "User not found"}
# For deductions, ensure sufficient credits
if amount < 0 and user["credits"] + amount < 0:
return {"success": False, "error": "Insufficient credits"}
# Update credits
result = db.users.update_one(
{"api_token": api_token},
{
"$inc": {"credits": amount},
"$set": {"updated_at": datetime.utcnow()}
}
)
if result.modified_count == 1:
# Get updated user
updated_user = db.users.find_one({"api_token": api_token})
return {
"success": True,
"credits": updated_user["credits"]
}
else:
return {"success": False, "error": "Failed to update credits"}
@app.function(volumes={"/data": user_db})
def log_usage(user_id: str, url: str, success: bool, error: Optional[str] = None) -> None:
"""Log usage statistics."""
from pymongo import MongoClient
from bson.objectid import ObjectId
client = MongoClient("mongodb://localhost:27017")
db = client.crawl4ai_db
log_entry = {
"user_id": user_id,
"url": url,
"timestamp": datetime.utcnow(),
"success": success,
"error": error
}
db.usage_stats.insert_one(log_entry)
# Main crawling function
@app.function(timeout=300) # 5 minute timeout
async def crawl(
url: str,
browser_config: Optional[Dict[str, Any]] = None,
crawler_config: Optional[Dict[str, Any]] = None,
) -> Dict[str, Any]:
"""
Crawl a given URL using Crawl4ai.
Args:
url: The URL to crawl
browser_config: Optional browser configuration to override defaults
crawler_config: Optional crawler configuration to override defaults
Returns:
A dictionary containing the crawl results
"""
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
CrawlResult
)
# Prepare browser config using the loader method
if browser_config is None:
browser_config = DEFAULT_BROWSER_CONFIG
browser_config_obj = BrowserConfig.load(browser_config)
# Prepare crawler config using the loader method
if crawler_config is None:
crawler_config = DEFAULT_CRAWLER_CONFIG
crawler_config_obj = CrawlerRunConfig.load(crawler_config)
# Perform the crawl
async with AsyncWebCrawler(config=browser_config_obj) as crawler:
result: CrawlResult = await crawler.arun(url=url, config=crawler_config_obj)
# Return serializable results
try:
# Try newer Pydantic v2 method
return result.model_dump()
except AttributeError:
try:
# Try older Pydantic v1 method
return result.dict()
except AttributeError:
# Fallback to manual conversion
return {
"url": result.url,
"title": result.title,
"status": result.status,
"content": str(result.content) if hasattr(result, "content") else None,
"links": [{"url": link.url, "text": link.text} for link in result.links] if hasattr(result, "links") else [],
"markdown_v2": {
"raw_markdown": result.markdown_v2.raw_markdown if hasattr(result, "markdown_v2") else None
}
}
# API endpoints
@app.function()
@web_endpoint(method="POST")
def crawl_endpoint(data: Dict[str, Any]) -> Dict[str, Any]:
"""
Web endpoint that accepts POST requests with JSON data containing:
- api_token: User's API token
- url: The URL to crawl
- browser_config: Optional browser configuration
- crawler_config: Optional crawler configuration
Returns the crawl results and remaining credits.
"""
# Extract and validate API token
api_token = data.get("api_token")
if not api_token:
return {
"success": False,
"error": "API token is required",
"status_code": 401
}
# Verify user
user = get_user_by_token.remote(api_token)
if not user:
return {
"success": False,
"error": "Invalid API token",
"status_code": 401
}
if not user.get("is_active", False):
return {
"success": False,
"error": "Account is inactive",
"status_code": 403
}
# Validate URL
url = data.get("url")
if not url:
return {
"success": False,
"error": "URL is required",
"status_code": 400
}
# Check credits
if user.get("credits", 0) <= 0:
return {
"success": False,
"error": "Insufficient credits",
"status_code": 403
}
# Deduct credit first (1 credit per call)
credit_result = update_user_credits.remote(api_token, -1)
if not credit_result.get("success", False):
return {
"success": False,
"error": credit_result.get("error", "Failed to process credits"),
"status_code": 500
}
# Extract configs
browser_config = data.get("browser_config")
crawler_config = data.get("crawler_config")
# Perform crawl
try:
start_time = time.time()
result = crawl.remote(url, browser_config, crawler_config)
execution_time = time.time() - start_time
# Log successful usage
log_usage.spawn(user["_id"], url, True)
return {
"success": True,
"data": result,
"credits_remaining": credit_result.get("credits"),
"execution_time_seconds": round(execution_time, 2),
"status_code": 200
}
except Exception as e:
# Log failed usage
log_usage.spawn(user["_id"], url, False, str(e))
# Return error
return {
"success": False,
"error": f"Crawling error: {str(e)}",
"credits_remaining": credit_result.get("credits"),
"status_code": 500
}
# Admin endpoints
@app.function(secrets=[admin_secret])
@web_endpoint(method="POST")
def admin_create_user(data: Dict[str, Any]) -> Dict[str, Any]:
"""Admin endpoint to create new users."""
# Validate admin token
admin_token = data.get("admin_token")
if admin_token != os.environ.get("ADMIN_TOKEN"):
return {
"success": False,
"error": "Invalid admin token",
"status_code": 401
}
# Validate input
email = data.get("email")
name = data.get("name")
if not email or not name:
return {
"success": False,
"error": "Email and name are required",
"status_code": 400
}
# Create user
user = create_user.remote(email, name)
if "error" in user:
return {
"success": False,
"error": user["error"],
"status_code": 400
}
return {
"success": True,
"data": {
"user_id": user["_id"],
"email": user["email"],
"name": user["name"],
"api_token": user["api_token"],
"credits": user["credits"],
"created_at": user["created_at"].isoformat() if isinstance(user["created_at"], datetime) else user["created_at"]
},
"status_code": 201
}
@app.function(secrets=[admin_secret])
@web_endpoint(method="POST")
def admin_update_credits(data: Dict[str, Any]) -> Dict[str, Any]:
"""Admin endpoint to update user credits."""
# Validate admin token
admin_token = data.get("admin_token")
if admin_token != os.environ.get("ADMIN_TOKEN"):
return {
"success": False,
"error": "Invalid admin token",
"status_code": 401
}
# Validate input
api_token = data.get("api_token")
amount = data.get("amount")
if not api_token:
return {
"success": False,
"error": "API token is required",
"status_code": 400
}
if not isinstance(amount, int):
return {
"success": False,
"error": "Amount must be an integer",
"status_code": 400
}
# Update credits
result = update_user_credits.remote(api_token, amount)
if not result.get("success", False):
return {
"success": False,
"error": result.get("error", "Failed to update credits"),
"status_code": 400
}
return {
"success": True,
"data": {
"credits": result["credits"]
},
"status_code": 200
}
@app.function(secrets=[admin_secret])
@web_endpoint(method="GET")
def admin_get_users(admin_token: str) -> Dict[str, Any]:
"""Admin endpoint to list all users."""
# Validate admin token
if admin_token != os.environ.get("ADMIN_TOKEN"):
return {
"success": False,
"error": "Invalid admin token",
"status_code": 401
}
users = get_all_users.remote()
return {
"success": True,
"data": users,
"status_code": 200
}
@app.function(volumes={"/data": user_db})
def get_all_users() -> List[Dict[str, Any]]:
"""Get all users (for admin)."""
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017")
db = client.crawl4ai_db
users = []
for user in db.users.find():
# Convert ObjectId to string
user["_id"] = str(user["_id"])
# Convert datetime to ISO format
for field in ["created_at", "updated_at"]:
if field in user and isinstance(user[field], datetime):
user[field] = user[field].isoformat()
users.append(user)
return users
# Public endpoints
@app.function()
@web_endpoint(method="GET")
def health_check() -> Dict[str, Any]:
"""Health check endpoint."""
return {
"status": "online",
"service": APP_NAME,
"version": CRAWL4AI_VERSION,
"timestamp": datetime.utcnow().isoformat()
}
@app.function()
@web_endpoint(method="GET")
def check_credits(api_token: str) -> Dict[str, Any]:
"""Check user credits."""
if not api_token:
return {
"success": False,
"error": "API token is required",
"status_code": 401
}
user = get_user_by_token.remote(api_token)
if not user:
return {
"success": False,
"error": "Invalid API token",
"status_code": 401
}
return {
"success": True,
"data": {
"credits": user["credits"],
"email": user["email"],
"name": user["name"]
},
"status_code": 200
}
# Local entrypoint for testing
@app.local_entrypoint()
def main(url: str = "https://www.modal.com"):
"""Command line entrypoint for local testing."""
print("Initializing database...")
init_db.remote()
print(f"Testing crawl on URL: {url}")
result = crawl.remote(url)
# Print sample of result
print("\nCrawl Result Sample:")
if "title" in result:
print(f"Title: {result['title']}")
if "status" in result:
print(f"Status: {result['status']}")
if "links" in result:
print(f"Links found: {len(result['links'])}")
if "markdown_v2" in result and result["markdown_v2"] and "raw_markdown" in result["markdown_v2"]:
print("\nMarkdown Preview (first 300 chars):")
print(result["markdown_v2"]["raw_markdown"][:300] + "...")

View File

@@ -1,127 +0,0 @@
import modal
from typing import Optional, Dict, Any
# Create a custom image with Crawl4ai and its dependencies
# "pip install crawl4ai",
image = modal.Image.debian_slim(python_version="3.10").pip_install(["fastapi[standard]"]).run_commands(
"apt-get update",
"apt-get install -y software-properties-common",
"apt-get install -y git",
"apt-add-repository non-free",
"apt-add-repository contrib",
"pip install -U git+https://github.com/unclecode/crawl4ai.git@next",
"pip install -U fastapi[standard]",
"pip install -U pydantic",
"crawl4ai-setup", # This installs playwright and downloads chromium
# Print fastpi version
"python -m fastapi --version",
)
# Define the app
app = modal.App("crawl4ai", image=image)
# Define default configurations
DEFAULT_BROWSER_CONFIG = {
"headless": True,
"verbose": False,
}
DEFAULT_CRAWLER_CONFIG = {
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"markdown_generator": {
"type": "DefaultMarkdownGenerator",
"params": {
"content_filter": {
"type": "PruningContentFilter",
"params": {
"threshold": 0.48,
"threshold_type": "fixed"
}
}
}
}
}
}
}
@app.function(timeout=300) # 5 minute timeout
async def crawl(
url: str,
browser_config: Optional[Dict[str, Any]] = None,
crawler_config: Optional[Dict[str, Any]] = None,
) -> Dict[str, Any]:
"""
Crawl a given URL using Crawl4ai.
Args:
url: The URL to crawl
browser_config: Optional browser configuration to override defaults
crawler_config: Optional crawler configuration to override defaults
Returns:
A dictionary containing the crawl results
"""
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
CrawlResult
)
# Prepare browser config using the loader method
if browser_config is None:
browser_config = DEFAULT_BROWSER_CONFIG
browser_config_obj = BrowserConfig.load(browser_config)
# Prepare crawler config using the loader method
if crawler_config is None:
crawler_config = DEFAULT_CRAWLER_CONFIG
crawler_config_obj = CrawlerRunConfig.load(crawler_config)
# Perform the crawl
async with AsyncWebCrawler(config=browser_config_obj) as crawler:
result: CrawlResult = await crawler.arun(url=url, config=crawler_config_obj)
# Return serializable results
try:
# Try newer Pydantic v2 method
return result.model_dump()
except AttributeError:
try:
# Try older Pydantic v1 method
return result.__dict__
except AttributeError:
# Fallback to returning the raw result
return result
@app.function()
@modal.web_endpoint(method="POST")
def crawl_endpoint(data: Dict[str, Any]) -> Dict[str, Any]:
"""
Web endpoint that accepts POST requests with JSON data containing:
- url: The URL to crawl
- browser_config: Optional browser configuration
- crawler_config: Optional crawler configuration
Returns the crawl results.
"""
url = data.get("url")
if not url:
return {"error": "URL is required"}
browser_config = data.get("browser_config")
crawler_config = data.get("crawler_config")
return crawl.remote(url, browser_config, crawler_config)
@app.local_entrypoint()
def main(url: str = "https://www.modal.com"):
"""
Command line entrypoint for local testing.
"""
result = crawl.remote(url)
print(result)

View File

@@ -1,453 +0,0 @@
# Deploying Crawl4ai with Modal: A Comprehensive Tutorial
Hey there! UncleCode here. I'm excited to show you how to deploy Crawl4ai using Modal - a fantastic serverless platform that makes deployment super simple and scalable.
In this tutorial, I'll walk you through deploying your own Crawl4ai instance on Modal's infrastructure. This will give you a powerful, scalable web crawling solution without having to worry about infrastructure management.
## What is Modal?
Modal is a serverless platform that allows you to run Python functions in the cloud without managing servers. It's perfect for deploying Crawl4ai because:
1. It handles all the infrastructure for you
2. It scales automatically based on demand
3. It makes deployment incredibly simple
## Prerequisites
Before we get started, you'll need:
- A Modal account (sign up at [modal.com](https://modal.com))
- Python 3.10 or later installed on your local machine
- Basic familiarity with Python and command-line operations
## Step 1: Setting Up Your Modal Account
First, sign up for a Modal account at [modal.com](https://modal.com) if you haven't already. Modal offers a generous free tier that's perfect for getting started.
After signing up, install the Modal CLI and authenticate:
```bash
pip install modal
modal token new
```
This will open a browser window where you can authenticate and generate a token for the CLI.
## Step 2: Creating Your Crawl4ai Deployment
Now, let's create a Python file called `crawl4ai_modal.py` with our deployment code:
```python
import modal
from typing import Optional, Dict, Any
# Create a custom image with Crawl4ai and its dependencies
image = modal.Image.debian_slim(python_version="3.10").pip_install(
["fastapi[standard]"]
).run_commands(
"apt-get update",
"apt-get install -y software-properties-common",
"apt-get install -y git",
"apt-add-repository non-free",
"apt-add-repository contrib",
"pip install -U crawl4ai",
"pip install -U fastapi[standard]",
"pip install -U pydantic",
"crawl4ai-setup", # This installs playwright and downloads chromium
)
# Define the app
app = modal.App("crawl4ai", image=image)
# Define default configurations
DEFAULT_BROWSER_CONFIG = {
"headless": True,
"verbose": False,
}
DEFAULT_CRAWLER_CONFIG = {
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"markdown_generator": {
"type": "DefaultMarkdownGenerator",
"params": {
"content_filter": {
"type": "PruningContentFilter",
"params": {
"threshold": 0.48,
"threshold_type": "fixed"
}
}
}
}
}
}
}
@app.function(timeout=300) # 5 minute timeout
async def crawl(
url: str,
browser_config: Optional[Dict[str, Any]] = None,
crawler_config: Optional[Dict[str, Any]] = None,
) -> Dict[str, Any]:
"""
Crawl a given URL using Crawl4ai.
Args:
url: The URL to crawl
browser_config: Optional browser configuration to override defaults
crawler_config: Optional crawler configuration to override defaults
Returns:
A dictionary containing the crawl results
"""
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
CrawlResult
)
# Prepare browser config using the loader method
if browser_config is None:
browser_config = DEFAULT_BROWSER_CONFIG
browser_config_obj = BrowserConfig.load(browser_config)
# Prepare crawler config using the loader method
if crawler_config is None:
crawler_config = DEFAULT_CRAWLER_CONFIG
crawler_config_obj = CrawlerRunConfig.load(crawler_config)
# Perform the crawl
async with AsyncWebCrawler(config=browser_config_obj) as crawler:
result: CrawlResult = await crawler.arun(url=url, config=crawler_config_obj)
# Return serializable results
try:
# Try newer Pydantic v2 method
return result.model_dump()
except AttributeError:
try:
# Try older Pydantic v1 method
return result.dict()
except AttributeError:
# Fallback to manual conversion
return {
"url": result.url,
"title": result.title,
"status": result.status,
"content": str(result.content) if hasattr(result, "content") else None,
"links": [{"url": link.url, "text": link.text} for link in result.links] if hasattr(result, "links") else [],
"markdown_v2": {
"raw_markdown": result.markdown_v2.raw_markdown if hasattr(result, "markdown_v2") else None
}
}
@app.function()
@modal.web_endpoint(method="POST")
def crawl_endpoint(data: Dict[str, Any]) -> Dict[str, Any]:
"""
Web endpoint that accepts POST requests with JSON data containing:
- url: The URL to crawl
- browser_config: Optional browser configuration
- crawler_config: Optional crawler configuration
Returns the crawl results.
"""
url = data.get("url")
if not url:
return {"error": "URL is required"}
browser_config = data.get("browser_config")
crawler_config = data.get("crawler_config")
return crawl.remote(url, browser_config, crawler_config)
@app.local_entrypoint()
def main(url: str = "https://www.modal.com"):
"""
Command line entrypoint for local testing.
"""
result = crawl.remote(url)
print(result)
```
## Step 3: Understanding the Code Components
Let's break down what's happening in this code:
### 1. Image Definition
```python
image = modal.Image.debian_slim(python_version="3.10").pip_install(
["fastapi[standard]"]
).run_commands(
"apt-get update",
"apt-get install -y software-properties-common",
"apt-get install -y git",
"apt-add-repository non-free",
"apt-add-repository contrib",
"pip install -U git+https://github.com/unclecode/crawl4ai.git@next",
"pip install -U fastapi[standard]",
"pip install -U pydantic",
"crawl4ai-setup", # This installs playwright and downloads chromium
)
```
This section defines the container image that Modal will use to run your code. It:
- Starts with a Debian Slim base image with Python 3.10
- Installs FastAPI
- Updates the system packages
- Installs Git and other dependencies
- Installs Crawl4ai from the GitHub repository
- Runs the Crawl4ai setup to install Playwright and download Chromium
### 2. Modal App Definition
```python
app = modal.App("crawl4ai", image=image)
```
This creates a Modal application named "crawl4ai" that uses the image we defined above.
### 3. Default Configurations
```python
DEFAULT_BROWSER_CONFIG = {
"headless": True,
"verbose": False,
}
DEFAULT_CRAWLER_CONFIG = {
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"markdown_generator": {
"type": "DefaultMarkdownGenerator",
"params": {
"content_filter": {
"type": "PruningContentFilter",
"params": {
"threshold": 0.48,
"threshold_type": "fixed"
}
}
}
}
}
}
}
```
These define the default configurations for the browser and crawler. You can customize these settings based on your specific needs.
### 4. The Crawl Function
```python
@app.function(timeout=300)
async def crawl(url, browser_config, crawler_config):
# Function implementation
```
This is the main function that performs the crawling. It:
- Takes a URL and optional configurations
- Sets up the browser and crawler with those configurations
- Performs the crawl
- Returns the results in a serializable format
The `@app.function(timeout=300)` decorator tells Modal to run this function in the cloud with a 5-minute timeout.
### 5. The Web Endpoint
```python
@app.function()
@modal.web_endpoint(method="POST")
def crawl_endpoint(data: Dict[str, Any]) -> Dict[str, Any]:
# Function implementation
```
This creates a web endpoint that accepts POST requests. It:
- Extracts the URL and configurations from the request
- Calls the crawl function with those parameters
- Returns the results
### 6. Local Entrypoint
```python
@app.local_entrypoint()
def main(url: str = "https://www.modal.com"):
# Function implementation
```
This provides a way to test the application from the command line.
## Step 4: Testing Locally
Before deploying, let's test our application locally:
```bash
modal run crawl4ai_modal.py --url "https://example.com"
```
This command will:
1. Upload your code to Modal
2. Create the necessary containers
3. Run the `main` function with the specified URL
4. Return the results
Modal will handle all the infrastructure setup for you. You should see the crawling results printed to your console.
## Step 5: Deploying Your Application
Once you're satisfied with the local testing, it's time to deploy:
```bash
modal deploy crawl4ai_modal.py
```
This will deploy your application to Modal's cloud. The deployment process will output URLs for your web endpoints.
You should see output similar to:
```
✓ Deployed crawl4ai.
URLs:
crawl_endpoint => https://your-username--crawl-endpoint.modal.run
```
Save this URL - you'll need it to make requests to your deployment.
## Step 6: Using Your Deployment
Now that your application is deployed, you can use it by sending POST requests to the endpoint URL:
```bash
curl -X POST https://your-username--crawl-endpoint.modal.run \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'
```
Or in Python:
```python
import requests
response = requests.post(
"https://your-username--crawl-endpoint.modal.run",
json={"url": "https://example.com"}
)
result = response.json()
print(result)
```
You can also customize the browser and crawler configurations:
```python
requests.post(
"https://your-username--crawl-endpoint.modal.run",
json={
"url": "https://example.com",
"browser_config": {
"headless": False,
"verbose": True
},
"crawler_config": {
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"markdown_generator": {
"type": "DefaultMarkdownGenerator",
"params": {
"content_filter": {
"type": "PruningContentFilter",
"params": {
"threshold": 0.6, # Adjusted threshold
"threshold_type": "fixed"
}
}
}
}
}
}
}
}
)
```
## Step 7: Calling Your Deployment from Another Python Script
You can also call your deployed function directly from another Python script:
```python
import modal
# Get a reference to the deployed function
crawl_function = modal.Function.from_name("crawl4ai", "crawl")
# Call the function
result = crawl_function.remote("https://example.com")
print(result)
```
## Understanding Modal's Execution Flow
To understand how Modal works, it's important to know:
1. **Local vs. Remote Execution**: When you call a function with `.remote()`, it runs in Modal's cloud, not on your local machine.
2. **Container Lifecycle**: Modal creates containers on-demand and destroys them when they're not needed.
3. **Caching**: Modal caches your container images to speed up subsequent runs.
4. **Serverless Scaling**: Modal automatically scales your application based on demand.
## Customizing Your Deployment
You can customize your deployment in several ways:
### Changing the Crawl4ai Version
To use a different version of Crawl4ai, update the installation command in the image definition:
```python
"pip install -U git+https://github.com/unclecode/crawl4ai.git@main", # Use main branch
```
### Adjusting Resource Limits
You can change the resources allocated to your functions:
```python
@app.function(timeout=600, cpu=2, memory=4096) # 10 minute timeout, 2 CPUs, 4GB RAM
async def crawl(...):
# Function implementation
```
### Keeping Containers Warm
To reduce cold start times, you can keep containers warm:
```python
@app.function(keep_warm=1) # Keep 1 container warm
async def crawl(...):
# Function implementation
```
## Conclusion
That's it! You've successfully deployed Crawl4ai on Modal. You now have a scalable web crawling solution that can handle as many requests as you need without requiring any infrastructure management.
The beauty of this setup is its simplicity - Modal handles all the hard parts, letting you focus on using Crawl4ai to extract the data you need.
Feel free to reach out if you have any questions or need help with your deployment!
Happy crawling!
- UncleCode
## Additional Resources
- [Modal Documentation](https://modal.com/docs)
- [Crawl4ai GitHub Repository](https://github.com/unclecode/crawl4ai)
- [Crawl4ai Documentation](https://docs.crawl4ai.com)

View File

@@ -1,317 +0,0 @@
#!/usr/bin/env python3
"""
Crawl4ai API Testing Script
This script tests all endpoints of the Crawl4ai API service and demonstrates their usage.
"""
import argparse
import json
import sys
import time
from typing import Dict, Any, List, Optional
import requests
# Colors for terminal output
class Colors:
HEADER = '\033[95m'
BLUE = '\033[94m'
GREEN = '\033[92m'
YELLOW = '\033[93m'
RED = '\033[91m'
ENDC = '\033[0m'
BOLD = '\033[1m'
UNDERLINE = '\033[4m'
def print_header(text: str) -> None:
"""Print a formatted header."""
print(f"\n{Colors.HEADER}{Colors.BOLD}{'=' * 80}{Colors.ENDC}")
print(f"{Colors.HEADER}{Colors.BOLD}{text.center(80)}{Colors.ENDC}")
print(f"{Colors.HEADER}{Colors.BOLD}{'=' * 80}{Colors.ENDC}\n")
def print_step(text: str) -> None:
"""Print a formatted step description."""
print(f"{Colors.BLUE}{Colors.BOLD}>> {text}{Colors.ENDC}")
def print_success(text: str) -> None:
"""Print a success message."""
print(f"{Colors.GREEN}{text}{Colors.ENDC}")
def print_warning(text: str) -> None:
"""Print a warning message."""
print(f"{Colors.YELLOW}{text}{Colors.ENDC}")
def print_error(text: str) -> None:
"""Print an error message."""
print(f"{Colors.RED}{text}{Colors.ENDC}")
def print_json(data: Dict[str, Any]) -> None:
"""Pretty print JSON data."""
print(json.dumps(data, indent=2))
def make_request(method: str, url: str, params: Optional[Dict[str, Any]] = None,
json_data: Optional[Dict[str, Any]] = None,
expected_status: int = 200) -> Dict[str, Any]:
"""Make an HTTP request and handle errors."""
print_step(f"Making {method.upper()} request to {url}")
if params:
print(f" Parameters: {params}")
if json_data:
print(f" JSON Data: {json_data}")
try:
response = requests.request(
method=method,
url=url,
params=params,
json=json_data,
timeout=300 # 5 minute timeout for crawling operations
)
status_code = response.status_code
print(f" Status Code: {status_code}")
try:
data = response.json()
print(" Response:")
print_json(data)
if status_code != expected_status:
print_error(f"Expected status code {expected_status}, got {status_code}")
return data
print_success("Request successful")
return data
except ValueError:
print_error("Response is not valid JSON")
print(response.text)
return {"error": "Invalid JSON response"}
except requests.RequestException as e:
print_error(f"Request failed: {str(e)}")
return {"error": str(e)}
def test_health_check(base_url: str) -> bool:
"""Test the health check endpoint."""
print_header("Testing Health Check Endpoint")
response = make_request("GET", f"{base_url}/health_check")
if "status" in response and response["status"] == "online":
print_success("Health check passed")
return True
else:
print_error("Health check failed")
return False
def test_admin_create_user(base_url: str, admin_token: str, email: str, name: str) -> Optional[str]:
"""Test creating a new user."""
print_header("Testing Admin User Creation")
response = make_request(
"POST",
f"{base_url}/admin_create_user",
json_data={
"admin_token": admin_token,
"email": email,
"name": name
},
expected_status=201
)
if response.get("success") and "data" in response:
api_token = response["data"].get("api_token")
if api_token:
print_success(f"User created successfully with API token: {api_token}")
return api_token
print_error("Failed to create user")
return None
def test_check_credits(base_url: str, api_token: str) -> Optional[int]:
"""Test checking user credits."""
print_header("Testing Check Credits Endpoint")
response = make_request(
"GET",
f"{base_url}/check_credits",
params={"api_token": api_token}
)
if response.get("success") and "data" in response:
credits = response["data"].get("credits")
if credits is not None:
print_success(f"User has {credits} credits")
return credits
print_error("Failed to check credits")
return None
def test_crawl_endpoint(base_url: str, api_token: str, url: str) -> bool:
"""Test the crawl endpoint."""
print_header("Testing Crawl Endpoint")
response = make_request(
"POST",
f"{base_url}/crawl_endpoint",
json_data={
"api_token": api_token,
"url": url
}
)
if response.get("success") and "data" in response:
print_success("Crawl completed successfully")
# Display some crawl result data
data = response["data"]
if "title" in data:
print(f"Page Title: {data['title']}")
if "status" in data:
print(f"Status: {data['status']}")
if "links" in data:
print(f"Links found: {len(data['links'])}")
if "markdown_v2" in data and data["markdown_v2"] and "raw_markdown" in data["markdown_v2"]:
print("Markdown Preview (first 200 chars):")
print(data["markdown_v2"]["raw_markdown"][:200] + "...")
credits_remaining = response.get("credits_remaining")
if credits_remaining is not None:
print(f"Credits remaining: {credits_remaining}")
return True
print_error("Crawl failed")
return False
def test_admin_update_credits(base_url: str, admin_token: str, api_token: str, amount: int) -> bool:
"""Test updating user credits."""
print_header("Testing Admin Update Credits")
response = make_request(
"POST",
f"{base_url}/admin_update_credits",
json_data={
"admin_token": admin_token,
"api_token": api_token,
"amount": amount
}
)
if response.get("success") and "data" in response:
print_success(f"Credits updated successfully, new balance: {response['data'].get('credits')}")
return True
print_error("Failed to update credits")
return False
def test_admin_get_users(base_url: str, admin_token: str) -> List[Dict[str, Any]]:
"""Test getting all users."""
print_header("Testing Admin Get All Users")
response = make_request(
"GET",
f"{base_url}/admin_get_users",
params={"admin_token": admin_token}
)
if response.get("success") and "data" in response:
users = response["data"]
print_success(f"Retrieved {len(users)} users")
return users
print_error("Failed to get users")
return []
def run_full_test(base_url: str, admin_token: str) -> None:
"""Run all tests in sequence."""
# Remove trailing slash if present
base_url = base_url.rstrip('/')
# Test 1: Health Check
if not test_health_check(base_url):
print_error("Health check failed, aborting tests")
sys.exit(1)
# Test 2: Create a test user
email = f"test-user-{int(time.time())}@example.com"
name = "Test User"
api_token = test_admin_create_user(base_url, admin_token, email, name)
if not api_token:
print_error("User creation failed, aborting tests")
sys.exit(1)
# Test 3: Check initial credits
initial_credits = test_check_credits(base_url, api_token)
if initial_credits is None:
print_error("Credit check failed, aborting tests")
sys.exit(1)
# Test 4: Perform a crawl
test_url = "https://news.ycombinator.com"
crawl_success = test_crawl_endpoint(base_url, api_token, test_url)
if not crawl_success:
print_warning("Crawl test failed, but continuing with other tests")
# Test 5: Check credits after crawl
post_crawl_credits = test_check_credits(base_url, api_token)
if post_crawl_credits is not None and initial_credits is not None:
if post_crawl_credits == initial_credits - 1:
print_success("Credit deduction verified")
else:
print_warning(f"Unexpected credit change: {initial_credits} -> {post_crawl_credits}")
# Test 6: Add credits
add_credits_amount = 50
if test_admin_update_credits(base_url, admin_token, api_token, add_credits_amount):
print_success(f"Added {add_credits_amount} credits")
# Test 7: Check credits after addition
post_addition_credits = test_check_credits(base_url, api_token)
if post_addition_credits is not None and post_crawl_credits is not None:
if post_addition_credits == post_crawl_credits + add_credits_amount:
print_success("Credit addition verified")
else:
print_warning(f"Unexpected credit change: {post_crawl_credits} -> {post_addition_credits}")
# Test 8: Get all users
users = test_admin_get_users(base_url, admin_token)
if users:
# Check if our test user is in the list
test_user = next((user for user in users if user.get("email") == email), None)
if test_user:
print_success("Test user found in users list")
else:
print_warning("Test user not found in users list")
# Final report
print_header("Test Summary")
print_success("All endpoints tested successfully")
print(f"Test user created with email: {email}")
print(f"API token: {api_token}")
print(f"Final credit balance: {post_addition_credits}")
def main():
parser = argparse.ArgumentParser(description="Test Crawl4ai API endpoints")
parser.add_argument("--base-url", required=True, help="Base URL of the Crawl4ai API (e.g., https://username--crawl4ai-api.modal.run)")
parser.add_argument("--admin-token", required=True, help="Admin token for authentication")
args = parser.parse_args()
print_header("Crawl4ai API Test Script")
print(f"Testing API at: {args.base_url}")
run_full_test(args.base_url, args.admin_token)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,123 @@
# Builtin Browser in Crawl4AI
This document explains the builtin browser feature in Crawl4AI and how to use it effectively.
## What is the Builtin Browser?
The builtin browser is a persistent Chrome instance that Crawl4AI manages for you. It runs in the background and can be used by multiple crawling operations, eliminating the need to start and stop browsers for each crawl.
Benefits include:
- **Faster startup times** - The browser is already running, so your scripts start faster
- **Shared resources** - All your crawling scripts can use the same browser instance
- **Simplified management** - No need to worry about CDP URLs or browser processes
- **Persistent cookies and sessions** - Browser state persists between script runs
- **Less resource usage** - Only one browser instance for multiple scripts
## Using the Builtin Browser
### In Python Code
Using the builtin browser in your code is simple:
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
# Create browser config with builtin mode
browser_config = BrowserConfig(
browser_mode="builtin", # This is the key setting!
headless=True # Can be headless or not
)
# Create the crawler
crawler = AsyncWebCrawler(config=browser_config)
# Use it - no need to explicitly start()
result = await crawler.arun("https://example.com")
```
Key points:
1. Set `browser_mode="builtin"` in your BrowserConfig
2. No need for explicit `start()` call - the crawler will automatically connect to the builtin browser
3. No need to use a context manager or call `close()` - the browser stays running
### Via CLI
The CLI provides commands to manage the builtin browser:
```bash
# Start the builtin browser
crwl browser start
# Check its status
crwl browser status
# Open a visible window to see what the browser is doing
crwl browser view --url https://example.com
# Stop it when no longer needed
crwl browser stop
# Restart with different settings
crwl browser restart --no-headless
```
When crawling via CLI, simply add the builtin browser mode:
```bash
crwl https://example.com -b "browser_mode=builtin"
```
## How It Works
1. When a crawler with `browser_mode="builtin"` is created:
- It checks if a builtin browser is already running
- If not, it automatically launches one
- It connects to the browser via CDP (Chrome DevTools Protocol)
2. The browser process continues running after your script exits
- This means it's ready for the next crawl
- You can manage it via the CLI commands
3. During installation, Crawl4AI attempts to create a builtin browser automatically
## Example
See the [builtin_browser_example.py](builtin_browser_example.py) file for a complete example.
Run it with:
```bash
python builtin_browser_example.py
```
## When to Use
The builtin browser is ideal for:
- Scripts that run frequently
- Development and testing workflows
- Applications that need to minimize startup time
- Systems where you want to manage browser instances centrally
You might not want to use it when:
- Running one-off scripts
- When you need different browser configurations for different tasks
- In environments where persistent processes are not allowed
## Troubleshooting
If you encounter issues:
1. Check the browser status:
```
crwl browser status
```
2. Try restarting it:
```
crwl browser restart
```
3. If problems persist, stop it and let Crawl4AI start a fresh one:
```
crwl browser stop
```

View File

@@ -0,0 +1,79 @@
import asyncio
import time
from crawl4ai.async_webcrawler import AsyncWebCrawler, CacheMode
from crawl4ai.async_configs import CrawlerRunConfig
from crawl4ai.async_dispatcher import MemoryAdaptiveDispatcher, RateLimiter
VERBOSE = False
async def crawl_sequential(urls):
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, verbose=VERBOSE)
results = []
start_time = time.perf_counter()
async with AsyncWebCrawler() as crawler:
for url in urls:
result_container = await crawler.arun(url=url, config=config)
results.append(result_container[0])
total_time = time.perf_counter() - start_time
return total_time, results
async def crawl_parallel_dispatcher(urls):
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, verbose=VERBOSE)
# Dispatcher with rate limiter enabled (default behavior)
dispatcher = MemoryAdaptiveDispatcher(
rate_limiter=RateLimiter(base_delay=(1.0, 3.0), max_delay=60.0, max_retries=3),
max_session_permit=50,
)
start_time = time.perf_counter()
async with AsyncWebCrawler() as crawler:
result_container = await crawler.arun_many(urls=urls, config=config, dispatcher=dispatcher)
results = []
if isinstance(result_container, list):
results = result_container
else:
async for res in result_container:
results.append(res)
total_time = time.perf_counter() - start_time
return total_time, results
async def crawl_parallel_no_rate_limit(urls):
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, verbose=VERBOSE)
# Dispatcher with no rate limiter and a high session permit to avoid queuing
dispatcher = MemoryAdaptiveDispatcher(
rate_limiter=None,
max_session_permit=len(urls) # allow all URLs concurrently
)
start_time = time.perf_counter()
async with AsyncWebCrawler() as crawler:
result_container = await crawler.arun_many(urls=urls, config=config, dispatcher=dispatcher)
results = []
if isinstance(result_container, list):
results = result_container
else:
async for res in result_container:
results.append(res)
total_time = time.perf_counter() - start_time
return total_time, results
async def main():
urls = ["https://example.com"] * 100
print(f"Crawling {len(urls)} URLs sequentially...")
seq_time, seq_results = await crawl_sequential(urls)
print(f"Sequential crawling took: {seq_time:.2f} seconds\n")
print(f"Crawling {len(urls)} URLs in parallel using arun_many with dispatcher (with rate limit)...")
disp_time, disp_results = await crawl_parallel_dispatcher(urls)
print(f"Parallel (dispatcher with rate limiter) took: {disp_time:.2f} seconds\n")
print(f"Crawling {len(urls)} URLs in parallel using dispatcher with no rate limiter...")
no_rl_time, no_rl_results = await crawl_parallel_no_rate_limit(urls)
print(f"Parallel (dispatcher without rate limiter) took: {no_rl_time:.2f} seconds\n")
print("Crawl4ai - Crawling Comparison")
print("--------------------------------------------------------")
print(f"Sequential crawling took: {seq_time:.2f} seconds")
print(f"Parallel (dispatcher with rate limiter) took: {disp_time:.2f} seconds")
print(f"Parallel (dispatcher without rate limiter) took: {no_rl_time:.2f} seconds")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,86 @@
#!/usr/bin/env python3
"""
Builtin Browser Example
This example demonstrates how to use Crawl4AI's builtin browser feature,
which simplifies the browser management process. With builtin mode:
- No need to manually start or connect to a browser
- No need to manage CDP URLs or browser processes
- Automatically connects to an existing browser or launches one if needed
- Browser persists between script runs, reducing startup time
- No explicit cleanup or close() calls needed
The example also demonstrates "auto-starting" where you don't need to explicitly
call start() method on the crawler.
"""
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
import time
async def crawl_with_builtin_browser():
"""
Simple example of crawling with the builtin browser.
Key features:
1. browser_mode="builtin" in BrowserConfig
2. No explicit start() call needed
3. No explicit close() needed
"""
print("\n=== Crawl4AI Builtin Browser Example ===\n")
# Create a browser configuration with builtin mode
browser_config = BrowserConfig(
browser_mode="builtin", # This is the key setting!
headless=True # Can run headless for background operation
)
# Create crawler run configuration
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS, # Skip cache for this demo
screenshot=True, # Take a screenshot
verbose=True # Show verbose logging
)
# Create the crawler instance
# Note: We don't need to use "async with" context manager
crawler = AsyncWebCrawler(config=browser_config)
# Start crawling several URLs - no explicit start() needed!
# The crawler will automatically connect to the builtin browser
print("\n➡️ Crawling first URL...")
t0 = time.time()
result1 = await crawler.arun(
url="https://crawl4ai.com",
config=crawler_config
)
t1 = time.time()
print(f"✅ First URL crawled in {t1-t0:.2f} seconds")
print(f" Got {len(result1.markdown.raw_markdown)} characters of content")
print(f" Title: {result1.metadata.get('title', 'No title')}")
# Try another URL - the browser is already running, so this should be faster
print("\n➡️ Crawling second URL...")
t0 = time.time()
result2 = await crawler.arun(
url="https://example.com",
config=crawler_config
)
t1 = time.time()
print(f"✅ Second URL crawled in {t1-t0:.2f} seconds")
print(f" Got {len(result2.markdown.raw_markdown)} characters of content")
print(f" Title: {result2.metadata.get('title', 'No title')}")
# The builtin browser continues running in the background
# No need to explicitly close it
print("\n🔄 The builtin browser remains running for future use")
print(" You can use 'crwl browser status' to check its status")
print(" or 'crwl browser stop' to stop it when completely done")
async def main():
"""Run the example"""
await crawl_with_builtin_browser()
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,209 @@
"""
CrawlerMonitor Example
This example demonstrates how to use the CrawlerMonitor component
to visualize and track web crawler operations in real-time.
"""
import time
import uuid
import random
import threading
from crawl4ai.components.crawler_monitor import CrawlerMonitor
from crawl4ai.models import CrawlStatus
def simulate_webcrawler_operations(monitor, num_tasks=20):
"""
Simulates a web crawler's operations with multiple tasks and different states.
Args:
monitor: The CrawlerMonitor instance
num_tasks: Number of tasks to simulate
"""
print(f"Starting simulation with {num_tasks} tasks...")
# Create and register all tasks first
task_ids = []
for i in range(num_tasks):
task_id = str(uuid.uuid4())
url = f"https://example.com/page{i}"
monitor.add_task(task_id, url)
task_ids.append((task_id, url))
# Small delay between task creation
time.sleep(0.2)
# Process tasks with a variety of different behaviors
threads = []
for i, (task_id, url) in enumerate(task_ids):
# Create a thread for each task
thread = threading.Thread(
target=process_task,
args=(monitor, task_id, url, i)
)
thread.daemon = True
threads.append(thread)
# Start threads in batches to simulate concurrent processing
batch_size = 4 # Process 4 tasks at a time
for i in range(0, len(threads), batch_size):
batch = threads[i:i+batch_size]
for thread in batch:
thread.start()
time.sleep(0.5) # Stagger thread start times
# Wait a bit before starting next batch
time.sleep(random.uniform(1.0, 3.0))
# Update queue statistics
update_queue_stats(monitor)
# Simulate memory pressure changes
active_threads = [t for t in threads if t.is_alive()]
if len(active_threads) > 8:
monitor.update_memory_status("CRITICAL")
elif len(active_threads) > 4:
monitor.update_memory_status("PRESSURE")
else:
monitor.update_memory_status("NORMAL")
# Wait for all threads to complete
for thread in threads:
thread.join()
# Final updates
update_queue_stats(monitor)
monitor.update_memory_status("NORMAL")
print("Simulation completed!")
def process_task(monitor, task_id, url, index):
"""Simulate processing of a single task."""
# Tasks start in queued state (already added)
# Simulate waiting in queue
wait_time = random.uniform(0.5, 3.0)
time.sleep(wait_time)
# Start processing - move to IN_PROGRESS
monitor.update_task(
task_id=task_id,
status=CrawlStatus.IN_PROGRESS,
start_time=time.time(),
wait_time=wait_time
)
# Simulate task processing with memory usage changes
total_process_time = random.uniform(2.0, 10.0)
step_time = total_process_time / 5 # Update in 5 steps
for step in range(5):
# Simulate increasing then decreasing memory usage
if step < 3: # First 3 steps - increasing
memory_usage = random.uniform(5.0, 20.0) * (step + 1)
else: # Last 2 steps - decreasing
memory_usage = random.uniform(5.0, 20.0) * (5 - step)
# Update peak memory if this is higher
peak = max(memory_usage, monitor.get_task_stats(task_id).get("peak_memory", 0))
monitor.update_task(
task_id=task_id,
memory_usage=memory_usage,
peak_memory=peak
)
time.sleep(step_time)
# Determine final state - 80% success, 20% failure
if index % 5 == 0: # Every 5th task fails
monitor.update_task(
task_id=task_id,
status=CrawlStatus.FAILED,
end_time=time.time(),
memory_usage=0.0,
error_message="Connection timeout"
)
else:
monitor.update_task(
task_id=task_id,
status=CrawlStatus.COMPLETED,
end_time=time.time(),
memory_usage=0.0
)
def update_queue_stats(monitor):
"""Update queue statistics based on current tasks."""
task_stats = monitor.get_all_task_stats()
# Count queued tasks
queued_tasks = [
stats for stats in task_stats.values()
if stats["status"] == CrawlStatus.QUEUED.name
]
total_queued = len(queued_tasks)
if total_queued > 0:
current_time = time.time()
# Calculate wait times
wait_times = [
current_time - stats.get("enqueue_time", current_time)
for stats in queued_tasks
]
highest_wait_time = max(wait_times) if wait_times else 0.0
avg_wait_time = sum(wait_times) / len(wait_times) if wait_times else 0.0
else:
highest_wait_time = 0.0
avg_wait_time = 0.0
# Update monitor
monitor.update_queue_statistics(
total_queued=total_queued,
highest_wait_time=highest_wait_time,
avg_wait_time=avg_wait_time
)
def main():
# Initialize the monitor
monitor = CrawlerMonitor(
urls_total=20, # Total URLs to process
refresh_rate=0.5, # Update UI twice per second
enable_ui=True, # Enable terminal UI
max_width=120 # Set maximum width to 120 characters
)
# Start the monitor
monitor.start()
try:
# Run simulation
simulate_webcrawler_operations(monitor)
# Keep monitor running a bit to see final state
print("Waiting to view final state...")
time.sleep(5)
except KeyboardInterrupt:
print("\nExample interrupted by user")
finally:
# Stop the monitor
monitor.stop()
print("Example completed!")
# Print some statistics
summary = monitor.get_summary()
print("\nCrawler Statistics Summary:")
print(f"Total URLs: {summary['urls_total']}")
print(f"Completed: {summary['urls_completed']}")
print(f"Completion percentage: {summary['completion_percentage']:.1f}%")
print(f"Peak memory usage: {summary['peak_memory_percent']:.1f}%")
# Print task status counts
status_counts = summary['status_counts']
print("\nTask Status Counts:")
for status, count in status_counts.items():
print(f" {status}: {count}")
if __name__ == "__main__":
main()

View File

@@ -73,7 +73,7 @@ async def test_stream_crawl(session, token: str):
# "https://news.ycombinator.com/news" # "https://news.ycombinator.com/news"
], ],
"browser_config": {"headless": True, "viewport": {"width": 1200}}, "browser_config": {"headless": True, "viewport": {"width": 1200}},
"crawler_config": {"stream": True, "cache_mode": "aggressive"} "crawler_config": {"stream": True, "cache_mode": "bypass"}
} }
headers = {"Authorization": f"Bearer {token}"} headers = {"Authorization": f"Bearer {token}"}
print(f"\nTesting Streaming Crawl: {url}") print(f"\nTesting Streaming Crawl: {url}")

View File

@@ -9,6 +9,26 @@ from crawl4ai import (
CrawlResult CrawlResult
) )
async def example_cdp():
browser_conf = BrowserConfig(
headless=False,
cdp_url="http://localhost:9223"
)
crawler_config = CrawlerRunConfig(
session_id="test",
js_code = """(() => { return {"result": "Hello World!"} })()""",
js_only=True
)
async with AsyncWebCrawler(
config=browser_conf,
verbose=True,
) as crawler:
result : CrawlResult = await crawler.arun(
url="https://www.helloworld.org",
config=crawler_config,
)
print(result.js_execution_result)
async def main(): async def main():
browser_config = BrowserConfig(headless=True, verbose=True) browser_config = BrowserConfig(headless=True, verbose=True)
@@ -16,18 +36,15 @@ async def main():
crawler_config = CrawlerRunConfig( crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS, cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator( markdown_generator=DefaultMarkdownGenerator(
# content_filter=PruningContentFilter( content_filter=PruningContentFilter(
# threshold=0.48, threshold_type="fixed", min_word_threshold=0 threshold=0.48, threshold_type="fixed", min_word_threshold=0
# ) )
), ),
) )
result : CrawlResult = await crawler.arun( result : CrawlResult = await crawler.arun(
# url="https://www.helloworld.org", config=crawler_config url="https://www.helloworld.org", config=crawler_config
url="https://www.kidocode.com", config=crawler_config
) )
print(result.markdown.raw_markdown[:500]) print(result.markdown.raw_markdown[:500])
# print(result.model_dump())
if __name__ == "__main__": if __name__ == "__main__":
asyncio.run(main()) asyncio.run(main())

View File

@@ -42,7 +42,7 @@ dependencies = [
"pyperclip>=1.8.2", "pyperclip>=1.8.2",
"faust-cchardet>=2.1.19", "faust-cchardet>=2.1.19",
"aiohttp>=3.11.11", "aiohttp>=3.11.11",
"humanize>=4.10.0" "humanize>=4.10.0",
] ]
classifiers = [ classifiers = [
"Development Status :: 4 - Beta", "Development Status :: 4 - Beta",

View File

@@ -10,6 +10,7 @@ import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, JsonXPathExtractionStrategy from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, JsonXPathExtractionStrategy
from crawl4ai.utils import preprocess_html_for_schema, JsonXPathExtractionStrategy
import json import json
# Test HTML - A complex job board with companies, departments, and positions # Test HTML - A complex job board with companies, departments, and positions

View File

@@ -0,0 +1,4 @@
"""Docker browser strategy tests.
This package contains tests for the Docker browser strategy implementation.
"""

View File

@@ -0,0 +1,653 @@
"""Test examples for Docker Browser Strategy.
These examples demonstrate the functionality of Docker Browser Strategy
and serve as functional tests.
"""
import asyncio
import os
import sys
import shutil
import uuid
import json
from typing import List, Dict, Any, Optional, Tuple
# Add the project root to Python path if running directly
if __name__ == "__main__":
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '../../..')))
from crawl4ai.browser import BrowserManager
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
from crawl4ai.async_logger import AsyncLogger
from crawl4ai.browser.docker_config import DockerConfig
from crawl4ai.browser.docker_registry import DockerRegistry
from crawl4ai.browser.docker_utils import DockerUtils
# Create a logger for clear terminal output
logger = AsyncLogger(verbose=True, log_file=None)
# Global Docker utils instance
docker_utils = DockerUtils(logger)
async def test_docker_components():
"""Test Docker utilities, registry, and image building.
This function tests the core Docker components before running the browser tests.
It validates DockerRegistry, DockerUtils, and builds test images to ensure
everything is functioning correctly.
"""
logger.info("Testing Docker components", tag="SETUP")
# Create a test registry directory
registry_dir = os.path.join(os.path.dirname(__file__), "test_registry")
registry_file = os.path.join(registry_dir, "test_registry.json")
os.makedirs(registry_dir, exist_ok=True)
try:
# 1. Test DockerRegistry
logger.info("Testing DockerRegistry...", tag="SETUP")
registry = DockerRegistry(registry_file)
# Test saving and loading registry
test_container_id = "test-container-123"
registry.register_container(test_container_id, 9876, "test-hash-123")
registry.save()
# Create a new registry instance that loads from the file
registry2 = DockerRegistry(registry_file)
port = registry2.get_container_host_port(test_container_id)
hash_value = registry2.get_container_config_hash(test_container_id)
if port != 9876 or hash_value != "test-hash-123":
logger.error("DockerRegistry persistence failed", tag="SETUP")
return False
# Clean up test container from registry
registry2.unregister_container(test_container_id)
logger.success("DockerRegistry works correctly", tag="SETUP")
# 2. Test DockerUtils
logger.info("Testing DockerUtils...", tag="SETUP")
# Test port detection
in_use = docker_utils.is_port_in_use(22) # SSH port is usually in use
logger.info(f"Port 22 in use: {in_use}", tag="SETUP")
# Get next available port
available_port = docker_utils.get_next_available_port(9000)
logger.info(f"Next available port: {available_port}", tag="SETUP")
# Test config hash generation
config_dict = {"mode": "connect", "headless": True}
config_hash = docker_utils.generate_config_hash(config_dict)
logger.info(f"Generated config hash: {config_hash[:8]}...", tag="SETUP")
# 3. Test Docker is available
logger.info("Checking Docker availability...", tag="SETUP")
if not await check_docker_available():
logger.error("Docker is not available - cannot continue tests", tag="SETUP")
return False
# 4. Test building connect image
logger.info("Building connect mode Docker image...", tag="SETUP")
connect_image = await docker_utils.ensure_docker_image_exists(None, "connect")
if not connect_image:
logger.error("Failed to build connect mode image", tag="SETUP")
return False
logger.success(f"Successfully built connect image: {connect_image}", tag="SETUP")
# 5. Test building launch image
logger.info("Building launch mode Docker image...", tag="SETUP")
launch_image = await docker_utils.ensure_docker_image_exists(None, "launch")
if not launch_image:
logger.error("Failed to build launch mode image", tag="SETUP")
return False
logger.success(f"Successfully built launch image: {launch_image}", tag="SETUP")
# 6. Test creating and removing container
logger.info("Testing container creation and removal...", tag="SETUP")
container_id = await docker_utils.create_container(
image_name=launch_image,
host_port=available_port,
container_name="crawl4ai-test-container"
)
if not container_id:
logger.error("Failed to create test container", tag="SETUP")
return False
logger.info(f"Created test container: {container_id[:12]}", tag="SETUP")
# Verify container is running
running = await docker_utils.is_container_running(container_id)
if not running:
logger.error("Test container is not running", tag="SETUP")
await docker_utils.remove_container(container_id)
return False
# Test commands in container
logger.info("Testing command execution in container...", tag="SETUP")
returncode, stdout, stderr = await docker_utils.exec_in_container(
container_id, ["ls", "-la", "/"]
)
if returncode != 0:
logger.error(f"Command execution failed: {stderr}", tag="SETUP")
await docker_utils.remove_container(container_id)
return False
# Verify Chrome is installed in the container
returncode, stdout, stderr = await docker_utils.exec_in_container(
container_id, ["which", "google-chrome"]
)
if returncode != 0:
logger.error("Chrome not found in container", tag="SETUP")
await docker_utils.remove_container(container_id)
return False
chrome_path = stdout.strip()
logger.info(f"Chrome found at: {chrome_path}", tag="SETUP")
# Test Chrome version
returncode, stdout, stderr = await docker_utils.exec_in_container(
container_id, ["google-chrome", "--version"]
)
if returncode != 0:
logger.error(f"Failed to get Chrome version: {stderr}", tag="SETUP")
await docker_utils.remove_container(container_id)
return False
logger.info(f"Chrome version: {stdout.strip()}", tag="SETUP")
# Remove test container
removed = await docker_utils.remove_container(container_id)
if not removed:
logger.error("Failed to remove test container", tag="SETUP")
return False
logger.success("Test container removed successfully", tag="SETUP")
# All components tested successfully
logger.success("All Docker components tested successfully", tag="SETUP")
return True
except Exception as e:
logger.error(f"Docker component tests failed: {str(e)}", tag="SETUP")
return False
finally:
# Clean up registry test directory
if os.path.exists(registry_dir):
shutil.rmtree(registry_dir)
async def test_docker_connect_mode():
"""Test Docker browser in connect mode.
This tests the basic functionality of creating a browser in Docker
connect mode and using it for navigation.
"""
logger.info("Testing Docker browser in connect mode", tag="TEST")
# Create temp directory for user data
temp_dir = os.path.join(os.path.dirname(__file__), "tmp_user_data")
os.makedirs(temp_dir, exist_ok=True)
try:
# Create Docker configuration
docker_config = DockerConfig(
mode="connect",
persistent=False,
remove_on_exit=True,
user_data_dir=temp_dir
)
# Create browser configuration
browser_config = BrowserConfig(
browser_mode="docker",
headless=True,
docker_config=docker_config
)
# Create browser manager
manager = BrowserManager(browser_config=browser_config, logger=logger)
# Start the browser
await manager.start()
logger.info("Browser started successfully", tag="TEST")
# Create crawler config
crawler_config = CrawlerRunConfig(url="https://example.com")
# Get a page
page, context = await manager.get_page(crawler_config)
logger.info("Got page successfully", tag="TEST")
# Navigate to a website
await page.goto("https://example.com")
logger.info("Navigated to example.com", tag="TEST")
# Get page title
title = await page.title()
logger.info(f"Page title: {title}", tag="TEST")
# Clean up
await manager.close()
logger.info("Browser closed successfully", tag="TEST")
return True
except Exception as e:
logger.error(f"Test failed: {str(e)}", tag="TEST")
# Ensure cleanup
try:
await manager.close()
except:
pass
return False
finally:
# Clean up the temp directory
if os.path.exists(temp_dir):
shutil.rmtree(temp_dir)
async def test_docker_launch_mode():
"""Test Docker browser in launch mode.
This tests launching a Chrome browser within a Docker container
on demand with custom settings.
"""
logger.info("Testing Docker browser in launch mode", tag="TEST")
# Create temp directory for user data
temp_dir = os.path.join(os.path.dirname(__file__), "tmp_user_data_launch")
os.makedirs(temp_dir, exist_ok=True)
try:
# Create Docker configuration
docker_config = DockerConfig(
mode="launch",
persistent=False,
remove_on_exit=True,
user_data_dir=temp_dir
)
# Create browser configuration
browser_config = BrowserConfig(
browser_mode="docker",
headless=True,
text_mode=True, # Enable text mode for faster operation
docker_config=docker_config
)
# Create browser manager
manager = BrowserManager(browser_config=browser_config, logger=logger)
# Start the browser
await manager.start()
logger.info("Browser started successfully", tag="TEST")
# Create crawler config
crawler_config = CrawlerRunConfig(url="https://example.com")
# Get a page
page, context = await manager.get_page(crawler_config)
logger.info("Got page successfully", tag="TEST")
# Navigate to a website
await page.goto("https://example.com")
logger.info("Navigated to example.com", tag="TEST")
# Get page title
title = await page.title()
logger.info(f"Page title: {title}", tag="TEST")
# Clean up
await manager.close()
logger.info("Browser closed successfully", tag="TEST")
return True
except Exception as e:
logger.error(f"Test failed: {str(e)}", tag="TEST")
# Ensure cleanup
try:
await manager.close()
except:
pass
return False
finally:
# Clean up the temp directory
if os.path.exists(temp_dir):
shutil.rmtree(temp_dir)
async def test_docker_persistent_storage():
"""Test Docker browser with persistent storage.
This tests creating localStorage data in one session and verifying
it persists to another session when using persistent storage.
"""
logger.info("Testing Docker browser with persistent storage", tag="TEST")
# Create a unique temp directory
test_id = uuid.uuid4().hex[:8]
temp_dir = os.path.join(os.path.dirname(__file__), f"tmp_user_data_persist_{test_id}")
os.makedirs(temp_dir, exist_ok=True)
manager1 = None
manager2 = None
try:
# Create Docker configuration with persistence
docker_config = DockerConfig(
mode="connect",
persistent=True, # Keep container running between sessions
user_data_dir=temp_dir,
container_user_data_dir="/data"
)
# Create browser configuration
browser_config = BrowserConfig(
browser_mode="docker",
headless=True,
docker_config=docker_config
)
# Create first browser manager
manager1 = BrowserManager(browser_config=browser_config, logger=logger)
# Start the browser
await manager1.start()
logger.info("First browser started successfully", tag="TEST")
# Create crawler config
crawler_config = CrawlerRunConfig()
# Get a page
page1, context1 = await manager1.get_page(crawler_config)
# Navigate to example.com
await page1.goto("https://example.com")
# Set localStorage item
test_value = f"test_value_{test_id}"
await page1.evaluate(f"localStorage.setItem('test_key', '{test_value}')")
logger.info(f"Set localStorage test_key = {test_value}", tag="TEST")
# Close the first browser manager
await manager1.close()
logger.info("First browser closed", tag="TEST")
# Create second browser manager with same config
manager2 = BrowserManager(browser_config=browser_config, logger=logger)
# Start the browser
await manager2.start()
logger.info("Second browser started successfully", tag="TEST")
# Get a page
page2, context2 = await manager2.get_page(crawler_config)
# Navigate to same site
await page2.goto("https://example.com")
# Get localStorage item
value = await page2.evaluate("localStorage.getItem('test_key')")
logger.info(f"Retrieved localStorage test_key = {value}", tag="TEST")
# Check if persistence worked
if value == test_value:
logger.success("Storage persistence verified!", tag="TEST")
else:
logger.error(f"Storage persistence failed! Expected {test_value}, got {value}", tag="TEST")
# Clean up
await manager2.close()
logger.info("Second browser closed successfully", tag="TEST")
return value == test_value
except Exception as e:
logger.error(f"Test failed: {str(e)}", tag="TEST")
# Ensure cleanup
try:
if manager1:
await manager1.close()
if manager2:
await manager2.close()
except:
pass
return False
finally:
# Clean up the temp directory
if os.path.exists(temp_dir):
shutil.rmtree(temp_dir)
async def test_docker_parallel_pages():
"""Test Docker browser with parallel page creation.
This tests the ability to create and use multiple pages in parallel
from a single Docker browser instance.
"""
logger.info("Testing Docker browser with parallel pages", tag="TEST")
try:
# Create Docker configuration
docker_config = DockerConfig(
mode="connect",
persistent=False,
remove_on_exit=True
)
# Create browser configuration
browser_config = BrowserConfig(
browser_mode="docker",
headless=True,
docker_config=docker_config
)
# Create browser manager
manager = BrowserManager(browser_config=browser_config, logger=logger)
# Start the browser
await manager.start()
logger.info("Browser started successfully", tag="TEST")
# Create crawler config
crawler_config = CrawlerRunConfig()
# Get multiple pages
page_count = 3
pages = await manager.get_pages(crawler_config, count=page_count)
logger.info(f"Got {len(pages)} pages successfully", tag="TEST")
if len(pages) != page_count:
logger.error(f"Expected {page_count} pages, got {len(pages)}", tag="TEST")
await manager.close()
return False
# Navigate to different sites with each page
tasks = []
for i, (page, _) in enumerate(pages):
tasks.append(page.goto(f"https://example.com?page={i}"))
# Wait for all navigations to complete
await asyncio.gather(*tasks)
logger.info("All pages navigated successfully", tag="TEST")
# Get titles from all pages
titles = []
for i, (page, _) in enumerate(pages):
title = await page.title()
titles.append(title)
logger.info(f"Page {i+1} title: {title}", tag="TEST")
# Clean up
await manager.close()
logger.info("Browser closed successfully", tag="TEST")
return True
except Exception as e:
logger.error(f"Test failed: {str(e)}", tag="TEST")
# Ensure cleanup
try:
await manager.close()
except:
pass
return False
async def test_docker_registry_reuse():
"""Test Docker container reuse via registry.
This tests that containers with matching configurations
are reused rather than creating new ones.
"""
logger.info("Testing Docker container reuse via registry", tag="TEST")
# Create registry for this test
registry_dir = os.path.join(os.path.dirname(__file__), "registry_reuse_test")
registry_file = os.path.join(registry_dir, "registry.json")
os.makedirs(registry_dir, exist_ok=True)
manager1 = None
manager2 = None
container_id1 = None
try:
# Create identical Docker configurations with custom registry
docker_config1 = DockerConfig(
mode="connect",
persistent=True, # Keep container running after closing
registry_file=registry_file
)
# Create first browser configuration
browser_config1 = BrowserConfig(
browser_mode="docker",
headless=True,
docker_config=docker_config1
)
# Create first browser manager
manager1 = BrowserManager(browser_config=browser_config1, logger=logger)
# Start the first browser
await manager1.start()
logger.info("First browser started successfully", tag="TEST")
# Get container ID from the strategy
docker_strategy1 = manager1._strategy
container_id1 = docker_strategy1.container_id
logger.info(f"First browser container ID: {container_id1[:12]}", tag="TEST")
# Close the first manager but keep container running
await manager1.close()
logger.info("First browser closed", tag="TEST")
# Create second Docker configuration identical to first
docker_config2 = DockerConfig(
mode="connect",
persistent=True,
registry_file=registry_file
)
# Create second browser configuration
browser_config2 = BrowserConfig(
browser_mode="docker",
headless=True,
docker_config=docker_config2
)
# Create second browser manager
manager2 = BrowserManager(browser_config=browser_config2, logger=logger)
# Start the second browser - should reuse existing container
await manager2.start()
logger.info("Second browser started successfully", tag="TEST")
# Get container ID from the second strategy
docker_strategy2 = manager2._strategy
container_id2 = docker_strategy2.container_id
logger.info(f"Second browser container ID: {container_id2[:12]}", tag="TEST")
# Verify container reuse
if container_id1 == container_id2:
logger.success("Container reuse successful - using same container!", tag="TEST")
else:
logger.error("Container reuse failed - new container created!", tag="TEST")
# Clean up
docker_strategy2.docker_config.persistent = False
docker_strategy2.docker_config.remove_on_exit = True
await manager2.close()
logger.info("Second browser closed and container removed", tag="TEST")
return container_id1 == container_id2
except Exception as e:
logger.error(f"Test failed: {str(e)}", tag="TEST")
# Ensure cleanup
try:
if manager1:
await manager1.close()
if manager2:
await manager2.close()
# Make sure container is removed
if container_id1:
await docker_utils.remove_container(container_id1, force=True)
except:
pass
return False
finally:
# Clean up registry directory
if os.path.exists(registry_dir):
shutil.rmtree(registry_dir)
async def run_tests():
"""Run all tests sequentially."""
results = []
logger.info("Starting Docker Browser Strategy tests", tag="TEST")
# Check if Docker is available
if not await check_docker_available():
logger.error("Docker is not available - skipping tests", tag="TEST")
return
# First test Docker components
setup_result = await test_docker_components()
if not setup_result:
logger.error("Docker component tests failed - skipping browser tests", tag="TEST")
return
# Run browser tests
results.append(await test_docker_connect_mode())
results.append(await test_docker_launch_mode())
results.append(await test_docker_persistent_storage())
results.append(await test_docker_parallel_pages())
results.append(await test_docker_registry_reuse())
# Print summary
total = len(results)
passed = sum(1 for r in results if r)
logger.info(f"Tests complete: {passed}/{total} passed", tag="SUMMARY")
if passed == total:
logger.success("All tests passed!", tag="SUMMARY")
else:
logger.error(f"{total - passed} tests failed", tag="SUMMARY")
async def check_docker_available() -> bool:
"""Check if Docker is available on the system.
Returns:
bool: True if Docker is available, False otherwise
"""
try:
proc = await asyncio.create_subprocess_exec(
"docker", "--version",
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE
)
stdout, _ = await proc.communicate()
return proc.returncode == 0 and stdout
except:
return False
if __name__ == "__main__":
asyncio.run(run_tests())

View File

@@ -0,0 +1,190 @@
"""Test examples for BrowserManager.
These examples demonstrate the functionality of BrowserManager
and serve as functional tests.
"""
import asyncio
import os
import sys
from typing import List
# Add the project root to Python path if running directly
if __name__ == "__main__":
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '../..')))
from crawl4ai.browser import BrowserManager
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
from crawl4ai.async_logger import AsyncLogger
# Create a logger for clear terminal output
logger = AsyncLogger(verbose=True, log_file=None)
async def test_basic_browser_manager():
"""Test basic BrowserManager functionality with default configuration."""
logger.info("Starting test_basic_browser_manager", tag="TEST")
try:
# Create a browser manager with default config
manager = BrowserManager(logger=logger)
# Start the browser
await manager.start()
logger.info("Browser started successfully", tag="TEST")
# Get a page
crawler_config = CrawlerRunConfig(url="https://example.com")
page, context = await manager.get_page(crawler_config)
logger.info("Page created successfully", tag="TEST")
# Navigate to a website
await page.goto("https://example.com")
title = await page.title()
logger.info(f"Page title: {title}", tag="TEST")
# Clean up
await manager.close()
logger.success("test_basic_browser_manager completed successfully", tag="TEST")
return True
except Exception as e:
logger.error(f"test_basic_browser_manager failed: {str(e)}", tag="TEST")
return False
async def test_custom_browser_config():
"""Test BrowserManager with custom browser configuration."""
logger.info("Starting test_custom_browser_config", tag="TEST")
try:
# Create a custom browser config
browser_config = BrowserConfig(
browser_type="chromium",
headless=True,
viewport_width=1280,
viewport_height=800,
light_mode=True
)
# Create browser manager with the config
manager = BrowserManager(browser_config=browser_config, logger=logger)
# Start the browser
await manager.start()
logger.info("Browser started successfully with custom config", tag="TEST")
# Get a page
crawler_config = CrawlerRunConfig(url="https://example.com")
page, context = await manager.get_page(crawler_config)
# Navigate to a website
await page.goto("https://example.com")
title = await page.title()
logger.info(f"Page title: {title}", tag="TEST")
# Verify viewport size
viewport_size = await page.evaluate("() => ({ width: window.innerWidth, height: window.innerHeight })")
logger.info(f"Viewport size: {viewport_size}", tag="TEST")
# Clean up
await manager.close()
logger.success("test_custom_browser_config completed successfully", tag="TEST")
return True
except Exception as e:
logger.error(f"test_custom_browser_config failed: {str(e)}", tag="TEST")
return False
async def test_multiple_pages():
"""Test BrowserManager with multiple pages."""
logger.info("Starting test_multiple_pages", tag="TEST")
try:
# Create browser manager
manager = BrowserManager(logger=logger)
# Start the browser
await manager.start()
logger.info("Browser started successfully", tag="TEST")
# Create multiple pages
pages = []
urls = ["https://example.com", "https://example.org", "https://mozilla.org"]
for i, url in enumerate(urls):
crawler_config = CrawlerRunConfig(url=url)
page, context = await manager.get_page(crawler_config)
await page.goto(url)
pages.append((page, url))
logger.info(f"Created page {i+1} for {url}", tag="TEST")
# Verify all pages are loaded correctly
for i, (page, url) in enumerate(pages):
title = await page.title()
logger.info(f"Page {i+1} title: {title}", tag="TEST")
# Clean up
await manager.close()
logger.success("test_multiple_pages completed successfully", tag="TEST")
return True
except Exception as e:
logger.error(f"test_multiple_pages failed: {str(e)}", tag="TEST")
return False
async def test_session_management():
"""Test session management in BrowserManager."""
logger.info("Starting test_session_management", tag="TEST")
try:
# Create browser manager
manager = BrowserManager(logger=logger)
# Start the browser
await manager.start()
logger.info("Browser started successfully", tag="TEST")
# Create a session
session_id = "test_session_1"
crawler_config = CrawlerRunConfig(url="https://example.com", session_id=session_id)
page1, context1 = await manager.get_page(crawler_config)
await page1.goto("https://example.com")
logger.info(f"Created session with ID: {session_id}", tag="TEST")
# Get the same session again
page2, context2 = await manager.get_page(crawler_config)
# Verify it's the same page/context
is_same_page = page1 == page2
is_same_context = context1 == context2
logger.info(f"Same page: {is_same_page}, Same context: {is_same_context}", tag="TEST")
# Kill the session
await manager.kill_session(session_id)
logger.info(f"Killed session with ID: {session_id}", tag="TEST")
# Clean up
await manager.close()
logger.success("test_session_management completed successfully", tag="TEST")
return True
except Exception as e:
logger.error(f"test_session_management failed: {str(e)}", tag="TEST")
return False
async def run_tests():
"""Run all tests sequentially."""
results = []
results.append(await test_basic_browser_manager())
results.append(await test_custom_browser_config())
results.append(await test_multiple_pages())
results.append(await test_session_management())
# Print summary
total = len(results)
passed = sum(results)
logger.info(f"Tests complete: {passed}/{total} passed", tag="SUMMARY")
if passed == total:
logger.success("All tests passed!", tag="SUMMARY")
else:
logger.error(f"{total - passed} tests failed", tag="SUMMARY")
if __name__ == "__main__":
asyncio.run(run_tests())

View File

@@ -0,0 +1,808 @@
"""
Test script for builtin browser functionality in the browser module.
This script tests:
1. Creating a builtin browser
2. Getting browser information
3. Killing the browser
4. Restarting the browser
5. Testing operations with different browser strategies
6. Testing edge cases
"""
import asyncio
import os
import sys
import time
from typing import List, Dict, Any
from colorama import Fore, Style, init
# Add the project root to the path for imports
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "../..")))
from rich.console import Console
from rich.table import Table
from rich.panel import Panel
from rich.text import Text
from rich.box import Box, SIMPLE
from crawl4ai.browser import BrowserManager
from crawl4ai.browser.strategies import BuiltinBrowserStrategy
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
from crawl4ai.async_logger import AsyncLogger
# Initialize colorama for cross-platform colored terminal output
init()
# Define colors for pretty output
SUCCESS = Fore.GREEN
WARNING = Fore.YELLOW
ERROR = Fore.RED
INFO = Fore.CYAN
RESET = Fore.RESET
# Create logger
logger = AsyncLogger(verbose=True)
async def test_builtin_browser_creation():
"""Test creating a builtin browser using the BrowserManager with BuiltinBrowserStrategy"""
print(f"\n{INFO}========== Testing Builtin Browser Creation =========={RESET}")
# Step 1: Create a BrowserManager with builtin mode
print(f"\n{INFO}1. Creating BrowserManager with builtin mode{RESET}")
browser_config = BrowserConfig(browser_mode="builtin", headless=True, verbose=True)
manager = BrowserManager(browser_config=browser_config, logger=logger)
# Step 2: Check if we have a BuiltinBrowserStrategy
print(f"\n{INFO}2. Checking if we have a BuiltinBrowserStrategy{RESET}")
if isinstance(manager._strategy, BuiltinBrowserStrategy):
print(
f"{SUCCESS}Correct strategy type: {manager._strategy.__class__.__name__}{RESET}"
)
else:
print(
f"{ERROR}Wrong strategy type: {manager._strategy.__class__.__name__}{RESET}"
)
return None
# Step 3: Start the manager to launch or connect to builtin browser
print(f"\n{INFO}3. Starting the browser manager{RESET}")
try:
await manager.start()
print(f"{SUCCESS}Browser manager started successfully{RESET}")
except Exception as e:
print(f"{ERROR}Failed to start browser manager: {str(e)}{RESET}")
return None
# Step 4: Get browser info from the strategy
print(f"\n{INFO}4. Getting browser information{RESET}")
browser_info = manager._strategy.get_builtin_browser_info()
if browser_info:
print(f"{SUCCESS}Browser info retrieved:{RESET}")
for key, value in browser_info.items():
if key != "config": # Skip the verbose config section
print(f" {key}: {value}")
cdp_url = browser_info.get("cdp_url")
print(f"{SUCCESS}CDP URL: {cdp_url}{RESET}")
else:
print(f"{ERROR}Failed to get browser information{RESET}")
cdp_url = None
# Save manager for later tests
return manager, cdp_url
async def test_page_operations(manager: BrowserManager):
"""Test page operations with the builtin browser"""
print(
f"\n{INFO}========== Testing Page Operations with Builtin Browser =========={RESET}"
)
# Step 1: Get a single page
print(f"\n{INFO}1. Getting a single page{RESET}")
try:
crawler_config = CrawlerRunConfig()
page, context = await manager.get_page(crawler_config)
print(f"{SUCCESS}Got page successfully{RESET}")
# Navigate to a test URL
await page.goto("https://example.com")
title = await page.title()
print(f"{SUCCESS}Page title: {title}{RESET}")
# Close the page
await page.close()
print(f"{SUCCESS}Page closed successfully{RESET}")
except Exception as e:
print(f"{ERROR}Page operation failed: {str(e)}{RESET}")
return False
# Step 2: Get multiple pages
print(f"\n{INFO}2. Getting multiple pages with get_pages(){RESET}")
try:
# Request 3 pages
crawler_config = CrawlerRunConfig()
pages = await manager.get_pages(crawler_config, count=3)
print(f"{SUCCESS}Got {len(pages)} pages{RESET}")
# Test each page
for i, (page, context) in enumerate(pages):
await page.goto(f"https://example.com?test={i}")
title = await page.title()
print(f"{SUCCESS}Page {i + 1} title: {title}{RESET}")
await page.close()
print(f"{SUCCESS}All pages tested and closed successfully{RESET}")
except Exception as e:
print(f"{ERROR}Multiple page operation failed: {str(e)}{RESET}")
return False
return True
async def test_browser_status_management(manager: BrowserManager):
"""Test browser status and management operations"""
print(f"\n{INFO}========== Testing Browser Status and Management =========={RESET}")
# Step 1: Get browser status
print(f"\n{INFO}1. Getting browser status{RESET}")
try:
status = await manager._strategy.get_builtin_browser_status()
print(f"{SUCCESS}Browser status:{RESET}")
print(f" Running: {status['running']}")
print(f" CDP URL: {status['cdp_url']}")
except Exception as e:
print(f"{ERROR}Failed to get browser status: {str(e)}{RESET}")
return False
# Step 2: Test killing the browser
print(f"\n{INFO}2. Testing killing the browser{RESET}")
try:
result = await manager._strategy.kill_builtin_browser()
if result:
print(f"{SUCCESS}Browser killed successfully{RESET}")
else:
print(f"{ERROR}Failed to kill browser{RESET}")
except Exception as e:
print(f"{ERROR}Browser kill operation failed: {str(e)}{RESET}")
return False
# Step 3: Check status after kill
print(f"\n{INFO}3. Checking status after kill{RESET}")
try:
status = await manager._strategy.get_builtin_browser_status()
if not status["running"]:
print(f"{SUCCESS}Browser is correctly reported as not running{RESET}")
else:
print(f"{ERROR}Browser is incorrectly reported as still running{RESET}")
except Exception as e:
print(f"{ERROR}Failed to get browser status: {str(e)}{RESET}")
return False
# Step 4: Launch a new browser
print(f"\n{INFO}4. Launching a new browser{RESET}")
try:
cdp_url = await manager._strategy.launch_builtin_browser(
browser_type="chromium", headless=True
)
if cdp_url:
print(f"{SUCCESS}New browser launched at: {cdp_url}{RESET}")
else:
print(f"{ERROR}Failed to launch new browser{RESET}")
return False
except Exception as e:
print(f"{ERROR}Browser launch failed: {str(e)}{RESET}")
return False
return True
async def test_multiple_managers():
"""Test creating multiple BrowserManagers that use the same builtin browser"""
print(f"\n{INFO}========== Testing Multiple Browser Managers =========={RESET}")
# Step 1: Create first manager
print(f"\n{INFO}1. Creating first browser manager{RESET}")
browser_config1 = (BrowserConfig(browser_mode="builtin", headless=True),)
manager1 = BrowserManager(browser_config=browser_config1, logger=logger)
# Step 2: Create second manager
print(f"\n{INFO}2. Creating second browser manager{RESET}")
browser_config2 = BrowserConfig(browser_mode="builtin", headless=True)
manager2 = BrowserManager(browser_config=browser_config2, logger=logger)
# Step 3: Start both managers (should connect to the same builtin browser)
print(f"\n{INFO}3. Starting both managers{RESET}")
try:
await manager1.start()
print(f"{SUCCESS}First manager started{RESET}")
await manager2.start()
print(f"{SUCCESS}Second manager started{RESET}")
# Check if they got the same CDP URL
cdp_url1 = manager1._strategy.config.cdp_url
cdp_url2 = manager2._strategy.config.cdp_url
if cdp_url1 == cdp_url2:
print(
f"{SUCCESS}Both managers connected to the same browser: {cdp_url1}{RESET}"
)
else:
print(
f"{WARNING}Managers connected to different browsers: {cdp_url1} and {cdp_url2}{RESET}"
)
except Exception as e:
print(f"{ERROR}Failed to start managers: {str(e)}{RESET}")
return False
# Step 4: Test using both managers
print(f"\n{INFO}4. Testing operations with both managers{RESET}")
try:
# First manager creates a page
page1, ctx1 = await manager1.get_page(CrawlerRunConfig())
await page1.goto("https://example.com")
title1 = await page1.title()
print(f"{SUCCESS}Manager 1 page title: {title1}{RESET}")
# Second manager creates a page
page2, ctx2 = await manager2.get_page(CrawlerRunConfig())
await page2.goto("https://example.org")
title2 = await page2.title()
print(f"{SUCCESS}Manager 2 page title: {title2}{RESET}")
# Clean up
await page1.close()
await page2.close()
except Exception as e:
print(f"{ERROR}Failed to use both managers: {str(e)}{RESET}")
return False
# Step 5: Close both managers
print(f"\n{INFO}5. Closing both managers{RESET}")
try:
await manager1.close()
print(f"{SUCCESS}First manager closed{RESET}")
await manager2.close()
print(f"{SUCCESS}Second manager closed{RESET}")
except Exception as e:
print(f"{ERROR}Failed to close managers: {str(e)}{RESET}")
return False
return True
async def test_edge_cases():
"""Test edge cases like multiple starts, killing browser during operations, etc."""
print(f"\n{INFO}========== Testing Edge Cases =========={RESET}")
# Step 1: Test multiple starts with the same manager
print(f"\n{INFO}1. Testing multiple starts with the same manager{RESET}")
browser_config = BrowserConfig(browser_mode="builtin", headless=True)
manager = BrowserManager(browser_config=browser_config, logger=logger)
try:
await manager.start()
print(f"{SUCCESS}First start successful{RESET}")
# Try to start again
await manager.start()
print(f"{SUCCESS}Second start completed without errors{RESET}")
# Test if it's still functional
page, context = await manager.get_page(CrawlerRunConfig())
await page.goto("https://example.com")
title = await page.title()
print(
f"{SUCCESS}Page operations work after multiple starts. Title: {title}{RESET}"
)
await page.close()
except Exception as e:
print(f"{ERROR}Multiple starts test failed: {str(e)}{RESET}")
return False
finally:
await manager.close()
# Step 2: Test killing the browser while manager is active
print(f"\n{INFO}2. Testing killing the browser while manager is active{RESET}")
manager = BrowserManager(browser_config=browser_config, logger=logger)
try:
await manager.start()
print(f"{SUCCESS}Manager started{RESET}")
# Kill the browser directly
print(f"{INFO}Killing the browser...{RESET}")
await manager._strategy.kill_builtin_browser()
print(f"{SUCCESS}Browser killed{RESET}")
# Try to get a page (should fail or launch a new browser)
try:
page, context = await manager.get_page(CrawlerRunConfig())
print(
f"{WARNING}Page request succeeded despite killed browser (might have auto-restarted){RESET}"
)
title = await page.title()
print(f"{SUCCESS}Got page title: {title}{RESET}")
await page.close()
except Exception as e:
print(
f"{SUCCESS}Page request failed as expected after browser was killed: {str(e)}{RESET}"
)
except Exception as e:
print(f"{ERROR}Kill during operation test failed: {str(e)}{RESET}")
return False
finally:
await manager.close()
return True
async def cleanup_browsers():
"""Clean up any remaining builtin browsers"""
print(f"\n{INFO}========== Cleaning Up Builtin Browsers =========={RESET}")
browser_config = BrowserConfig(browser_mode="builtin", headless=True)
manager = BrowserManager(browser_config=browser_config, logger=logger)
try:
# No need to start, just access the strategy directly
strategy = manager._strategy
if isinstance(strategy, BuiltinBrowserStrategy):
result = await strategy.kill_builtin_browser()
if result:
print(f"{SUCCESS}Successfully killed all builtin browsers{RESET}")
else:
print(f"{WARNING}No builtin browsers found to kill{RESET}")
else:
print(f"{ERROR}Wrong strategy type: {strategy.__class__.__name__}{RESET}")
except Exception as e:
print(f"{ERROR}Cleanup failed: {str(e)}{RESET}")
finally:
# Just to be safe
try:
await manager.close()
except:
pass
async def test_performance_scaling():
"""Test performance with multiple browsers and pages.
This test creates multiple browsers on different ports,
spawns multiple pages per browser, and measures performance metrics.
"""
print(f"\n{INFO}========== Testing Performance Scaling =========={RESET}")
# Configuration parameters
num_browsers = 10
pages_per_browser = 10
total_pages = num_browsers * pages_per_browser
base_port = 9222
# Set up a measuring mechanism for memory
import psutil
import gc
# Force garbage collection before starting
gc.collect()
process = psutil.Process()
initial_memory = process.memory_info().rss / 1024 / 1024 # in MB
peak_memory = initial_memory
# Report initial configuration
print(
f"{INFO}Test configuration: {num_browsers} browsers × {pages_per_browser} pages = {total_pages} total crawls{RESET}"
)
# List to track managers
managers: List[BrowserManager] = []
all_pages = []
# Get crawl4ai home directory
crawl4ai_home = os.path.expanduser("~/.crawl4ai")
temp_dir = os.path.join(crawl4ai_home, "temp")
os.makedirs(temp_dir, exist_ok=True)
# Create all managers but don't start them yet
manager_configs = []
for i in range(num_browsers):
port = base_port + i
browser_config = BrowserConfig(
browser_mode="builtin",
headless=True,
debugging_port=port,
user_data_dir=os.path.join(temp_dir, f"browser_profile_{i}"),
)
manager = BrowserManager(browser_config=browser_config, logger=logger)
manager._strategy.shutting_down = True
manager_configs.append((manager, i, port))
# Define async function to start a single manager
async def start_manager(manager, index, port):
try:
await manager.start()
return manager
except Exception as e:
print(
f"{ERROR}Failed to start browser {index + 1} on port {port}: {str(e)}{RESET}"
)
return None
# Start all managers in parallel
start_tasks = [
start_manager(manager, i, port) for manager, i, port in manager_configs
]
started_managers = await asyncio.gather(*start_tasks)
# Filter out None values (failed starts) and add to managers list
managers = [m for m in started_managers if m is not None]
if len(managers) == 0:
print(f"{ERROR}All browser managers failed to start. Aborting test.{RESET}")
return False
if len(managers) < num_browsers:
print(
f"{WARNING}Only {len(managers)} out of {num_browsers} browser managers started successfully{RESET}"
)
# Create pages for each browser
for i, manager in enumerate(managers):
try:
pages = await manager.get_pages(CrawlerRunConfig(), count=pages_per_browser)
all_pages.extend(pages)
except Exception as e:
print(f"{ERROR}Failed to create pages for browser {i + 1}: {str(e)}{RESET}")
# Check memory after page creation
gc.collect()
current_memory = process.memory_info().rss / 1024 / 1024
peak_memory = max(peak_memory, current_memory)
# Ask for confirmation before loading
confirmation = input(
f"{WARNING}Do you want to proceed with loading pages? (y/n): {RESET}"
)
# Step 1: Create and start multiple browser managers in parallel
start_time = time.time()
if confirmation.lower() == "y":
load_start_time = time.time()
# Function to load a single page
async def load_page(page_ctx, index):
page, _ = page_ctx
try:
await page.goto(f"https://example.com/page{index}", timeout=30000)
title = await page.title()
return title
except Exception as e:
return f"Error: {str(e)}"
# Load all pages concurrently
load_tasks = [load_page(page_ctx, i) for i, page_ctx in enumerate(all_pages)]
load_results = await asyncio.gather(*load_tasks, return_exceptions=True)
# Count successes and failures
successes = sum(
1 for r in load_results if isinstance(r, str) and not r.startswith("Error")
)
failures = len(load_results) - successes
load_time = time.time() - load_start_time
total_test_time = time.time() - start_time
# Check memory after loading (peak memory)
gc.collect()
current_memory = process.memory_info().rss / 1024 / 1024
peak_memory = max(peak_memory, current_memory)
# Calculate key metrics
memory_per_page = peak_memory / successes if successes > 0 else 0
time_per_crawl = total_test_time / successes if successes > 0 else 0
crawls_per_second = successes / total_test_time if total_test_time > 0 else 0
crawls_per_minute = crawls_per_second * 60
crawls_per_hour = crawls_per_minute * 60
# Print simplified performance summary
from rich.console import Console
from rich.table import Table
console = Console()
# Create a simple summary table
table = Table(title="CRAWL4AI PERFORMANCE SUMMARY")
table.add_column("Metric", style="cyan")
table.add_column("Value", style="green")
table.add_row("Total Crawls Completed", f"{successes}")
table.add_row("Total Time", f"{total_test_time:.2f} seconds")
table.add_row("Time Per Crawl", f"{time_per_crawl:.2f} seconds")
table.add_row("Crawling Speed", f"{crawls_per_second:.2f} crawls/second")
table.add_row("Projected Rate (1 minute)", f"{crawls_per_minute:.0f} crawls")
table.add_row("Projected Rate (1 hour)", f"{crawls_per_hour:.0f} crawls")
table.add_row("Peak Memory Usage", f"{peak_memory:.2f} MB")
table.add_row("Memory Per Crawl", f"{memory_per_page:.2f} MB")
# Display the table
console.print(table)
# Ask confirmation before cleanup
confirmation = input(
f"{WARNING}Do you want to proceed with cleanup? (y/n): {RESET}"
)
if confirmation.lower() != "y":
print(f"{WARNING}Cleanup aborted by user{RESET}")
return False
# Close all pages
for page, _ in all_pages:
try:
await page.close()
except:
pass
# Close all managers
for manager in managers:
try:
await manager.close()
except:
pass
# Remove the temp directory
import shutil
if os.path.exists(temp_dir):
shutil.rmtree(temp_dir)
return True
async def test_performance_scaling_lab( num_browsers: int = 10, pages_per_browser: int = 10):
"""Test performance with multiple browsers and pages.
This test creates multiple browsers on different ports,
spawns multiple pages per browser, and measures performance metrics.
"""
print(f"\n{INFO}========== Testing Performance Scaling =========={RESET}")
# Configuration parameters
num_browsers = num_browsers
pages_per_browser = pages_per_browser
total_pages = num_browsers * pages_per_browser
base_port = 9222
# Set up a measuring mechanism for memory
import psutil
import gc
# Force garbage collection before starting
gc.collect()
process = psutil.Process()
initial_memory = process.memory_info().rss / 1024 / 1024 # in MB
peak_memory = initial_memory
# Report initial configuration
print(
f"{INFO}Test configuration: {num_browsers} browsers × {pages_per_browser} pages = {total_pages} total crawls{RESET}"
)
# List to track managers
managers: List[BrowserManager] = []
all_pages = []
# Get crawl4ai home directory
crawl4ai_home = os.path.expanduser("~/.crawl4ai")
temp_dir = os.path.join(crawl4ai_home, "temp")
os.makedirs(temp_dir, exist_ok=True)
# Create all managers but don't start them yet
manager_configs = []
for i in range(num_browsers):
port = base_port + i
browser_config = BrowserConfig(
browser_mode="builtin",
headless=True,
debugging_port=port,
user_data_dir=os.path.join(temp_dir, f"browser_profile_{i}"),
)
manager = BrowserManager(browser_config=browser_config, logger=logger)
manager._strategy.shutting_down = True
manager_configs.append((manager, i, port))
# Define async function to start a single manager
async def start_manager(manager, index, port):
try:
await manager.start()
return manager
except Exception as e:
print(
f"{ERROR}Failed to start browser {index + 1} on port {port}: {str(e)}{RESET}"
)
return None
# Start all managers in parallel
start_tasks = [
start_manager(manager, i, port) for manager, i, port in manager_configs
]
started_managers = await asyncio.gather(*start_tasks)
# Filter out None values (failed starts) and add to managers list
managers = [m for m in started_managers if m is not None]
if len(managers) == 0:
print(f"{ERROR}All browser managers failed to start. Aborting test.{RESET}")
return False
if len(managers) < num_browsers:
print(
f"{WARNING}Only {len(managers)} out of {num_browsers} browser managers started successfully{RESET}"
)
# Create pages for each browser
for i, manager in enumerate(managers):
try:
pages = await manager.get_pages(CrawlerRunConfig(), count=pages_per_browser)
all_pages.extend(pages)
except Exception as e:
print(f"{ERROR}Failed to create pages for browser {i + 1}: {str(e)}{RESET}")
# Check memory after page creation
gc.collect()
current_memory = process.memory_info().rss / 1024 / 1024
peak_memory = max(peak_memory, current_memory)
# Ask for confirmation before loading
confirmation = input(
f"{WARNING}Do you want to proceed with loading pages? (y/n): {RESET}"
)
# Step 1: Create and start multiple browser managers in parallel
start_time = time.time()
if confirmation.lower() == "y":
load_start_time = time.time()
# Function to load a single page
async def load_page(page_ctx, index):
page, _ = page_ctx
try:
await page.goto(f"https://example.com/page{index}", timeout=30000)
title = await page.title()
return title
except Exception as e:
return f"Error: {str(e)}"
# Load all pages concurrently
load_tasks = [load_page(page_ctx, i) for i, page_ctx in enumerate(all_pages)]
load_results = await asyncio.gather(*load_tasks, return_exceptions=True)
# Count successes and failures
successes = sum(
1 for r in load_results if isinstance(r, str) and not r.startswith("Error")
)
failures = len(load_results) - successes
load_time = time.time() - load_start_time
total_test_time = time.time() - start_time
# Check memory after loading (peak memory)
gc.collect()
current_memory = process.memory_info().rss / 1024 / 1024
peak_memory = max(peak_memory, current_memory)
# Calculate key metrics
memory_per_page = peak_memory / successes if successes > 0 else 0
time_per_crawl = total_test_time / successes if successes > 0 else 0
crawls_per_second = successes / total_test_time if total_test_time > 0 else 0
crawls_per_minute = crawls_per_second * 60
crawls_per_hour = crawls_per_minute * 60
# Print simplified performance summary
from rich.console import Console
from rich.table import Table
console = Console()
# Create a simple summary table
table = Table(title="CRAWL4AI PERFORMANCE SUMMARY")
table.add_column("Metric", style="cyan")
table.add_column("Value", style="green")
table.add_row("Total Crawls Completed", f"{successes}")
table.add_row("Total Time", f"{total_test_time:.2f} seconds")
table.add_row("Time Per Crawl", f"{time_per_crawl:.2f} seconds")
table.add_row("Crawling Speed", f"{crawls_per_second:.2f} crawls/second")
table.add_row("Projected Rate (1 minute)", f"{crawls_per_minute:.0f} crawls")
table.add_row("Projected Rate (1 hour)", f"{crawls_per_hour:.0f} crawls")
table.add_row("Peak Memory Usage", f"{peak_memory:.2f} MB")
table.add_row("Memory Per Crawl", f"{memory_per_page:.2f} MB")
# Display the table
console.print(table)
# Ask confirmation before cleanup
confirmation = input(
f"{WARNING}Do you want to proceed with cleanup? (y/n): {RESET}"
)
if confirmation.lower() != "y":
print(f"{WARNING}Cleanup aborted by user{RESET}")
return False
# Close all pages
for page, _ in all_pages:
try:
await page.close()
except:
pass
# Close all managers
for manager in managers:
try:
await manager.close()
except:
pass
# Remove the temp directory
import shutil
if os.path.exists(temp_dir):
shutil.rmtree(temp_dir)
return True
async def main():
"""Run all tests"""
try:
print(f"{INFO}Starting builtin browser tests with browser module{RESET}")
# # Run browser creation test
# manager, cdp_url = await test_builtin_browser_creation()
# if not manager:
# print(f"{ERROR}Browser creation failed, cannot continue tests{RESET}")
# return
# # Run page operations test
# await test_page_operations(manager)
# # Run browser status and management test
# await test_browser_status_management(manager)
# # Close manager before multiple manager test
# await manager.close()
# Run multiple managers test
# await test_multiple_managers()
# Run performance scaling test
await test_performance_scaling()
# Run cleanup test
# await cleanup_browsers()
# Run edge cases test
# await test_edge_cases()
print(f"\n{SUCCESS}All tests completed!{RESET}")
except Exception as e:
print(f"\n{ERROR}Test failed with error: {str(e)}{RESET}")
import traceback
traceback.print_exc()
finally:
# Clean up: kill any remaining builtin browsers
await cleanup_browsers()
print(f"{SUCCESS}Test cleanup complete{RESET}")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,160 @@
"""Test examples for BuiltinBrowserStrategy.
These examples demonstrate the functionality of BuiltinBrowserStrategy
and serve as functional tests.
"""
import asyncio
import os
import sys
# Add the project root to Python path if running directly
if __name__ == "__main__":
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '../..')))
from crawl4ai.browser import BrowserManager
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
from crawl4ai.async_logger import AsyncLogger
# Create a logger for clear terminal output
logger = AsyncLogger(verbose=True, log_file=None)
async def test_builtin_browser():
"""Test using a builtin browser that persists between sessions."""
logger.info("Testing builtin browser", tag="TEST")
browser_config = BrowserConfig(
browser_mode="builtin",
headless=True
)
manager = BrowserManager(browser_config=browser_config, logger=logger)
try:
# Start should connect to existing builtin browser or create one
await manager.start()
logger.info("Connected to builtin browser", tag="TEST")
# Test page creation
crawler_config = CrawlerRunConfig()
page, context = await manager.get_page(crawler_config)
# Test navigation
await page.goto("https://example.com")
title = await page.title()
logger.info(f"Page title: {title}", tag="TEST")
# Close manager (should not close the builtin browser)
await manager.close()
logger.info("First session closed", tag="TEST")
# Create a second manager to verify browser persistence
logger.info("Creating second session to verify persistence", tag="TEST")
manager2 = BrowserManager(browser_config=browser_config, logger=logger)
await manager2.start()
logger.info("Connected to existing builtin browser", tag="TEST")
page2, context2 = await manager2.get_page(crawler_config)
await page2.goto("https://example.org")
title2 = await page2.title()
logger.info(f"Second session page title: {title2}", tag="TEST")
await manager2.close()
logger.info("Second session closed successfully", tag="TEST")
return True
except Exception as e:
logger.error(f"Test failed: {str(e)}", tag="TEST")
try:
await manager.close()
except:
pass
return False
async def test_builtin_browser_status():
"""Test getting status of the builtin browser."""
logger.info("Testing builtin browser status", tag="TEST")
from crawl4ai.browser.strategies import BuiltinBrowserStrategy
browser_config = BrowserConfig(
browser_mode="builtin",
headless=True
)
# Create strategy directly to access its status methods
strategy = BuiltinBrowserStrategy(browser_config, logger)
try:
# Get status before starting (should be not running)
status_before = await strategy.get_builtin_browser_status()
logger.info(f"Initial status: {status_before}", tag="TEST")
# Start the browser
await strategy.start()
logger.info("Browser started successfully", tag="TEST")
# Get status after starting
status_after = await strategy.get_builtin_browser_status()
logger.info(f"Status after start: {status_after}", tag="TEST")
# Create a page to verify functionality
crawler_config = CrawlerRunConfig()
page, context = await strategy.get_page(crawler_config)
await page.goto("https://example.com")
title = await page.title()
logger.info(f"Page title: {title}", tag="TEST")
# Close strategy (should not kill the builtin browser)
await strategy.close()
logger.info("Strategy closed successfully", tag="TEST")
# Create a new strategy object
strategy2 = BuiltinBrowserStrategy(browser_config, logger)
# Get status again (should still be running)
status_final = await strategy2.get_builtin_browser_status()
logger.info(f"Final status: {status_final}", tag="TEST")
# Verify that the status shows the browser is running
is_running = status_final.get('running', False)
logger.info(f"Builtin browser persistence confirmed: {is_running}", tag="TEST")
# Kill the builtin browser to clean up
logger.info("Killing builtin browser", tag="TEST")
success = await strategy2.kill_builtin_browser()
logger.info(f"Killed builtin browser successfully: {success}", tag="TEST")
return is_running and success
except Exception as e:
logger.error(f"Test failed: {str(e)}", tag="TEST")
try:
await strategy.close()
# Try to kill the builtin browser to clean up
strategy2 = BuiltinBrowserStrategy(browser_config, logger)
await strategy2.kill_builtin_browser()
except:
pass
return False
async def run_tests():
"""Run all tests sequentially."""
results = []
results.append(await test_builtin_browser())
results.append(await test_builtin_browser_status())
# Print summary
total = len(results)
passed = sum(results)
logger.info(f"Tests complete: {passed}/{total} passed", tag="SUMMARY")
if passed == total:
logger.success("All tests passed!", tag="SUMMARY")
else:
logger.error(f"{total - passed} tests failed", tag="SUMMARY")
if __name__ == "__main__":
asyncio.run(run_tests())

View File

@@ -0,0 +1,227 @@
"""Test examples for CDPBrowserStrategy.
These examples demonstrate the functionality of CDPBrowserStrategy
and serve as functional tests.
"""
import asyncio
import os
import sys
# Add the project root to Python path if running directly
if __name__ == "__main__":
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '../..')))
from crawl4ai.browser import BrowserManager
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
from crawl4ai.async_logger import AsyncLogger
# Create a logger for clear terminal output
logger = AsyncLogger(verbose=True, log_file=None)
async def test_cdp_launch_connect():
"""Test launching a browser and connecting via CDP."""
logger.info("Testing launch and connect via CDP", tag="TEST")
browser_config = BrowserConfig(
use_managed_browser=True,
headless=True
)
manager = BrowserManager(browser_config=browser_config, logger=logger)
try:
await manager.start()
logger.info("Browser launched and connected via CDP", tag="TEST")
# Test with multiple pages
pages = []
for i in range(3):
crawler_config = CrawlerRunConfig()
page, context = await manager.get_page(crawler_config)
await page.goto(f"https://example.com?test={i}")
pages.append(page)
logger.info(f"Created page {i+1}", tag="TEST")
# Verify all pages are working
for i, page in enumerate(pages):
title = await page.title()
logger.info(f"Page {i+1} title: {title}", tag="TEST")
await manager.close()
logger.info("Browser closed successfully", tag="TEST")
return True
except Exception as e:
logger.error(f"Test failed: {str(e)}", tag="TEST")
try:
await manager.close()
except:
pass
return False
async def test_cdp_with_user_data_dir():
"""Test CDP browser with a user data directory."""
logger.info("Testing CDP browser with user data directory", tag="TEST")
# Create a temporary user data directory
import tempfile
user_data_dir = tempfile.mkdtemp(prefix="crawl4ai-test-")
logger.info(f"Created temporary user data directory: {user_data_dir}", tag="TEST")
browser_config = BrowserConfig(
use_managed_browser=True,
headless=True,
user_data_dir=user_data_dir
)
manager = BrowserManager(browser_config=browser_config, logger=logger)
try:
await manager.start()
logger.info("Browser launched with user data directory", tag="TEST")
# Navigate to a page and store some data
crawler_config = CrawlerRunConfig()
page, context = await manager.get_page(crawler_config)
# Set a cookie
await context.add_cookies([{
"name": "test_cookie",
"value": "test_value",
"url": "https://example.com"
}])
# Visit the site
await page.goto("https://example.com")
# Verify cookie was set
cookies = await context.cookies(["https://example.com"])
has_test_cookie = any(cookie["name"] == "test_cookie" for cookie in cookies)
logger.info(f"Cookie set successfully: {has_test_cookie}", tag="TEST")
# Close the browser
await manager.close()
logger.info("First browser session closed", tag="TEST")
# Start a new browser with the same user data directory
logger.info("Starting second browser session with same user data directory", tag="TEST")
manager2 = BrowserManager(browser_config=browser_config, logger=logger)
await manager2.start()
# Get a new page and check if the cookie persists
page2, context2 = await manager2.get_page(crawler_config)
await page2.goto("https://example.com")
# Verify cookie persisted
cookies2 = await context2.cookies(["https://example.com"])
has_test_cookie2 = any(cookie["name"] == "test_cookie" for cookie in cookies2)
logger.info(f"Cookie persisted across sessions: {has_test_cookie2}", tag="TEST")
# Clean up
await manager2.close()
# Remove temporary directory
import shutil
shutil.rmtree(user_data_dir, ignore_errors=True)
logger.info(f"Removed temporary user data directory", tag="TEST")
return has_test_cookie and has_test_cookie2
except Exception as e:
logger.error(f"Test failed: {str(e)}", tag="TEST")
try:
await manager.close()
except:
pass
# Clean up temporary directory
try:
import shutil
shutil.rmtree(user_data_dir, ignore_errors=True)
except:
pass
return False
async def test_cdp_session_management():
"""Test session management with CDP browser."""
logger.info("Testing session management with CDP browser", tag="TEST")
browser_config = BrowserConfig(
use_managed_browser=True,
headless=True
)
manager = BrowserManager(browser_config=browser_config, logger=logger)
try:
await manager.start()
logger.info("Browser launched successfully", tag="TEST")
# Create two sessions
session1_id = "test_session_1"
session2_id = "test_session_2"
# Set up first session
crawler_config1 = CrawlerRunConfig(session_id=session1_id)
page1, context1 = await manager.get_page(crawler_config1)
await page1.goto("https://example.com")
await page1.evaluate("localStorage.setItem('session1_data', 'test_value')")
logger.info(f"Set up session 1 with ID: {session1_id}", tag="TEST")
# Set up second session
crawler_config2 = CrawlerRunConfig(session_id=session2_id)
page2, context2 = await manager.get_page(crawler_config2)
await page2.goto("https://example.org")
await page2.evaluate("localStorage.setItem('session2_data', 'test_value2')")
logger.info(f"Set up session 2 with ID: {session2_id}", tag="TEST")
# Get first session again
page1_again, _ = await manager.get_page(crawler_config1)
# Verify it's the same page and data persists
is_same_page = page1 == page1_again
data1 = await page1_again.evaluate("localStorage.getItem('session1_data')")
logger.info(f"Session 1 reuse successful: {is_same_page}, data: {data1}", tag="TEST")
# Kill first session
await manager.kill_session(session1_id)
logger.info(f"Killed session 1", tag="TEST")
# Verify second session still works
data2 = await page2.evaluate("localStorage.getItem('session2_data')")
logger.info(f"Session 2 still functional after killing session 1, data: {data2}", tag="TEST")
# Clean up
await manager.close()
logger.info("Browser closed successfully", tag="TEST")
return is_same_page and data1 == "test_value" and data2 == "test_value2"
except Exception as e:
logger.error(f"Test failed: {str(e)}", tag="TEST")
try:
await manager.close()
except:
pass
return False
async def run_tests():
"""Run all tests sequentially."""
results = []
# results.append(await test_cdp_launch_connect())
# results.append(await test_cdp_with_user_data_dir())
results.append(await test_cdp_session_management())
# Print summary
total = len(results)
passed = sum(results)
logger.info(f"Tests complete: {passed}/{total} passed", tag="SUMMARY")
if passed == total:
logger.success("All tests passed!", tag="SUMMARY")
else:
logger.error(f"{total - passed} tests failed", tag="SUMMARY")
if __name__ == "__main__":
asyncio.run(run_tests())

View File

@@ -0,0 +1,77 @@
"""Combined test runner for all browser module tests.
This script runs all the browser module tests in sequence and
provides a comprehensive summary.
"""
import asyncio
import os
import sys
import time
# Add the project root to Python path if running directly
if __name__ == "__main__":
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '../..')))
from crawl4ai.async_logger import AsyncLogger
# Create a logger for clear terminal output
logger = AsyncLogger(verbose=True, log_file=None)
async def run_test_module(module_name, header):
"""Run all tests in a module and return results."""
logger.info(f"\n{'-'*30}", tag="TEST")
logger.info(f"RUNNING: {header}", tag="TEST")
logger.info(f"{'-'*30}", tag="TEST")
# Import the module dynamically
module = __import__(f"tests.browser.{module_name}", fromlist=["run_tests"])
# Track time for performance measurement
start_time = time.time()
# Run the tests
await module.run_tests()
# Calculate time taken
time_taken = time.time() - start_time
logger.info(f"Time taken: {time_taken:.2f} seconds", tag="TIMING")
return time_taken
async def main():
"""Run all test modules."""
logger.info("STARTING COMPREHENSIVE BROWSER MODULE TESTS", tag="MAIN")
# List of test modules to run
test_modules = [
("test_browser_manager", "Browser Manager Tests"),
("test_playwright_strategy", "Playwright Strategy Tests"),
("test_cdp_strategy", "CDP Strategy Tests"),
("test_builtin_strategy", "Builtin Browser Strategy Tests"),
("test_profiles", "Profile Management Tests")
]
# Run each test module
timings = {}
for module_name, header in test_modules:
try:
time_taken = await run_test_module(module_name, header)
timings[module_name] = time_taken
except Exception as e:
logger.error(f"Error running {module_name}: {str(e)}", tag="ERROR")
# Print summary
logger.info("\n\nTEST SUMMARY:", tag="SUMMARY")
logger.info(f"{'-'*50}", tag="SUMMARY")
for module_name, header in test_modules:
if module_name in timings:
logger.info(f"{header}: {timings[module_name]:.2f} seconds", tag="SUMMARY")
else:
logger.error(f"{header}: FAILED TO RUN", tag="SUMMARY")
logger.info(f"{'-'*50}", tag="SUMMARY")
total_time = sum(timings.values())
logger.info(f"Total time: {total_time:.2f} seconds", tag="SUMMARY")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,902 @@
"""
Test examples for parallel crawling with the browser module.
These examples demonstrate the functionality of parallel page creation
and serve as functional tests for multi-page crawling performance.
"""
import asyncio
import os
import sys
import time
from typing import List
# Add the project root to Python path if running directly
if __name__ == "__main__":
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '../..')))
from crawl4ai.browser import BrowserManager
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
from crawl4ai.async_logger import AsyncLogger
# Create a logger for clear terminal output
logger = AsyncLogger(verbose=True, log_file=None)
async def test_get_pages_basic():
"""Test basic functionality of get_pages method."""
logger.info("Testing basic get_pages functionality", tag="TEST")
browser_config = BrowserConfig(headless=True)
manager = BrowserManager(browser_config=browser_config, logger=logger)
try:
await manager.start()
# Request 3 pages
crawler_config = CrawlerRunConfig()
pages = await manager.get_pages(crawler_config, count=3)
# Verify we got the correct number of pages
assert len(pages) == 3, f"Expected 3 pages, got {len(pages)}"
# Verify each page is valid
for i, (page, context) in enumerate(pages):
await page.goto("https://example.com")
title = await page.title()
logger.info(f"Page {i+1} title: {title}", tag="TEST")
assert title, f"Page {i+1} has no title"
await manager.close()
logger.success("Basic get_pages test completed successfully", tag="TEST")
return True
except Exception as e:
logger.error(f"Test failed: {str(e)}", tag="TEST")
try:
await manager.close()
except:
pass
return False
async def test_parallel_approaches_comparison():
"""Compare two parallel crawling approaches:
1. Create a page for each URL on-demand (get_page + gather)
2. Get all pages upfront with get_pages, then use them (get_pages + gather)
"""
logger.info("Comparing different parallel crawling approaches", tag="TEST")
urls = [
"https://example.com/page1",
"https://crawl4ai.com",
"https://kidocode.com",
"https://bbc.com",
# "https://example.com/page1",
# "https://example.com/page2",
# "https://example.com/page3",
# "https://example.com/page4",
]
browser_config = BrowserConfig(headless=False)
manager = BrowserManager(browser_config=browser_config, logger=logger)
try:
await manager.start()
# Approach 1: Create a page for each URL on-demand and run in parallel
logger.info("Testing approach 1: get_page for each URL + gather", tag="TEST")
start_time = time.time()
async def fetch_title_approach1(url):
"""Create a new page for each URL, go to the URL, and get title"""
crawler_config = CrawlerRunConfig(url=url)
page, context = await manager.get_page(crawler_config)
try:
await page.goto(url)
title = await page.title()
return title
finally:
await page.close()
# Run fetch_title_approach1 for each URL in parallel
tasks = [fetch_title_approach1(url) for url in urls]
approach1_results = await asyncio.gather(*tasks)
approach1_time = time.time() - start_time
logger.info(f"Approach 1 time (get_page + gather): {approach1_time:.2f}s", tag="TEST")
# Approach 2: Get all pages upfront with get_pages, then use them in parallel
logger.info("Testing approach 2: get_pages upfront + gather", tag="TEST")
start_time = time.time()
# Get all pages upfront
crawler_config = CrawlerRunConfig()
pages = await manager.get_pages(crawler_config, count=len(urls))
async def fetch_title_approach2(page_ctx, url):
"""Use a pre-created page to go to URL and get title"""
page, _ = page_ctx
try:
await page.goto(url)
title = await page.title()
return title
finally:
await page.close()
# Use the pre-created pages to fetch titles in parallel
tasks = [fetch_title_approach2(page_ctx, url) for page_ctx, url in zip(pages, urls)]
approach2_results = await asyncio.gather(*tasks)
approach2_time = time.time() - start_time
logger.info(f"Approach 2 time (get_pages + gather): {approach2_time:.2f}s", tag="TEST")
# Compare results and performance
speedup = approach1_time / approach2_time if approach2_time > 0 else 0
if speedup > 1:
logger.success(f"Approach 2 (get_pages upfront) was {speedup:.2f}x faster", tag="TEST")
else:
logger.info(f"Approach 1 (get_page + gather) was {1/speedup:.2f}x faster", tag="TEST")
# Verify same content was retrieved in both approaches
assert len(approach1_results) == len(approach2_results), "Result count mismatch"
# Sort results for comparison since parallel execution might complete in different order
assert sorted(approach1_results) == sorted(approach2_results), "Results content mismatch"
await manager.close()
return True
except Exception as e:
logger.error(f"Test failed: {str(e)}", tag="TEST")
try:
await manager.close()
except:
pass
return False
async def test_multi_browser_scaling(num_browsers=3, pages_per_browser=5):
"""Test performance with multiple browsers and pages per browser.
Compares two approaches:
1. On-demand page creation (get_page + gather)
2. Pre-created pages (get_pages + gather)
"""
logger.info(f"Testing multi-browser scaling with {num_browsers} browsers × {pages_per_browser} pages", tag="TEST")
# Generate test URLs
total_pages = num_browsers * pages_per_browser
urls = [f"https://example.com/page_{i}" for i in range(total_pages)]
# Create browser managers
managers = []
base_port = 9222
try:
# Start all browsers in parallel
start_tasks = []
for i in range(num_browsers):
browser_config = BrowserConfig(
headless=True # Using default browser mode like in test_parallel_approaches_comparison
)
manager = BrowserManager(browser_config=browser_config, logger=logger)
start_tasks.append(manager.start())
managers.append(manager)
await asyncio.gather(*start_tasks)
# Distribute URLs among managers
urls_per_manager = {}
for i, manager in enumerate(managers):
start_idx = i * pages_per_browser
end_idx = min(start_idx + pages_per_browser, len(urls))
urls_per_manager[manager] = urls[start_idx:end_idx]
# Approach 1: Create a page for each URL on-demand and run in parallel
logger.info("Testing approach 1: get_page for each URL + gather", tag="TEST")
start_time = time.time()
async def fetch_title_approach1(manager, url):
"""Create a new page for the URL, go to the URL, and get title"""
crawler_config = CrawlerRunConfig(url=url)
page, context = await manager.get_page(crawler_config)
try:
await page.goto(url)
title = await page.title()
return title
finally:
await page.close()
# Run fetch_title_approach1 for each URL in parallel
tasks = []
for manager, manager_urls in urls_per_manager.items():
for url in manager_urls:
tasks.append(fetch_title_approach1(manager, url))
approach1_results = await asyncio.gather(*tasks)
approach1_time = time.time() - start_time
logger.info(f"Approach 1 time (get_page + gather): {approach1_time:.2f}s", tag="TEST")
# Approach 2: Get all pages upfront with get_pages, then use them in parallel
logger.info("Testing approach 2: get_pages upfront + gather", tag="TEST")
start_time = time.time()
# Get all pages upfront for each manager
all_pages = []
for manager, manager_urls in urls_per_manager.items():
crawler_config = CrawlerRunConfig()
pages = await manager.get_pages(crawler_config, count=len(manager_urls))
all_pages.extend(zip(pages, manager_urls))
async def fetch_title_approach2(page_ctx, url):
"""Use a pre-created page to go to URL and get title"""
page, _ = page_ctx
try:
await page.goto(url)
title = await page.title()
return title
finally:
await page.close()
# Use the pre-created pages to fetch titles in parallel
tasks = [fetch_title_approach2(page_ctx, url) for page_ctx, url in all_pages]
approach2_results = await asyncio.gather(*tasks)
approach2_time = time.time() - start_time
logger.info(f"Approach 2 time (get_pages + gather): {approach2_time:.2f}s", tag="TEST")
# Compare results and performance
speedup = approach1_time / approach2_time if approach2_time > 0 else 0
pages_per_second = total_pages / approach2_time
# Show a simple summary
logger.info(f"📊 Summary: {num_browsers} browsers × {pages_per_browser} pages = {total_pages} total crawls", tag="TEST")
logger.info(f"⚡ Performance: {pages_per_second:.1f} pages/second ({pages_per_second*60:.0f} pages/minute)", tag="TEST")
logger.info(f"🚀 Total crawl time: {approach2_time:.2f} seconds", tag="TEST")
if speedup > 1:
logger.success(f"✅ Approach 2 (get_pages upfront) was {speedup:.2f}x faster", tag="TEST")
else:
logger.info(f"✅ Approach 1 (get_page + gather) was {1/speedup:.2f}x faster", tag="TEST")
# Close all managers
for manager in managers:
await manager.close()
return True
except Exception as e:
logger.error(f"Test failed: {str(e)}", tag="TEST")
# Clean up
for manager in managers:
try:
await manager.close()
except:
pass
return False
async def grid_search_optimal_configuration(total_urls=50):
"""Perform a grid search to find the optimal balance between number of browsers and pages per browser.
This function tests different combinations of browser count and pages per browser,
while keeping the total number of URLs constant. It measures performance metrics
for each configuration to find the "sweet spot" that provides the best speed
with reasonable memory usage.
Args:
total_urls: Total number of URLs to crawl (default: 50)
"""
logger.info(f"=== GRID SEARCH FOR OPTIMAL CRAWLING CONFIGURATION ({total_urls} URLs) ===", tag="TEST")
# Generate test URLs once
urls = [f"https://example.com/page_{i}" for i in range(total_urls)]
# Define grid search configurations
# We'll use more flexible approach: test all browser counts from 1 to min(20, total_urls)
# and distribute pages evenly (some browsers may have 1 more page than others)
configurations = []
# Maximum number of browsers to test
max_browsers_to_test = min(20, total_urls)
# Try configurations with 1 to max_browsers_to_test browsers
for num_browsers in range(1, max_browsers_to_test + 1):
base_pages_per_browser = total_urls // num_browsers
remainder = total_urls % num_browsers
# Generate exact page distribution array
if remainder > 0:
# First 'remainder' browsers get one more page
page_distribution = [base_pages_per_browser + 1] * remainder + [base_pages_per_browser] * (num_browsers - remainder)
pages_distribution = f"{base_pages_per_browser+1} pages × {remainder} browsers, {base_pages_per_browser} pages × {num_browsers - remainder} browsers"
else:
# All browsers get the same number of pages
page_distribution = [base_pages_per_browser] * num_browsers
pages_distribution = f"{base_pages_per_browser} pages × {num_browsers} browsers"
# Format the distribution as a tuple string like (4, 4, 3, 3)
distribution_str = str(tuple(page_distribution))
configurations.append((num_browsers, base_pages_per_browser, pages_distribution, page_distribution, distribution_str))
# Track results
results = []
# Test each configuration
for num_browsers, pages_per_browser, pages_distribution, page_distribution, distribution_str in configurations:
logger.info("-" * 80, tag="TEST")
logger.info(f"Testing configuration: {num_browsers} browsers with distribution: {distribution_str}", tag="TEST")
logger.info(f"Details: {pages_distribution}", tag="TEST")
# Sleep a bit for randomness
await asyncio.sleep(0.5)
try:
# Import psutil for memory tracking
try:
import psutil
process = psutil.Process()
initial_memory = process.memory_info().rss / (1024 * 1024) # MB
except ImportError:
logger.warning("psutil not available, memory metrics will not be tracked", tag="TEST")
initial_memory = 0
# Create and start browser managers
managers = []
start_time = time.time()
# Start all browsers in parallel
start_tasks = []
for i in range(num_browsers):
browser_config = BrowserConfig(
headless=True
)
manager = BrowserManager(browser_config=browser_config, logger=logger)
start_tasks.append(manager.start())
managers.append(manager)
await asyncio.gather(*start_tasks)
browser_startup_time = time.time() - start_time
# Measure memory after browser startup
if initial_memory > 0:
browser_memory = process.memory_info().rss / (1024 * 1024) - initial_memory
else:
browser_memory = 0
# Distribute URLs among managers using the exact page distribution
urls_per_manager = {}
total_assigned = 0
for i, manager in enumerate(managers):
if i < len(page_distribution):
# Get the exact number of pages for this browser from our distribution
manager_pages = page_distribution[i]
# Get the URL slice for this manager
start_idx = total_assigned
end_idx = start_idx + manager_pages
urls_per_manager[manager] = urls[start_idx:end_idx]
total_assigned += manager_pages
else:
# If we have more managers than our distribution (should never happen)
urls_per_manager[manager] = []
# Use the more efficient approach (pre-created pages)
logger.info("Running page crawling test...", tag="TEST")
crawl_start_time = time.time()
# Get all pages upfront for each manager
all_pages = []
for manager, manager_urls in urls_per_manager.items():
if not manager_urls: # Skip managers with no URLs
continue
crawler_config = CrawlerRunConfig()
pages = await manager.get_pages(crawler_config, count=len(manager_urls))
all_pages.extend(zip(pages, manager_urls))
# Measure memory after page creation
if initial_memory > 0:
pages_memory = process.memory_info().rss / (1024 * 1024) - browser_memory - initial_memory
else:
pages_memory = 0
# Function to crawl a URL with a pre-created page
async def fetch_title(page_ctx, url):
page, _ = page_ctx
try:
await page.goto(url)
title = await page.title()
return title
finally:
await page.close()
# Use the pre-created pages to fetch titles in parallel
tasks = [fetch_title(page_ctx, url) for page_ctx, url in all_pages]
crawl_results = await asyncio.gather(*tasks)
crawl_time = time.time() - crawl_start_time
total_time = time.time() - start_time
# Final memory measurement
if initial_memory > 0:
peak_memory = max(browser_memory + pages_memory, process.memory_info().rss / (1024 * 1024) - initial_memory)
else:
peak_memory = 0
# Close all managers
for manager in managers:
await manager.close()
# Calculate metrics
pages_per_second = total_urls / crawl_time
# Store result metrics
result = {
"num_browsers": num_browsers,
"pages_per_browser": pages_per_browser,
"page_distribution": page_distribution,
"distribution_str": distribution_str,
"total_urls": total_urls,
"browser_startup_time": browser_startup_time,
"crawl_time": crawl_time,
"total_time": total_time,
"browser_memory": browser_memory,
"pages_memory": pages_memory,
"peak_memory": peak_memory,
"pages_per_second": pages_per_second,
# Calculate efficiency score (higher is better)
# This balances speed vs memory usage
"efficiency_score": pages_per_second / (peak_memory + 1) if peak_memory > 0 else pages_per_second,
}
results.append(result)
# Log the results
logger.info(f"Browser startup: {browser_startup_time:.2f}s", tag="TEST")
logger.info(f"Crawl time: {crawl_time:.2f}s", tag="TEST")
logger.info(f"Total time: {total_time:.2f}s", tag="TEST")
logger.info(f"Performance: {pages_per_second:.1f} pages/second", tag="TEST")
if peak_memory > 0:
logger.info(f"Browser memory: {browser_memory:.1f}MB", tag="TEST")
logger.info(f"Pages memory: {pages_memory:.1f}MB", tag="TEST")
logger.info(f"Peak memory: {peak_memory:.1f}MB", tag="TEST")
logger.info(f"Efficiency score: {result['efficiency_score']:.6f}", tag="TEST")
except Exception as e:
logger.error(f"Error testing configuration: {str(e)}", tag="TEST")
import traceback
traceback.print_exc()
# Clean up
for manager in managers:
try:
await manager.close()
except:
pass
# Print summary of all configurations
logger.info("=" * 100, tag="TEST")
logger.info("GRID SEARCH RESULTS SUMMARY", tag="TEST")
logger.info("=" * 100, tag="TEST")
# Rank configurations by efficiency score
ranked_results = sorted(results, key=lambda x: x["efficiency_score"], reverse=True)
# Also determine rankings by different metrics
fastest = sorted(results, key=lambda x: x["crawl_time"])[0]
lowest_memory = sorted(results, key=lambda x: x["peak_memory"] if x["peak_memory"] > 0 else float('inf'))[0]
most_efficient = ranked_results[0]
# Print top performers by category
logger.info("🏆 TOP PERFORMERS BY CATEGORY:", tag="TEST")
logger.info(f"⚡ Fastest: {fastest['num_browsers']} browsers × ~{fastest['pages_per_browser']} pages " +
f"({fastest['crawl_time']:.2f}s, {fastest['pages_per_second']:.1f} pages/s)", tag="TEST")
if lowest_memory["peak_memory"] > 0:
logger.info(f"💾 Lowest memory: {lowest_memory['num_browsers']} browsers × ~{lowest_memory['pages_per_browser']} pages " +
f"({lowest_memory['peak_memory']:.1f}MB)", tag="TEST")
logger.info(f"🌟 Most efficient: {most_efficient['num_browsers']} browsers × ~{most_efficient['pages_per_browser']} pages " +
f"(score: {most_efficient['efficiency_score']:.6f})", tag="TEST")
# Print result table header
logger.info("\n📊 COMPLETE RANKING TABLE (SORTED BY EFFICIENCY SCORE):", tag="TEST")
logger.info("-" * 120, tag="TEST")
# Define table header
header = f"{'Rank':<5} | {'Browsers':<8} | {'Distribution':<55} | {'Total Time(s)':<12} | {'Speed(p/s)':<12} | {'Memory(MB)':<12} | {'Efficiency':<10} | {'Notes'}"
logger.info(header, tag="TEST")
logger.info("-" * 120, tag="TEST")
# Print each configuration in ranked order
for rank, result in enumerate(ranked_results, 1):
# Add special notes for top performers
notes = []
if result == fastest:
notes.append("⚡ Fastest")
if result == lowest_memory:
notes.append("💾 Lowest Memory")
if result == most_efficient:
notes.append("🌟 Most Efficient")
notes_str = " | ".join(notes) if notes else ""
# Format memory if available
memory_str = f"{result['peak_memory']:.1f}" if result['peak_memory'] > 0 else "N/A"
# Get the distribution string
dist_str = result.get('distribution_str', str(tuple([result['pages_per_browser']] * result['num_browsers'])))
# Build the row
row = f"{rank:<5} | {result['num_browsers']:<8} | {dist_str:<55} | {result['total_time']:.2f}s{' ':<7} | "
row += f"{result['pages_per_second']:.2f}{' ':<6} | {memory_str}{' ':<6} | {result['efficiency_score']:.4f}{' ':<4} | {notes_str}"
logger.info(row, tag="TEST")
logger.info("-" * 120, tag="TEST")
# Generate visualization if matplotlib is available
try:
import matplotlib.pyplot as plt
import numpy as np
# Extract data for plotting from ranked results
browser_counts = [r["num_browsers"] for r in ranked_results]
efficiency_scores = [r["efficiency_score"] for r in ranked_results]
crawl_times = [r["crawl_time"] for r in ranked_results]
total_times = [r["total_time"] for r in ranked_results]
# Filter results with memory data
memory_results = [r for r in ranked_results if r["peak_memory"] > 0]
memory_browser_counts = [r["num_browsers"] for r in memory_results]
peak_memories = [r["peak_memory"] for r in memory_results]
# Create figure with clean design
plt.figure(figsize=(14, 12), facecolor='white')
plt.style.use('ggplot')
# Create grid for subplots
gs = plt.GridSpec(3, 1, height_ratios=[1, 1, 1], hspace=0.3)
# Plot 1: Efficiency Score (higher is better)
ax1 = plt.subplot(gs[0])
bar_colors = ['#3498db'] * len(browser_counts)
# Highlight the most efficient
most_efficient_idx = browser_counts.index(most_efficient["num_browsers"])
bar_colors[most_efficient_idx] = '#e74c3c' # Red for most efficient
bars = ax1.bar(range(len(browser_counts)), efficiency_scores, color=bar_colors)
ax1.set_xticks(range(len(browser_counts)))
ax1.set_xticklabels([f"{bc}" for bc in browser_counts], rotation=45)
ax1.set_xlabel('Number of Browsers')
ax1.set_ylabel('Efficiency Score (higher is better)')
ax1.set_title('Browser Configuration Efficiency (higher is better)')
# Add value labels on top of bars
for bar, score in zip(bars, efficiency_scores):
height = bar.get_height()
ax1.text(bar.get_x() + bar.get_width()/2., height + 0.02*max(efficiency_scores),
f'{score:.3f}', ha='center', va='bottom', rotation=90, fontsize=8)
# Highlight best configuration
ax1.text(0.02, 0.90, f"🌟 Most Efficient: {most_efficient['num_browsers']} browsers with ~{most_efficient['pages_per_browser']} pages",
transform=ax1.transAxes, fontsize=12, verticalalignment='top',
bbox=dict(boxstyle='round,pad=0.5', facecolor='yellow', alpha=0.3))
# Plot 2: Time Performance
ax2 = plt.subplot(gs[1])
# Plot both total time and crawl time
ax2.plot(browser_counts, crawl_times, 'bo-', label='Crawl Time (s)', linewidth=2)
ax2.plot(browser_counts, total_times, 'go--', label='Total Time (s)', linewidth=2, alpha=0.6)
# Mark the fastest configuration
fastest_idx = browser_counts.index(fastest["num_browsers"])
ax2.plot(browser_counts[fastest_idx], crawl_times[fastest_idx], 'ro', ms=10,
label=f'Fastest: {fastest["num_browsers"]} browsers')
ax2.set_xlabel('Number of Browsers')
ax2.set_ylabel('Time (seconds)')
ax2.set_title(f'Time Performance for {total_urls} URLs by Browser Count')
ax2.grid(True, linestyle='--', alpha=0.7)
ax2.legend(loc='upper right')
# Plot pages per second on second y-axis
pages_per_second = [total_urls/t for t in crawl_times]
ax2_twin = ax2.twinx()
ax2_twin.plot(browser_counts, pages_per_second, 'r^--', label='Pages/second', alpha=0.5)
ax2_twin.set_ylabel('Pages per second')
# Add note about the fastest configuration
ax2.text(0.02, 0.90, f"⚡ Fastest: {fastest['num_browsers']} browsers with ~{fastest['pages_per_browser']} pages" +
f"\n {fastest['crawl_time']:.2f}s ({fastest['pages_per_second']:.1f} pages/s)",
transform=ax2.transAxes, fontsize=12, verticalalignment='top',
bbox=dict(boxstyle='round,pad=0.5', facecolor='lightblue', alpha=0.3))
# Plot 3: Memory Usage (if available)
if memory_results:
ax3 = plt.subplot(gs[2])
# Prepare data for grouped bar chart
memory_per_browser = [m/n for m, n in zip(peak_memories, memory_browser_counts)]
memory_per_page = [m/(n*p) for m, n, p in zip(
[r["peak_memory"] for r in memory_results],
[r["num_browsers"] for r in memory_results],
[r["pages_per_browser"] for r in memory_results])]
x = np.arange(len(memory_browser_counts))
width = 0.35
# Create grouped bars
ax3.bar(x - width/2, peak_memories, width, label='Total Memory (MB)', color='#9b59b6')
ax3.bar(x + width/2, memory_per_browser, width, label='Memory per Browser (MB)', color='#3498db')
# Configure axis
ax3.set_xticks(x)
ax3.set_xticklabels([f"{bc}" for bc in memory_browser_counts], rotation=45)
ax3.set_xlabel('Number of Browsers')
ax3.set_ylabel('Memory (MB)')
ax3.set_title('Memory Usage by Browser Configuration')
ax3.legend(loc='upper left')
ax3.grid(True, linestyle='--', alpha=0.7)
# Add second y-axis for memory per page
ax3_twin = ax3.twinx()
ax3_twin.plot(x, memory_per_page, 'ro-', label='Memory per Page (MB)')
ax3_twin.set_ylabel('Memory per Page (MB)')
# Get lowest memory configuration
lowest_memory_idx = memory_browser_counts.index(lowest_memory["num_browsers"])
# Add note about lowest memory configuration
ax3.text(0.02, 0.90, f"💾 Lowest Memory: {lowest_memory['num_browsers']} browsers with ~{lowest_memory['pages_per_browser']} pages" +
f"\n {lowest_memory['peak_memory']:.1f}MB ({lowest_memory['peak_memory']/total_urls:.2f}MB per page)",
transform=ax3.transAxes, fontsize=12, verticalalignment='top',
bbox=dict(boxstyle='round,pad=0.5', facecolor='lightgreen', alpha=0.3))
# Add overall title
plt.suptitle(f'Browser Scaling Grid Search Results for {total_urls} URLs', fontsize=16, y=0.98)
# Add timestamp and info at the bottom
plt.figtext(0.5, 0.01, f"Generated by Crawl4AI at {time.strftime('%Y-%m-%d %H:%M:%S')}",
ha="center", fontsize=10, style='italic')
# Get current directory and save the figure there
import os
__current_file = os.path.abspath(__file__)
current_dir = os.path.dirname(__current_file)
output_file = os.path.join(current_dir, 'browser_scaling_grid_search.png')
# Adjust layout and save figure with high DPI
plt.tight_layout(rect=[0, 0.03, 1, 0.97])
plt.savefig(output_file, dpi=200, bbox_inches='tight')
logger.success(f"Visualization saved to {output_file}", tag="TEST")
except ImportError:
logger.warning("matplotlib not available, skipping visualization", tag="TEST")
return most_efficient["num_browsers"], most_efficient["pages_per_browser"]
async def find_optimal_browser_config(total_urls=50, verbose=True, rate_limit_delay=0.2):
"""Find optimal browser configuration for crawling a specific number of URLs.
Args:
total_urls: Number of URLs to crawl
verbose: Whether to print progress
rate_limit_delay: Delay between page loads to avoid rate limiting
Returns:
dict: Contains fastest, lowest_memory, and optimal configurations
"""
if verbose:
print(f"\n=== Finding optimal configuration for crawling {total_urls} URLs ===\n")
# Generate test URLs with timestamp to avoid caching
timestamp = int(time.time())
urls = [f"https://example.com/page_{i}?t={timestamp}" for i in range(total_urls)]
# Limit browser configurations to test (1 browser to max 10)
max_browsers = min(10, total_urls)
configs_to_test = []
# Generate configurations (browser count, pages distribution)
for num_browsers in range(1, max_browsers + 1):
base_pages = total_urls // num_browsers
remainder = total_urls % num_browsers
# Create distribution array like [3, 3, 2, 2] (some browsers get one more page)
if remainder > 0:
distribution = [base_pages + 1] * remainder + [base_pages] * (num_browsers - remainder)
else:
distribution = [base_pages] * num_browsers
configs_to_test.append((num_browsers, distribution))
results = []
# Test each configuration
for browser_count, page_distribution in configs_to_test:
if verbose:
print(f"Testing {browser_count} browsers with distribution {tuple(page_distribution)}")
try:
# Track memory if possible
try:
import psutil
process = psutil.Process()
start_memory = process.memory_info().rss / (1024 * 1024) # MB
except ImportError:
if verbose:
print("Memory tracking not available (psutil not installed)")
start_memory = 0
# Start browsers in parallel
managers = []
start_tasks = []
start_time = time.time()
for i in range(browser_count):
config = BrowserConfig(headless=True)
manager = BrowserManager(browser_config=config, logger=logger)
start_tasks.append(manager.start())
managers.append(manager)
await asyncio.gather(*start_tasks)
# Distribute URLs among browsers
urls_per_manager = {}
url_index = 0
for i, manager in enumerate(managers):
pages_for_this_browser = page_distribution[i]
end_index = url_index + pages_for_this_browser
urls_per_manager[manager] = urls[url_index:end_index]
url_index = end_index
# Create pages for each browser
all_pages = []
for manager, manager_urls in urls_per_manager.items():
if not manager_urls:
continue
pages = await manager.get_pages(CrawlerRunConfig(), count=len(manager_urls))
all_pages.extend(zip(pages, manager_urls))
# Crawl pages with delay to avoid rate limiting
async def crawl_page(page_ctx, url):
page, _ = page_ctx
try:
await page.goto(url)
if rate_limit_delay > 0:
await asyncio.sleep(rate_limit_delay)
title = await page.title()
return title
finally:
await page.close()
crawl_start = time.time()
crawl_tasks = [crawl_page(page_ctx, url) for page_ctx, url in all_pages]
await asyncio.gather(*crawl_tasks)
crawl_time = time.time() - crawl_start
total_time = time.time() - start_time
# Measure final memory usage
if start_memory > 0:
end_memory = process.memory_info().rss / (1024 * 1024)
memory_used = end_memory - start_memory
else:
memory_used = 0
# Close all browsers
for manager in managers:
await manager.close()
# Calculate metrics
pages_per_second = total_urls / crawl_time
# Calculate efficiency score (higher is better)
# This balances speed vs memory
if memory_used > 0:
efficiency = pages_per_second / (memory_used + 1)
else:
efficiency = pages_per_second
# Store result
result = {
"browser_count": browser_count,
"distribution": tuple(page_distribution),
"crawl_time": crawl_time,
"total_time": total_time,
"memory_used": memory_used,
"pages_per_second": pages_per_second,
"efficiency": efficiency
}
results.append(result)
if verbose:
print(f" ✓ Crawled {total_urls} pages in {crawl_time:.2f}s ({pages_per_second:.1f} pages/sec)")
if memory_used > 0:
print(f" ✓ Memory used: {memory_used:.1f}MB ({memory_used/total_urls:.1f}MB per page)")
print(f" ✓ Efficiency score: {efficiency:.4f}")
except Exception as e:
if verbose:
print(f" ✗ Error: {str(e)}")
# Clean up
for manager in managers:
try:
await manager.close()
except:
pass
# If no successful results, return None
if not results:
return None
# Find best configurations
fastest = sorted(results, key=lambda x: x["crawl_time"])[0]
# Only consider memory if available
memory_results = [r for r in results if r["memory_used"] > 0]
if memory_results:
lowest_memory = sorted(memory_results, key=lambda x: x["memory_used"])[0]
else:
lowest_memory = fastest
# Find most efficient (balanced speed vs memory)
optimal = sorted(results, key=lambda x: x["efficiency"], reverse=True)[0]
# Print summary
if verbose:
print("\n=== OPTIMAL CONFIGURATIONS ===")
print(f"⚡ Fastest: {fastest['browser_count']} browsers {fastest['distribution']}")
print(f" {fastest['crawl_time']:.2f}s, {fastest['pages_per_second']:.1f} pages/sec")
print(f"💾 Memory-efficient: {lowest_memory['browser_count']} browsers {lowest_memory['distribution']}")
if lowest_memory["memory_used"] > 0:
print(f" {lowest_memory['memory_used']:.1f}MB, {lowest_memory['memory_used']/total_urls:.2f}MB per page")
print(f"🌟 Balanced optimal: {optimal['browser_count']} browsers {optimal['distribution']}")
print(f" {optimal['crawl_time']:.2f}s, {optimal['pages_per_second']:.1f} pages/sec, score: {optimal['efficiency']:.4f}")
return {
"fastest": fastest,
"lowest_memory": lowest_memory,
"optimal": optimal,
"all_configs": results
}
async def run_tests():
"""Run all tests sequentially."""
results = []
# Find optimal configuration using our utility function
configs = await find_optimal_browser_config(
total_urls=20, # Use a small number for faster testing
verbose=True,
rate_limit_delay=0.2 # 200ms delay between page loads to avoid rate limiting
)
if configs:
# Show the optimal configuration
optimal = configs["optimal"]
print(f"\n🎯 Recommended configuration for production use:")
print(f" {optimal['browser_count']} browsers with distribution {optimal['distribution']}")
print(f" Estimated performance: {optimal['pages_per_second']:.1f} pages/second")
results.append(True)
else:
print("\n❌ Failed to find optimal configuration")
results.append(False)
# Print summary
total = len(results)
passed = sum(results)
print(f"\nTests complete: {passed}/{total} passed")
if passed == total:
print("All tests passed!")
else:
print(f"{total - passed} tests failed")
if __name__ == "__main__":
asyncio.run(run_tests())

View File

@@ -0,0 +1,267 @@
"""Test examples for PlaywrightBrowserStrategy.
These examples demonstrate the functionality of PlaywrightBrowserStrategy
and serve as functional tests.
"""
import asyncio
import os
import sys
# Add the project root to Python path if running directly
if __name__ == "__main__":
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '../..')))
from crawl4ai.browser import BrowserManager
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
from crawl4ai.async_logger import AsyncLogger
# Create a logger for clear terminal output
logger = AsyncLogger(verbose=True, log_file=None)
async def test_playwright_basic():
"""Test basic Playwright browser functionality."""
logger.info("Testing standard Playwright browser", tag="TEST")
# Create browser config for standard Playwright
browser_config = BrowserConfig(
headless=True,
viewport_width=1280,
viewport_height=800
)
# Create browser manager with the config
manager = BrowserManager(browser_config=browser_config, logger=logger)
try:
# Start the browser
await manager.start()
logger.info("Browser started successfully", tag="TEST")
# Create crawler config
crawler_config = CrawlerRunConfig(url="https://example.com")
# Get a page
page, context = await manager.get_page(crawler_config)
logger.info("Got page successfully", tag="TEST")
# Navigate to a website
await page.goto("https://example.com")
logger.info("Navigated to example.com", tag="TEST")
# Get page title
title = await page.title()
logger.info(f"Page title: {title}", tag="TEST")
# Clean up
await manager.close()
logger.info("Browser closed successfully", tag="TEST")
return True
except Exception as e:
logger.error(f"Test failed: {str(e)}", tag="TEST")
# Ensure cleanup
try:
await manager.close()
except:
pass
return False
async def test_playwright_text_mode():
"""Test Playwright browser in text-only mode."""
logger.info("Testing Playwright text mode", tag="TEST")
# Create browser config with text mode enabled
browser_config = BrowserConfig(
headless=True,
text_mode=True # Enable text-only mode
)
# Create browser manager with the config
manager = BrowserManager(browser_config=browser_config, logger=logger)
try:
# Start the browser
await manager.start()
logger.info("Browser started successfully in text mode", tag="TEST")
# Get a page
crawler_config = CrawlerRunConfig(url="https://example.com")
page, context = await manager.get_page(crawler_config)
# Navigate to a website
await page.goto("https://example.com")
logger.info("Navigated to example.com", tag="TEST")
# Get page title
title = await page.title()
logger.info(f"Page title: {title}", tag="TEST")
# Check if images are blocked in text mode
# We'll check if any image requests were made
has_images = False
async with page.expect_request("**/*.{png,jpg,jpeg,gif,webp,svg}", timeout=1000) as request_info:
try:
# Try to load a page with images
await page.goto("https://picsum.photos/", wait_until="domcontentloaded")
request = await request_info.value
has_images = True
except:
# Timeout without image requests means text mode is working
has_images = False
logger.info(f"Text mode image blocking working: {not has_images}", tag="TEST")
# Clean up
await manager.close()
logger.info("Browser closed successfully", tag="TEST")
return True
except Exception as e:
logger.error(f"Test failed: {str(e)}", tag="TEST")
# Ensure cleanup
try:
await manager.close()
except:
pass
return False
async def test_playwright_context_reuse():
"""Test context caching and reuse with identical configurations."""
logger.info("Testing context reuse with identical configurations", tag="TEST")
# Create browser config
browser_config = BrowserConfig(headless=True)
# Create browser manager
manager = BrowserManager(browser_config=browser_config, logger=logger)
try:
# Start the browser
await manager.start()
logger.info("Browser started successfully", tag="TEST")
# Create identical crawler configs
crawler_config1 = CrawlerRunConfig(
css_selector="body",
)
crawler_config2 = CrawlerRunConfig(
css_selector="body",
)
# Get pages with these configs
page1, context1 = await manager.get_page(crawler_config1)
page2, context2 = await manager.get_page(crawler_config2)
# Check if contexts are reused
is_same_context = context1 == context2
logger.info(f"Contexts reused: {is_same_context}", tag="TEST")
# Now try with a different config
crawler_config3 = CrawlerRunConfig()
page3, context3 = await manager.get_page(crawler_config3)
# This should be a different context
is_different_context = context1 != context3
logger.info(f"Different contexts for different configs: {is_different_context}", tag="TEST")
# Clean up
await manager.close()
logger.info("Browser closed successfully", tag="TEST")
# Both tests should pass for success
return is_same_context and is_different_context
except Exception as e:
logger.error(f"Test failed: {str(e)}", tag="TEST")
# Ensure cleanup
try:
await manager.close()
except:
pass
return False
async def test_playwright_session_management():
"""Test session management with Playwright browser."""
logger.info("Testing session management with Playwright browser", tag="TEST")
browser_config = BrowserConfig(
headless=True
)
manager = BrowserManager(browser_config=browser_config, logger=logger)
try:
await manager.start()
logger.info("Browser launched successfully", tag="TEST")
# Create two sessions
session1_id = "playwright_session_1"
session2_id = "playwright_session_2"
# Set up first session
crawler_config1 = CrawlerRunConfig(session_id=session1_id, url="https://example.com")
page1, context1 = await manager.get_page(crawler_config1)
await page1.goto("https://example.com")
await page1.evaluate("localStorage.setItem('playwright_session1_data', 'test_value1')")
logger.info(f"Set up session 1 with ID: {session1_id}", tag="TEST")
# Set up second session
crawler_config2 = CrawlerRunConfig(session_id=session2_id, url="https://example.org")
page2, context2 = await manager.get_page(crawler_config2)
await page2.goto("https://example.org")
await page2.evaluate("localStorage.setItem('playwright_session2_data', 'test_value2')")
logger.info(f"Set up session 2 with ID: {session2_id}", tag="TEST")
# Get first session again
page1_again, context1_again = await manager.get_page(crawler_config1)
# Verify it's the same page and data persists
is_same_page = page1 == page1_again
is_same_context = context1 == context1_again
data1 = await page1_again.evaluate("localStorage.getItem('playwright_session1_data')")
logger.info(f"Session 1 reuse successful: {is_same_page}, data: {data1}", tag="TEST")
# Kill first session
await manager.kill_session(session1_id)
logger.info(f"Killed session 1", tag="TEST")
# Verify second session still works
data2 = await page2.evaluate("localStorage.getItem('playwright_session2_data')")
logger.info(f"Session 2 still functional after killing session 1, data: {data2}", tag="TEST")
# Clean up
await manager.close()
logger.info("Browser closed successfully", tag="TEST")
return is_same_page and is_same_context and data1 == "test_value1" and data2 == "test_value2"
except Exception as e:
logger.error(f"Test failed: {str(e)}", tag="TEST")
try:
await manager.close()
except:
pass
return False
async def run_tests():
"""Run all tests sequentially."""
results = []
results.append(await test_playwright_basic())
results.append(await test_playwright_text_mode())
results.append(await test_playwright_context_reuse())
results.append(await test_playwright_session_management())
# Print summary
total = len(results)
passed = sum(results)
logger.info(f"Tests complete: {passed}/{total} passed", tag="SUMMARY")
if passed == total:
logger.success("All tests passed!", tag="SUMMARY")
else:
logger.error(f"{total - passed} tests failed", tag="SUMMARY")
if __name__ == "__main__":
asyncio.run(run_tests())

View File

@@ -0,0 +1,176 @@
"""Test examples for BrowserProfileManager.
These examples demonstrate the functionality of BrowserProfileManager
and serve as functional tests.
"""
import asyncio
import os
import sys
import uuid
import shutil
# Add the project root to Python path if running directly
if __name__ == "__main__":
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '../..')))
from crawl4ai.browser import BrowserManager, BrowserProfileManager
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
from crawl4ai.async_logger import AsyncLogger
# Create a logger for clear terminal output
logger = AsyncLogger(verbose=True, log_file=None)
async def test_profile_creation():
"""Test creating and managing browser profiles."""
logger.info("Testing profile creation and management", tag="TEST")
profile_manager = BrowserProfileManager(logger=logger)
try:
# List existing profiles
profiles = profile_manager.list_profiles()
logger.info(f"Found {len(profiles)} existing profiles", tag="TEST")
# Generate a unique profile name for testing
test_profile_name = f"test-profile-{uuid.uuid4().hex[:8]}"
# Create a test profile directory
profile_path = os.path.join(profile_manager.profiles_dir, test_profile_name)
os.makedirs(os.path.join(profile_path, "Default"), exist_ok=True)
# Create a dummy Preferences file to simulate a Chrome profile
with open(os.path.join(profile_path, "Default", "Preferences"), "w") as f:
f.write("{\"test\": true}")
logger.info(f"Created test profile at: {profile_path}", tag="TEST")
# Verify the profile is now in the list
profiles = profile_manager.list_profiles()
profile_found = any(p["name"] == test_profile_name for p in profiles)
logger.info(f"Profile found in list: {profile_found}", tag="TEST")
# Try to get the profile path
retrieved_path = profile_manager.get_profile_path(test_profile_name)
path_match = retrieved_path == profile_path
logger.info(f"Retrieved correct profile path: {path_match}", tag="TEST")
# Delete the profile
success = profile_manager.delete_profile(test_profile_name)
logger.info(f"Profile deletion successful: {success}", tag="TEST")
# Verify it's gone
profiles_after = profile_manager.list_profiles()
profile_removed = not any(p["name"] == test_profile_name for p in profiles_after)
logger.info(f"Profile removed from list: {profile_removed}", tag="TEST")
# Clean up just in case
if os.path.exists(profile_path):
shutil.rmtree(profile_path, ignore_errors=True)
return profile_found and path_match and success and profile_removed
except Exception as e:
logger.error(f"Test failed: {str(e)}", tag="TEST")
# Clean up test directory
try:
if os.path.exists(profile_path):
shutil.rmtree(profile_path, ignore_errors=True)
except:
pass
return False
async def test_profile_with_browser():
"""Test using a profile with a browser."""
logger.info("Testing using a profile with a browser", tag="TEST")
profile_manager = BrowserProfileManager(logger=logger)
test_profile_name = f"test-browser-profile-{uuid.uuid4().hex[:8]}"
profile_path = None
try:
# Create a test profile directory
profile_path = os.path.join(profile_manager.profiles_dir, test_profile_name)
os.makedirs(os.path.join(profile_path, "Default"), exist_ok=True)
# Create a dummy Preferences file to simulate a Chrome profile
with open(os.path.join(profile_path, "Default", "Preferences"), "w") as f:
f.write("{\"test\": true}")
logger.info(f"Created test profile at: {profile_path}", tag="TEST")
# Now use this profile with a browser
browser_config = BrowserConfig(
user_data_dir=profile_path,
headless=True
)
manager = BrowserManager(browser_config=browser_config, logger=logger)
# Start the browser with the profile
await manager.start()
logger.info("Browser started with profile", tag="TEST")
# Create a page
crawler_config = CrawlerRunConfig()
page, context = await manager.get_page(crawler_config)
# Navigate and set some data to verify profile works
await page.goto("https://example.com")
await page.evaluate("localStorage.setItem('test_data', 'profile_value')")
# Close browser
await manager.close()
logger.info("First browser session closed", tag="TEST")
# Create a new browser with the same profile
manager2 = BrowserManager(browser_config=browser_config, logger=logger)
await manager2.start()
logger.info("Second browser session started with same profile", tag="TEST")
# Get a page and check if the data persists
page2, context2 = await manager2.get_page(crawler_config)
await page2.goto("https://example.com")
data = await page2.evaluate("localStorage.getItem('test_data')")
# Verify data persisted
data_persisted = data == "profile_value"
logger.info(f"Data persisted across sessions: {data_persisted}", tag="TEST")
# Clean up
await manager2.close()
logger.info("Second browser session closed", tag="TEST")
# Delete the test profile
success = profile_manager.delete_profile(test_profile_name)
logger.info(f"Test profile deleted: {success}", tag="TEST")
return data_persisted and success
except Exception as e:
logger.error(f"Test failed: {str(e)}", tag="TEST")
# Clean up
try:
if profile_path and os.path.exists(profile_path):
shutil.rmtree(profile_path, ignore_errors=True)
except:
pass
return False
async def run_tests():
"""Run all tests sequentially."""
results = []
results.append(await test_profile_creation())
results.append(await test_profile_with_browser())
# Print summary
total = len(results)
passed = sum(results)
logger.info(f"Tests complete: {passed}/{total} passed", tag="SUMMARY")
if passed == total:
logger.success("All tests passed!", tag="SUMMARY")
else:
logger.error(f"{total - passed} tests failed", tag="SUMMARY")
if __name__ == "__main__":
asyncio.run(run_tests())

View File

@@ -73,7 +73,7 @@ async def test_stream_crawl(session, token: str):
# "https://news.ycombinator.com/news" # "https://news.ycombinator.com/news"
], ],
"browser_config": {"headless": True, "viewport": {"width": 1200}}, "browser_config": {"headless": True, "viewport": {"width": 1200}},
"crawler_config": {"stream": True, "cache_mode": "aggressive"} "crawler_config": {"stream": True, "cache_mode": "bypass"}
} }
headers = {"Authorization": f"Bearer {token}"} headers = {"Authorization": f"Bearer {token}"}
print(f"\nTesting Streaming Crawl: {url}") print(f"\nTesting Streaming Crawl: {url}")

View File

@@ -0,0 +1,168 @@
"""
Test script for the CrawlerMonitor component.
This script simulates a crawler with multiple tasks to demonstrate the real-time monitoring capabilities.
"""
import time
import uuid
import random
import threading
import sys
import os
# Add the parent directory to the path to import crawl4ai
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../..")))
from crawl4ai.components.crawler_monitor import CrawlerMonitor
from crawl4ai.models import CrawlStatus
def simulate_crawler_task(monitor, task_id, url, simulate_failure=False):
"""Simulate a crawler task with different states."""
# Task starts in the QUEUED state
wait_time = random.uniform(0.5, 3.0)
time.sleep(wait_time)
# Update to IN_PROGRESS state
monitor.update_task(
task_id=task_id,
status=CrawlStatus.IN_PROGRESS,
start_time=time.time(),
wait_time=wait_time
)
# Simulate task running
process_time = random.uniform(1.0, 5.0)
for i in range(int(process_time * 2)):
# Simulate memory usage changes
memory_usage = random.uniform(5.0, 25.0)
monitor.update_task(
task_id=task_id,
memory_usage=memory_usage,
peak_memory=max(memory_usage, monitor.get_task_stats(task_id).get("peak_memory", 0))
)
time.sleep(0.5)
# Update to COMPLETED or FAILED state
if simulate_failure and random.random() < 0.8: # 80% chance of failure if simulate_failure is True
monitor.update_task(
task_id=task_id,
status=CrawlStatus.FAILED,
end_time=time.time(),
error_message="Simulated failure: Connection timeout",
memory_usage=0.0
)
else:
monitor.update_task(
task_id=task_id,
status=CrawlStatus.COMPLETED,
end_time=time.time(),
memory_usage=0.0
)
def update_queue_stats(monitor, num_queued_tasks):
"""Update queue statistics periodically."""
while monitor.is_running:
queued_tasks = [
task for task_id, task in monitor.get_all_task_stats().items()
if task["status"] == CrawlStatus.QUEUED.name
]
total_queued = len(queued_tasks)
if total_queued > 0:
current_time = time.time()
wait_times = [
current_time - task.get("enqueue_time", current_time)
for task in queued_tasks
]
highest_wait_time = max(wait_times) if wait_times else 0.0
avg_wait_time = sum(wait_times) / len(wait_times) if wait_times else 0.0
else:
highest_wait_time = 0.0
avg_wait_time = 0.0
monitor.update_queue_statistics(
total_queued=total_queued,
highest_wait_time=highest_wait_time,
avg_wait_time=avg_wait_time
)
# Simulate memory pressure based on number of active tasks
active_tasks = len([
task for task_id, task in monitor.get_all_task_stats().items()
if task["status"] == CrawlStatus.IN_PROGRESS.name
])
if active_tasks > 8:
monitor.update_memory_status("CRITICAL")
elif active_tasks > 4:
monitor.update_memory_status("PRESSURE")
else:
monitor.update_memory_status("NORMAL")
time.sleep(1.0)
def test_crawler_monitor():
"""Test the CrawlerMonitor with simulated crawler tasks."""
# Total number of URLs to crawl
total_urls = 50
# Initialize the monitor
monitor = CrawlerMonitor(urls_total=total_urls, refresh_rate=0.5)
# Start the monitor
monitor.start()
# Start thread to update queue statistics
queue_stats_thread = threading.Thread(target=update_queue_stats, args=(monitor, total_urls))
queue_stats_thread.daemon = True
queue_stats_thread.start()
try:
# Create task threads
threads = []
for i in range(total_urls):
task_id = str(uuid.uuid4())
url = f"https://example.com/page{i}"
# Add task to monitor
monitor.add_task(task_id, url)
# Determine if this task should simulate failure
simulate_failure = (i % 10 == 0) # Every 10th task
# Create and start thread for this task
thread = threading.Thread(
target=simulate_crawler_task,
args=(monitor, task_id, url, simulate_failure)
)
thread.daemon = True
threads.append(thread)
# Start threads with delay to simulate tasks being added over time
batch_size = 5
for i in range(0, len(threads), batch_size):
batch = threads[i:i+batch_size]
for thread in batch:
thread.start()
time.sleep(0.5) # Small delay between starting threads
# Wait a bit before starting the next batch
time.sleep(2.0)
# Wait for all threads to complete
for thread in threads:
thread.join()
# Keep monitor running a bit longer to see the final state
time.sleep(5.0)
except KeyboardInterrupt:
print("\nTest interrupted by user")
finally:
# Stop the monitor
monitor.stop()
print("\nCrawler monitor test completed")
if __name__ == "__main__":
test_crawler_monitor()

View File

@@ -0,0 +1,410 @@
import asyncio
import time
import psutil
import logging
import random
from typing import List, Dict
import uuid
import sys
import os
# Import your crawler components
from crawl4ai.models import DisplayMode, CrawlStatus, CrawlResult
from crawl4ai.async_configs import CrawlerRunConfig, BrowserConfig, CacheMode
from crawl4ai import AsyncWebCrawler
from crawl4ai import MemoryAdaptiveDispatcher, CrawlerMonitor
# Global configuration
STREAM = False # Toggle between streaming and non-streaming modes
# Configure logging to file only (to avoid breaking the rich display)
os.makedirs("logs", exist_ok=True)
file_handler = logging.FileHandler("logs/memory_stress_test.log")
file_handler.setFormatter(logging.Formatter('%(asctime)s [%(levelname)s] %(message)s'))
# Root logger - only to file, not console
root_logger = logging.getLogger()
root_logger.setLevel(logging.INFO)
root_logger.addHandler(file_handler)
# Our test logger also writes to file only
logger = logging.getLogger("memory_stress_test")
logger.setLevel(logging.INFO)
logger.addHandler(file_handler)
logger.propagate = False # Don't propagate to root logger
# Create a memory restrictor to simulate limited memory environment
class MemorySimulator:
def __init__(self, target_percent: float = 85.0, aggressive: bool = False):
"""Simulates memory pressure by allocating memory"""
self.target_percent = target_percent
self.memory_blocks: List[bytearray] = []
self.aggressive = aggressive
def apply_pressure(self, additional_percent: float = 0.0):
"""Fill memory until we reach target percentage"""
current_percent = psutil.virtual_memory().percent
target = self.target_percent + additional_percent
if current_percent >= target:
return # Already at target
logger.info(f"Current memory: {current_percent}%, target: {target}%")
# Calculate how much memory we need to allocate
total_memory = psutil.virtual_memory().total
target_usage = (target / 100.0) * total_memory
current_usage = (current_percent / 100.0) * total_memory
bytes_to_allocate = int(target_usage - current_usage)
if bytes_to_allocate <= 0:
return
# Allocate in smaller chunks to avoid overallocation
if self.aggressive:
# Use larger chunks for faster allocation in aggressive mode
chunk_size = min(bytes_to_allocate, 200 * 1024 * 1024) # 200MB chunks
else:
chunk_size = min(bytes_to_allocate, 50 * 1024 * 1024) # 50MB chunks
try:
logger.info(f"Allocating {chunk_size / (1024 * 1024):.1f}MB to reach target memory usage")
self.memory_blocks.append(bytearray(chunk_size))
time.sleep(0.5) # Give system time to register the allocation
except MemoryError:
logger.warning("Unable to allocate more memory")
def release_pressure(self, percent: float = None):
"""
Release allocated memory
If percent is specified, release that percentage of blocks
"""
if not self.memory_blocks:
return
if percent is None:
# Release all
logger.info(f"Releasing all {len(self.memory_blocks)} memory blocks")
self.memory_blocks.clear()
else:
# Release specified percentage
blocks_to_release = int(len(self.memory_blocks) * (percent / 100.0))
if blocks_to_release > 0:
logger.info(f"Releasing {blocks_to_release} of {len(self.memory_blocks)} memory blocks ({percent}%)")
self.memory_blocks = self.memory_blocks[blocks_to_release:]
def spike_pressure(self, duration: float = 5.0):
"""
Create a temporary spike in memory pressure then release
Useful for forcing requeues
"""
logger.info(f"Creating memory pressure spike for {duration} seconds")
# Save current blocks count
initial_blocks = len(self.memory_blocks)
# Create spike with extra 5%
self.apply_pressure(additional_percent=5.0)
# Schedule release after duration
asyncio.create_task(self._delayed_release(duration, initial_blocks))
async def _delayed_release(self, delay: float, target_blocks: int):
"""Helper for spike_pressure - releases extra blocks after delay"""
await asyncio.sleep(delay)
# Remove blocks added since spike started
if len(self.memory_blocks) > target_blocks:
logger.info(f"Releasing memory spike ({len(self.memory_blocks) - target_blocks} blocks)")
self.memory_blocks = self.memory_blocks[:target_blocks]
# Test statistics collector
class TestResults:
def __init__(self):
self.start_time = time.time()
self.completed_urls: List[str] = []
self.failed_urls: List[str] = []
self.requeued_count = 0
self.memory_warnings = 0
self.max_memory_usage = 0.0
self.max_queue_size = 0
self.max_wait_time = 0.0
self.url_to_attempt: Dict[str, int] = {} # Track retries per URL
def log_summary(self):
duration = time.time() - self.start_time
logger.info("===== TEST SUMMARY =====")
logger.info(f"Stream mode: {'ON' if STREAM else 'OFF'}")
logger.info(f"Total duration: {duration:.1f} seconds")
logger.info(f"Completed URLs: {len(self.completed_urls)}")
logger.info(f"Failed URLs: {len(self.failed_urls)}")
logger.info(f"Requeue events: {self.requeued_count}")
logger.info(f"Memory warnings: {self.memory_warnings}")
logger.info(f"Max memory usage: {self.max_memory_usage:.1f}%")
logger.info(f"Max queue size: {self.max_queue_size}")
logger.info(f"Max wait time: {self.max_wait_time:.1f} seconds")
# Log URLs with multiple attempts
retried_urls = {url: count for url, count in self.url_to_attempt.items() if count > 1}
if retried_urls:
logger.info(f"URLs with retries: {len(retried_urls)}")
# Log the top 5 most retried
top_retries = sorted(retried_urls.items(), key=lambda x: x[1], reverse=True)[:5]
for url, count in top_retries:
logger.info(f" URL {url[-30:]} had {count} attempts")
# Write summary to a separate human-readable file
with open("logs/test_summary.txt", "w") as f:
f.write(f"Stream mode: {'ON' if STREAM else 'OFF'}\n")
f.write(f"Total duration: {duration:.1f} seconds\n")
f.write(f"Completed URLs: {len(self.completed_urls)}\n")
f.write(f"Failed URLs: {len(self.failed_urls)}\n")
f.write(f"Requeue events: {self.requeued_count}\n")
f.write(f"Memory warnings: {self.memory_warnings}\n")
f.write(f"Max memory usage: {self.max_memory_usage:.1f}%\n")
f.write(f"Max queue size: {self.max_queue_size}\n")
f.write(f"Max wait time: {self.max_wait_time:.1f} seconds\n")
# Custom monitor with stats tracking
# Custom monitor that extends CrawlerMonitor with test-specific tracking
class StressTestMonitor(CrawlerMonitor):
def __init__(self, test_results: TestResults, **kwargs):
# Initialize the parent CrawlerMonitor
super().__init__(**kwargs)
self.test_results = test_results
def update_memory_status(self, status: str):
if status != self.memory_status:
logger.info(f"Memory status changed: {self.memory_status} -> {status}")
if "CRITICAL" in status or "PRESSURE" in status:
self.test_results.memory_warnings += 1
# Track peak memory usage in test results
current_memory = psutil.virtual_memory().percent
self.test_results.max_memory_usage = max(self.test_results.max_memory_usage, current_memory)
# Call parent method to update the dashboard
super().update_memory_status(status)
def update_queue_statistics(self, total_queued: int, highest_wait_time: float, avg_wait_time: float):
# Track queue metrics in test results
self.test_results.max_queue_size = max(self.test_results.max_queue_size, total_queued)
self.test_results.max_wait_time = max(self.test_results.max_wait_time, highest_wait_time)
# Call parent method to update the dashboard
super().update_queue_statistics(total_queued, highest_wait_time, avg_wait_time)
def update_task(self, task_id: str, **kwargs):
# Track URL status changes for test results
if task_id in self.stats:
old_status = self.stats[task_id].status
# If this is a requeue event (requeued due to memory pressure)
if 'error_message' in kwargs and 'requeued' in kwargs['error_message']:
if not hasattr(self.stats[task_id], 'counted_requeue') or not self.stats[task_id].counted_requeue:
self.test_results.requeued_count += 1
self.stats[task_id].counted_requeue = True
# Track completion status for test results
if 'status' in kwargs:
new_status = kwargs['status']
if old_status != new_status:
if new_status == CrawlStatus.COMPLETED:
if task_id not in self.test_results.completed_urls:
self.test_results.completed_urls.append(task_id)
elif new_status == CrawlStatus.FAILED:
if task_id not in self.test_results.failed_urls:
self.test_results.failed_urls.append(task_id)
# Call parent method to update the dashboard
super().update_task(task_id, **kwargs)
self.live.update(self._create_table())
# Generate test URLs - use example.com with unique paths to avoid browser caching
def generate_test_urls(count: int) -> List[str]:
urls = []
for i in range(count):
# Add random path and query parameters to create unique URLs
path = f"/path/{uuid.uuid4()}"
query = f"?test={i}&random={random.randint(1, 100000)}"
urls.append(f"https://example.com{path}{query}")
return urls
# Process result callback
async def process_result(result, test_results: TestResults):
# Track attempt counts
if result.url not in test_results.url_to_attempt:
test_results.url_to_attempt[result.url] = 1
else:
test_results.url_to_attempt[result.url] += 1
if "requeued" in result.error_message:
test_results.requeued_count += 1
logger.debug(f"Requeued due to memory pressure: {result.url}")
elif result.success:
test_results.completed_urls.append(result.url)
logger.debug(f"Successfully processed: {result.url}")
else:
test_results.failed_urls.append(result.url)
logger.warning(f"Failed to process: {result.url} - {result.error_message}")
# Process multiple results (used in non-streaming mode)
async def process_results(results, test_results: TestResults):
for result in results:
await process_result(result, test_results)
# Main test function for extreme memory pressure simulation
async def run_memory_stress_test(
url_count: int = 100,
target_memory_percent: float = 92.0, # Push to dangerous levels
chunk_size: int = 20, # Larger chunks for more chaos
aggressive: bool = False,
spikes: bool = True
):
test_results = TestResults()
memory_simulator = MemorySimulator(target_percent=target_memory_percent, aggressive=aggressive)
logger.info(f"Starting stress test with {url_count} URLs in {'STREAM' if STREAM else 'NON-STREAM'} mode")
logger.info(f"Target memory usage: {target_memory_percent}%")
# First, elevate memory usage to create pressure
logger.info("Creating initial memory pressure...")
memory_simulator.apply_pressure()
# Create test URLs in chunks to simulate real-world crawling where URLs are discovered
all_urls = generate_test_urls(url_count)
url_chunks = [all_urls[i:i+chunk_size] for i in range(0, len(all_urls), chunk_size)]
# Set up the crawler components - low memory thresholds to create more requeues
browser_config = BrowserConfig(headless=True, verbose=False)
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
verbose=False,
stream=STREAM # Use the global STREAM variable to set mode
)
# Create monitor with reference to test results
monitor = StressTestMonitor(
test_results=test_results,
display_mode=DisplayMode.DETAILED,
max_visible_rows=20,
total_urls=url_count # Pass total URLs count
)
# Create dispatcher with EXTREME settings - pure survival mode
# These settings are designed to create a memory battleground
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=63.0, # Start throttling at just 60% memory
critical_threshold_percent=70.0, # Start requeuing at 70% - incredibly aggressive
recovery_threshold_percent=55.0, # Only resume normal ops when plenty of memory available
check_interval=0.1, # Check extremely frequently (100ms)
max_session_permit=20 if aggressive else 10, # Double the concurrent sessions - pure chaos
fairness_timeout=10.0, # Extremely low timeout - rapid priority changes
monitor=monitor
)
# Set up spike schedule if enabled
if spikes:
spike_intervals = []
# Create 3-5 random spike times
num_spikes = random.randint(3, 5)
for _ in range(num_spikes):
# Schedule spikes at random chunks
chunk_index = random.randint(1, len(url_chunks) - 1)
spike_intervals.append(chunk_index)
logger.info(f"Scheduled memory spikes at chunks: {spike_intervals}")
try:
async with AsyncWebCrawler(config=browser_config) as crawler:
# Process URLs in chunks to simulate discovering URLs over time
for chunk_index, url_chunk in enumerate(url_chunks):
logger.info(f"Processing chunk {chunk_index+1}/{len(url_chunks)} ({len(url_chunk)} URLs)")
# Regular pressure increases
if chunk_index % 2 == 0:
logger.info("Increasing memory pressure...")
memory_simulator.apply_pressure()
# Memory spike if scheduled for this chunk
if spikes and chunk_index in spike_intervals:
logger.info(f"⚠️ CREATING MASSIVE MEMORY SPIKE at chunk {chunk_index+1} ⚠️")
# Create a nightmare scenario - multiple overlapping spikes
memory_simulator.spike_pressure(duration=10.0) # 10-second spike
# 50% chance of double-spike (pure evil)
if random.random() < 0.5:
await asyncio.sleep(2.0) # Wait 2 seconds
logger.info("💀 DOUBLE SPIKE - EXTREME MEMORY PRESSURE 💀")
memory_simulator.spike_pressure(duration=8.0) # 8-second overlapping spike
if STREAM:
# Stream mode - process results as they come in
async for result in dispatcher.run_urls_stream(
urls=url_chunk,
crawler=crawler,
config=run_config
):
await process_result(result, test_results)
else:
# Non-stream mode - get all results at once
results = await dispatcher.run_urls(
urls=url_chunk,
crawler=crawler,
config=run_config
)
await process_results(results, test_results)
# Simulate discovering more URLs while others are still processing
await asyncio.sleep(1)
# RARELY release pressure - make the system fight for resources
if chunk_index % 5 == 4: # Less frequent releases
release_percent = random.choice([10, 15, 20]) # Smaller, inconsistent releases
logger.info(f"Releasing {release_percent}% of memory blocks - brief respite")
memory_simulator.release_pressure(percent=release_percent)
except Exception as e:
logger.error(f"Test error: {str(e)}")
raise
finally:
# Release memory pressure
memory_simulator.release_pressure()
# Log final results
test_results.log_summary()
# Check for success criteria
if len(test_results.completed_urls) + len(test_results.failed_urls) < url_count:
logger.error(f"TEST FAILED: Not all URLs were processed. {url_count - len(test_results.completed_urls) - len(test_results.failed_urls)} URLs missing.")
return False
logger.info("TEST PASSED: All URLs were processed without crashing.")
return True
# Command-line entry point
if __name__ == "__main__":
# Parse command line arguments
url_count = int(sys.argv[1]) if len(sys.argv) > 1 else 100
target_memory = float(sys.argv[2]) if len(sys.argv) > 2 else 85.0
# Check if stream mode is specified
if len(sys.argv) > 3:
STREAM = sys.argv[3].lower() in ('true', 'yes', '1', 'stream')
# Check if aggressive mode is specified
aggressive = False
if len(sys.argv) > 4:
aggressive = sys.argv[4].lower() in ('true', 'yes', '1', 'aggressive')
print(f"Starting test with {url_count} URLs, {target_memory}% memory target")
print(f"Stream mode: {STREAM}, Aggressive: {aggressive}")
print("Logs will be written to the logs directory")
print("Live display starting now...")
# Run the test
result = asyncio.run(run_memory_stress_test(
url_count=url_count,
target_memory_percent=target_memory,
aggressive=aggressive
))
# Exit with status code
sys.exit(0 if result else 1)