diff --git a/CHANGELOG.md b/CHANGELOG.md index b50e4eef..bc3da893 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,6 +5,53 @@ All notable changes to Crawl4AI will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [0.6.0rc1‑r1] ‑ 2025‑04‑22 + +### Added +- Browser pooling with page pre‑warming and fine‑grained **geolocation, locale, and timezone** controls +- Crawler pool manager (SDK + Docker API) for smarter resource allocation +- Network & console log capture plus MHTML snapshot export +- **Table extractor**: turn HTML ``s into DataFrames or CSV with one flag +- High‑volume stress‑test framework in `tests/memory` and API load scripts +- MCP protocol endpoints with socket & SSE support; playground UI scaffold +- Docs v2 revamp: TOC, GitHub badge, copy‑code buttons, Docker API demo +- “Ask AI” helper button *(work‑in‑progress, shipping soon)* +- New examples: geo‑location usage, network/console capture, Docker API, markdown source selection, crypto analysis +- Expanded automated test suites for browser, Docker, MCP and memory benchmarks + +### Changed +- Consolidated and renamed browser strategies; legacy docker strategy modules removed +- `ProxyConfig` moved to `async_configs` +- Server migrated to pool‑based crawler management +- FastAPI validators replace custom query validation +- Docker build now uses Chromium base image +- Large‑scale repo tidy‑up (≈36 k insertions, ≈5 k deletions) + +### Fixed +- Async crawler session leak, duplicate‑visit handling, URL normalisation +- Target‑element regressions in scraping strategies +- Logged‑URL readability, encoded‑URL decoding, middle truncation for long URLs +- Closed issues: #701, #733, #756, #774, #804, #822, #839, #841, #842, #843, #867, #902, #911 + +### Removed +- Obsolete modules under `crawl4ai/browser/*` superseded by the new pooled browser layer + +### Deprecated +- Old markdown generator names now alias `DefaultMarkdownGenerator` and emit warnings + +--- + +#### Upgrade notes +1. Update any direct imports from `crawl4ai/browser/*` to the new pooled browser modules +2. If you override `AsyncPlaywrightCrawlerStrategy.get_page`, adopt the new signature +3. Rebuild Docker images to pull the new Chromium layer +4. Switch to `DefaultMarkdownGenerator` (or silence the deprecation warning) + +--- + +`121 files changed, ≈36 223 insertions, ≈4 975 deletions` :contentReference[oaicite:0]{index=0}​:contentReference[oaicite:1]{index=1} + + ### [Feature] 2025-04-21 - Implemented MCP protocol for machine-to-machine communication - Added WebSocket and SSE transport for MCP server diff --git a/Dockerfile b/Dockerfile index d32639a5..7ea648f9 100644 --- a/Dockerfile +++ b/Dockerfile @@ -1,5 +1,10 @@ FROM python:3.10-slim +# C4ai version +ARG C4AI_VER=0.6.0 +ENV C4AI_VERSION=$C4AI_VER +LABEL c4ai.version=$C4AI_VER + # Set build arguments ARG APP_HOME=/app ARG GITHUB_REPO=https://github.com/unclecode/crawl4ai.git diff --git a/crawl4ai/__version__.py b/crawl4ai/__version__.py index cc2aaa57..06e10ed9 100644 --- a/crawl4ai/__version__.py +++ b/crawl4ai/__version__.py @@ -1,2 +1,3 @@ # crawl4ai/_version.py -__version__ = "0.5.0.post8" +__version__ = "0.6.0rc1" + diff --git a/deploy/docker/README-new.md b/deploy/docker/README-new.md deleted file mode 100644 index 3a9bdf52..00000000 --- a/deploy/docker/README-new.md +++ /dev/null @@ -1,644 +0,0 @@ -# Crawl4AI Docker Guide 🐳 - -## Table of Contents -- [Prerequisites](#prerequisites) -- [Installation](#installation) - - [Option 1: Using Docker Compose (Recommended)](#option-1-using-docker-compose-recommended) - - [Option 2: Manual Local Build & Run](#option-2-manual-local-build--run) - - [Option 3: Using Pre-built Docker Hub Images](#option-3-using-pre-built-docker-hub-images) -- [Dockerfile Parameters](#dockerfile-parameters) -- [Using the API](#using-the-api) - - [Understanding Request Schema](#understanding-request-schema) - - [REST API Examples](#rest-api-examples) - - [Python SDK](#python-sdk) -- [Metrics & Monitoring](#metrics--monitoring) -- [Deployment Scenarios](#deployment-scenarios) -- [Complete Examples](#complete-examples) -- [Server Configuration](#server-configuration) - - [Understanding config.yml](#understanding-configyml) - - [JWT Authentication](#jwt-authentication) - - [Configuration Tips and Best Practices](#configuration-tips-and-best-practices) - - [Customizing Your Configuration](#customizing-your-configuration) - - [Configuration Recommendations](#configuration-recommendations) -- [Getting Help](#getting-help) - -## Prerequisites - -Before we dive in, make sure you have: -- Docker installed and running (version 20.10.0 or higher), including `docker compose` (usually bundled with Docker Desktop). -- `git` for cloning the repository. -- At least 4GB of RAM available for the container (more recommended for heavy use). -- Python 3.10+ (if using the Python SDK). -- Node.js 16+ (if using the Node.js examples). - -> 💡 **Pro tip**: Run `docker info` to check your Docker installation and available resources. - -## Installation - -We offer several ways to get the Crawl4AI server running. Docker Compose is the easiest way to manage local builds and runs. - -### Option 1: Using Docker Compose (Recommended) - -Docker Compose simplifies building and running the service, especially for local development and testing across different platforms. - -#### 1. Clone Repository - -```bash -git clone https://github.com/unclecode/crawl4ai.git -cd crawl4ai -``` - -#### 2. Environment Setup (API Keys) - -If you plan to use LLMs, copy the example environment file and add your API keys. This file should be in the **project root directory**. - -```bash -# Make sure you are in the 'crawl4ai' root directory -cp deploy/docker/.llm.env.example .llm.env - -# Now edit .llm.env and add your API keys -# Example content: -# OPENAI_API_KEY=sk-your-key -# ANTHROPIC_API_KEY=your-anthropic-key -# ... -``` -> 🔑 **Note**: Keep your API keys secure! Never commit `.llm.env` to version control. - -#### 3. Build and Run with Compose - -The `docker-compose.yml` file in the project root defines services for different scenarios using **profiles**. - -* **Build and Run Locally (AMD64):** - ```bash - # Builds the image locally using Dockerfile and runs it - docker compose --profile local-amd64 up --build -d - ``` - -* **Build and Run Locally (ARM64):** - ```bash - # Builds the image locally using Dockerfile and runs it - docker compose --profile local-arm64 up --build -d - ``` - -* **Run Pre-built Image from Docker Hub (AMD64):** - ```bash - # Pulls and runs the specified AMD64 image from Docker Hub - # (Set VERSION env var for specific tags, e.g., VERSION=0.5.1-d1) - docker compose --profile hub-amd64 up -d - ``` - -* **Run Pre-built Image from Docker Hub (ARM64):** - ```bash - # Pulls and runs the specified ARM64 image from Docker Hub - docker compose --profile hub-arm64 up -d - ``` - -> The server will be available at `http://localhost:11235`. - -#### 4. Stopping Compose Services - -```bash -# Stop the service(s) associated with a profile (e.g., local-amd64) -docker compose --profile local-amd64 down -``` - -### Option 2: Manual Local Build & Run - -If you prefer not to use Docker Compose for local builds. - -#### 1. Clone Repository & Setup Environment - -Follow steps 1 and 2 from the Docker Compose section above (clone repo, `cd crawl4ai`, create `.llm.env` in the root). - -#### 2. Build the Image (Multi-Arch) - -Use `docker buildx` to build the image. This example builds for multiple platforms and loads the image matching your host architecture into the local Docker daemon. - -```bash -# Make sure you are in the 'crawl4ai' root directory -docker buildx build --platform linux/amd64,linux/arm64 -t crawl4ai-local:latest --load . -``` - -#### 3. Run the Container - -* **Basic run (no LLM support):** - ```bash - # Replace --platform if your host is ARM64 - docker run -d \ - -p 11235:11235 \ - --name crawl4ai-standalone \ - --shm-size=1g \ - --platform linux/amd64 \ - crawl4ai-local:latest - ``` - -* **With LLM support:** - ```bash - # Make sure .llm.env is in the current directory (project root) - # Replace --platform if your host is ARM64 - docker run -d \ - -p 11235:11235 \ - --name crawl4ai-standalone \ - --env-file .llm.env \ - --shm-size=1g \ - --platform linux/amd64 \ - crawl4ai-local:latest - ``` - -> The server will be available at `http://localhost:11235`. - -#### 4. Stopping the Manual Container - -```bash -docker stop crawl4ai-standalone && docker rm crawl4ai-standalone -``` - -### Option 3: Using Pre-built Docker Hub Images - -Pull and run images directly from Docker Hub without building locally. - -#### 1. Pull the Image - -We use a versioning scheme like `LIBRARY_VERSION-dREVISION` (e.g., `0.5.1-d1`). The `latest` tag points to the most recent stable release. Images are built with multi-arch manifests, so Docker usually pulls the correct version for your system automatically. - -```bash -# Pull a specific version (recommended for stability) -docker pull unclecode/crawl4ai:0.5.1-d1 - -# Or pull the latest stable version -docker pull unclecode/crawl4ai:latest -``` - -#### 2. Setup Environment (API Keys) - -If using LLMs, create the `.llm.env` file in a directory of your choice, similar to Step 2 in the Compose section. - -#### 3. Run the Container - -* **Basic run:** - ```bash - docker run -d \ - -p 11235:11235 \ - --name crawl4ai-hub \ - --shm-size=1g \ - unclecode/crawl4ai:0.5.1-d1 # Or use :latest - ``` - -* **With LLM support:** - ```bash - # Make sure .llm.env is in the current directory you are running docker from - docker run -d \ - -p 11235:11235 \ - --name crawl4ai-hub \ - --env-file .llm.env \ - --shm-size=1g \ - unclecode/crawl4ai:0.5.1-d1 # Or use :latest - ``` - -> The server will be available at `http://localhost:11235`. - -#### 4. Stopping the Hub Container - -```bash -docker stop crawl4ai-hub && docker rm crawl4ai-hub -``` - -#### Docker Hub Versioning Explained - -* **Image Name:** `unclecode/crawl4ai` -* **Tag Format:** `LIBRARY_VERSION-dREVISION` - * `LIBRARY_VERSION`: The Semantic Version of the core `crawl4ai` Python library included (e.g., `0.5.1`). - * `dREVISION`: An incrementing number (starting at `d1`) for Docker build changes made *without* changing the library version (e.g., base image updates, dependency fixes). Resets to `d1` for each new `LIBRARY_VERSION`. -* **Example:** `unclecode/crawl4ai:0.5.1-d1` -* **`latest` Tag:** Points to the most recent stable `LIBRARY_VERSION-dREVISION`. -* **Multi-Arch:** Images support `linux/amd64` and `linux/arm64`. Docker automatically selects the correct architecture. - ---- - -*(Rest of the document remains largely the same, but with key updates below)* - ---- - -## Dockerfile Parameters - -You can customize the image build process using build arguments (`--build-arg`). These are typically used via `docker buildx build` or within the `docker-compose.yml` file. - -```bash -# Example: Build with 'all' features using buildx -docker buildx build \ - --platform linux/amd64,linux/arm64 \ - --build-arg INSTALL_TYPE=all \ - -t yourname/crawl4ai-all:latest \ - --load \ - . # Build from root context -``` - -### Build Arguments Explained - -| Argument | Description | Default | Options | -| :----------- | :--------------------------------------- | :-------- | :--------------------------------- | -| INSTALL_TYPE | Feature set | `default` | `default`, `all`, `torch`, `transformer` | -| ENABLE_GPU | GPU support (CUDA for AMD64) | `false` | `true`, `false` | -| APP_HOME | Install path inside container (advanced) | `/app` | any valid path | -| USE_LOCAL | Install library from local source | `true` | `true`, `false` | -| GITHUB_REPO | Git repo to clone if USE_LOCAL=false | *(see Dockerfile)* | any git URL | -| GITHUB_BRANCH| Git branch to clone if USE_LOCAL=false | `main` | any branch name | - -*(Note: PYTHON_VERSION is fixed by the `FROM` instruction in the Dockerfile)* - -### Build Best Practices - -1. **Choose the Right Install Type** - * `default`: Basic installation, smallest image size. Suitable for most standard web scraping and markdown generation. - * `all`: Full features including `torch` and `transformers` for advanced extraction strategies (e.g., CosineStrategy, certain LLM filters). Significantly larger image. Ensure you need these extras. -2. **Platform Considerations** - * Use `buildx` for building multi-architecture images, especially for pushing to registries. - * Use `docker compose` profiles (`local-amd64`, `local-arm64`) for easy platform-specific local builds. -3. **Performance Optimization** - * The image automatically includes platform-specific optimizations (OpenMP for AMD64, OpenBLAS for ARM64). - ---- - -## Using the API - -Communicate with the running Docker server via its REST API (defaulting to `http://localhost:11235`). You can use the Python SDK or make direct HTTP requests. - -### Python SDK - -Install the SDK: `pip install crawl4ai` - -```python -import asyncio -from crawl4ai.docker_client import Crawl4aiDockerClient -from crawl4ai import BrowserConfig, CrawlerRunConfig, CacheMode # Assuming you have crawl4ai installed - -async def main(): - # Point to the correct server port - async with Crawl4aiDockerClient(base_url="http://localhost:11235", verbose=True) as client: - # If JWT is enabled on the server, authenticate first: - # await client.authenticate("user@example.com") # See Server Configuration section - - # Example Non-streaming crawl - print("--- Running Non-Streaming Crawl ---") - results = await client.crawl( - ["https://httpbin.org/html"], - browser_config=BrowserConfig(headless=True), # Use library classes for config aid - crawler_config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS) - ) - if results: # client.crawl returns None on failure - print(f"Non-streaming results success: {results.success}") - if results.success: - for result in results: # Iterate through the CrawlResultContainer - print(f"URL: {result.url}, Success: {result.success}") - else: - print("Non-streaming crawl failed.") - - - # Example Streaming crawl - print("\n--- Running Streaming Crawl ---") - stream_config = CrawlerRunConfig(stream=True, cache_mode=CacheMode.BYPASS) - try: - async for result in await client.crawl( # client.crawl returns an async generator for streaming - ["https://httpbin.org/html", "https://httpbin.org/links/5/0"], - browser_config=BrowserConfig(headless=True), - crawler_config=stream_config - ): - print(f"Streamed result: URL: {result.url}, Success: {result.success}") - except Exception as e: - print(f"Streaming crawl failed: {e}") - - - # Example Get schema - print("\n--- Getting Schema ---") - schema = await client.get_schema() - print(f"Schema received: {bool(schema)}") # Print whether schema was received - -if __name__ == "__main__": - asyncio.run(main()) -``` - -*(SDK parameters like timeout, verify_ssl etc. remain the same)* - -### Second Approach: Direct API Calls - -Crucially, when sending configurations directly via JSON, they **must** follow the `{"type": "ClassName", "params": {...}}` structure for any non-primitive value (like config objects or strategies). Dictionaries must be wrapped as `{"type": "dict", "value": {...}}`. - -*(Keep the detailed explanation of Configuration Structure, Basic Pattern, Simple vs Complex, Strategy Pattern, Complex Nested Example, Quick Grammar Overview, Important Rules, Pro Tip)* - -#### More Examples *(Ensure Schema example uses type/value wrapper)* - -**Advanced Crawler Configuration** -*(Keep example, ensure cache_mode uses valid enum value like "bypass")* - -**Extraction Strategy** -```json -{ - "crawler_config": { - "type": "CrawlerRunConfig", - "params": { - "extraction_strategy": { - "type": "JsonCssExtractionStrategy", - "params": { - "schema": { - "type": "dict", - "value": { - "baseSelector": "article.post", - "fields": [ - {"name": "title", "selector": "h1", "type": "text"}, - {"name": "content", "selector": ".content", "type": "html"} - ] - } - } - } - } - } - } -} -``` - -**LLM Extraction Strategy** *(Keep example, ensure schema uses type/value wrapper)* -*(Keep Deep Crawler Example)* - -### REST API Examples - -Update URLs to use port `11235`. - -#### Simple Crawl - -```python -import requests - -# Configuration objects converted to the required JSON structure -browser_config_payload = { - "type": "BrowserConfig", - "params": {"headless": True} -} -crawler_config_payload = { - "type": "CrawlerRunConfig", - "params": {"stream": False, "cache_mode": "bypass"} # Use string value of enum -} - -crawl_payload = { - "urls": ["https://httpbin.org/html"], - "browser_config": browser_config_payload, - "crawler_config": crawler_config_payload -} -response = requests.post( - "http://localhost:11235/crawl", # Updated port - # headers={"Authorization": f"Bearer {token}"}, # If JWT is enabled - json=crawl_payload -) -print(f"Status Code: {response.status_code}") -if response.ok: - print(response.json()) -else: - print(f"Error: {response.text}") - -``` - -#### Streaming Results - -```python -import json -import httpx # Use httpx for async streaming example - -async def test_stream_crawl(token: str = None): # Made token optional - """Test the /crawl/stream endpoint with multiple URLs.""" - url = "http://localhost:11235/crawl/stream" # Updated port - payload = { - "urls": [ - "https://httpbin.org/html", - "https://httpbin.org/links/5/0", - ], - "browser_config": { - "type": "BrowserConfig", - "params": {"headless": True, "viewport": {"type": "dict", "value": {"width": 1200, "height": 800}}} # Viewport needs type:dict - }, - "crawler_config": { - "type": "CrawlerRunConfig", - "params": {"stream": True, "cache_mode": "bypass"} - } - } - - headers = {} - # if token: - # headers = {"Authorization": f"Bearer {token}"} # If JWT is enabled - - try: - async with httpx.AsyncClient() as client: - async with client.stream("POST", url, json=payload, headers=headers, timeout=120.0) as response: - print(f"Status: {response.status_code} (Expected: 200)") - response.raise_for_status() # Raise exception for bad status codes - - # Read streaming response line-by-line (NDJSON) - async for line in response.aiter_lines(): - if line: - try: - data = json.loads(line) - # Check for completion marker - if data.get("status") == "completed": - print("Stream completed.") - break - print(f"Streamed Result: {json.dumps(data, indent=2)}") - except json.JSONDecodeError: - print(f"Warning: Could not decode JSON line: {line}") - - except httpx.HTTPStatusError as e: - print(f"HTTP error occurred: {e.response.status_code} - {e.response.text}") - except Exception as e: - print(f"Error in streaming crawl test: {str(e)}") - -# To run this example: -# import asyncio -# asyncio.run(test_stream_crawl()) -``` - ---- - -## Metrics & Monitoring - -Keep an eye on your crawler with these endpoints: - -- `/health` - Quick health check -- `/metrics` - Detailed Prometheus metrics -- `/schema` - Full API schema - -Example health check: -```bash -curl http://localhost:11235/health -``` - ---- - -*(Deployment Scenarios and Complete Examples sections remain the same, maybe update links if examples moved)* - ---- - -## Server Configuration - -The server's behavior can be customized through the `config.yml` file. - -### Understanding config.yml - -The configuration file is loaded from `/app/config.yml` inside the container. By default, the file from `deploy/docker/config.yml` in the repository is copied there during the build. - -Here's a detailed breakdown of the configuration options (using defaults from `deploy/docker/config.yml`): - -```yaml -# Application Configuration -app: - title: "Crawl4AI API" - version: "1.0.0" # Consider setting this to match library version, e.g., "0.5.1" - host: "0.0.0.0" - port: 8020 # NOTE: This port is used ONLY when running server.py directly. Gunicorn overrides this (see supervisord.conf). - reload: False # Default set to False - suitable for production - timeout_keep_alive: 300 - -# Default LLM Configuration -llm: - provider: "openai/gpt-4o-mini" - api_key_env: "OPENAI_API_KEY" - # api_key: sk-... # If you pass the API key directly then api_key_env will be ignored - -# Redis Configuration (Used by internal Redis server managed by supervisord) -redis: - host: "localhost" - port: 6379 - db: 0 - password: "" - # ... other redis options ... - -# Rate Limiting Configuration -rate_limiting: - enabled: True - default_limit: "1000/minute" - trusted_proxies: [] - storage_uri: "memory://" # Use "redis://localhost:6379" if you need persistent/shared limits - -# Security Configuration -security: - enabled: false # Master toggle for security features - jwt_enabled: false # Enable JWT authentication (requires security.enabled=true) - https_redirect: false # Force HTTPS (requires security.enabled=true) - trusted_hosts: ["*"] # Allowed hosts (use specific domains in production) - headers: # Security headers (applied if security.enabled=true) - x_content_type_options: "nosniff" - x_frame_options: "DENY" - content_security_policy: "default-src 'self'" - strict_transport_security: "max-age=63072000; includeSubDomains" - -# Crawler Configuration -crawler: - memory_threshold_percent: 95.0 - rate_limiter: - base_delay: [1.0, 2.0] # Min/max delay between requests in seconds for dispatcher - timeouts: - stream_init: 30.0 # Timeout for stream initialization - batch_process: 300.0 # Timeout for non-streaming /crawl processing - -# Logging Configuration -logging: - level: "INFO" - format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s" - -# Observability Configuration -observability: - prometheus: - enabled: True - endpoint: "/metrics" - health_check: - endpoint: "/health" -``` - -*(JWT Authentication section remains the same, just note the default port is now 11235 for requests)* - -*(Configuration Tips and Best Practices remain the same)* - -### Customizing Your Configuration - -You can override the default `config.yml`. - -#### Method 1: Modify Before Build - -1. Edit the `deploy/docker/config.yml` file in your local repository clone. -2. Build the image using `docker buildx` or `docker compose --profile local-... up --build`. The modified file will be copied into the image. - -#### Method 2: Runtime Mount (Recommended for Custom Deploys) - -1. Create your custom configuration file, e.g., `my-custom-config.yml` locally. Ensure it contains all necessary sections. -2. Mount it when running the container: - - * **Using `docker run`:** - ```bash - # Assumes my-custom-config.yml is in the current directory - docker run -d -p 11235:11235 \ - --name crawl4ai-custom-config \ - --env-file .llm.env \ - --shm-size=1g \ - -v $(pwd)/my-custom-config.yml:/app/config.yml \ - unclecode/crawl4ai:latest # Or your specific tag - ``` - - * **Using `docker-compose.yml`:** Add a `volumes` section to the service definition: - ```yaml - services: - crawl4ai-hub-amd64: # Or your chosen service - image: unclecode/crawl4ai:latest - profiles: ["hub-amd64"] - <<: *base-config - volumes: - # Mount local custom config over the default one in the container - - ./my-custom-config.yml:/app/config.yml - # Keep the shared memory volume from base-config - - /dev/shm:/dev/shm - ``` - *(Note: Ensure `my-custom-config.yml` is in the same directory as `docker-compose.yml`)* - -> 💡 When mounting, your custom file *completely replaces* the default one. Ensure it's a valid and complete configuration. - -### Configuration Recommendations - -1. **Security First** 🔒 - - Always enable security in production - - Use specific trusted_hosts instead of wildcards - - Set up proper rate limiting to protect your server - - Consider your environment before enabling HTTPS redirect - -2. **Resource Management** 💻 - - Adjust memory_threshold_percent based on available RAM - - Set timeouts according to your content size and network conditions - - Use Redis for rate limiting in multi-container setups - -3. **Monitoring** 📊 - - Enable Prometheus if you need metrics - - Set DEBUG logging in development, INFO in production - - Regular health check monitoring is crucial - -4. **Performance Tuning** ⚡ - - Start with conservative rate limiter delays - - Increase batch_process timeout for large content - - Adjust stream_init timeout based on initial response times - -## Getting Help - -We're here to help you succeed with Crawl4AI! Here's how to get support: - -- 📖 Check our [full documentation](https://docs.crawl4ai.com) -- 🐛 Found a bug? [Open an issue](https://github.com/unclecode/crawl4ai/issues) -- 💬 Join our [Discord community](https://discord.gg/crawl4ai) -- ⭐ Star us on GitHub to show support! - -## Summary - -In this guide, we've covered everything you need to get started with Crawl4AI's Docker deployment: -- Building and running the Docker container -- Configuring the environment -- Making API requests with proper typing -- Using the Python SDK -- Monitoring your deployment - -Remember, the examples in the `examples` folder are your friends - they show real-world usage patterns that you can adapt for your needs. - -Keep exploring, and don't hesitate to reach out if you need help! We're building something amazing together. 🚀 - -Happy crawling! 🕷️ diff --git a/deploy/docker/README.md b/deploy/docker/README.md index b4b6e414..1deebd50 100644 --- a/deploy/docker/README.md +++ b/deploy/docker/README.md @@ -3,395 +3,504 @@ ## Table of Contents - [Prerequisites](#prerequisites) - [Installation](#installation) - - [Local Build](#local-build) - - [Docker Hub](#docker-hub) + - [Option 1: Using Pre-built Docker Hub Images (Recommended)](#option-1-using-pre-built-docker-hub-images-recommended) + - [Option 2: Using Docker Compose](#option-2-using-docker-compose) + - [Option 3: Manual Local Build & Run](#option-3-manual-local-build--run) - [Dockerfile Parameters](#dockerfile-parameters) - [Using the API](#using-the-api) + - [Playground Interface](#playground-interface) + - [Python SDK](#python-sdk) - [Understanding Request Schema](#understanding-request-schema) - [REST API Examples](#rest-api-examples) - - [Python SDK](#python-sdk) +- [Additional API Endpoints](#additional-api-endpoints) + - [HTML Extraction Endpoint](#html-extraction-endpoint) + - [Screenshot Endpoint](#screenshot-endpoint) + - [PDF Export Endpoint](#pdf-export-endpoint) + - [JavaScript Execution Endpoint](#javascript-execution-endpoint) + - [Library Context Endpoint](#library-context-endpoint) +- [MCP (Model Context Protocol) Support](#mcp-model-context-protocol-support) + - [What is MCP?](#what-is-mcp) + - [Connecting via MCP](#connecting-via-mcp) + - [Using with Claude Code](#using-with-claude-code) + - [Available MCP Tools](#available-mcp-tools) + - [Testing MCP Connections](#testing-mcp-connections) + - [MCP Schemas](#mcp-schemas) - [Metrics & Monitoring](#metrics--monitoring) - [Deployment Scenarios](#deployment-scenarios) - [Complete Examples](#complete-examples) +- [Server Configuration](#server-configuration) + - [Understanding config.yml](#understanding-configyml) + - [JWT Authentication](#jwt-authentication) + - [Configuration Tips and Best Practices](#configuration-tips-and-best-practices) + - [Customizing Your Configuration](#customizing-your-configuration) + - [Configuration Recommendations](#configuration-recommendations) - [Getting Help](#getting-help) +- [Summary](#summary) ## Prerequisites Before we dive in, make sure you have: -- Docker installed and running (version 20.10.0 or higher) -- At least 4GB of RAM available for the container -- Python 3.10+ (if using the Python SDK) -- Node.js 16+ (if using the Node.js examples) +- Docker installed and running (version 20.10.0 or higher), including `docker compose` (usually bundled with Docker Desktop). +- `git` for cloning the repository. +- At least 4GB of RAM available for the container (more recommended for heavy use). +- Python 3.10+ (if using the Python SDK). +- Node.js 16+ (if using the Node.js examples). > 💡 **Pro tip**: Run `docker info` to check your Docker installation and available resources. ## Installation -### Local Build +We offer several ways to get the Crawl4AI server running. The quickest way is to use our pre-built Docker Hub images. -Let's get your local environment set up step by step! +### Option 1: Using Pre-built Docker Hub Images (Recommended) -#### 1. Building the Image +Pull and run images directly from Docker Hub without building locally. -First, clone the repository and build the Docker image: +#### 1. Pull the Image + +Our latest release candidate is `0.6.0rc1-r1`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system. ```bash -# Clone the repository -git clone https://github.com/unclecode/crawl4ai.git -cd crawl4ai/deploy +# Pull the release candidate (recommended for latest features) +docker pull unclecode/crawl4ai:0.6.0rc1-r1 -# Build the Docker image -docker build --platform=linux/amd64 --no-cache -t crawl4ai . - -# Or build for arm64 -docker build --platform=linux/arm64 --no-cache -t crawl4ai . +# Or pull the latest stable version +docker pull unclecode/crawl4ai:latest ``` -#### 2. Environment Setup +#### 2. Setup Environment (API Keys) -If you plan to use LLMs (Language Models), you'll need to set up your API keys. Create a `.llm.env` file: +If you plan to use LLMs, create a `.llm.env` file in your working directory: -```env +```bash +# Create a .llm.env file with your API keys +cat > .llm.env << EOL # OpenAI OPENAI_API_KEY=sk-your-key # Anthropic ANTHROPIC_API_KEY=your-anthropic-key -# DeepSeek -DEEPSEEK_API_KEY=your-deepseek-key - -# Check out https://docs.litellm.ai/docs/providers for more providers! +# Other providers as needed +# DEEPSEEK_API_KEY=your-deepseek-key +# GROQ_API_KEY=your-groq-key +# TOGETHER_API_KEY=your-together-key +# MISTRAL_API_KEY=your-mistral-key +# GEMINI_API_TOKEN=your-gemini-token +EOL ``` +> 🔑 **Note**: Keep your API keys secure! Never commit `.llm.env` to version control. -> 🔑 **Note**: Keep your API keys secure! Never commit them to version control. +#### 3. Run the Container -#### 3. Running the Container +* **Basic run:** + ```bash + docker run -d \ + -p 11235:11235 \ + --name crawl4ai \ + --shm-size=1g \ + unclecode/crawl4ai:0.6.0rc1-r1 + ``` -You have several options for running the container: +* **With LLM support:** + ```bash + # Make sure .llm.env is in the current directory + docker run -d \ + -p 11235:11235 \ + --name crawl4ai \ + --env-file .llm.env \ + --shm-size=1g \ + unclecode/crawl4ai:0.6.0rc1-r1 + ``` -Basic run (no LLM support): -```bash -docker run -d -p 8000:8000 --name crawl4ai crawl4ai -``` +> The server will be available at `http://localhost:11235`. Visit `/playground` to access the interactive testing interface. -With LLM support: -```bash -docker run -d -p 8000:8000 \ - --env-file .llm.env \ - --name crawl4ai \ - crawl4ai -``` - -Using host environment variables (Not a good practice, but works for local testing): -```bash -docker run -d -p 8000:8000 \ - --env-file .llm.env \ - --env "$(env)" \ - --name crawl4ai \ - crawl4ai -``` - -#### Multi-Platform Build -For distributing your image across different architectures, use `buildx`: +#### 4. Stopping the Container ```bash -# Set up buildx builder -docker buildx create --use +docker stop crawl4ai && docker rm crawl4ai +``` -# Build for multiple platforms +#### Docker Hub Versioning Explained + +* **Image Name:** `unclecode/crawl4ai` +* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.6.0rc1-r1`) + * `LIBRARY_VERSION`: The semantic version of the core `crawl4ai` Python library + * `SUFFIX`: Optional tag for release candidates (`rc1`) and revisions (`r1`) +* **`latest` Tag:** Points to the most recent stable version +* **Multi-Architecture Support:** All images support both `linux/amd64` and `linux/arm64` architectures through a single tag + +### Option 2: Using Docker Compose + +Docker Compose simplifies building and running the service, especially for local development and testing. + +#### 1. Clone Repository + +```bash +git clone https://github.com/unclecode/crawl4ai.git +cd crawl4ai +``` + +#### 2. Environment Setup (API Keys) + +If you plan to use LLMs, copy the example environment file and add your API keys. This file should be in the **project root directory**. + +```bash +# Make sure you are in the 'crawl4ai' root directory +cp deploy/docker/.llm.env.example .llm.env + +# Now edit .llm.env and add your API keys +``` + +#### 3. Build and Run with Compose + +The `docker-compose.yml` file in the project root provides a simplified approach that automatically handles architecture detection using buildx. + +* **Run Pre-built Image from Docker Hub:** + ```bash + # Pulls and runs the release candidate from Docker Hub + # Automatically selects the correct architecture + IMAGE=unclecode/crawl4ai:0.6.0rc1-r1 docker compose up -d + ``` + +* **Build and Run Locally:** + ```bash + # Builds the image locally using Dockerfile and runs it + # Automatically uses the correct architecture for your machine + docker compose up --build -d + ``` + +* **Customize the Build:** + ```bash + # Build with all features (includes torch and transformers) + INSTALL_TYPE=all docker compose up --build -d + + # Build with GPU support (for AMD64 platforms) + ENABLE_GPU=true docker compose up --build -d + ``` + +> The server will be available at `http://localhost:11235`. + +#### 4. Stopping the Service + +```bash +# Stop the service +docker compose down +``` + +### Option 3: Manual Local Build & Run + +If you prefer not to use Docker Compose for direct control over the build and run process. + +#### 1. Clone Repository & Setup Environment + +Follow steps 1 and 2 from the Docker Compose section above (clone repo, `cd crawl4ai`, create `.llm.env` in the root). + +#### 2. Build the Image (Multi-Arch) + +Use `docker buildx` to build the image. Crawl4AI now uses buildx to handle multi-architecture builds automatically. + +```bash +# Make sure you are in the 'crawl4ai' root directory +# Build for the current architecture and load it into Docker +docker buildx build -t crawl4ai-local:latest --load . + +# Or build for multiple architectures (useful for publishing) +docker buildx build --platform linux/amd64,linux/arm64 -t crawl4ai-local:latest --load . + +# Build with additional options +docker buildx build \ + --build-arg INSTALL_TYPE=all \ + --build-arg ENABLE_GPU=false \ + -t crawl4ai-local:latest --load . +``` + +#### 3. Run the Container + +* **Basic run (no LLM support):** + ```bash + docker run -d \ + -p 11235:11235 \ + --name crawl4ai-standalone \ + --shm-size=1g \ + crawl4ai-local:latest + ``` + +* **With LLM support:** + ```bash + # Make sure .llm.env is in the current directory (project root) + docker run -d \ + -p 11235:11235 \ + --name crawl4ai-standalone \ + --env-file .llm.env \ + --shm-size=1g \ + crawl4ai-local:latest + ``` + +> The server will be available at `http://localhost:11235`. + +#### 4. Stopping the Manual Container + +```bash +docker stop crawl4ai-standalone && docker rm crawl4ai-standalone +``` + +--- + +## MCP (Model Context Protocol) Support + +Crawl4AI server includes support for the Model Context Protocol (MCP), allowing you to connect the server's capabilities directly to MCP-compatible clients like Claude Code. + +### What is MCP? + +MCP is an open protocol that standardizes how applications provide context to LLMs. It allows AI models to access external tools, data sources, and services through a standardized interface. + +### Connecting via MCP + +The Crawl4AI server exposes two MCP endpoints: + +- **Server-Sent Events (SSE)**: `http://localhost:11235/mcp/sse` +- **WebSocket**: `ws://localhost:11235/mcp/ws` + +### Using with Claude Code + +You can add Crawl4AI as an MCP tool provider in Claude Code with a simple command: + +```bash +# Add the Crawl4AI server as an MCP provider +claude mcp add --transport sse c4ai-sse http://localhost:11235/mcp/sse + +# List all MCP providers to verify it was added +claude mcp list +``` + +Once connected, Claude Code can directly use Crawl4AI's capabilities like screenshot capture, PDF generation, and HTML processing without having to make separate API calls. + +### Available MCP Tools + +When connected via MCP, the following tools are available: + +- `md` - Generate markdown from web content +- `html` - Extract preprocessed HTML +- `screenshot` - Capture webpage screenshots +- `pdf` - Generate PDF documents +- `execute_js` - Run JavaScript on web pages +- `crawl` - Perform multi-URL crawling +- `ask` - Query the Crawl4AI library context + +### Testing MCP Connections + +You can test the MCP WebSocket connection using the test file included in the repository: + +```bash +# From the repository root +python tests/mcp/test_mcp_socket.py +``` + +### MCP Schemas + +Access the MCP tool schemas at `http://localhost:11235/mcp/schema` for detailed information on each tool's parameters and capabilities. + +--- + +## Additional API Endpoints + +In addition to the core `/crawl` and `/crawl/stream` endpoints, the server provides several specialized endpoints: + +### HTML Extraction Endpoint + +``` +POST /html +``` + +Crawls the URL and returns preprocessed HTML optimized for schema extraction. + +```json +{ + "url": "https://example.com" +} +``` + +### Screenshot Endpoint + +``` +POST /screenshot +``` + +Captures a full-page PNG screenshot of the specified URL. + +```json +{ + "url": "https://example.com", + "screenshot_wait_for": 2, + "output_path": "/path/to/save/screenshot.png" +} +``` + +- `screenshot_wait_for`: Optional delay in seconds before capture (default: 2) +- `output_path`: Optional path to save the screenshot (recommended) + +### PDF Export Endpoint + +``` +POST /pdf +``` + +Generates a PDF document of the specified URL. + +```json +{ + "url": "https://example.com", + "output_path": "/path/to/save/document.pdf" +} +``` + +- `output_path`: Optional path to save the PDF (recommended) + +### JavaScript Execution Endpoint + +``` +POST /execute_js +``` + +Executes JavaScript snippets on the specified URL and returns the full crawl result. + +```json +{ + "url": "https://example.com", + "scripts": [ + "return document.title", + "return Array.from(document.querySelectorAll('a')).map(a => a.href)" + ] +} +``` + +- `scripts`: List of JavaScript snippets to execute sequentially + +--- + +## Dockerfile Parameters + +You can customize the image build process using build arguments (`--build-arg`). These are typically used via `docker buildx build` or within the `docker-compose.yml` file. + +```bash +# Example: Build with 'all' features using buildx docker buildx build \ --platform linux/amd64,linux/arm64 \ - -t crawl4ai \ - --push \ - . -``` - -> 💡 **Note**: Multi-platform builds require Docker Buildx and need to be pushed to a registry. - -#### Development Build -For development, you might want to enable all features: - -```bash -docker build -t crawl4ai --build-arg INSTALL_TYPE=all \ - --build-arg PYTHON_VERSION=3.10 \ - --build-arg ENABLE_GPU=true \ - . -``` - -#### GPU-Enabled Build -If you plan to use GPU acceleration: - -```bash -docker build -t crawl4ai - --build-arg ENABLE_GPU=true \ - deploy/docker/ + -t yourname/crawl4ai-all:latest \ + --load \ + . # Build from root context ``` ### Build Arguments Explained -| Argument | Description | Default | Options | -|----------|-------------|---------|----------| -| PYTHON_VERSION | Python version | 3.10 | 3.8, 3.9, 3.10 | -| INSTALL_TYPE | Feature set | default | default, all, torch, transformer | -| ENABLE_GPU | GPU support | false | true, false | -| APP_HOME | Install path | /app | any valid path | +| Argument | Description | Default | Options | +| :----------- | :--------------------------------------- | :-------- | :--------------------------------- | +| INSTALL_TYPE | Feature set | `default` | `default`, `all`, `torch`, `transformer` | +| ENABLE_GPU | GPU support (CUDA for AMD64) | `false` | `true`, `false` | +| APP_HOME | Install path inside container (advanced) | `/app` | any valid path | +| USE_LOCAL | Install library from local source | `true` | `true`, `false` | +| GITHUB_REPO | Git repo to clone if USE_LOCAL=false | *(see Dockerfile)* | any git URL | +| GITHUB_BRANCH| Git branch to clone if USE_LOCAL=false | `main` | any branch name | + +*(Note: PYTHON_VERSION is fixed by the `FROM` instruction in the Dockerfile)* ### Build Best Practices -1. **Choose the Right Install Type** - - `default`: Basic installation, smallest image, to be honest, I use this most of the time. - - `all`: Full features, larger image (include transformer, and nltk, make sure you really need them) +1. **Choose the Right Install Type** + * `default`: Basic installation, smallest image size. Suitable for most standard web scraping and markdown generation. + * `all`: Full features including `torch` and `transformers` for advanced extraction strategies (e.g., CosineStrategy, certain LLM filters). Significantly larger image. Ensure you need these extras. +2. **Platform Considerations** + * Use `buildx` for building multi-architecture images, especially for pushing to registries. + * Use `docker compose` profiles (`local-amd64`, `local-arm64`) for easy platform-specific local builds. +3. **Performance Optimization** + * The image automatically includes platform-specific optimizations (OpenMP for AMD64, OpenBLAS for ARM64). -2. **Platform Considerations** - - Let Docker auto-detect platform unless you need cross-compilation - - Use --platform for specific architecture requirements - - Consider buildx for multi-architecture distribution - -3. **Performance Optimization** - - The image automatically includes platform-specific optimizations - - AMD64 gets OpenMP optimizations - - ARM64 gets OpenBLAS optimizations - -### Docker Hub - -> 🚧 Coming soon! The image will be available at `crawl4ai`. Stay tuned! +--- ## Using the API -In the following sections, we discuss two ways to communicate with the Docker server. One option is to use the client SDK that I developed for Python, and I will soon develop one for Node.js. I highly recommend this approach to avoid mistakes. Alternatively, you can take a more technical route by using the JSON structure and passing it to all the URLs, which I will explain in detail. +Communicate with the running Docker server via its REST API (defaulting to `http://localhost:11235`). You can use the Python SDK or make direct HTTP requests. + +### Playground Interface + +A built-in web playground is available at `http://localhost:11235/playground` for testing and generating API requests. The playground allows you to: + +1. Configure `CrawlerRunConfig` and `BrowserConfig` using the main library's Python syntax +2. Test crawling operations directly from the interface +3. Generate corresponding JSON for REST API requests based on your configuration + +This is the easiest way to translate Python configuration to JSON requests when building integrations. ### Python SDK -The SDK makes things easier! Here's how to use it: +Install the SDK: `pip install crawl4ai` ```python +import asyncio from crawl4ai.docker_client import Crawl4aiDockerClient -from crawl4ai import BrowserConfig, CrawlerRunConfig +from crawl4ai import BrowserConfig, CrawlerRunConfig, CacheMode # Assuming you have crawl4ai installed async def main(): - async with Crawl4aiDockerClient(base_url="http://localhost:8000", verbose=True) as client: - # If JWT is enabled, you can authenticate like this: (more on this later) - # await client.authenticate("test@example.com") - - # Non-streaming crawl + # Point to the correct server port + async with Crawl4aiDockerClient(base_url="http://localhost:11235", verbose=True) as client: + # If JWT is enabled on the server, authenticate first: + # await client.authenticate("user@example.com") # See Server Configuration section + + # Example Non-streaming crawl + print("--- Running Non-Streaming Crawl ---") results = await client.crawl( - ["https://example.com", "https://python.org"], - browser_config=BrowserConfig(headless=True), - crawler_config=CrawlerRunConfig() + ["https://httpbin.org/html"], + browser_config=BrowserConfig(headless=True), # Use library classes for config aid + crawler_config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS) ) - print(f"Non-streaming results: {results}") - - # Streaming crawl - crawler_config = CrawlerRunConfig(stream=True) - async for result in await client.crawl( - ["https://example.com", "https://python.org"], - browser_config=BrowserConfig(headless=True), - crawler_config=crawler_config - ): - print(f"Streamed result: {result}") - - # Get schema + if results: # client.crawl returns None on failure + print(f"Non-streaming results success: {results.success}") + if results.success: + for result in results: # Iterate through the CrawlResultContainer + print(f"URL: {result.url}, Success: {result.success}") + else: + print("Non-streaming crawl failed.") + + + # Example Streaming crawl + print("\n--- Running Streaming Crawl ---") + stream_config = CrawlerRunConfig(stream=True, cache_mode=CacheMode.BYPASS) + try: + async for result in await client.crawl( # client.crawl returns an async generator for streaming + ["https://httpbin.org/html", "https://httpbin.org/links/5/0"], + browser_config=BrowserConfig(headless=True), + crawler_config=stream_config + ): + print(f"Streamed result: URL: {result.url}, Success: {result.success}") + except Exception as e: + print(f"Streaming crawl failed: {e}") + + + # Example Get schema + print("\n--- Getting Schema ---") schema = await client.get_schema() - print(f"Schema: {schema}") + print(f"Schema received: {bool(schema)}") # Print whether schema was received if __name__ == "__main__": asyncio.run(main()) ``` -`Crawl4aiDockerClient` is an async context manager that handles the connection for you. You can pass in optional parameters for more control: +*(SDK parameters like timeout, verify_ssl etc. remain the same)* -- `base_url` (str): Base URL of the Crawl4AI Docker server -- `timeout` (float): Default timeout for requests in seconds -- `verify_ssl` (bool): Whether to verify SSL certificates -- `verbose` (bool): Whether to show logging output -- `log_file` (str, optional): Path to log file if file logging is desired +### Second Approach: Direct API Calls -This client SDK generates a properly structured JSON request for the server's HTTP API. +Crucially, when sending configurations directly via JSON, they **must** follow the `{"type": "ClassName", "params": {...}}` structure for any non-primitive value (like config objects or strategies). Dictionaries must be wrapped as `{"type": "dict", "value": {...}}`. -## Second Approach: Direct API Calls +*(Keep the detailed explanation of Configuration Structure, Basic Pattern, Simple vs Complex, Strategy Pattern, Complex Nested Example, Quick Grammar Overview, Important Rules, Pro Tip)* -This is super important! The API expects a specific structure that matches our Python classes. Let me show you how it works. - -### Understanding Configuration Structure - -Let's dive deep into how configurations work in Crawl4AI. Every configuration object follows a consistent pattern of `type` and `params`. This structure enables complex, nested configurations while maintaining clarity. - -#### The Basic Pattern - -Try this in Python to understand the structure: -```python -from crawl4ai import BrowserConfig - -# Create a config and see its structure -config = BrowserConfig(headless=True) -print(config.dump()) -``` - -This outputs: -```json -{ - "type": "BrowserConfig", - "params": { - "headless": true - } -} -``` - -#### Simple vs Complex Values - -The structure follows these rules: -- Simple values (strings, numbers, booleans, lists) are passed directly -- Complex values (classes, dictionaries) use the type-params pattern - -For example, with dictionaries: -```json -{ - "browser_config": { - "type": "BrowserConfig", - "params": { - "headless": true, // Simple boolean - direct value - "viewport": { // Complex dictionary - needs type-params - "type": "dict", - "value": { - "width": 1200, - "height": 800 - } - } - } - } -} -``` - -#### Strategy Pattern and Nesting - -Strategies (like chunking or content filtering) demonstrate why we need this structure. Consider this chunking configuration: - -```json -{ - "crawler_config": { - "type": "CrawlerRunConfig", - "params": { - "chunking_strategy": { - "type": "RegexChunking", // Strategy implementation - "params": { - "patterns": ["\n\n", "\\.\\s+"] - } - } - } - } -} -``` - -Here, `chunking_strategy` accepts any chunking implementation. The `type` field tells the system which strategy to use, and `params` configures that specific strategy. - -#### Complex Nested Example - -Let's look at a more complex example with content filtering: - -```json -{ - "crawler_config": { - "type": "CrawlerRunConfig", - "params": { - "markdown_generator": { - "type": "DefaultMarkdownGenerator", - "params": { - "content_filter": { - "type": "PruningContentFilter", - "params": { - "threshold": 0.48, - "threshold_type": "fixed" - } - } - } - } - } - } -} -``` - -This shows how deeply configurations can nest while maintaining a consistent structure. - -#### Quick Grammar Overview -``` -config := { - "type": string, - "params": { - key: simple_value | complex_value - } -} - -simple_value := string | number | boolean | [simple_value] -complex_value := config | dict_value - -dict_value := { - "type": "dict", - "value": object -} -``` - -#### Important Rules 🚨 - -- Always use the type-params pattern for class instances -- Use direct values for primitives (numbers, strings, booleans) -- Wrap dictionaries with {"type": "dict", "value": {...}} -- Arrays/lists are passed directly without type-params -- All parameters are optional unless specifically required - -#### Pro Tip 💡 - -The easiest way to get the correct structure is to: -1. Create configuration objects in Python -2. Use the `dump()` method to see their JSON representation -3. Use that JSON in your API calls - -Example: -```python -from crawl4ai import CrawlerRunConfig, PruningContentFilter - -config = CrawlerRunConfig( - markdown_generator=DefaultMarkdownGenerator( - content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed") - ), - cache_mode= CacheMode.BYPASS -) -print(config.dump()) # Use this JSON in your API calls -``` - - -#### More Examples +#### More Examples *(Ensure Schema example uses type/value wrapper)* **Advanced Crawler Configuration** +*(Keep example, ensure cache_mode uses valid enum value like "bypass")* -```json -{ - "urls": ["https://example.com"], - "crawler_config": { - "type": "CrawlerRunConfig", - "params": { - "cache_mode": "bypass", - "markdown_generator": { - "type": "DefaultMarkdownGenerator", - "params": { - "content_filter": { - "type": "PruningContentFilter", - "params": { - "threshold": 0.48, - "threshold_type": "fixed", - "min_word_threshold": 0 - } - } - } - } - } - } -} -``` - -**Extraction Strategy**: - +**Extraction Strategy** ```json { "crawler_config": { @@ -401,11 +510,14 @@ print(config.dump()) # Use this JSON in your API calls "type": "JsonCssExtractionStrategy", "params": { "schema": { - "baseSelector": "article.post", - "fields": [ - {"name": "title", "selector": "h1", "type": "text"}, - {"name": "content", "selector": ".content", "type": "html"} - ] + "type": "dict", + "value": { + "baseSelector": "article.post", + "fields": [ + {"name": "title", "selector": "h1", "type": "text"}, + {"name": "content", "selector": ".content", "type": "html"} + ] + } } } } @@ -414,166 +526,105 @@ print(config.dump()) # Use this JSON in your API calls } ``` -**LLM Extraction Strategy** - -```json -{ - "crawler_config": { - "type": "CrawlerRunConfig", - "params": { - "extraction_strategy": { - "type": "LLMExtractionStrategy", - "params": { - "instruction": "Extract article title, author, publication date and main content", - "provider": "openai/gpt-4", - "api_token": "your-api-token", - "schema": { - "type": "dict", - "value": { - "title": "Article Schema", - "type": "object", - "properties": { - "title": { - "type": "string", - "description": "The article's headline" - }, - "author": { - "type": "string", - "description": "The author's name" - }, - "published_date": { - "type": "string", - "format": "date-time", - "description": "Publication date and time" - }, - "content": { - "type": "string", - "description": "The main article content" - } - }, - "required": ["title", "content"] - } - } - } - } - } - } -} -``` - -**Deep Crawler Example** - -```json -{ - "crawler_config": { - "type": "CrawlerRunConfig", - "params": { - "deep_crawl_strategy": { - "type": "BFSDeepCrawlStrategy", - "params": { - "max_depth": 3, - "filter_chain": { - "type": "FilterChain", - "params": { - "filters": [ - { - "type": "ContentTypeFilter", - "params": { - "allowed_types": ["text/html", "application/xhtml+xml"] - } - }, - { - "type": "DomainFilter", - "params": { - "allowed_domains": ["blog.*", "docs.*"], - } - } - ] - } - }, - "url_scorer": { - "type": "CompositeScorer", - "params": { - "scorers": [ - { - "type": "KeywordRelevanceScorer", - "params": { - "keywords": ["tutorial", "guide", "documentation"], - } - }, - { - "type": "PathDepthScorer", - "params": { - "weight": 0.5, - "optimal_depth": 3 - } - } - ] - } - } - } - } - } - } -} -``` +**LLM Extraction Strategy** *(Keep example, ensure schema uses type/value wrapper)* +*(Keep Deep Crawler Example)* ### REST API Examples -Let's look at some practical examples: +Update URLs to use port `11235`. #### Simple Crawl ```python import requests +# Configuration objects converted to the required JSON structure +browser_config_payload = { + "type": "BrowserConfig", + "params": {"headless": True} +} +crawler_config_payload = { + "type": "CrawlerRunConfig", + "params": {"stream": False, "cache_mode": "bypass"} # Use string value of enum +} + crawl_payload = { - "urls": ["https://example.com"], - "browser_config": {"headless": True}, - "crawler_config": {"stream": False} + "urls": ["https://httpbin.org/html"], + "browser_config": browser_config_payload, + "crawler_config": crawler_config_payload } response = requests.post( - "http://localhost:8000/crawl", - # headers={"Authorization": f"Bearer {token}"}, # If JWT is enabled, more on this later + "http://localhost:11235/crawl", # Updated port + # headers={"Authorization": f"Bearer {token}"}, # If JWT is enabled json=crawl_payload ) -print(response.json()) # Print the response for debugging +print(f"Status Code: {response.status_code}") +if response.ok: + print(response.json()) +else: + print(f"Error: {response.text}") + ``` #### Streaming Results ```python -async def test_stream_crawl(session, token: str): +import json +import httpx # Use httpx for async streaming example + +async def test_stream_crawl(token: str = None): # Made token optional """Test the /crawl/stream endpoint with multiple URLs.""" - url = "http://localhost:8000/crawl/stream" + url = "http://localhost:11235/crawl/stream" # Updated port payload = { "urls": [ - "https://example.com", - "https://example.com/page1", - "https://example.com/page2", - "https://example.com/page3", + "https://httpbin.org/html", + "https://httpbin.org/links/5/0", ], - "browser_config": {"headless": True, "viewport": {"width": 1200}}, - "crawler_config": {"stream": True, "cache_mode": "bypass"} + "browser_config": { + "type": "BrowserConfig", + "params": {"headless": True, "viewport": {"type": "dict", "value": {"width": 1200, "height": 800}}} # Viewport needs type:dict + }, + "crawler_config": { + "type": "CrawlerRunConfig", + "params": {"stream": True, "cache_mode": "bypass"} + } } - # headers = {"Authorization": f"Bearer {token}"} # If JWT is enabled, more on this later - + headers = {} + # if token: + # headers = {"Authorization": f"Bearer {token}"} # If JWT is enabled + try: - async with session.post(url, json=payload, headers=headers) as response: - status = response.status - print(f"Status: {status} (Expected: 200)") - assert status == 200, f"Expected 200, got {status}" - - # Read streaming response line-by-line (NDJSON) - async for line in response.content: - if line: - data = json.loads(line.decode('utf-8').strip()) - print(f"Streamed Result: {json.dumps(data, indent=2)}") + async with httpx.AsyncClient() as client: + async with client.stream("POST", url, json=payload, headers=headers, timeout=120.0) as response: + print(f"Status: {response.status_code} (Expected: 200)") + response.raise_for_status() # Raise exception for bad status codes + + # Read streaming response line-by-line (NDJSON) + async for line in response.aiter_lines(): + if line: + try: + data = json.loads(line) + # Check for completion marker + if data.get("status") == "completed": + print("Stream completed.") + break + print(f"Streamed Result: {json.dumps(data, indent=2)}") + except json.JSONDecodeError: + print(f"Warning: Could not decode JSON line: {line}") + + except httpx.HTTPStatusError as e: + print(f"HTTP error occurred: {e.response.status_code} - {e.response.text}") except Exception as e: print(f"Error in streaming crawl test: {str(e)}") + +# To run this example: +# import asyncio +# asyncio.run(test_stream_crawl()) ``` +--- + ## Metrics & Monitoring Keep an eye on your crawler with these endpoints: @@ -584,57 +635,63 @@ Keep an eye on your crawler with these endpoints: Example health check: ```bash -curl http://localhost:8000/health +curl http://localhost:11235/health ``` -## Deployment Scenarios +--- -> 🚧 Coming soon! We'll cover: -> - Kubernetes deployment -> - Cloud provider setups (AWS, GCP, Azure) -> - High-availability configurations -> - Load balancing strategies +*(Deployment Scenarios and Complete Examples sections remain the same, maybe update links if examples moved)* -## Complete Examples - -Check out the `examples` folder in our repository for full working examples! Here are two to get you started: -[Using Client SDK](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_python_sdk.py) -[Using REST API](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_python_rest_api.py) +--- ## Server Configuration -The server's behavior can be customized through the `config.yml` file. Let's explore how to configure your Crawl4AI server for optimal performance and security. +The server's behavior can be customized through the `config.yml` file. ### Understanding config.yml -The configuration file is located at `deploy/docker/config.yml`. You can either modify this file before building the image or mount a custom configuration when running the container. +The configuration file is loaded from `/app/config.yml` inside the container. By default, the file from `deploy/docker/config.yml` in the repository is copied there during the build. -Here's a detailed breakdown of the configuration options: +Here's a detailed breakdown of the configuration options (using defaults from `deploy/docker/config.yml`): ```yaml # Application Configuration app: - title: "Crawl4AI API" # Server title in OpenAPI docs - version: "1.0.0" # API version - host: "0.0.0.0" # Listen on all interfaces - port: 8000 # Server port - reload: True # Enable hot reloading (development only) - timeout_keep_alive: 300 # Keep-alive timeout in seconds + title: "Crawl4AI API" + version: "1.0.0" # Consider setting this to match library version, e.g., "0.5.1" + host: "0.0.0.0" + port: 8020 # NOTE: This port is used ONLY when running server.py directly. Gunicorn overrides this (see supervisord.conf). + reload: False # Default set to False - suitable for production + timeout_keep_alive: 300 + +# Default LLM Configuration +llm: + provider: "openai/gpt-4o-mini" + api_key_env: "OPENAI_API_KEY" + # api_key: sk-... # If you pass the API key directly then api_key_env will be ignored + +# Redis Configuration (Used by internal Redis server managed by supervisord) +redis: + host: "localhost" + port: 6379 + db: 0 + password: "" + # ... other redis options ... # Rate Limiting Configuration rate_limiting: - enabled: True # Enable/disable rate limiting - default_limit: "100/minute" # Rate limit format: "number/timeunit" - trusted_proxies: [] # List of trusted proxy IPs - storage_uri: "memory://" # Use "redis://localhost:6379" for production + enabled: True + default_limit: "1000/minute" + trusted_proxies: [] + storage_uri: "memory://" # Use "redis://localhost:6379" if you need persistent/shared limits # Security Configuration security: - enabled: false # Master toggle for security features - jwt_enabled: true # Enable JWT authentication - https_redirect: True # Force HTTPS - trusted_hosts: ["*"] # Allowed hosts (use specific domains in production) - headers: # Security headers + enabled: false # Master toggle for security features + jwt_enabled: false # Enable JWT authentication (requires security.enabled=true) + https_redirect: false # Force HTTPS (requires security.enabled=true) + trusted_hosts: ["*"] # Allowed hosts (use specific domains in production) + headers: # Security headers (applied if security.enabled=true) x_content_type_options: "nosniff" x_frame_options: "DENY" content_security_policy: "default-src 'self'" @@ -642,148 +699,72 @@ security: # Crawler Configuration crawler: - memory_threshold_percent: 95.0 # Memory usage threshold + memory_threshold_percent: 95.0 rate_limiter: - base_delay: [1.0, 2.0] # Min and max delay between requests + base_delay: [1.0, 2.0] # Min/max delay between requests in seconds for dispatcher timeouts: - stream_init: 30.0 # Stream initialization timeout - batch_process: 300.0 # Batch processing timeout + stream_init: 30.0 # Timeout for stream initialization + batch_process: 300.0 # Timeout for non-streaming /crawl processing # Logging Configuration logging: - level: "INFO" # Log level (DEBUG, INFO, WARNING, ERROR) + level: "INFO" format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s" # Observability Configuration observability: prometheus: - enabled: True # Enable Prometheus metrics - endpoint: "/metrics" # Metrics endpoint + enabled: True + endpoint: "/metrics" health_check: - endpoint: "/health" # Health check endpoint + endpoint: "/health" ``` -### JWT Authentication +*(JWT Authentication section remains the same, just note the default port is now 11235 for requests)* -When `security.jwt_enabled` is set to `true` in your config.yml, all endpoints require JWT authentication via bearer tokens. Here's how it works: - -#### Getting a Token -```python -POST /token -Content-Type: application/json - -{ - "email": "user@example.com" -} -``` - -The endpoint returns: -```json -{ - "email": "user@example.com", - "access_token": "eyJ0eXAiOiJKV1QiLCJhbGciOi...", - "token_type": "bearer" -} -``` - -#### Using the Token -Add the token to your requests: -```bash -curl -H "Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGci..." http://localhost:8000/crawl -``` - -Using the Python SDK: -```python -from crawl4ai.docker_client import Crawl4aiDockerClient - -async with Crawl4aiDockerClient() as client: - # Authenticate first - await client.authenticate("user@example.com") - - # Now all requests will include the token automatically - result = await client.crawl(urls=["https://example.com"]) -``` - -#### Production Considerations 💡 -The default implementation uses a simple email verification. For production use, consider: -- Email verification via OTP/magic links -- OAuth2 integration -- Rate limiting token generation -- Token expiration and refresh mechanisms -- IP-based restrictions - -### Configuration Tips and Best Practices - -1. **Production Settings** 🏭 - - ```yaml - app: - reload: False # Disable reload in production - timeout_keep_alive: 120 # Lower timeout for better resource management - - rate_limiting: - storage_uri: "redis://redis:6379" # Use Redis for distributed rate limiting - default_limit: "50/minute" # More conservative rate limit - - security: - enabled: true # Enable all security features - trusted_hosts: ["your-domain.com"] # Restrict to your domain - ``` - -2. **Development Settings** 🛠️ - - ```yaml - app: - reload: True # Enable hot reloading - timeout_keep_alive: 300 # Longer timeout for debugging - - logging: - level: "DEBUG" # More verbose logging - ``` - -3. **High-Traffic Settings** 🚦 - - ```yaml - crawler: - memory_threshold_percent: 85.0 # More conservative memory limit - rate_limiter: - base_delay: [2.0, 4.0] # More aggressive rate limiting - ``` +*(Configuration Tips and Best Practices remain the same)* ### Customizing Your Configuration -#### Method 1: Pre-build Configuration +You can override the default `config.yml`. -```bash -# Copy and modify config before building -cd crawl4ai/deploy -vim custom-config.yml # Or use any editor +#### Method 1: Modify Before Build -# Build with custom config -docker build --platform=linux/amd64 --no-cache -t crawl4ai:latest . -``` +1. Edit the `deploy/docker/config.yml` file in your local repository clone. +2. Build the image using `docker buildx` or `docker compose --profile local-... up --build`. The modified file will be copied into the image. -#### Method 2: Build-time Configuration +#### Method 2: Runtime Mount (Recommended for Custom Deploys) -Use a custom config during build: +1. Create your custom configuration file, e.g., `my-custom-config.yml` locally. Ensure it contains all necessary sections. +2. Mount it when running the container: -```bash -# Build with custom config -docker build --platform=linux/amd64 --no-cache \ - --build-arg CONFIG_PATH=/path/to/custom-config.yml \ - -t crawl4ai:latest . -``` + * **Using `docker run`:** + ```bash + # Assumes my-custom-config.yml is in the current directory + docker run -d -p 11235:11235 \ + --name crawl4ai-custom-config \ + --env-file .llm.env \ + --shm-size=1g \ + -v $(pwd)/my-custom-config.yml:/app/config.yml \ + unclecode/crawl4ai:latest # Or your specific tag + ``` -#### Method 3: Runtime Configuration -```bash -# Mount custom config at runtime -docker run -d -p 8000:8000 \ - -v $(pwd)/custom-config.yml:/app/config.yml \ - crawl4ai-server:prod -``` + * **Using `docker-compose.yml`:** Add a `volumes` section to the service definition: + ```yaml + services: + crawl4ai-hub-amd64: # Or your chosen service + image: unclecode/crawl4ai:latest + profiles: ["hub-amd64"] + <<: *base-config + volumes: + # Mount local custom config over the default one in the container + - ./my-custom-config.yml:/app/config.yml + # Keep the shared memory volume from base-config + - /dev/shm:/dev/shm + ``` + *(Note: Ensure `my-custom-config.yml` is in the same directory as `docker-compose.yml`)* -> 💡 Note: When using Method 2, `/path/to/custom-config.yml` is relative to deploy directory. -> 💡 Note: When using Method 3, ensure your custom config file has all required fields as the container will use this instead of the built-in config. +> 💡 When mounting, your custom file *completely replaces* the default one. Ensure it's a valid and complete configuration. ### Configuration Recommendations @@ -821,13 +802,20 @@ We're here to help you succeed with Crawl4AI! Here's how to get support: In this guide, we've covered everything you need to get started with Crawl4AI's Docker deployment: - Building and running the Docker container -- Configuring the environment +- Configuring the environment +- Using the interactive playground for testing - Making API requests with proper typing - Using the Python SDK +- Leveraging specialized endpoints for screenshots, PDFs, and JavaScript execution +- Connecting via the Model Context Protocol (MCP) - Monitoring your deployment +The new playground interface at `http://localhost:11235/playground` makes it much easier to test configurations and generate the corresponding JSON for API requests. + +For AI application developers, the MCP integration allows tools like Claude Code to directly access Crawl4AI's capabilities without complex API handling. + Remember, the examples in the `examples` folder are your friends - they show real-world usage patterns that you can adapt for your needs. Keep exploring, and don't hesitate to reach out if you need help! We're building something amazing together. 🚀 -Happy crawling! 🕷️ \ No newline at end of file +Happy crawling! 🕷️ diff --git a/deploy/docker/config.yml b/deploy/docker/config.yml index e93343c1..680765a3 100644 --- a/deploy/docker/config.yml +++ b/deploy/docker/config.yml @@ -3,9 +3,9 @@ app: title: "Crawl4AI API" version: "1.0.0" host: "0.0.0.0" - port: 8020 + port: 11235 reload: False - workers: 4 + workers: 1 timeout_keep_alive: 300 # Default LLM Configuration diff --git a/deploy/docker/requirements.txt b/deploy/docker/requirements.txt index 0dbb684c..dd489e28 100644 --- a/deploy/docker/requirements.txt +++ b/deploy/docker/requirements.txt @@ -1,5 +1,5 @@ -fastapi==0.115.12 -uvicorn==0.34.2 +fastapi>=0.115.12 +uvicorn>=0.34.2 gunicorn>=23.0.0 slowapi==0.1.9 prometheus-fastapi-instrumentator>=7.1.0 @@ -8,8 +8,9 @@ jwt>=1.3.1 dnspython>=2.7.0 email-validator==2.2.0 sse-starlette==2.2.1 -pydantic==2.11 +pydantic>=2.11 rank-bm25==0.2.2 anyio==4.9.0 PyJWT==2.10.1 - +mcp>=1.6.0 +websockets>=15.0.1 diff --git a/deploy/docker/server.py b/deploy/docker/server.py index 3cad8d05..bda9d891 100644 --- a/deploy/docker/server.py +++ b/deploy/docker/server.py @@ -629,6 +629,7 @@ async def get_context( # attach MCP layer (adds /mcp/ws, /mcp/sse, /mcp/schema) +print(f"MCP server running on {config['app']['host']}:{config['app']['port']}") attach_mcp( app, base_url=f"http://{config['app']['host']}:{config['app']['port']}" diff --git a/deploy/docker/static/playground/index.html b/deploy/docker/static/playground/index.html index 8c2b3fb9..8f0e2bdd 100644 --- a/deploy/docker/static/playground/index.html +++ b/deploy/docker/static/playground/index.html @@ -536,10 +536,14 @@ const endpointMap = { crawl: '/crawl', + }; + + /*const endpointMap = { + crawl: '/crawl', crawl_stream: '/crawl/stream', md: '/md', llm: '/llm' - }; + };*/ const api = endpointMap[endpoint]; const payload = { diff --git a/deploy/docker/supervisord.conf b/deploy/docker/supervisord.conf index d51cc953..a1b994aa 100644 --- a/deploy/docker/supervisord.conf +++ b/deploy/docker/supervisord.conf @@ -14,7 +14,7 @@ stderr_logfile=/dev/stderr ; Redirect redis stderr to container stderr stderr_logfile_maxbytes=0 [program:gunicorn] -command=/usr/local/bin/gunicorn --bind 0.0.0.0:11235 --workers 2 --threads 2 --timeout 120 --graceful-timeout 30 --keep-alive 60 --log-level info --worker-class uvicorn.workers.UvicornWorker server:app +command=/usr/local/bin/gunicorn --bind 0.0.0.0:11235 --workers 1 --threads 4 --timeout 1800 --graceful-timeout 30 --keep-alive 300 --log-level info --worker-class uvicorn.workers.UvicornWorker server:app directory=/app ; Working directory for the app user=appuser ; Run gunicorn as our non-root user autorestart=true diff --git a/docker-compose.yml b/docker-compose.yml index 4331d219..10ff3269 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -1,19 +1,11 @@ -# docker-compose.yml +version: '3.8' -# Base configuration anchor for reusability +# Shared configuration for all environments x-base-config: &base-config ports: - # Map host port 11235 to container port 11235 (where Gunicorn will listen) - - "11235:11235" - # - "8080:8080" # Uncomment if needed - - # Load API keys primarily from .llm.env file - # Create .llm.env in the root directory .llm.env.example + - "11235:11235" # Gunicorn port env_file: - - .llm.env - - # Define environment variables, allowing overrides from host environment - # Syntax ${VAR:-} uses host env var 'VAR' if set, otherwise uses value from .llm.env + - .llm.env # API keys (create from .llm.env.example) environment: - OPENAI_API_KEY=${OPENAI_API_KEY:-} - DEEPSEEK_API_KEY=${DEEPSEEK_API_KEY:-} @@ -22,10 +14,8 @@ x-base-config: &base-config - TOGETHER_API_KEY=${TOGETHER_API_KEY:-} - MISTRAL_API_KEY=${MISTRAL_API_KEY:-} - GEMINI_API_TOKEN=${GEMINI_API_TOKEN:-} - volumes: - # Mount /dev/shm for Chromium/Playwright performance - - /dev/shm:/dev/shm + - /dev/shm:/dev/shm # Chromium performance deploy: resources: limits: @@ -34,47 +24,26 @@ x-base-config: &base-config memory: 1G restart: unless-stopped healthcheck: - # IMPORTANT: Ensure Gunicorn binds to 11235 in supervisord.conf test: ["CMD", "curl", "-f", "http://localhost:11235/health"] interval: 30s timeout: 10s retries: 3 - start_period: 40s # Give the server time to start - # Run the container as the non-root user defined in the Dockerfile + start_period: 40s user: "appuser" services: - # --- Local Build Services --- - crawl4ai-local-amd64: + crawl4ai: + # 1. Default: Pull multi-platform test image from Docker Hub + # 2. Override with local image via: IMAGE=local-test docker compose up + image: ${IMAGE:-unclecode/crawl4ai:${TAG:-latest}} + + # Local build config (used with --build) build: - context: . # Build context is the root directory - dockerfile: Dockerfile # Dockerfile is in the root directory + context: . + dockerfile: Dockerfile args: INSTALL_TYPE: ${INSTALL_TYPE:-default} ENABLE_GPU: ${ENABLE_GPU:-false} - # PYTHON_VERSION arg is omitted as it's fixed by 'FROM python:3.10-slim' in Dockerfile - platform: linux/amd64 - profiles: ["local-amd64"] - <<: *base-config # Inherit base configuration - - crawl4ai-local-arm64: - build: - context: . # Build context is the root directory - dockerfile: Dockerfile # Dockerfile is in the root directory - args: - INSTALL_TYPE: ${INSTALL_TYPE:-default} - ENABLE_GPU: ${ENABLE_GPU:-false} - platform: linux/arm64 - profiles: ["local-arm64"] - <<: *base-config - - # --- Docker Hub Image Services --- - crawl4ai-hub-amd64: - image: unclecode/crawl4ai:${VERSION:-latest}-amd64 - profiles: ["hub-amd64"] - <<: *base-config - - crawl4ai-hub-arm64: - image: unclecode/crawl4ai:${VERSION:-latest}-arm64 - profiles: ["hub-arm64"] + + # Inherit shared config <<: *base-config \ No newline at end of file diff --git a/docs/md_v2/blog/releases/0.6.0.md b/docs/md_v2/blog/releases/0.6.0.md new file mode 100644 index 00000000..2e5bb63c --- /dev/null +++ b/docs/md_v2/blog/releases/0.6.0.md @@ -0,0 +1,51 @@ +# Crawl4AI 0.6.0 + +*Release date: 2025‑04‑22* + +0.6.0 is the **biggest jump** since the 0.5 series, packing a smarter browser core, pool‑based crawlers, and a ton of DX candy. Expect faster runs, lower RAM burn, and richer diagnostics. + +--- + +## 🚀 Key upgrades + +| Area | What changed | +|------|--------------| +| **Browser** | New **Browser** management with pooling, page pre‑warm, geolocation + locale + timezone switches | +| **Crawler** | Console and network log capture, MHTML snapshots, safer `get_page` API | +| **Server & API** | **Crawler Pool Manager** endpoint, MCP socket + SSE support | +| **Docs** | v2 layout, floating Ask‑AI helper, GitHub stats badge, copy‑code buttons, Docker API demo | +| **Tests** | Memory + load benchmarks, 90+ new cases covering MCP and Docker | + +--- + +## ⚠️ Breaking changes + +1. **`get_page` signature** – returns `(html, metadata)` instead of plain html. +2. **Docker** – new Chromium base layer, rebuild images. + +--- + +## How to upgrade + +```bash +pip install -U crawl4ai==0.6.0 +``` + +--- + +## Full changelog + +The diff between `main` and `next` spans **36 k insertions, 4.9 k deletions** over 121 files. Read the [compare view](https://github.com/unclecode/crawl4ai/compare/0.5.0.post8...0.6.0) or see `CHANGELOG.md` for the granular list. + +--- + +## Upgrade tips + +* Using the Docker API? Pull `unclecode/crawl4ai:0.6.0`, new args are documented in `/deploy/docker/README.md`. +* Stress‑test your stack with `tests/memory/run_benchmark.py` before production rollout. +* Markdown generators renamed but aliased, update when convenient, warnings will remind you. + +--- + +Happy crawling, ping `@unclecode` on X for questions or memes. + diff --git a/pyproject.toml b/pyproject.toml index 032e5cd6..cffef4de 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -8,7 +8,7 @@ dynamic = ["version"] description = "🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & scraper" readme = "README.md" requires-python = ">=3.9" -license = {text = "MIT"} +license = {text = "Apache-2.0"} authors = [ {name = "Unclecode", email = "unclecode@kidocode.com"} ] diff --git a/tests/mcp/test_mcp_socket.py b/tests/mcp/test_mcp_socket.py index ecb3070f..32456b31 100644 --- a/tests/mcp/test_mcp_socket.py +++ b/tests/mcp/test_mcp_socket.py @@ -101,19 +101,19 @@ async def test_context(s: ClientSession): async def main() -> None: - async with websocket_client("ws://localhost:8020/mcp/ws") as (r, w): + async with websocket_client("ws://localhost:11235/mcp/ws") as (r, w): async with ClientSession(r, w) as s: await s.initialize() # handshake tools = (await s.list_tools()).tools print("tools:", [t.name for t in tools]) # await test_list() - # await test_crawl(s) - # await test_md(s) - # await test_screenshot(s) - # await test_pdf(s) - # await test_execute_js(s) - # await test_html(s) + await test_crawl(s) + await test_md(s) + await test_screenshot(s) + await test_pdf(s) + await test_execute_js(s) + await test_html(s) await test_context(s) anyio.run(main)