* Fix: Use correct URL variable for raw HTML extraction (#1116) - Prevents full HTML content from being passed as URL to extraction strategies - Added unit tests to verify raw HTML and regular URL processing Fix: Wrong URL variable used for extraction of raw html * Fix #1181: Preserve whitespace in code blocks during HTML scraping The remove_empty_elements_fast() method was removing whitespace-only span elements inside <pre> and <code> tags, causing import statements like "import torch" to become "importtorch". Now skips elements inside code blocks where whitespace is significant. * Refactor Pydantic model configuration to use ConfigDict for arbitrary types * Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621 * Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638 * fix: ensure BrowserConfig.to_dict serializes proxy_config * feat: make LLM backoff configurable end-to-end - extend LLMConfig with backoff delay/attempt/factor fields and thread them through LLMExtractionStrategy, LLMContentFilter, table extraction, and Docker API handlers - expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff and document them in the md_v2 guides * reproduced AttributeError from #1642 * pass timeout parameter to docker client request * added missing deep crawling objects to init * generalized query in ContentRelevanceFilter to be a str or list * import modules from enhanceable deserialization * parameterized tests * Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268 * refactor: replace PyPDF2 with pypdf across the codebase. ref #1412 * Add browser_context_id and target_id parameters to BrowserConfig Enable Crawl4AI to connect to pre-created CDP browser contexts, which is essential for cloud browser services that pre-create isolated contexts. Changes: - Add browser_context_id and target_id parameters to BrowserConfig - Update from_kwargs() and to_dict() methods - Modify BrowserManager.start() to use existing context when provided - Add _get_page_by_target_id() helper method - Update get_page() to handle pre-existing targets - Add test for browser_context_id functionality This enables cloud services to: 1. Create isolated CDP contexts before Crawl4AI connects 2. Pass context/target IDs to BrowserConfig 3. Have Crawl4AI reuse existing contexts instead of creating new ones * Add cdp_cleanup_on_close flag to prevent memory leaks in cloud/server scenarios * Fix: add cdp_cleanup_on_close to from_kwargs * Fix: find context by target_id for concurrent CDP connections * Fix: use target_id to find correct page in get_page * Fix: use CDP to find context by browserContextId for concurrent sessions * Revert context matching attempts - Playwright cannot see CDP-created contexts * Add create_isolated_context flag for concurrent CDP crawls When True, forces creation of a new browser context instead of reusing the default context. Essential for concurrent crawls on the same browser to prevent navigation conflicts. * Add context caching to create_isolated_context branch Uses contexts_by_config cache (same as non-CDP mode) to reuse contexts for multiple URLs with same config. Still creates new page per crawl for navigation isolation. Benefits batch/deep crawls. * Add init_scripts support to BrowserConfig for pre-page-load JS injection This adds the ability to inject JavaScript that runs before any page loads, useful for stealth evasions (canvas/audio fingerprinting, userAgentData). - Add init_scripts parameter to BrowserConfig (list of JS strings) - Apply init_scripts in setup_context() via context.add_init_script() - Update from_kwargs() and to_dict() for serialization * Fix CDP connection handling: support WS URLs and proper cleanup Changes to browser_manager.py: 1. _verify_cdp_ready(): Support multiple URL formats - WebSocket URLs (ws://, wss://): Skip HTTP verification, Playwright handles directly - HTTP URLs with query params: Properly parse with urlparse to preserve query string - Fixes issue where naive f"{cdp_url}/json/version" broke WS URLs and query params 2. close(): Proper cleanup when cdp_cleanup_on_close=True - Close all sessions (pages) - Close all contexts - Call browser.close() to disconnect (doesn't terminate browser, just releases connection) - Wait 1 second for CDP connection to fully release - Stop Playwright instance to prevent memory leaks This enables: - Connecting to specific browsers via WS URL - Reusing the same browser with multiple sequential connections - No user wait needed between connections (internal 1s delay handles it) Added tests/browser/test_cdp_cleanup_reuse.py with comprehensive tests. * Update gitignore * Some debugging for caching * Add _generate_screenshot_from_html for raw: and file:// URLs Implements the missing method that was being called but never defined. Now raw: and file:// URLs can generate screenshots by: 1. Loading HTML into a browser page via page.set_content() 2. Taking screenshot using existing take_screenshot() method 3. Cleaning up the page afterward This enables cached HTML to be rendered with screenshots in crawl4ai-cloud. * Add PDF and MHTML support for raw: and file:// URLs - Replace _generate_screenshot_from_html with _generate_media_from_html - New method handles screenshot, PDF, and MHTML in one browser session - Update raw: and file:// URL handlers to use new method - Enables cached HTML to generate all media types * Add crash recovery for deep crawl strategies Add optional resume_state and on_state_change parameters to all deep crawl strategies (BFS, DFS, Best-First) for cloud deployment crash recovery. Features: - resume_state: Pass saved state to resume from checkpoint - on_state_change: Async callback fired after each URL for real-time state persistence to external storage (Redis, DB, etc.) - export_state(): Get last captured state manually - Zero overhead when features are disabled (None defaults) State includes visited URLs, pending queue/stack, depths, and pages_crawled count. All state is JSON-serializable. * Fix: HTTP strategy raw: URL parsing truncates at # character The AsyncHTTPCrawlerStrategy.crawl() method used urlparse() to extract content from raw: URLs. This caused HTML with CSS color codes like #eee to be truncated because # is treated as a URL fragment delimiter. Before: raw:body{background:#eee} -> parsed.path = 'body{background:' After: raw:body{background:#eee} -> raw_content = 'body{background:#eee' Fix: Strip the raw: or raw:// prefix directly instead of using urlparse, matching how the browser strategy handles it. * Add base_url parameter to CrawlerRunConfig for raw HTML processing When processing raw: HTML (e.g., from cache), the URL parameter is meaningless for markdown link resolution. This adds a base_url parameter that can be set explicitly to provide proper URL resolution context. Changes: - Add base_url parameter to CrawlerRunConfig.__init__ - Add base_url to CrawlerRunConfig.from_kwargs - Update aprocess_html to use base_url for markdown generation Usage: config = CrawlerRunConfig(base_url='https://example.com') result = await crawler.arun(url='raw:{html}', config=config) * Add prefetch mode for two-phase deep crawling - Add `prefetch` parameter to CrawlerRunConfig - Add `quick_extract_links()` function for fast link extraction - Add short-circuit in aprocess_html() for prefetch mode - Add 42 tests (unit, integration, regression) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Updates on proxy rotation and proxy configuration * Add proxy support to HTTP crawler strategy * Add browser pipeline support for raw:/file:// URLs - Add process_in_browser parameter to CrawlerRunConfig - Route raw:/file:// URLs through _crawl_web() when browser operations needed - Use page.set_content() instead of goto() for local content - Fix cookie handling for non-HTTP URLs in browser_manager - Auto-detect browser requirements: js_code, wait_for, screenshot, etc. - Maintain fast path for raw:/file:// without browser params Fixes #310 * Add smart TTL cache for sitemap URL seeder - Add cache_ttl_hours and validate_sitemap_lastmod params to SeedingConfig - New JSON cache format with metadata (version, created_at, lastmod, url_count) - Cache validation by TTL expiry and sitemap lastmod comparison - Auto-migration from old .jsonl to new .json format - Fixes bug where incomplete cache was used indefinitely * Update URL seeder docs with smart TTL cache parameters - Add cache_ttl_hours and validate_sitemap_lastmod to parameter table - Document smart TTL cache validation with examples - Add cache-related troubleshooting entries - Update key features summary * Add MEMORY.md to gitignore * Docs: Add multi-sample schema generation section Add documentation explaining how to pass multiple HTML samples to generate_schema() for stable selectors that work across pages with varying DOM structures. Includes: - Problem explanation (fragile nth-child selectors) - Solution with code example - Key points for multi-sample queries - Comparison table of fragile vs stable selectors * Fix critical RCE and LFI vulnerabilities in Docker API deployment Security fixes for vulnerabilities reported by ProjectDiscovery: 1. Remote Code Execution via Hooks (CVE pending) - Remove __import__ from allowed_builtins in hook_manager.py - Prevents arbitrary module imports (os, subprocess, etc.) - Hooks now disabled by default via CRAWL4AI_HOOKS_ENABLED env var 2. Local File Inclusion via file:// URLs (CVE pending) - Add URL scheme validation to /execute_js, /screenshot, /pdf, /html - Block file://, javascript:, data: and other dangerous schemes - Only allow http://, https://, and raw: (where appropriate) 3. Security hardening - Add CRAWL4AI_HOOKS_ENABLED=false as default (opt-in for hooks) - Add security warning comments in config.yml - Add validate_url_scheme() helper for consistent validation Testing: - Add unit tests (test_security_fixes.py) - 16 tests - Add integration tests (run_security_tests.py) for live server Affected endpoints: - POST /crawl (hooks disabled by default) - POST /crawl/stream (hooks disabled by default) - POST /execute_js (URL validation added) - POST /screenshot (URL validation added) - POST /pdf (URL validation added) - POST /html (URL validation added) Breaking changes: - Hooks require CRAWL4AI_HOOKS_ENABLED=true to function - file:// URLs no longer work on API endpoints (use library directly) * Enhance authentication flow by implementing JWT token retrieval and adding authorization headers to API requests * Add release notes for v0.7.9, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates * Add release notes for v0.8.0, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates Documentation for v0.8.0 release: - SECURITY.md: Security policy and vulnerability reporting guidelines - RELEASE_NOTES_v0.8.0.md: Comprehensive release notes - migration/v0.8.0-upgrade-guide.md: Step-by-step migration guide - security/GHSA-DRAFT-RCE-LFI.md: GitHub security advisory drafts - CHANGELOG.md: Updated with v0.8.0 changes Breaking changes documented: - Docker API hooks disabled by default (CRAWL4AI_HOOKS_ENABLED) - file:// URLs blocked on Docker API endpoints Security fixes credited to Neo by ProjectDiscovery * Add examples for deep crawl crash recovery and prefetch mode in documentation * Release v0.8.0: The v0.8.0 Update - Updated version to 0.8.0 - Added comprehensive demo and release notes - Updated all documentation * Update security researcher acknowledgment with a hyperlink for Neo by ProjectDiscovery * Add async agenerate_schema method for schema generation - Extract prompt building to shared _build_schema_prompt() method - Add agenerate_schema() async version using aperform_completion_with_backoff - Refactor generate_schema() to use shared prompt builder - Fixes Gemini/Vertex AI compatibility in async contexts (FastAPI) * Fix: Enable litellm.drop_params for O-series/GPT-5 model compatibility O-series (o1, o3) and GPT-5 models only support temperature=1. Setting litellm.drop_params=True auto-drops unsupported parameters instead of throwing UnsupportedParamsError. Fixes temperature=0.01 error for these models in LLM extraction. --------- Co-authored-by: rbushria <rbushri@gmail.com> Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com> Co-authored-by: Soham Kukreti <kukretisoham@gmail.com> Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com> Co-authored-by: unclecode <unclecode@kidocode.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
1034 lines
33 KiB
Markdown
1034 lines
33 KiB
Markdown
# Crawl4AI Docker Guide 🐳
|
|
|
|
## Table of Contents
|
|
- [Prerequisites](#prerequisites)
|
|
- [Installation](#installation)
|
|
- [Option 1: Using Pre-built Docker Hub Images (Recommended)](#option-1-using-pre-built-docker-hub-images-recommended)
|
|
- [Option 2: Using Docker Compose](#option-2-using-docker-compose)
|
|
- [Option 3: Manual Local Build & Run](#option-3-manual-local-build--run)
|
|
- [Dockerfile Parameters](#dockerfile-parameters)
|
|
- [Using the API](#using-the-api)
|
|
- [Playground Interface](#playground-interface)
|
|
- [Python SDK](#python-sdk)
|
|
- [Understanding Request Schema](#understanding-request-schema)
|
|
- [REST API Examples](#rest-api-examples)
|
|
- [Asynchronous Jobs with Webhooks](#asynchronous-jobs-with-webhooks)
|
|
- [Additional API Endpoints](#additional-api-endpoints)
|
|
- [HTML Extraction Endpoint](#html-extraction-endpoint)
|
|
- [Screenshot Endpoint](#screenshot-endpoint)
|
|
- [PDF Export Endpoint](#pdf-export-endpoint)
|
|
- [JavaScript Execution Endpoint](#javascript-execution-endpoint)
|
|
- [Library Context Endpoint](#library-context-endpoint)
|
|
- [MCP (Model Context Protocol) Support](#mcp-model-context-protocol-support)
|
|
- [What is MCP?](#what-is-mcp)
|
|
- [Connecting via MCP](#connecting-via-mcp)
|
|
- [Using with Claude Code](#using-with-claude-code)
|
|
- [Available MCP Tools](#available-mcp-tools)
|
|
- [Testing MCP Connections](#testing-mcp-connections)
|
|
- [MCP Schemas](#mcp-schemas)
|
|
- [Metrics & Monitoring](#metrics--monitoring)
|
|
- [Deployment Scenarios](#deployment-scenarios)
|
|
- [Complete Examples](#complete-examples)
|
|
- [Server Configuration](#server-configuration)
|
|
- [Understanding config.yml](#understanding-configyml)
|
|
- [JWT Authentication](#jwt-authentication)
|
|
- [Configuration Tips and Best Practices](#configuration-tips-and-best-practices)
|
|
- [Customizing Your Configuration](#customizing-your-configuration)
|
|
- [Configuration Recommendations](#configuration-recommendations)
|
|
- [Getting Help](#getting-help)
|
|
- [Summary](#summary)
|
|
|
|
## Prerequisites
|
|
|
|
Before we dive in, make sure you have:
|
|
- Docker installed and running (version 20.10.0 or higher), including `docker compose` (usually bundled with Docker Desktop).
|
|
- `git` for cloning the repository.
|
|
- At least 4GB of RAM available for the container (more recommended for heavy use).
|
|
- Python 3.10+ (if using the Python SDK).
|
|
- Node.js 16+ (if using the Node.js examples).
|
|
|
|
> 💡 **Pro tip**: Run `docker info` to check your Docker installation and available resources.
|
|
|
|
## Installation
|
|
|
|
We offer several ways to get the Crawl4AI server running. The quickest way is to use our pre-built Docker Hub images.
|
|
|
|
### Option 1: Using Pre-built Docker Hub Images (Recommended)
|
|
|
|
Pull and run images directly from Docker Hub without building locally.
|
|
|
|
#### 1. Pull the Image
|
|
|
|
Our latest stable release is `0.8.0`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
|
|
|
|
```bash
|
|
# Pull the latest stable version (0.8.0)
|
|
docker pull unclecode/crawl4ai:0.8.0
|
|
|
|
# Or use the latest tag (points to 0.8.0)
|
|
docker pull unclecode/crawl4ai:latest
|
|
```
|
|
|
|
#### 2. Setup Environment (API Keys)
|
|
|
|
If you plan to use LLMs, create a `.llm.env` file in your working directory:
|
|
|
|
```bash
|
|
# Create a .llm.env file with your API keys
|
|
cat > .llm.env << EOL
|
|
# OpenAI
|
|
OPENAI_API_KEY=sk-your-key
|
|
|
|
# Anthropic
|
|
ANTHROPIC_API_KEY=your-anthropic-key
|
|
|
|
# Other providers as needed
|
|
# DEEPSEEK_API_KEY=your-deepseek-key
|
|
# GROQ_API_KEY=your-groq-key
|
|
# TOGETHER_API_KEY=your-together-key
|
|
# MISTRAL_API_KEY=your-mistral-key
|
|
# GEMINI_API_TOKEN=your-gemini-token
|
|
EOL
|
|
```
|
|
> 🔑 **Note**: Keep your API keys secure! Never commit `.llm.env` to version control.
|
|
|
|
#### 3. Run the Container
|
|
|
|
* **Basic run:**
|
|
```bash
|
|
docker run -d \
|
|
-p 11235:11235 \
|
|
--name crawl4ai \
|
|
--shm-size=1g \
|
|
unclecode/crawl4ai:0.8.0
|
|
```
|
|
|
|
* **With LLM support:**
|
|
```bash
|
|
# Make sure .llm.env is in the current directory
|
|
docker run -d \
|
|
-p 11235:11235 \
|
|
--name crawl4ai \
|
|
--env-file .llm.env \
|
|
--shm-size=1g \
|
|
unclecode/crawl4ai:0.8.0
|
|
```
|
|
|
|
> The server will be available at `http://localhost:11235`. Visit `/playground` to access the interactive testing interface.
|
|
|
|
#### 4. Stopping the Container
|
|
|
|
```bash
|
|
docker stop crawl4ai && docker rm crawl4ai
|
|
```
|
|
|
|
#### Docker Hub Versioning Explained
|
|
|
|
* **Image Name:** `unclecode/crawl4ai`
|
|
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.7.0-r1`)
|
|
* `LIBRARY_VERSION`: The semantic version of the core `crawl4ai` Python library
|
|
* `SUFFIX`: Optional tag for release candidates (``) and revisions (`r1`)
|
|
* **`latest` Tag:** Points to the most recent stable version
|
|
* **Multi-Architecture Support:** All images support both `linux/amd64` and `linux/arm64` architectures through a single tag
|
|
|
|
### Option 2: Using Docker Compose
|
|
|
|
Docker Compose simplifies building and running the service, especially for local development and testing.
|
|
|
|
#### 1. Clone Repository
|
|
|
|
```bash
|
|
git clone https://github.com/unclecode/crawl4ai.git
|
|
cd crawl4ai
|
|
```
|
|
|
|
#### 2. Environment Setup (API Keys)
|
|
|
|
If you plan to use LLMs, copy the example environment file and add your API keys. This file should be in the **project root directory**.
|
|
|
|
```bash
|
|
# Make sure you are in the 'crawl4ai' root directory
|
|
cp deploy/docker/.llm.env.example .llm.env
|
|
|
|
# Now edit .llm.env and add your API keys
|
|
```
|
|
|
|
**Flexible LLM Provider Configuration:**
|
|
|
|
The Docker setup now supports flexible LLM provider configuration through three methods:
|
|
|
|
1. **Environment Variable** (Highest Priority): Set `LLM_PROVIDER` to override the default
|
|
```bash
|
|
export LLM_PROVIDER="anthropic/claude-3-opus"
|
|
# Or in your .llm.env file:
|
|
# LLM_PROVIDER=anthropic/claude-3-opus
|
|
```
|
|
|
|
2. **API Request Parameter**: Specify provider per request
|
|
```json
|
|
{
|
|
"url": "https://example.com",
|
|
"provider": "groq/mixtral-8x7b"
|
|
}
|
|
```
|
|
|
|
3. **Config File Default**: Falls back to `config.yml` (default: `openai/gpt-4o-mini`)
|
|
|
|
The system automatically selects the appropriate API key based on the provider.
|
|
|
|
#### 3. Build and Run with Compose
|
|
|
|
The `docker-compose.yml` file in the project root provides a simplified approach that automatically handles architecture detection using buildx.
|
|
|
|
* **Run Pre-built Image from Docker Hub:**
|
|
```bash
|
|
# Pulls and runs the release candidate from Docker Hub
|
|
# Automatically selects the correct architecture
|
|
IMAGE=unclecode/crawl4ai:0.8.0 docker compose up -d
|
|
```
|
|
|
|
* **Build and Run Locally:**
|
|
```bash
|
|
# Builds the image locally using Dockerfile and runs it
|
|
# Automatically uses the correct architecture for your machine
|
|
docker compose up --build -d
|
|
```
|
|
|
|
* **Customize the Build:**
|
|
```bash
|
|
# Build with all features (includes torch and transformers)
|
|
INSTALL_TYPE=all docker compose up --build -d
|
|
|
|
# Build with GPU support (for AMD64 platforms)
|
|
ENABLE_GPU=true docker compose up --build -d
|
|
```
|
|
|
|
> The server will be available at `http://localhost:11235`.
|
|
|
|
#### 4. Stopping the Service
|
|
|
|
```bash
|
|
# Stop the service
|
|
docker compose down
|
|
```
|
|
|
|
### Option 3: Manual Local Build & Run
|
|
|
|
If you prefer not to use Docker Compose for direct control over the build and run process.
|
|
|
|
#### 1. Clone Repository & Setup Environment
|
|
|
|
Follow steps 1 and 2 from the Docker Compose section above (clone repo, `cd crawl4ai`, create `.llm.env` in the root).
|
|
|
|
#### 2. Build the Image (Multi-Arch)
|
|
|
|
Use `docker buildx` to build the image. Crawl4AI now uses buildx to handle multi-architecture builds automatically.
|
|
|
|
```bash
|
|
# Make sure you are in the 'crawl4ai' root directory
|
|
# Build for the current architecture and load it into Docker
|
|
docker buildx build -t crawl4ai-local:latest --load .
|
|
|
|
# Or build for multiple architectures (useful for publishing)
|
|
docker buildx build --platform linux/amd64,linux/arm64 -t crawl4ai-local:latest --load .
|
|
|
|
# Build with additional options
|
|
docker buildx build \
|
|
--build-arg INSTALL_TYPE=all \
|
|
--build-arg ENABLE_GPU=false \
|
|
-t crawl4ai-local:latest --load .
|
|
```
|
|
|
|
#### 3. Run the Container
|
|
|
|
* **Basic run (no LLM support):**
|
|
```bash
|
|
docker run -d \
|
|
-p 11235:11235 \
|
|
--name crawl4ai-standalone \
|
|
--shm-size=1g \
|
|
crawl4ai-local:latest
|
|
```
|
|
|
|
* **With LLM support:**
|
|
```bash
|
|
# Make sure .llm.env is in the current directory (project root)
|
|
docker run -d \
|
|
-p 11235:11235 \
|
|
--name crawl4ai-standalone \
|
|
--env-file .llm.env \
|
|
--shm-size=1g \
|
|
crawl4ai-local:latest
|
|
```
|
|
|
|
> The server will be available at `http://localhost:11235`.
|
|
|
|
#### 4. Stopping the Manual Container
|
|
|
|
```bash
|
|
docker stop crawl4ai-standalone && docker rm crawl4ai-standalone
|
|
```
|
|
|
|
---
|
|
|
|
## MCP (Model Context Protocol) Support
|
|
|
|
Crawl4AI server includes support for the Model Context Protocol (MCP), allowing you to connect the server's capabilities directly to MCP-compatible clients like Claude Code.
|
|
|
|
### What is MCP?
|
|
|
|
MCP is an open protocol that standardizes how applications provide context to LLMs. It allows AI models to access external tools, data sources, and services through a standardized interface.
|
|
|
|
### Connecting via MCP
|
|
|
|
The Crawl4AI server exposes two MCP endpoints:
|
|
|
|
- **Server-Sent Events (SSE)**: `http://localhost:11235/mcp/sse`
|
|
- **WebSocket**: `ws://localhost:11235/mcp/ws`
|
|
|
|
### Using with Claude Code
|
|
|
|
You can add Crawl4AI as an MCP tool provider in Claude Code with a simple command:
|
|
|
|
```bash
|
|
# Add the Crawl4AI server as an MCP provider
|
|
claude mcp add --transport sse c4ai-sse http://localhost:11235/mcp/sse
|
|
|
|
# List all MCP providers to verify it was added
|
|
claude mcp list
|
|
```
|
|
|
|
Once connected, Claude Code can directly use Crawl4AI's capabilities like screenshot capture, PDF generation, and HTML processing without having to make separate API calls.
|
|
|
|
### Available MCP Tools
|
|
|
|
When connected via MCP, the following tools are available:
|
|
|
|
- `md` - Generate markdown from web content
|
|
- `html` - Extract preprocessed HTML
|
|
- `screenshot` - Capture webpage screenshots
|
|
- `pdf` - Generate PDF documents
|
|
- `execute_js` - Run JavaScript on web pages
|
|
- `crawl` - Perform multi-URL crawling
|
|
- `ask` - Query the Crawl4AI library context
|
|
|
|
### Testing MCP Connections
|
|
|
|
You can test the MCP WebSocket connection using the test file included in the repository:
|
|
|
|
```bash
|
|
# From the repository root
|
|
python tests/mcp/test_mcp_socket.py
|
|
```
|
|
|
|
### MCP Schemas
|
|
|
|
Access the MCP tool schemas at `http://localhost:11235/mcp/schema` for detailed information on each tool's parameters and capabilities.
|
|
|
|
---
|
|
|
|
## Additional API Endpoints
|
|
|
|
In addition to the core `/crawl` and `/crawl/stream` endpoints, the server provides several specialized endpoints:
|
|
|
|
### HTML Extraction Endpoint
|
|
|
|
```
|
|
POST /html
|
|
```
|
|
|
|
Crawls the URL and returns preprocessed HTML optimized for schema extraction.
|
|
|
|
```json
|
|
{
|
|
"url": "https://example.com"
|
|
}
|
|
```
|
|
|
|
### Screenshot Endpoint
|
|
|
|
```
|
|
POST /screenshot
|
|
```
|
|
|
|
Captures a full-page PNG screenshot of the specified URL.
|
|
|
|
```json
|
|
{
|
|
"url": "https://example.com",
|
|
"screenshot_wait_for": 2,
|
|
"output_path": "/path/to/save/screenshot.png"
|
|
}
|
|
```
|
|
|
|
- `screenshot_wait_for`: Optional delay in seconds before capture (default: 2)
|
|
- `output_path`: Optional path to save the screenshot (recommended)
|
|
|
|
### PDF Export Endpoint
|
|
|
|
```
|
|
POST /pdf
|
|
```
|
|
|
|
Generates a PDF document of the specified URL.
|
|
|
|
```json
|
|
{
|
|
"url": "https://example.com",
|
|
"output_path": "/path/to/save/document.pdf"
|
|
}
|
|
```
|
|
|
|
- `output_path`: Optional path to save the PDF (recommended)
|
|
|
|
### JavaScript Execution Endpoint
|
|
|
|
```
|
|
POST /execute_js
|
|
```
|
|
|
|
Executes JavaScript snippets on the specified URL and returns the full crawl result.
|
|
|
|
```json
|
|
{
|
|
"url": "https://example.com",
|
|
"scripts": [
|
|
"return document.title",
|
|
"return Array.from(document.querySelectorAll('a')).map(a => a.href)"
|
|
]
|
|
}
|
|
```
|
|
|
|
- `scripts`: List of JavaScript snippets to execute sequentially
|
|
|
|
---
|
|
|
|
## Dockerfile Parameters
|
|
|
|
You can customize the image build process using build arguments (`--build-arg`). These are typically used via `docker buildx build` or within the `docker-compose.yml` file.
|
|
|
|
```bash
|
|
# Example: Build with 'all' features using buildx
|
|
docker buildx build \
|
|
--platform linux/amd64,linux/arm64 \
|
|
--build-arg INSTALL_TYPE=all \
|
|
-t yourname/crawl4ai-all:latest \
|
|
--load \
|
|
. # Build from root context
|
|
```
|
|
|
|
### Build Arguments Explained
|
|
|
|
| Argument | Description | Default | Options |
|
|
| :----------- | :--------------------------------------- | :-------- | :--------------------------------- |
|
|
| INSTALL_TYPE | Feature set | `default` | `default`, `all`, `torch`, `transformer` |
|
|
| ENABLE_GPU | GPU support (CUDA for AMD64) | `false` | `true`, `false` |
|
|
| APP_HOME | Install path inside container (advanced) | `/app` | any valid path |
|
|
| USE_LOCAL | Install library from local source | `true` | `true`, `false` |
|
|
| GITHUB_REPO | Git repo to clone if USE_LOCAL=false | *(see Dockerfile)* | any git URL |
|
|
| GITHUB_BRANCH| Git branch to clone if USE_LOCAL=false | `main` | any branch name |
|
|
|
|
*(Note: PYTHON_VERSION is fixed by the `FROM` instruction in the Dockerfile)*
|
|
|
|
### Build Best Practices
|
|
|
|
1. **Choose the Right Install Type**
|
|
* `default`: Basic installation, smallest image size. Suitable for most standard web scraping and markdown generation.
|
|
* `all`: Full features including `torch` and `transformers` for advanced extraction strategies (e.g., CosineStrategy, certain LLM filters). Significantly larger image. Ensure you need these extras.
|
|
2. **Platform Considerations**
|
|
* Use `buildx` for building multi-architecture images, especially for pushing to registries.
|
|
* Use `docker compose` profiles (`local-amd64`, `local-arm64`) for easy platform-specific local builds.
|
|
3. **Performance Optimization**
|
|
* The image automatically includes platform-specific optimizations (OpenMP for AMD64, OpenBLAS for ARM64).
|
|
|
|
---
|
|
|
|
## Using the API
|
|
|
|
Communicate with the running Docker server via its REST API (defaulting to `http://localhost:11235`). You can use the Python SDK or make direct HTTP requests.
|
|
|
|
### Playground Interface
|
|
|
|
A built-in web playground is available at `http://localhost:11235/playground` for testing and generating API requests. The playground allows you to:
|
|
|
|
1. Configure `CrawlerRunConfig` and `BrowserConfig` using the main library's Python syntax
|
|
2. Test crawling operations directly from the interface
|
|
3. Generate corresponding JSON for REST API requests based on your configuration
|
|
|
|
This is the easiest way to translate Python configuration to JSON requests when building integrations.
|
|
|
|
### Python SDK
|
|
|
|
Install the SDK: `pip install crawl4ai`
|
|
|
|
```python
|
|
import asyncio
|
|
from crawl4ai.docker_client import Crawl4aiDockerClient
|
|
from crawl4ai import BrowserConfig, CrawlerRunConfig, CacheMode # Assuming you have crawl4ai installed
|
|
|
|
async def main():
|
|
# Point to the correct server port
|
|
async with Crawl4aiDockerClient(base_url="http://localhost:11235", verbose=True) as client:
|
|
# If JWT is enabled on the server, authenticate first:
|
|
# await client.authenticate("user@example.com") # See Server Configuration section
|
|
|
|
# Example Non-streaming crawl
|
|
print("--- Running Non-Streaming Crawl ---")
|
|
results = await client.crawl(
|
|
["https://httpbin.org/html"],
|
|
browser_config=BrowserConfig(headless=True), # Use library classes for config aid
|
|
crawler_config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
|
)
|
|
if results: # client.crawl returns None on failure
|
|
print(f"Non-streaming results success: {results.success}")
|
|
if results.success:
|
|
for result in results: # Iterate through the CrawlResultContainer
|
|
print(f"URL: {result.url}, Success: {result.success}")
|
|
else:
|
|
print("Non-streaming crawl failed.")
|
|
|
|
|
|
# Example Streaming crawl
|
|
print("\n--- Running Streaming Crawl ---")
|
|
stream_config = CrawlerRunConfig(stream=True, cache_mode=CacheMode.BYPASS)
|
|
try:
|
|
async for result in await client.crawl( # client.crawl returns an async generator for streaming
|
|
["https://httpbin.org/html", "https://httpbin.org/links/5/0"],
|
|
browser_config=BrowserConfig(headless=True),
|
|
crawler_config=stream_config
|
|
):
|
|
print(f"Streamed result: URL: {result.url}, Success: {result.success}")
|
|
except Exception as e:
|
|
print(f"Streaming crawl failed: {e}")
|
|
|
|
|
|
# Example Get schema
|
|
print("\n--- Getting Schema ---")
|
|
schema = await client.get_schema()
|
|
print(f"Schema received: {bool(schema)}") # Print whether schema was received
|
|
|
|
if __name__ == "__main__":
|
|
asyncio.run(main())
|
|
```
|
|
|
|
*(SDK parameters like timeout, verify_ssl etc. remain the same)*
|
|
|
|
### Second Approach: Direct API Calls
|
|
|
|
Crucially, when sending configurations directly via JSON, they **must** follow the `{"type": "ClassName", "params": {...}}` structure for any non-primitive value (like config objects or strategies). Dictionaries must be wrapped as `{"type": "dict", "value": {...}}`.
|
|
|
|
*(Keep the detailed explanation of Configuration Structure, Basic Pattern, Simple vs Complex, Strategy Pattern, Complex Nested Example, Quick Grammar Overview, Important Rules, Pro Tip)*
|
|
|
|
#### More Examples *(Ensure Schema example uses type/value wrapper)*
|
|
|
|
**Advanced Crawler Configuration**
|
|
*(Keep example, ensure cache_mode uses valid enum value like "bypass")*
|
|
|
|
**Extraction Strategy**
|
|
```json
|
|
{
|
|
"crawler_config": {
|
|
"type": "CrawlerRunConfig",
|
|
"params": {
|
|
"extraction_strategy": {
|
|
"type": "JsonCssExtractionStrategy",
|
|
"params": {
|
|
"schema": {
|
|
"type": "dict",
|
|
"value": {
|
|
"baseSelector": "article.post",
|
|
"fields": [
|
|
{"name": "title", "selector": "h1", "type": "text"},
|
|
{"name": "content", "selector": ".content", "type": "html"}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**LLM Extraction Strategy** *(Keep example, ensure schema uses type/value wrapper)*
|
|
*(Keep Deep Crawler Example)*
|
|
|
|
### REST API Examples
|
|
|
|
Update URLs to use port `11235`.
|
|
|
|
#### Simple Crawl
|
|
|
|
```python
|
|
import requests
|
|
|
|
# Configuration objects converted to the required JSON structure
|
|
browser_config_payload = {
|
|
"type": "BrowserConfig",
|
|
"params": {"headless": True}
|
|
}
|
|
crawler_config_payload = {
|
|
"type": "CrawlerRunConfig",
|
|
"params": {"stream": False, "cache_mode": "bypass"} # Use string value of enum
|
|
}
|
|
|
|
crawl_payload = {
|
|
"urls": ["https://httpbin.org/html"],
|
|
"browser_config": browser_config_payload,
|
|
"crawler_config": crawler_config_payload
|
|
}
|
|
response = requests.post(
|
|
"http://localhost:11235/crawl", # Updated port
|
|
# headers={"Authorization": f"Bearer {token}"}, # If JWT is enabled
|
|
json=crawl_payload
|
|
)
|
|
print(f"Status Code: {response.status_code}")
|
|
if response.ok:
|
|
print(response.json())
|
|
else:
|
|
print(f"Error: {response.text}")
|
|
|
|
```
|
|
|
|
#### Streaming Results
|
|
|
|
```python
|
|
import json
|
|
import httpx # Use httpx for async streaming example
|
|
|
|
async def test_stream_crawl(token: str = None): # Made token optional
|
|
"""Test the /crawl/stream endpoint with multiple URLs."""
|
|
url = "http://localhost:11235/crawl/stream" # Updated port
|
|
payload = {
|
|
"urls": [
|
|
"https://httpbin.org/html",
|
|
"https://httpbin.org/links/5/0",
|
|
],
|
|
"browser_config": {
|
|
"type": "BrowserConfig",
|
|
"params": {"headless": True, "viewport": {"type": "dict", "value": {"width": 1200, "height": 800}}} # Viewport needs type:dict
|
|
},
|
|
"crawler_config": {
|
|
"type": "CrawlerRunConfig",
|
|
"params": {"stream": True, "cache_mode": "bypass"}
|
|
}
|
|
}
|
|
|
|
headers = {}
|
|
# if token:
|
|
# headers = {"Authorization": f"Bearer {token}"} # If JWT is enabled
|
|
|
|
try:
|
|
async with httpx.AsyncClient() as client:
|
|
async with client.stream("POST", url, json=payload, headers=headers, timeout=120.0) as response:
|
|
print(f"Status: {response.status_code} (Expected: 200)")
|
|
response.raise_for_status() # Raise exception for bad status codes
|
|
|
|
# Read streaming response line-by-line (NDJSON)
|
|
async for line in response.aiter_lines():
|
|
if line:
|
|
try:
|
|
data = json.loads(line)
|
|
# Check for completion marker
|
|
if data.get("status") == "completed":
|
|
print("Stream completed.")
|
|
break
|
|
print(f"Streamed Result: {json.dumps(data, indent=2)}")
|
|
except json.JSONDecodeError:
|
|
print(f"Warning: Could not decode JSON line: {line}")
|
|
|
|
except httpx.HTTPStatusError as e:
|
|
print(f"HTTP error occurred: {e.response.status_code} - {e.response.text}")
|
|
except Exception as e:
|
|
print(f"Error in streaming crawl test: {str(e)}")
|
|
|
|
# To run this example:
|
|
# import asyncio
|
|
# asyncio.run(test_stream_crawl())
|
|
```
|
|
|
|
### Asynchronous Jobs with Webhooks
|
|
|
|
For long-running crawls or when you want to avoid keeping connections open, use the job queue endpoints. Instead of polling for results, configure a webhook to receive notifications when jobs complete.
|
|
|
|
#### Why Use Jobs & Webhooks?
|
|
|
|
- **No Polling Required** - Get notified when crawls complete instead of constantly checking status
|
|
- **Better Resource Usage** - Free up client connections while jobs run in the background
|
|
- **Scalable Architecture** - Ideal for high-volume crawling with TypeScript/Node.js clients or microservices
|
|
- **Reliable Delivery** - Automatic retry with exponential backoff (5 attempts: 1s → 2s → 4s → 8s → 16s)
|
|
|
|
#### How It Works
|
|
|
|
1. **Submit Job** → POST to `/crawl/job` with optional `webhook_config`
|
|
2. **Get Task ID** → Receive a `task_id` immediately
|
|
3. **Job Runs** → Crawl executes in the background
|
|
4. **Webhook Fired** → Server POSTs completion notification to your webhook URL
|
|
5. **Fetch Results** → If data wasn't included in webhook, GET `/crawl/job/{task_id}`
|
|
|
|
#### Quick Example
|
|
|
|
```bash
|
|
# Submit a crawl job with webhook notification
|
|
curl -X POST http://localhost:11235/crawl/job \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"urls": ["https://example.com"],
|
|
"webhook_config": {
|
|
"webhook_url": "https://myapp.com/webhooks/crawl-complete",
|
|
"webhook_data_in_payload": false
|
|
}
|
|
}'
|
|
|
|
# Response: {"task_id": "crawl_a1b2c3d4"}
|
|
```
|
|
|
|
**Your webhook receives:**
|
|
```json
|
|
{
|
|
"task_id": "crawl_a1b2c3d4",
|
|
"task_type": "crawl",
|
|
"status": "completed",
|
|
"timestamp": "2025-10-21T10:30:00.000000+00:00",
|
|
"urls": ["https://example.com"]
|
|
}
|
|
```
|
|
|
|
Then fetch the results:
|
|
```bash
|
|
curl http://localhost:11235/crawl/job/crawl_a1b2c3d4
|
|
```
|
|
|
|
#### Include Data in Webhook
|
|
|
|
Set `webhook_data_in_payload: true` to receive the full crawl results directly in the webhook:
|
|
|
|
```bash
|
|
curl -X POST http://localhost:11235/crawl/job \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"urls": ["https://example.com"],
|
|
"webhook_config": {
|
|
"webhook_url": "https://myapp.com/webhooks/crawl-complete",
|
|
"webhook_data_in_payload": true
|
|
}
|
|
}'
|
|
```
|
|
|
|
**Your webhook receives the complete data:**
|
|
```json
|
|
{
|
|
"task_id": "crawl_a1b2c3d4",
|
|
"task_type": "crawl",
|
|
"status": "completed",
|
|
"timestamp": "2025-10-21T10:30:00.000000+00:00",
|
|
"urls": ["https://example.com"],
|
|
"data": {
|
|
"markdown": "...",
|
|
"html": "...",
|
|
"links": {...},
|
|
"metadata": {...}
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Webhook Authentication
|
|
|
|
Add custom headers for authentication:
|
|
|
|
```json
|
|
{
|
|
"urls": ["https://example.com"],
|
|
"webhook_config": {
|
|
"webhook_url": "https://myapp.com/webhooks/crawl",
|
|
"webhook_data_in_payload": false,
|
|
"webhook_headers": {
|
|
"X-Webhook-Secret": "your-secret-token",
|
|
"X-Service-ID": "crawl4ai-prod"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Global Default Webhook
|
|
|
|
Configure a default webhook URL in `config.yml` for all jobs:
|
|
|
|
```yaml
|
|
webhooks:
|
|
enabled: true
|
|
default_url: "https://myapp.com/webhooks/default"
|
|
data_in_payload: false
|
|
retry:
|
|
max_attempts: 5
|
|
initial_delay_ms: 1000
|
|
max_delay_ms: 32000
|
|
timeout_ms: 30000
|
|
```
|
|
|
|
Now jobs without `webhook_config` automatically use the default webhook.
|
|
|
|
#### Job Status Polling (Without Webhooks)
|
|
|
|
If you prefer polling instead of webhooks, just omit `webhook_config`:
|
|
|
|
```bash
|
|
# Submit job
|
|
curl -X POST http://localhost:11235/crawl/job \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"urls": ["https://example.com"]}'
|
|
# Response: {"task_id": "crawl_xyz"}
|
|
|
|
# Poll for status
|
|
curl http://localhost:11235/crawl/job/crawl_xyz
|
|
```
|
|
|
|
The response includes `status` field: `"processing"`, `"completed"`, or `"failed"`.
|
|
|
|
#### LLM Extraction Jobs with Webhooks
|
|
|
|
The same webhook system works for LLM extraction jobs via `/llm/job`:
|
|
|
|
```bash
|
|
# Submit LLM extraction job with webhook
|
|
curl -X POST http://localhost:11235/llm/job \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"url": "https://example.com/article",
|
|
"q": "Extract the article title, author, and main points",
|
|
"provider": "openai/gpt-4o-mini",
|
|
"webhook_config": {
|
|
"webhook_url": "https://myapp.com/webhooks/llm-complete",
|
|
"webhook_data_in_payload": true,
|
|
"webhook_headers": {
|
|
"X-Webhook-Secret": "your-secret-token"
|
|
}
|
|
}
|
|
}'
|
|
|
|
# Response: {"task_id": "llm_1234567890"}
|
|
```
|
|
|
|
**Your webhook receives:**
|
|
```json
|
|
{
|
|
"task_id": "llm_1234567890",
|
|
"task_type": "llm_extraction",
|
|
"status": "completed",
|
|
"timestamp": "2025-10-22T12:30:00.000000+00:00",
|
|
"urls": ["https://example.com/article"],
|
|
"data": {
|
|
"extracted_content": {
|
|
"title": "Understanding Web Scraping",
|
|
"author": "John Doe",
|
|
"main_points": ["Point 1", "Point 2", "Point 3"]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Key Differences for LLM Jobs:**
|
|
- Task type is `"llm_extraction"` instead of `"crawl"`
|
|
- Extracted data is in `data.extracted_content`
|
|
- Single URL only (not an array)
|
|
- Supports schema-based extraction with `schema` parameter
|
|
|
|
> 💡 **Pro tip**: See [WEBHOOK_EXAMPLES.md](./WEBHOOK_EXAMPLES.md) for detailed examples including TypeScript client code, Flask webhook handlers, and failure handling.
|
|
|
|
---
|
|
|
|
## Metrics & Monitoring
|
|
|
|
Keep an eye on your crawler with these endpoints:
|
|
|
|
- `/health` - Quick health check
|
|
- `/metrics` - Detailed Prometheus metrics
|
|
- `/schema` - Full API schema
|
|
|
|
Example health check:
|
|
```bash
|
|
curl http://localhost:11235/health
|
|
```
|
|
|
|
---
|
|
|
|
*(Deployment Scenarios and Complete Examples sections remain the same, maybe update links if examples moved)*
|
|
|
|
---
|
|
|
|
## Server Configuration
|
|
|
|
The server's behavior can be customized through the `config.yml` file.
|
|
|
|
### Understanding config.yml
|
|
|
|
The configuration file is loaded from `/app/config.yml` inside the container. By default, the file from `deploy/docker/config.yml` in the repository is copied there during the build.
|
|
|
|
Here's a detailed breakdown of the configuration options (using defaults from `deploy/docker/config.yml`):
|
|
|
|
```yaml
|
|
# Application Configuration
|
|
app:
|
|
title: "Crawl4AI API"
|
|
version: "1.0.0" # Consider setting this to match library version, e.g., "0.5.1"
|
|
host: "0.0.0.0"
|
|
port: 8020 # NOTE: This port is used ONLY when running server.py directly. Gunicorn overrides this (see supervisord.conf).
|
|
reload: False # Default set to False - suitable for production
|
|
timeout_keep_alive: 300
|
|
|
|
# Default LLM Configuration
|
|
llm:
|
|
provider: "openai/gpt-4o-mini" # Can be overridden by LLM_PROVIDER env var
|
|
# api_key: sk-... # If you pass the API key directly (not recommended)
|
|
|
|
# Redis Configuration (Used by internal Redis server managed by supervisord)
|
|
redis:
|
|
host: "localhost"
|
|
port: 6379
|
|
db: 0
|
|
password: ""
|
|
# ... other redis options ...
|
|
|
|
# Rate Limiting Configuration
|
|
rate_limiting:
|
|
enabled: True
|
|
default_limit: "1000/minute"
|
|
trusted_proxies: []
|
|
storage_uri: "memory://" # Use "redis://localhost:6379" if you need persistent/shared limits
|
|
|
|
# Security Configuration
|
|
security:
|
|
enabled: false # Master toggle for security features
|
|
jwt_enabled: false # Enable JWT authentication (requires security.enabled=true)
|
|
https_redirect: false # Force HTTPS (requires security.enabled=true)
|
|
trusted_hosts: ["*"] # Allowed hosts (use specific domains in production)
|
|
headers: # Security headers (applied if security.enabled=true)
|
|
x_content_type_options: "nosniff"
|
|
x_frame_options: "DENY"
|
|
content_security_policy: "default-src 'self'"
|
|
strict_transport_security: "max-age=63072000; includeSubDomains"
|
|
|
|
# Crawler Configuration
|
|
crawler:
|
|
memory_threshold_percent: 95.0
|
|
rate_limiter:
|
|
base_delay: [1.0, 2.0] # Min/max delay between requests in seconds for dispatcher
|
|
timeouts:
|
|
stream_init: 30.0 # Timeout for stream initialization
|
|
batch_process: 300.0 # Timeout for non-streaming /crawl processing
|
|
|
|
# Logging Configuration
|
|
logging:
|
|
level: "INFO"
|
|
format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
|
|
|
|
# Observability Configuration
|
|
observability:
|
|
prometheus:
|
|
enabled: True
|
|
endpoint: "/metrics"
|
|
health_check:
|
|
endpoint: "/health"
|
|
```
|
|
|
|
*(JWT Authentication section remains the same, just note the default port is now 11235 for requests)*
|
|
|
|
*(Configuration Tips and Best Practices remain the same)*
|
|
|
|
### Customizing Your Configuration
|
|
|
|
You can override the default `config.yml`.
|
|
|
|
#### Method 1: Modify Before Build
|
|
|
|
1. Edit the `deploy/docker/config.yml` file in your local repository clone.
|
|
2. Build the image using `docker buildx` or `docker compose --profile local-... up --build`. The modified file will be copied into the image.
|
|
|
|
#### Method 2: Runtime Mount (Recommended for Custom Deploys)
|
|
|
|
1. Create your custom configuration file, e.g., `my-custom-config.yml` locally. Ensure it contains all necessary sections.
|
|
2. Mount it when running the container:
|
|
|
|
* **Using `docker run`:**
|
|
```bash
|
|
# Assumes my-custom-config.yml is in the current directory
|
|
docker run -d -p 11235:11235 \
|
|
--name crawl4ai-custom-config \
|
|
--env-file .llm.env \
|
|
--shm-size=1g \
|
|
-v $(pwd)/my-custom-config.yml:/app/config.yml \
|
|
unclecode/crawl4ai:latest # Or your specific tag
|
|
```
|
|
|
|
* **Using `docker-compose.yml`:** Add a `volumes` section to the service definition:
|
|
```yaml
|
|
services:
|
|
crawl4ai-hub-amd64: # Or your chosen service
|
|
image: unclecode/crawl4ai:latest
|
|
profiles: ["hub-amd64"]
|
|
<<: *base-config
|
|
volumes:
|
|
# Mount local custom config over the default one in the container
|
|
- ./my-custom-config.yml:/app/config.yml
|
|
# Keep the shared memory volume from base-config
|
|
- /dev/shm:/dev/shm
|
|
```
|
|
*(Note: Ensure `my-custom-config.yml` is in the same directory as `docker-compose.yml`)*
|
|
|
|
> 💡 When mounting, your custom file *completely replaces* the default one. Ensure it's a valid and complete configuration.
|
|
|
|
### Configuration Recommendations
|
|
|
|
1. **Security First** 🔒
|
|
- Always enable security in production
|
|
- Use specific trusted_hosts instead of wildcards
|
|
- Set up proper rate limiting to protect your server
|
|
- Consider your environment before enabling HTTPS redirect
|
|
|
|
2. **Resource Management** 💻
|
|
- Adjust memory_threshold_percent based on available RAM
|
|
- Set timeouts according to your content size and network conditions
|
|
- Use Redis for rate limiting in multi-container setups
|
|
|
|
3. **Monitoring** 📊
|
|
- Enable Prometheus if you need metrics
|
|
- Set DEBUG logging in development, INFO in production
|
|
- Regular health check monitoring is crucial
|
|
|
|
4. **Performance Tuning** ⚡
|
|
- Start with conservative rate limiter delays
|
|
- Increase batch_process timeout for large content
|
|
- Adjust stream_init timeout based on initial response times
|
|
|
|
## Getting Help
|
|
|
|
We're here to help you succeed with Crawl4AI! Here's how to get support:
|
|
|
|
- 📖 Check our [full documentation](https://docs.crawl4ai.com)
|
|
- 🐛 Found a bug? [Open an issue](https://github.com/unclecode/crawl4ai/issues)
|
|
- 💬 Join our [Discord community](https://discord.gg/crawl4ai)
|
|
- ⭐ Star us on GitHub to show support!
|
|
|
|
## Summary
|
|
|
|
In this guide, we've covered everything you need to get started with Crawl4AI's Docker deployment:
|
|
- Building and running the Docker container
|
|
- Configuring the environment
|
|
- Using the interactive playground for testing
|
|
- Making API requests with proper typing
|
|
- Using the Python SDK
|
|
- Asynchronous job queues with webhook notifications
|
|
- Leveraging specialized endpoints for screenshots, PDFs, and JavaScript execution
|
|
- Connecting via the Model Context Protocol (MCP)
|
|
- Monitoring your deployment
|
|
|
|
The new playground interface at `http://localhost:11235/playground` makes it much easier to test configurations and generate the corresponding JSON for API requests.
|
|
|
|
For AI application developers, the MCP integration allows tools like Claude Code to directly access Crawl4AI's capabilities without complex API handling.
|
|
|
|
Remember, the examples in the `examples` folder are your friends - they show real-world usage patterns that you can adapt for your needs.
|
|
|
|
Keep exploring, and don't hesitate to reach out if you need help! We're building something amazing together. 🚀
|
|
|
|
Happy crawling! 🕷️
|