1096 lines
87 KiB
Plaintext
1096 lines
87 KiB
Plaintext
Okay, I'm ready to generate the "Reasoning & Problem-Solving Framework" document for the `deployment` component of `crawl4ai`, based on the outline you provided and the code/documentation context from our previous interaction.
|
|
|
|
```markdown
|
|
# Detailed Outline for crawl4ai - deployment Component
|
|
|
|
**Target Document Type:** reasoning
|
|
**Target Output Filename Suggestion:** `llm_reasoning_deployment.md`
|
|
**Library Version Context:** 0.6.0
|
|
**Outline Generation Date:** 2025-05-24
|
|
---
|
|
|
|
## 1. Introduction to Crawl4ai Deployment
|
|
|
|
This guide focuses on deploying and configuring `crawl4ai`, whether you're using it as a simple Python library or as a robust Dockerized server. Understanding the different deployment strategies will help you choose the best approach for your specific needs, from quick local scripts to scalable, API-driven crawling services.
|
|
|
|
* 1.1. Why Different Deployment Strategies Matter
|
|
* 1.1.1. Explaining the trade-offs: Library vs. Server (Docker) mode.
|
|
* **Library Mode:**
|
|
* **Pros:** Simplest to get started with for Python developers, direct integration into existing Python projects, easier debugging of Python-specific logic.
|
|
* **Cons:** Requires Python environment setup on every machine, can be harder to manage dependencies for larger teams or across different OS, resource management (browsers, memory) is directly tied to the script's host.
|
|
* **Why choose it?** Ideal for individual developers, small scripts, quick prototyping, or when `crawl4ai` is a component within a larger Python application.
|
|
* **Server (Docker) Mode:**
|
|
* **Pros:** Consistent environment (Docker handles dependencies), easy to scale, API-first (accessible from any language), better resource isolation and management, simplified deployment to cloud or on-premise servers.
|
|
* **Cons:** Requires Docker knowledge, slightly more setup initially, debugging might involve looking at container logs in addition to application logs.
|
|
* **Why choose it?** Best for team collaboration, production deployments, providing crawling as a service, language-agnostic access, or when you need robust, isolated browser instances.
|
|
* 1.1.2. When to choose simple library installation.
|
|
* Choose simple library installation when:
|
|
* You are primarily working in a Python environment.
|
|
* You need to quickly integrate crawling into an existing Python script or application.
|
|
* Your deployment target is a machine where you can easily manage Python environments and Playwright browser installations.
|
|
* You are prototyping or working on a small-scale project.
|
|
* 1.1.3. When a Dockerized server deployment is beneficial (scalability, isolation, API access).
|
|
* Opt for a Dockerized server when:
|
|
* You need a consistent, reproducible crawling environment across different machines or team members.
|
|
* You plan to offer crawling capabilities as an API to other services or applications (potentially written in different languages).
|
|
* You require better resource isolation for browser instances to prevent them from impacting other processes on the host machine.
|
|
* You anticipate needing to scale your crawling operations up or down based on demand.
|
|
* You are deploying to a cloud environment or a server where Docker is the preferred deployment method.
|
|
|
|
* 1.2. Overview of Installation Paths
|
|
* 1.2.1. Quick guide to choosing your installation path based on needs.
|
|
* **For local Python development/scripting:** Start with "Core Library Installation." Add "Advanced Library Installation" if you need features like local ML model inference.
|
|
* **For a standalone, API-accessible server:** Jump to "Docker Deployment." You can choose between pre-built images (easiest), Docker Compose (good for managing related services like Redis), or manual builds (for full control).
|
|
* 1.2.2. What this guide will cover for each path.
|
|
* This guide will provide step-by-step instructions, explanations of "why" certain steps are necessary, best practices, and troubleshooting tips for both library installation and the various Docker deployment options.
|
|
|
|
## 2. Core Library Installation & Usage
|
|
|
|
This section details how to get the `crawl4ai` library up and running directly in your Python environment.
|
|
|
|
* 2.1. Understanding the Basic Installation
|
|
* 2.1.1. **How-to:** Installing the core `crawl4ai` library.
|
|
* **Command:**
|
|
```bash
|
|
pip install crawl4ai
|
|
```
|
|
* **What core functionalities this provides:**
|
|
* The `AsyncWebCrawler` class and its associated configuration objects (`BrowserConfig`, `CrawlerRunConfig`).
|
|
* Core scraping capabilities (HTML, Markdown, links, media).
|
|
* Basic content processing and filtering.
|
|
* Support for Playwright-driven browser automation.
|
|
* The `crawl4ai-setup` and `crawl4ai-doctor` CLI tools.
|
|
* 2.1.2. The Importance of Post-Installation Setup (`crawl4ai-setup`)
|
|
* **Why `crawl4ai-setup` is crucial:** `crawl4ai` relies on Playwright for browser automation. Playwright, in turn, needs browser executables (like Chromium, Firefox, WebKit) to be downloaded and installed in a location it can find. `crawl4ai-setup` automates this process.
|
|
* **What it does:**
|
|
* Invokes Playwright's browser installation mechanism (e.g., `playwright install --with-deps chromium`).
|
|
* Performs OS-specific checks to ensure necessary libraries or dependencies for running headless browsers are present (especially important on Linux).
|
|
* Sets up the local Crawl4ai home directory structure (e.g., `~/.crawl4ai/cache`).
|
|
* **Troubleshooting common `crawl4ai-setup` issues:**
|
|
* **Permission errors:** Ensure you have write permissions to the Playwright browser installation directory (often in your user's home directory or a system-wide location if installing as root).
|
|
* **Network issues:** Browser downloads can be large; ensure a stable internet connection. Proxies might interfere if not configured correctly for Playwright.
|
|
* **Missing OS dependencies (Linux):** The script attempts to guide you, but you might need to manually install packages like `libnss3`, `libatk1.0-0`, etc.
|
|
* *Code Example: Running `crawl4ai-setup` and interpreting its output.*
|
|
```bash
|
|
crawl4ai-setup
|
|
```
|
|
**Expected Output (Success):**
|
|
```
|
|
[INIT] Running post-installation setup...
|
|
[SETUP] Playwright browser installation complete.
|
|
[COMPLETE] Post-installation setup completed!
|
|
```
|
|
**Potential Issue Output:**
|
|
```
|
|
[ERROR] Failed to install Playwright browsers. Please run 'playwright install --with-deps' manually.
|
|
```
|
|
* 2.1.3. Diagnosing Your Environment with `crawl4ai-doctor`
|
|
* **When and why to use `crawl4ai-doctor`:** Run this command if you encounter issues after installation, or if crawls are failing unexpectedly. It performs a series of checks to verify that your Python environment, Playwright installation, and browser executables are correctly set up and accessible.
|
|
* **Interpreting `crawl4ai-doctor` output for common problems:**
|
|
* It will check Python version compatibility.
|
|
* It verifies if Playwright is installed and if browsers can be launched.
|
|
* It might suggest solutions for common issues it detects.
|
|
* *Code Example: Running `crawl4ai-doctor` and typical successful/problematic outputs.*
|
|
```bash
|
|
crawl4ai-doctor
|
|
```
|
|
**Expected Output (Success):**
|
|
```
|
|
[INIT] Running Crawl4ai health check...
|
|
[INFO] Python version: 3.X.X
|
|
[INFO] Playwright version: X.Y.Z
|
|
[TEST] Testing crawling capabilities...
|
|
[COMPLETE] ✅ Crawling test passed!
|
|
Crawl4ai doctor check completed. All systems operational.
|
|
```
|
|
**Potential Issue Output:**
|
|
```
|
|
[ERROR] ❌ Test failed: Could not launch browser. Ensure Playwright browsers are installed (run 'crawl4ai-setup' or 'playwright install --with-deps chromium').
|
|
```
|
|
* 2.1.4. Verifying Your Basic Installation: Your First Simple Crawl
|
|
* **Step-by-step guide:**
|
|
1. Create a new Python file (e.g., `test_crawl.py`).
|
|
2. Import necessary classes: `AsyncWebCrawler`, `BrowserConfig`, `CrawlerRunConfig`.
|
|
3. Write an `async` function.
|
|
4. Inside the function, create a `BrowserConfig` instance (defaults are usually fine for a first test).
|
|
5. Create an `AsyncWebCrawler` instance, passing the `BrowserConfig`. Use an `async with` statement for proper resource management.
|
|
6. Create a `CrawlerRunConfig` instance (again, defaults are fine).
|
|
7. Call `crawler.arun(url="https://example.com", config=run_config)`.
|
|
8. Print a part of the result, e.g., `result.markdown[:300]`.
|
|
9. Use `asyncio.run()` to execute your `async` function.
|
|
* **Expected output:** You should see the first 300 characters of the Markdown content extracted from `example.com`.
|
|
* **How to confirm success:** If the script runs without errors and prints Markdown content, your basic installation is working.
|
|
* *Code Example: A minimal Python script to crawl `example.com`.*
|
|
```python
|
|
import asyncio
|
|
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
|
|
|
async def main():
|
|
browser_cfg = BrowserConfig(headless=True) # Keep headless for non-UI environments
|
|
run_cfg = CrawlerRunConfig(cache_mode=CacheMode.BYPASS) # Bypass cache for a fresh fetch
|
|
|
|
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
|
print("Attempting to crawl https://example.com...")
|
|
result = await crawler.arun(url="https://example.com", config=run_cfg)
|
|
if result.success:
|
|
print("Crawl successful!")
|
|
print("Markdown (first 300 chars):")
|
|
if result.markdown:
|
|
print(result.markdown.raw_markdown[:300])
|
|
else:
|
|
print("No markdown content generated.")
|
|
else:
|
|
print(f"Crawl failed: {result.error_message}")
|
|
|
|
if __name__ == "__main__":
|
|
asyncio.run(main())
|
|
```
|
|
|
|
* 2.2. Advanced Library Installation: Extending Functionality
|
|
* 2.2.1. When to Consider Optional Features
|
|
* **Identifying use cases:**
|
|
* **Local Machine Learning/NLP tasks:** If you plan to use features like `CosineSimilarityFilter`, advanced `LLMContentFilter` modes that might leverage local sentence transformers, or other AI-driven text processing directly within your Python script without relying on an external LLM API for everything.
|
|
* **PyTorch-dependent features:** Some advanced filters or future AI integrations might specifically require PyTorch.
|
|
* **Hugging Face Transformers:** If you intend to use models directly from the Hugging Face Hub for tasks like summarization, classification, or custom embedding generation within your Crawl4ai workflow.
|
|
* **Understanding the additional capabilities:** These extras typically bring in libraries like `torch`, `transformers`, `scikit-learn`, and `nltk`, enabling more sophisticated local data processing and AI model inference.
|
|
* 2.2.2. **How-to:** Installing Optional Extras
|
|
* **Explaining `crawl4ai[torch]`:**
|
|
* Installs `torch` and related dependencies.
|
|
* **Why:** Necessary for features that perform local neural network inference, such as certain embedding models or advanced NLP tasks that are PyTorch-based.
|
|
* **Explaining `crawl4ai[transformer]`:**
|
|
* Installs `transformers` (from Hugging Face) and `tokenizers`.
|
|
* **Why:** Enables the use of a wide range of pre-trained transformer models for tasks like text generation, summarization, and classification directly within Crawl4ai, often in conjunction with `torch`.
|
|
* **Explaining `crawl4ai[all]`:**
|
|
* Installs all optional dependencies, including `torch`, `transformers`, `nltk`, `scikit-learn`, `PyPDF2`, etc.
|
|
* **When to use:** If you anticipate needing a broad range of features and don't mind a larger installation footprint. Convenient for development environments.
|
|
* **Potential downsides:** Significantly larger installation size and more dependencies to manage, which might increase the chance of conflicts in complex environments.
|
|
* *Code Example: `pip install crawl4ai[torch]` and `pip install crawl4ai[all]`.*
|
|
```bash
|
|
# For PyTorch related features
|
|
pip install crawl4ai[torch]
|
|
|
|
# For Hugging Face Transformers related features
|
|
pip install crawl4ai[transformer]
|
|
|
|
# To install all optional features
|
|
pip install crawl4ai[all]
|
|
```
|
|
* 2.2.3. Pre-fetching Models with `crawl4ai-download-models`
|
|
* **Why pre-fetch models:**
|
|
* **Offline use:** Allows features dependent on these models (e.g., certain embedding generators, classifiers) to run without an internet connection after initial download.
|
|
* **Faster startup:** Avoids download time on the first run of a script that uses these models.
|
|
* **Controlled environment:** Ensures you have the specific model versions Crawl4ai expects.
|
|
* **How to use the command:**
|
|
```bash
|
|
crawl4ai-download-models
|
|
```
|
|
* **Where models are stored:** Typically in a cache directory managed by the underlying libraries (e.g., Hugging Face's cache, usually in `~/.cache/huggingface`). `crawl4ai-download-models` simply triggers the download process via these libraries.
|
|
|
|
## 3. Docker Deployment: Running Crawl4ai as a Server
|
|
|
|
Deploying Crawl4ai with Docker provides a consistent, isolated, and scalable environment, making it ideal for production or when offering crawling as an API service.
|
|
|
|
* 3.1. Why Deploy Crawl4ai with Docker?
|
|
* 3.1.1. **Benefits:**
|
|
* **Isolation:** Browser instances and dependencies are contained within the Docker image, preventing conflicts with your host system or other applications.
|
|
* **Reproducibility:** Ensures that Crawl4ai runs the same way across different environments (development, staging, production).
|
|
* **Scalability:** Docker containers can be easily scaled up or down using orchestration tools like Kubernetes or Docker Swarm.
|
|
* **API-first Access:** Exposes Crawl4ai's functionality via a REST API, allowing applications written in any language to utilize its crawling capabilities.
|
|
* 3.1.2. Common use cases for a Dockerized Crawl4ai server.
|
|
* Providing a centralized crawling service for multiple applications or teams.
|
|
* Integrating Crawl4ai into non-Python microservices architectures.
|
|
* Deploying to cloud platforms that favor containerized applications.
|
|
* Ensuring consistent browser behavior and dependency management for critical crawling tasks.
|
|
|
|
* 3.2. Prerequisites for Docker Deployment
|
|
* 3.2.1. **Docker:** Ensure Docker Desktop (for Windows/Mac) or Docker Engine (for Linux) is installed and the Docker daemon is running.
|
|
* *Decision:* If you're new to Docker, visit the official Docker website for installation instructions specific to your OS.
|
|
* 3.2.2. **Git:** Required if you plan to build the Docker image locally from the source code or use Docker Compose with a local repository clone.
|
|
* *Decision:* If you only intend to use pre-built images from Docker Hub, Git might not be strictly necessary on the deployment machine, but it's good practice for managing configurations.
|
|
* 3.2.3. **RAM Requirements:** Web browsers, especially multiple concurrent instances, can be memory-intensive.
|
|
* **Guidance:**
|
|
* Minimum: At least 2GB RAM for the Docker container itself, plus additional RAM per concurrent browser page (e.g., 250-500MB per page, can vary).
|
|
* A common starting point for a server expected to handle a few concurrent crawls might be 4GB-8GB total allocated to Docker.
|
|
* Monitor your container's memory usage (`docker stats <container_id>`) and adjust resources as needed. Insufficient RAM can lead to browser crashes or slow performance.
|
|
* Remember to configure `--shm-size` (shared memory size) for your Docker run command (e.g., `--shm-size=1g`), as Chromium-based browsers heavily use it. The `docker-compose.yml` already includes a `/dev/shm` mount.
|
|
|
|
* 3.3. Docker Installation Options: A Decision Guide
|
|
* 3.3.1. Option 1: Using Pre-built Images from Docker Hub
|
|
* **When to use:** This is the **easiest and quickest** way to get started if you don't need custom modifications to the Crawl4ai server image. It's ideal for standard use cases and trying out the server.
|
|
* **How-to:**
|
|
* **Pulling the image:**
|
|
```bash
|
|
docker pull unclecode/crawl4ai:latest # For the latest stable release
|
|
# Or, for a specific version (recommended for production):
|
|
docker pull unclecode/crawl4ai:0.6.0
|
|
```
|
|
* **Understanding Docker Hub Tags:**
|
|
* `latest`: Points to the most recent stable release. Use with caution in production as it can change unexpectedly.
|
|
* Specific versions (e.g., `0.6.0`): Recommended for production to ensure reproducibility and avoid breaking changes.
|
|
* `0.6.0-rc1`: Release candidates, nearly stable.
|
|
* `dev`: Development builds from the `main` branch, potentially unstable.
|
|
* **Decision:** For production, always pin to a specific version tag. Use `latest` for quick tests or when you always want the newest features and are prepared for potential changes.
|
|
* **Setting up the environment:** Create a `.llm.env` file in your current directory to store API keys for LLM providers if you plan to use LLM-based extraction or filtering features.
|
|
* *Example `.llm.env` content:*
|
|
```env
|
|
OPENAI_API_KEY=sk-yourOpenAiApiKeyxxxxxxxxxxxx
|
|
ANTHROPIC_API_KEY=sk-ant-yourAnthropicApiKeyxxxxxxxx
|
|
GEMINI_API_TOKEN=yourGoogleAIGeminiApiKeyxxxxxxxx
|
|
# Add other LLM provider keys as needed
|
|
```
|
|
* **Running the container (Basic, no LLM support initially):**
|
|
```bash
|
|
docker run -d -p 11235:11235 --name crawl4ai-server --shm-size=1g unclecode/crawl4ai:0.6.0
|
|
```
|
|
* `-d`: Run in detached mode (background).
|
|
* `-p 11235:11235`: Map port 11235 on your host to port 11235 in the container.
|
|
* `--name crawl4ai-server`: Assign a name to the container for easier management.
|
|
* `--shm-size=1g`: Allocate 1GB of shared memory, crucial for browser stability.
|
|
* **Running with LLM Support (mounting `.llm.env`):**
|
|
```bash
|
|
docker run -d -p 11235:11235 --name crawl4ai-server --shm-size=1g --env-file .llm.env unclecode/crawl4ai:0.6.0
|
|
```
|
|
* **Stopping and removing the container:**
|
|
```bash
|
|
docker stop crawl4ai-server
|
|
docker rm crawl4ai-server
|
|
```
|
|
* **Best practices:**
|
|
* Always use specific version tags in production.
|
|
* Manage API keys securely using `.env` files or Docker secrets, not by hardcoding them into run commands or Dockerfiles.
|
|
* 3.3.2. Option 2: Using Docker Compose
|
|
* **When to use:**
|
|
* When you want an easier way to manage the container's configuration and lifecycle.
|
|
* If you plan to run related services (e.g., a dedicated Redis instance for rate limiting or job queues) alongside Crawl4ai.
|
|
* If you need to make minor local customizations to the build process (like choosing `INSTALL_TYPE`) without managing complex `docker build` commands.
|
|
* **How-to:**
|
|
1. **Cloning the `crawl4ai` repository:**
|
|
```bash
|
|
git clone https://github.com/unclecode/crawl4ai.git
|
|
cd crawl4ai
|
|
```
|
|
2. **Setting up `.llm.env`:** Create this file in the root of the cloned repository if you need LLM support (see example above).
|
|
3. **Running with Pre-built Images (default in `docker-compose.yml`):**
|
|
```bash
|
|
# This will use the image specified in docker-compose.yml (e.g., unclecode/crawl4ai:latest or a specific version)
|
|
docker-compose up -d
|
|
```
|
|
* The `docker-compose.yml` file is pre-configured to pull official images and set up necessary volumes (like `/dev/shm`).
|
|
4. **Building Images Locally with Docker Compose:**
|
|
* **When this is preferred:** If you need to build the image with specific optional features (`INSTALL_TYPE`) or enable GPU support, and you prefer the `docker-compose` workflow.
|
|
* **How:** You'll modify the `docker-compose.yml` to use the `build` context or pass build arguments via the command line.
|
|
```bash
|
|
# Example: Build with all features
|
|
docker-compose build --build-arg INSTALL_TYPE=all
|
|
docker-compose up -d
|
|
|
|
# Example: Build with GPU support (ensure Dockerfile supports this and host has NVIDIA drivers/toolkit)
|
|
# Potentially requires modifying docker-compose.yml to pass GPU runtime flags
|
|
docker-compose build --build-arg ENABLE_GPU=true
|
|
docker-compose up -d
|
|
```
|
|
*Note: The provided `docker-compose.yml` already has a `build` section, so `docker-compose build` will use it. You can uncomment/modify `args` in the `build` section of `docker-compose.yml` as well.*
|
|
5. **Stopping services:**
|
|
```bash
|
|
docker-compose down
|
|
```
|
|
* **Advantages:** Simplifies managing container configurations, volumes, and networks, especially if you add more services later.
|
|
* 3.3.3. Option 3: Manual Local Build & Run
|
|
* **When to use:**
|
|
* When you need to make significant customizations to the `Dockerfile` itself.
|
|
* For development and testing of changes to the Crawl4ai server codebase.
|
|
* If you need to build for a specific architecture not readily available as a pre-built image variant (though `buildx` helps with this).
|
|
* **How-to:**
|
|
1. **Cloning the repository:**
|
|
```bash
|
|
git clone https://github.com/unclecode/crawl4ai.git
|
|
cd crawl4ai
|
|
```
|
|
2. **Setting up `.llm.env`:** Create this file in the root directory.
|
|
3. **Building with `docker buildx` (recommended for multi-arch):**
|
|
* **Understanding multi-arch builds:** `docker buildx` allows you to build images for multiple architectures (e.g., `linux/amd64` for typical Intel/AMD servers, `linux/arm64` for ARM-based servers like AWS Graviton or Raspberry Pi).
|
|
* **Passing build arguments:**
|
|
```bash
|
|
# Example: Build for amd64 and arm64, with all features, and tag it
|
|
docker buildx build \
|
|
--platform linux/amd64,linux/arm64 \
|
|
--build-arg INSTALL_TYPE=all \
|
|
--build-arg ENABLE_GPU=false \
|
|
-t my-custom-crawl4ai:latest \
|
|
--push . # Use --load to load into local Docker images instead of pushing
|
|
```
|
|
* Replace `--push` with `--load` if you want to use the image locally immediately.
|
|
4. **Running the locally built container:**
|
|
```bash
|
|
docker run -d -p 11235:11235 --name my-crawl4ai-server --shm-size=1g --env-file .llm.env my-custom-crawl4ai:latest
|
|
```
|
|
5. **Stopping and removing the container:**
|
|
```bash
|
|
docker stop my-crawl4ai-server
|
|
docker rm my-crawl4ai-server
|
|
```
|
|
* **Considerations:** This method gives you the most control but requires a deeper understanding of Docker image building. Build times can be longer, especially with `INSTALL_TYPE=all`.
|
|
|
|
* 3.4. Understanding Dockerfile Build Parameters (`ARG` values)
|
|
* These arguments allow you to customize the Docker image during the build process (`docker build` or `docker-compose build`).
|
|
* `C4AI_VER`:
|
|
* **Role:** Specifies the version of Crawl4ai to install if not using local source. It's used in the Dockerfile if `USE_LOCAL=false`.
|
|
* **Why change:** You might want to build an image based on a specific older version or a development tag.
|
|
* `APP_HOME`:
|
|
* **Role:** Defines the working directory inside the container (e.g., `/app`).
|
|
* **Why change:** Rarely needed unless you have specific path requirements for integrations.
|
|
* `GITHUB_REPO`, `GITHUB_BRANCH`:
|
|
* **Role:** Used when `USE_LOCAL=false` to clone Crawl4ai from a specific GitHub repository and branch.
|
|
* **Why change:** To build from your own fork, a feature branch, or a specific commit for testing.
|
|
* `USE_LOCAL`:
|
|
* **Role:** A boolean (`true` or `false`). If `true`, the Docker build uses the local source code from the directory where the `Dockerfile` resides (copied via `COPY . /tmp/project/`). If `false`, it clones from `GITHUB_REPO` and `GITHUB_BRANCH`.
|
|
* **Why change:** Set to `true` when developing and wanting to build an image with your local changes. Set to `false` for CI/CD or building from a canonical Git source.
|
|
* `PYTHON_VERSION`:
|
|
* **Role:** Specifies the base Python slim image version (e.g., `3.12`).
|
|
* **Why change:** If you need to ensure compatibility with a specific Python version for your dependencies or environment.
|
|
* `INSTALL_TYPE`:
|
|
* **Role:** Controls which optional dependencies of `crawl4ai` are installed. Options include `default` (core), `all` (all extras), `torch`, `transformer`.
|
|
* **Impact:**
|
|
* `default`: Smallest image, fewest features.
|
|
* `all`: Largest image, all features (including ML/NLP capabilities).
|
|
* `torch`/`transformer`: Intermediate size, specific ML/NLP capabilities.
|
|
* **Why change:** To tailor the image size and included features to your specific needs, avoiding unnecessary bloat.
|
|
* `ENABLE_GPU`:
|
|
* **Role:** A boolean (`true` or `false`). If `true`, the Dockerfile attempts to install GPU-related dependencies (e.g., CUDA toolkit if `TARGETARCH` is compatible).
|
|
* **Why change:** Set to `true` if you have a compatible GPU on your Docker host and want to accelerate ML tasks (like local LLM inference or embeddings) inside the container. Requires appropriate Docker runtime configuration (e.g., `--gpus all`).
|
|
* `TARGETARCH`:
|
|
* **Role:** Automatically set by Docker Buildx based on the `--platform` flag. It informs the Dockerfile about the target architecture (e.g., `amd64`, `arm64`) so it can install architecture-specific dependencies (like OpenMP for AMD64 or OpenBLAS for ARM64, or CUDA for NVIDIA GPUs on compatible architectures).
|
|
* **Why be aware:** Essential for understanding multi-arch builds and ensuring correct dependencies are installed for the target platform.
|
|
* *Guidance: Best practices for setting these arguments:*
|
|
* For development with local changes: `USE_LOCAL=true`.
|
|
* For minimal production image: `INSTALL_TYPE=default` (if no advanced features needed).
|
|
* For ML-heavy tasks on GPU hardware: `ENABLE_GPU=true`, `INSTALL_TYPE=all` (or `torch`/`transformer`).
|
|
* Always specify `C4AI_VER` or `GITHUB_BRANCH` explicitly for reproducible builds if not using `USE_LOCAL=true`.
|
|
|
|
* 3.5. Server Configuration (`config.yml`)
|
|
The `config.yml` file (located at `/app/config.yml` inside the container, and `deploy/docker/config.yml` in the source) controls various aspects of the Crawl4ai server's behavior.
|
|
* 3.5.1. Overview of `config.yml` Structure
|
|
* **`app` section:**
|
|
* **Purpose:** Configures the FastAPI/Uvicorn server.
|
|
* `host`, `port`: Network interface and port the server listens on.
|
|
* `workers`: Number of Uvicorn worker processes (for handling concurrent requests).
|
|
* **Reasoning:** Adjust `workers` based on your server's CPU cores and expected load. `0.0.0.0` for `host` makes it accessible externally.
|
|
* **`llm` section:**
|
|
* **Purpose:** Default settings for LLM integrations.
|
|
* `provider`: Default LLM provider/model (e.g., `openai/gpt-4o-mini`).
|
|
* `api_key_env`: The environment variable name from which to read the API key for the default provider (e.g., `OPENAI_API_KEY`).
|
|
* `api_key`: (Optional, discouraged) Directly embed an API key. It's better to use `api_key_env`.
|
|
* **Reasoning:** Centralizes default LLM settings. API keys should almost always be managed via environment variables for security.
|
|
* **`redis` section:**
|
|
* **Purpose:** Configuration for connecting to a Redis instance.
|
|
* Used for distributed rate limiting (if `rate_limiting.storage_uri` points to Redis) and potentially for the job queue in future versions.
|
|
* **Reasoning:** Essential for robust rate limiting in a scaled environment. If not using distributed features, default `memory://` for rate limiting is simpler.
|
|
* **`rate_limiting` section:**
|
|
* **Purpose:** Controls API rate limiting to prevent abuse.
|
|
* `enabled`: `true` or `false`.
|
|
* `default_limit`: E.g., "1000/minute".
|
|
* `storage_uri`: `memory://` (default, per-instance) or `redis://...` (for distributed).
|
|
* **Reasoning:** Always enable in production. Adjust limits based on expected traffic and capacity.
|
|
* **`security` section:**
|
|
* **Purpose:** Security-related settings.
|
|
* `enabled`: Master switch for security features below.
|
|
* `jwt_enabled`: Enable/disable JWT token authentication for API endpoints.
|
|
* `https_redirect`: If `true`, redirects HTTP to HTTPS (requires a reverse proxy like Nginx to handle SSL termination).
|
|
* `trusted_hosts`: List of allowed host headers. `["*"]` allows all, but be more specific in production.
|
|
* `headers`: Default security headers (X-Content-Type-Options, X-Frame-Options, CSP, HSTS).
|
|
* **Reasoning:** Crucial for production. `jwt_enabled` protects your API. `trusted_hosts` prevents host header attacks. Default headers provide good baseline security.
|
|
* **`crawler` section:**
|
|
* **Purpose:** Default behaviors for the crawler instances managed by the server.
|
|
* `base_config`: Default `CrawlerRunConfig` parameters if not specified in the API request.
|
|
* `memory_threshold_percent`: For `MemoryAdaptiveDispatcher`, at what system memory percentage to start throttling.
|
|
* `rate_limiter`: Default settings for the `RateLimiter` used by dispatchers.
|
|
* `pool`:
|
|
* `max_pages`: Corresponds to `GLOBAL_SEM` in `server.py`. Max concurrent browser pages server-wide.
|
|
* `idle_ttl_sec`: How long an idle browser instance remains in the pool before being cleaned up by the `janitor`.
|
|
* `browser`: Default `BrowserConfig` parameters.
|
|
* `kwargs`: Passed to Playwright's browser launch.
|
|
* `extra_args`: Additional browser command-line flags.
|
|
* **Reasoning:** Fine-tune these based on server resources and crawling needs. `max_pages` is critical for stability. `idle_ttl_sec` balances responsiveness with resource conservation.
|
|
* **`logging` section:**
|
|
* **Purpose:** Controls server-side logging.
|
|
* `level`: `INFO`, `DEBUG`, `WARNING`, `ERROR`.
|
|
* `format`: Log message format.
|
|
* **Reasoning:** Set to `DEBUG` for detailed troubleshooting, `INFO` for general production logs.
|
|
* **`observability` section:**
|
|
* **Purpose:** Endpoints for monitoring.
|
|
* `prometheus.endpoint`: Path for Prometheus metrics (e.g., `/metrics`).
|
|
* `health_check.endpoint`: Path for health checks (e.g., `/health`).
|
|
* **Reasoning:** Essential for production monitoring and integration with alerting systems.
|
|
* 3.5.2. Securing Your Server: JWT Authentication
|
|
* **Why enable JWT authentication:** To protect your Crawl4ai server API from unauthorized access, especially if it's exposed to the internet or a shared network.
|
|
* **How to enable:** In `config.yml`, under the `security` section, set `jwt_enabled: true`.
|
|
* **Impact on API requests:** Most API endpoints (those decorated with `Depends(token_dep)`) will require an `Authorization: Bearer <your_jwt_token>` header.
|
|
* **Generating tokens via the `/token` endpoint:**
|
|
* The `/token` endpoint itself is *not* protected by JWT.
|
|
* You send a POST request with an email (currently, any email in a valid format works, but domain verification can be configured for more robust auth if needed for other systems; for Crawl4ai's purpose, the token is the primary gate).
|
|
* The server responds with an access token.
|
|
* *Example: Requesting a token with `curl`.*
|
|
```bash
|
|
curl -X POST "http://localhost:11235/token" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"email": "user@example.com"}'
|
|
```
|
|
**Expected Response:**
|
|
```json
|
|
{
|
|
"email": "user@example.com",
|
|
"access_token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
|
|
"token_type": "bearer"
|
|
}
|
|
```
|
|
* *Example: Requesting a token with Python `requests`.*
|
|
```python
|
|
import requests
|
|
import json
|
|
|
|
response = requests.post(
|
|
"http://localhost:11235/token",
|
|
json={"email": "user@example.com"}
|
|
)
|
|
if response.status_code == 200:
|
|
token_data = response.json()
|
|
print(f"Access Token: {token_data['access_token']}")
|
|
else:
|
|
print(f"Error getting token: {response.text}")
|
|
```
|
|
* 3.5.3. Customizing `config.yml`
|
|
* **Method 1: Modifying `config.yml` before building a local Docker image.**
|
|
* **How:** If you're building your own Docker image (Option 3.3.3 or Docker Compose with a local build context), you can directly edit the `deploy/docker/config.yml` file in your cloned repository before running `docker build` or `docker-compose build`.
|
|
* **Why:** Best if you want the custom configuration to be part of the image itself, ensuring consistency if you distribute or version the image.
|
|
* **Method 2: Mounting a custom `config.yml` at runtime.**
|
|
* **How:** Create your custom `config.yml` file on your Docker host machine. Then, when running the container, use a volume mount to replace the default `config.yml` inside the container.
|
|
* *Code Example:*
|
|
```bash
|
|
# Assuming your custom config is at ./my-custom-config.yml on the host
|
|
docker run -d -p 11235:11235 \
|
|
--name crawl4ai-server \
|
|
--shm-size=1g \
|
|
--env-file .llm.env \
|
|
-v "$(pwd)/my-custom-config.yml:/app/config.yml" \
|
|
unclecode/crawl4ai:0.6.0
|
|
```
|
|
* **Why:** Useful for quick configuration changes without rebuilding the image, or for managing configurations separately from the image, especially if you use pre-built images.
|
|
* 3.5.4. Key Configuration Recommendations
|
|
* **Security:**
|
|
* **Always** enable `security.jwt_enabled: true` in production or shared environments.
|
|
* If using a reverse proxy for SSL, set `security.https_redirect: true`.
|
|
* Configure `security.trusted_hosts` to your server's domain(s) instead of `["*"]` in production.
|
|
* Review default security headers in `security.headers` and customize if needed for your security policies (e.g., a stricter Content Security Policy).
|
|
* **Resource Management:**
|
|
* Adjust `crawler.pool.max_pages` based on your server's RAM and CPU. Too high can lead to instability; too low can underutilize resources.
|
|
* Set `app.workers` (Uvicorn workers) typically to `(2 * CPU_CORES) + 1` as a starting point, but benchmark for your specific workload.
|
|
* Tune `crawler.pool.idle_ttl_sec` to balance between keeping browser instances warm (lower TTL) and conserving resources (higher TTL).
|
|
* **Monitoring:**
|
|
* Ensure `observability.prometheus.enabled: true` if you use Prometheus.
|
|
* Integrate the `observability.health_check.endpoint` into your load balancer or container orchestrator health checks.
|
|
* **Performance:**
|
|
* For `rate_limiting`, use a Redis backend (`storage_uri: redis://...`) if you have multiple server instances behind a load balancer to share rate limit state. For a single instance, `memory://` is fine.
|
|
* Adjust `rate_limiting.default_limit` to a reasonable value that protects your server and downstream services without unduly restricting legitimate users.
|
|
|
|
* 3.6. Interacting with the Dockerized Crawl4ai Server
|
|
* 3.6.1. The Playground Interface (`/playground`)
|
|
* **How-to:** Open your web browser and navigate to `http://localhost:11235/playground` (or your server's address and port).
|
|
* **Purpose:**
|
|
* Provides an interactive UI (Swagger/OpenAPI) to explore all available API endpoints.
|
|
* Allows you to test API calls directly from your browser.
|
|
* Shows request and response schemas, making it easy to understand payload structures.
|
|
* Helps in generating example request payloads for your own client applications.
|
|
* **Key features to explore:**
|
|
* Expand each endpoint to see its parameters, request body schema, and possible responses.
|
|
* Use the "Try it out" button to send test requests.
|
|
* Examine the "Schemas" section at the bottom to understand the structure of objects like `CrawlRequest`, `BrowserConfig`, `CrawlerRunConfig`, and `CrawlResult`.
|
|
* 3.6.2. Using the Python SDK (`Crawl4aiDockerClient`)
|
|
* **How-to:**
|
|
```python
|
|
from crawl4ai.docker_client import Crawl4aiDockerClient
|
|
import asyncio
|
|
|
|
client = Crawl4aiDockerClient(base_url="http://localhost:11235")
|
|
|
|
async def run_crawl():
|
|
# ... (define browser_config_dict and crawler_config_dict)
|
|
# See "Constructing JSON Configuration Payloads" below for examples
|
|
browser_config_dict = {"type": "BrowserConfig", "params": {"headless": True}}
|
|
crawler_config_dict = {"type": "CrawlerRunConfig", "params": {"screenshot": True}}
|
|
|
|
results = await client.crawl(
|
|
urls=["https://example.com"],
|
|
browser_config=browser_config_dict,
|
|
crawler_config=crawler_config_dict
|
|
)
|
|
for result in results:
|
|
if result.success:
|
|
print(f"Crawled {result.url}, screenshot available: {bool(result.screenshot)}")
|
|
else:
|
|
print(f"Failed {result.url}: {result.error_message}")
|
|
|
|
# asyncio.run(run_crawl())
|
|
```
|
|
* **Authentication with the SDK when JWT is enabled:**
|
|
* If your server has `security.jwt_enabled: true`, you'll need to authenticate the client.
|
|
* *Code Example:*
|
|
```python
|
|
# client = Crawl4aiDockerClient(base_url="http://localhost:11235")
|
|
# await client.authenticate_with_email(email="user@example.com")
|
|
# Now client will automatically include the token in subsequent requests.
|
|
# Or, if you already have a token:
|
|
# client.set_token("your_jwt_token_here")
|
|
```
|
|
*Note: The `authenticate_with_email` method is a conceptual example. The actual SDK might require you to fetch the token separately and then use `client.set_token()`.*
|
|
* **Making `crawl()` requests:**
|
|
* **Non-streaming (default):**
|
|
* **When to use:** For a small number of URLs or when you need all results before proceeding.
|
|
* **How results are returned:** The `client.crawl()` call will block until all URLs are processed, then return a list of `CrawlResult` objects.
|
|
* **Streaming (`stream=True`):**
|
|
* **Benefits:** For long-running crawls involving many URLs or when processing time per URL is high. It allows you to process results incrementally as they become available, improving responsiveness and potentially reducing memory footprint if you process and discard results immediately.
|
|
* **How to process:** The `client.crawl(..., stream=True)` will return an async generator. You iterate over it using `async for`.
|
|
* *Code Example: Python snippet demonstrating both.*
|
|
```python
|
|
from crawl4ai.docker_client import Crawl4aiDockerClient
|
|
from crawl4ai import BrowserConfig, CrawlerRunConfig, CacheMode
|
|
import asyncio
|
|
|
|
client = Crawl4aiDockerClient(base_url="http://localhost:11235")
|
|
# Assume client is authenticated if JWT is enabled server-side
|
|
|
|
browser_cfg_dict = BrowserConfig(headless=True).dump() # Use .dump() to get the serializable dict
|
|
crawler_cfg_dict_base = CrawlerRunConfig(cache_mode=CacheMode.BYPASS).dump()
|
|
|
|
urls_to_crawl = ["https://example.com", "https://crawl4ai.com"]
|
|
|
|
async def non_streaming_example():
|
|
print("\n--- Non-Streaming Example ---")
|
|
results_list = await client.crawl(
|
|
urls=urls_to_crawl,
|
|
browser_config=browser_cfg_dict,
|
|
crawler_config=crawler_cfg_dict_base
|
|
)
|
|
for result_data in results_list: # result_data is a dict here
|
|
print(f"Non-Streamed: {result_data.get('url')} - Success: {result_data.get('success')}")
|
|
|
|
async def streaming_example():
|
|
print("\n--- Streaming Example ---")
|
|
crawler_cfg_dict_stream = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, stream=True).dump()
|
|
async for result_data in client.crawl( # result_data is a dict here
|
|
urls=urls_to_crawl,
|
|
browser_config=browser_cfg_dict,
|
|
crawler_config=crawler_cfg_dict_stream
|
|
):
|
|
print(f"Streamed: {result_data.get('url')} - Success: {result_data.get('success')}")
|
|
# Process each result as it arrives
|
|
if result_data.get('status') == 'completed': # Check for final stream completion marker
|
|
print("Stream ended.")
|
|
break
|
|
|
|
|
|
async def main_sdk():
|
|
# If JWT is enabled:
|
|
# success = await client.authenticate_with_email(email="user@example.com")
|
|
# if not success:
|
|
# print("SDK Authentication failed.")
|
|
# return
|
|
|
|
await non_streaming_example()
|
|
await streaming_example()
|
|
|
|
# asyncio.run(main_sdk())
|
|
```
|
|
* **Fetching API Schema with `get_schema()`:**
|
|
* **How this helps:** The schema describes the structure of `BrowserConfig` and `CrawlerRunConfig`, including all available parameters, their types, and default values. This is useful for programmatically understanding configuration options or validating your payloads.
|
|
* *Code Example:*
|
|
```python
|
|
# async def show_schema():
|
|
# schema = await client.get_schema()
|
|
# print("BrowserConfig Schema:", json.dumps(schema['browser'], indent=2))
|
|
# print("CrawlerRunConfig Schema:", json.dumps(schema['crawler'], indent=2))
|
|
# asyncio.run(show_schema())
|
|
```
|
|
* 3.6.3. Constructing JSON Configuration Payloads
|
|
* **Understanding the `{"type": "ClassName", "params": {...}}` pattern:**
|
|
* Crawl4ai uses this pattern for serializing and deserializing configuration objects that can have different underlying implementations (strategies).
|
|
* `"type"`: The Python class name (e.g., "BrowserConfig", "CrawlerRunConfig", "LLMExtractionStrategy").
|
|
* `"params"`: A dictionary of the parameters that would be passed to the class's `__init__` method.
|
|
* **Why this pattern?** It allows the server to dynamically instantiate the correct Python configuration objects from the JSON payload sent by the client.
|
|
* **How-to:** Translate Python class initializations to JSON:
|
|
* If you have `BrowserConfig(headless=False, browser_type="firefox")` in Python.
|
|
* The JSON equivalent is:
|
|
```json
|
|
{
|
|
"type": "BrowserConfig",
|
|
"params": {
|
|
"headless": false,
|
|
"browser_type": "firefox"
|
|
}
|
|
}
|
|
```
|
|
* Python `BrowserConfig().dump()` or `CrawlerRunConfig().dump()` methods automatically generate this correct JSON-serializable dictionary structure.
|
|
* **Common pitfalls:**
|
|
* Forgetting the `"type"` field.
|
|
* Incorrectly nesting `"params"`.
|
|
* Using Python booleans (`True`) instead of JSON booleans (`true`). The `dump()` method handles this.
|
|
* *Example: JSON payload for a complex `CrawlRequest` for the `/crawl` endpoint.*
|
|
```json
|
|
{
|
|
"urls": ["https://example.com/news", "https://blog.example.com"],
|
|
"browser_config": {
|
|
"type": "BrowserConfig",
|
|
"params": {
|
|
"headless": true,
|
|
"user_agent": "MyCustomCrawler/1.0"
|
|
}
|
|
},
|
|
"crawler_config": {
|
|
"type": "CrawlerRunConfig",
|
|
"params": {
|
|
"screenshot": true,
|
|
"pdf": false,
|
|
"word_count_threshold": 50,
|
|
"cache_mode": "bypass"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
* 3.6.4. Direct REST API Usage
|
|
* **When to prefer direct HTTP requests:**
|
|
* When integrating Crawl4ai into applications written in languages other than Python.
|
|
* For simple, one-off requests where setting up the SDK might be overkill.
|
|
* When you need fine-grained control over HTTP headers or request timing not exposed by the SDK.
|
|
* **How-to:** Making POST requests to `/crawl` (non-streaming).
|
|
* *Example: `curl` snippet for `/crawl`.*
|
|
```bash
|
|
# Ensure you have your JWT token if security is enabled
|
|
# export C4AI_TOKEN="your_jwt_token_here"
|
|
curl -X POST "http://localhost:11235/crawl" \
|
|
-H "Content-Type: application/json" \
|
|
-H "Authorization: Bearer $C4AI_TOKEN" \
|
|
-d '{
|
|
"urls": ["https://example.com"],
|
|
"browser_config": {"type": "BrowserConfig", "params": {"headless": true}},
|
|
"crawler_config": {"type": "CrawlerRunConfig", "params": {"screenshot": false}}
|
|
}'
|
|
```
|
|
* *Python `requests` snippet for `/crawl`.*
|
|
```python
|
|
# import requests
|
|
# import json
|
|
#
|
|
# headers = {"Content-Type": "application/json"}
|
|
# # if jwt_token: headers["Authorization"] = f"Bearer {jwt_token}"
|
|
#
|
|
# payload = {
|
|
# "urls": ["https://example.com"],
|
|
# "browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
|
|
# "crawler_config": {"type": "CrawlerRunConfig", "params": {"screenshot": False}}
|
|
# }
|
|
# response = requests.post("http://localhost:11235/crawl", headers=headers, json=payload)
|
|
# if response.status_code == 200:
|
|
# print(json.dumps(response.json(), indent=2))
|
|
# else:
|
|
# print(f"Error: {response.status_code} - {response.text}")
|
|
```
|
|
* **How-to:** Making POST requests to `/crawl/stream` (streaming).
|
|
* **Understanding NDJSON:** The server will stream back results as Newline Delimited JSON. Each line is a complete JSON object representing a `CrawlResult` for one URL, or a status update.
|
|
* *Example: `curl` for `/crawl/stream` (NDJSON output will print to terminal).*
|
|
```bash
|
|
# curl -N -X POST "http://localhost:11235/crawl/stream" \
|
|
# -H "Content-Type: application/json" \
|
|
# -H "Authorization: Bearer $C4AI_TOKEN" \
|
|
# -d '{
|
|
# "urls": ["https://example.com", "https://crawl4ai.com"],
|
|
# "browser_config": {"type": "BrowserConfig", "params": {"headless": true}},
|
|
# "crawler_config": {"type": "CrawlerRunConfig", "params": {"stream": true}}
|
|
# }'
|
|
```
|
|
* *Python `requests` snippet for `/crawl/stream` and processing NDJSON.*
|
|
```python
|
|
# import requests
|
|
# import json
|
|
#
|
|
# headers = {"Content-Type": "application/json"}
|
|
# # if jwt_token: headers["Authorization"] = f"Bearer {jwt_token}"
|
|
#
|
|
# payload = {
|
|
# "urls": ["https://example.com", "https://crawl4ai.com"],
|
|
# "browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
|
|
# "crawler_config": {"type": "CrawlerRunConfig", "params": {"stream": True}} # stream implicitly handled by endpoint
|
|
# }
|
|
# with requests.post("http://localhost:11235/crawl/stream", headers=headers, json=payload, stream=True) as r:
|
|
# if r.status_code == 200:
|
|
# for line in r.iter_lines():
|
|
# if line:
|
|
# result_data = json.loads(line.decode('utf-8'))
|
|
# print(f"Streamed API: {result_data.get('url')} - Success: {result_data.get('success')}")
|
|
# if result_data.get('status') == 'completed':
|
|
# print("Stream ended via API.")
|
|
# break
|
|
# else:
|
|
# print(f"Error: {r.status_code} - {r.text}")
|
|
```
|
|
|
|
* 3.7. Exploring Additional API Endpoints
|
|
These endpoints provide targeted functionalities beyond general crawling.
|
|
* 3.7.1. `/html`: Generating Preprocessed HTML
|
|
* **Purpose:** Use this when you need the HTML of a page after JavaScript execution and basic sanitization (e.g., removing scripts, styles), but *before* Crawl4ai's full Markdown conversion or complex filtering. It's ideal for feeding into custom HTML parsers or schema extraction tools that expect mostly clean, rendered HTML.
|
|
* **Request structure (`HTMLRequest`):**
|
|
* `url (str)`: The URL to fetch.
|
|
* **Response structure:**
|
|
* `html (str)`: The preprocessed HTML content.
|
|
* `url (str)`: The original URL requested.
|
|
* `success (bool)`: Indicates if the operation was successful.
|
|
* *Example: Use case: You have an external tool that extracts microdata from HTML. Use `/html` to get the rendered HTML for this tool.*
|
|
```bash
|
|
# curl -X POST "http://localhost:11235/html" \
|
|
# -H "Content-Type: application/json" \
|
|
# -H "Authorization: Bearer $C4AI_TOKEN" \
|
|
# -d '{"url": "https://example.com/dynamic-page"}'
|
|
```
|
|
* 3.7.2. `/screenshot`: Capturing Web Pages
|
|
* **Purpose:** To obtain a visual snapshot (PNG) of a web page as it's rendered by the browser. Useful for archiving, visual verification, or when textual content alone isn't sufficient.
|
|
* **Key parameters (`ScreenshotRequest`):**
|
|
* `url (str)`: The URL to capture.
|
|
* `screenshot_wait_for (Optional[float])`: Seconds to wait after page load *before* taking the screenshot.
|
|
* **How to use:** Essential for pages with animations, delayed content loading via JS, or elements that appear after a short interval. Set this to a few seconds (e.g., `2.0` or `5.0`) to allow such content to render.
|
|
* `output_path (Optional[str])`:
|
|
* If provided (e.g., `"./screenshots/page.png"`), the server saves the screenshot to this path *relative to the server's filesystem*. The response will contain the absolute path.
|
|
* If `null` or omitted, the screenshot is returned as a base64-encoded string in the JSON response.
|
|
* **Decision:** Use `output_path` if the server has persistent storage and you want files saved directly. Omit it if the client needs to receive and handle the image data.
|
|
* **Response structure:**
|
|
* `success (bool)`
|
|
* `path (str)`: (If `output_path` was provided) Absolute path to the saved screenshot on the server.
|
|
* `screenshot (str)`: (If `output_path` was *not* provided) Base64 encoded PNG data.
|
|
* *Example: Capturing a screenshot of a dynamic page after a 2-second delay and receiving it as base64.*
|
|
```bash
|
|
# curl -X POST "http://localhost:11235/screenshot" \
|
|
# -H "Content-Type: application/json" \
|
|
# -H "Authorization: Bearer $C4AI_TOKEN" \
|
|
# -d '{"url": "https://example.com/animated-chart", "screenshot_wait_for": 2.0}'
|
|
```
|
|
* 3.7.3. `/pdf`: Generating PDFs
|
|
* **Purpose:** To create a PDF document from a rendered web page. Useful for printable versions, archiving, or offline reading.
|
|
* **Key parameters (`PDFRequest`):**
|
|
* `url (str)`: The URL to convert to PDF.
|
|
* `output_path (Optional[str])`: Similar to `/screenshot`, if provided, saves the PDF to this server-side path. Otherwise, returns base64 PDF data.
|
|
* **Response structure:**
|
|
* `success (bool)`
|
|
* `path (str)`: (If `output_path` was provided) Absolute path to the saved PDF on the server.
|
|
* `pdf (str)`: (If `output_path` was *not* provided) Base64 encoded PDF data.
|
|
* *Example: Generating a PDF for documentation and saving it on the server.*
|
|
```bash
|
|
# curl -X POST "http://localhost:11235/pdf" \
|
|
# -H "Content-Type: application/json" \
|
|
# -H "Authorization: Bearer $C4AI_TOKEN" \
|
|
# -d '{"url": "https://crawl4ai.com/docs", "output_path": "/app_data/pdfs/crawl4ai_docs.pdf"}'
|
|
```
|
|
* 3.7.4. `/execute_js`: Running Custom JavaScript
|
|
* **Purpose:** This is a powerful endpoint for advanced page interactions. Use it when you need to:
|
|
* Click buttons, fill forms, or trigger other UI events programmatically.
|
|
* Extract data that is only available after certain JS execution (e.g., from dynamically generated DOM elements).
|
|
* Modify the page content or state before further processing or screenshotting.
|
|
* **Key parameters (`JSEndpointRequest`):**
|
|
* `url (str)`: The URL on which to execute the scripts.
|
|
* `scripts (List[str])`: A list of JavaScript code snippets to execute in order.
|
|
* **Best practices for JS snippets:**
|
|
* Each script in the list should be an **expression that returns a value**, or an IIFE (Immediately Invoked Function Expression).
|
|
* If a script is asynchronous (e.g., involves `fetch` or `setTimeout`), it **must** return a `Promise`. Crawl4ai will `await` this promise.
|
|
* Keep snippets focused. For complex logic, consider breaking it into multiple steps in the `scripts` list.
|
|
* Be mindful of the page's context. Your script runs within the browser's environment for that page.
|
|
* *Example of a good snippet:* `async () => { await new Promise(r => setTimeout(r, 1000)); return document.title; }`
|
|
* *Example of a snippet that might not work as expected if not returning a promise for async ops:* `setTimeout(() => { console.log('done'); }, 1000);` (Crawl4ai might not wait for this).
|
|
* **Response structure:** The full `CrawlResult` object (as a JSON serializable dictionary). The results of your JS executions will be in the `js_execution_result` field of the `CrawlResult`. This field will be a dictionary where keys are `script_0`, `script_1`, etc., and values are the return values of your corresponding scripts.
|
|
* **How to access:** `response_json['results'][0]['js_execution_result']['script_0']`
|
|
* *Example: Clicking a "Load More" button and then extracting the count of new items.*
|
|
```python
|
|
# Python client example
|
|
# js_scripts = [
|
|
# "document.querySelector('#load-more-btn').click();",
|
|
# "async () => { await new Promise(r => setTimeout(r, 2000)); return document.querySelectorAll('.item').length; }"
|
|
# ]
|
|
# payload = {"url": "https://example.com/infinite-scroll", "scripts": js_scripts}
|
|
# # ... make request to /execute_js ...
|
|
# # new_item_count = response_data['results'][0]['js_execution_result']['script_1']
|
|
```
|
|
* 3.7.5. `/ask` (or `/library-context`): Retrieving Library Context for AI
|
|
* **Purpose:** This endpoint is designed to provide contextual information about the Crawl4ai library itself. It's intended to be used by AI assistants (like code generation copilots or RAG systems) to help them understand Crawl4ai's API, features, and documentation, enabling them to generate more accurate code snippets or provide better assistance.
|
|
* **Key parameters:**
|
|
* `context_type (str)`: `"code"`, `"doc"`, or `"all"`.
|
|
* **When to use:**
|
|
* `"code"`: For fetching relevant code snippets (functions, classes).
|
|
* `"doc"`: For fetching relevant documentation sections.
|
|
* `"all"`: For fetching both.
|
|
* `query (Optional[str])`: A natural language query or keywords. The endpoint uses BM25 (a text retrieval algorithm) to find relevant chunks of code or documentation based on this query.
|
|
* **How to formulate:** Be specific. E.g., "how to set proxy in BrowserConfig", "CrawlerRunConfig screenshot options".
|
|
* `score_ratio (float, default: 0.5)`: A value between 0.0 and 1.0. It filters results based on their BM25 score relative to the maximum possible score for the query. A higher `score_ratio` means more stringent filtering (fewer, more relevant results).
|
|
* **Understanding its impact:** Start with the default. If you get too few results, lower it. If too many irrelevant results, increase it.
|
|
* `max_results (int, default: 20)`: The maximum number of code chunks or documentation sections to return.
|
|
* **Response structure:** A JSON object containing:
|
|
* `code_results (List[Dict])`: If `context_type` includes "code". Each dict has `"text"` (the code chunk) and `"score"`.
|
|
* `doc_results (List[Dict])`: If `context_type` includes "doc". Each dict has `"text"` (the documentation chunk) and `"score"`.
|
|
* *Example: Using `/ask` to get information about `BrowserConfig` for an AI assistant.*
|
|
```bash
|
|
# curl -X GET "http://localhost:11235/ask?context_type=all&query=BrowserConfig%20proxy%20settings&max_results=3" \
|
|
# -H "Authorization: Bearer $C4AI_TOKEN"
|
|
```
|
|
This would return code snippets and documentation sections related to proxy settings in `BrowserConfig`.
|
|
|
|
* 3.8. MCP (Model Context Protocol) Integration
|
|
* 3.8.1. What is MCP and Why Use It?
|
|
* **Explanation:** MCP (Model Context Protocol) is a standardized way for AI models and development tools (like IDE extensions) to interact with external services and fetch context. Crawl4ai's MCP support allows AI tools that understand MCP (e.g., Anthropic's Claude Code extension for VS Code) to directly use Crawl4ai's functionalities.
|
|
* **Benefits:**
|
|
* **Seamless Tool Integration:** AI tools can discover and use Crawl4ai's capabilities without custom API integrations for each tool.
|
|
* **Contextual Awareness:** The AI model gets structured information about what a tool can do, its parameters, and how to interpret its output.
|
|
* **Enhanced AI Assistance:** Enables AI to, for example, suggest Crawl4ai code, execute crawls, or get information from web pages directly within the development environment.
|
|
* 3.8.2. Connection Endpoints: `/mcp/sse` and `/mcp/ws`
|
|
* **SSE (Server-Sent Events - `/mcp/sse`):** A unidirectional stream from server to client. Simpler for many MCP use cases where the tool primarily sends a request and awaits a response or stream of updates.
|
|
* **WebSockets (`/mcp/ws`):** A bidirectional, persistent connection. More suitable for highly interactive tools or when continuous two-way communication is needed.
|
|
* **When to choose:** For most current MCP integrations (like with Claude Code), SSE is often sufficient and simpler to implement on the client-tool side.
|
|
* 3.8.3. **How-to:** Integrating with Claude Code
|
|
* The `claude mcp add` command registers an MCP-compliant service with your Claude Code extension.
|
|
* *Example:*
|
|
```bash
|
|
# Assuming your Crawl4ai server is running locally
|
|
claude mcp add -t sse c4ai-mcp-service http://localhost:11235/mcp/sse
|
|
```
|
|
*Replace `c4ai-mcp-service` with a name of your choice for this tool in Claude.*
|
|
* **Illustrative workflow:**
|
|
1. Add the Crawl4ai MCP service to Claude Code.
|
|
2. In your code editor, you might ask Claude: "@c4ai-mcp-service Get the Markdown for example.com".
|
|
3. Claude, understanding MCP, would interact with the `/mcp/sse` endpoint, invoke the appropriate Crawl4ai tool (likely the `md` or `crawl` tool), and return the result to you in the editor.
|
|
* 3.8.4. Available MCP Tools and Their Use Cases
|
|
* The tools exposed via MCP largely mirror the additional API endpoints:
|
|
* `md`: Get Markdown content for a URL. **Use case:** Quickly summarize a page for an LLM.
|
|
* `html`: Get preprocessed HTML. **Use case:** Provide cleaner HTML to an AI for parsing.
|
|
* `screenshot`: Get a screenshot. **Use case:** Visual context for an AI, or for documentation.
|
|
* `pdf`: Get a PDF. **Use case:** Archival or providing document context.
|
|
* `execute_js`: Run JavaScript on a page. **Use case:** Interact with dynamic elements before an AI processes the page.
|
|
* `crawl`: Perform a full crawl operation. **Use case:** Comprehensive data gathering directed by an AI.
|
|
* `ask`: Query library context. **Use case:** AI asks Crawl4ai about its own capabilities to generate better code.
|
|
* 3.8.5. Testing MCP Connections and Tool Usage
|
|
* **Simple methods:**
|
|
* Use `curl` with the `-N` (no-buffering) flag for SSE to see the event stream:
|
|
```bash
|
|
# Example: Test list_tools via MCP/SSE
|
|
# You'd typically send a JSON-RPC request in the first message after connection.
|
|
# This is a simplified conceptual test.
|
|
# curl -N -H "Content-Type: application/json" http://localhost:11235/mcp/sse
|
|
# (Then send a JSON-RPC request for list_tools on the established connection if the tool supports it interactively)
|
|
```
|
|
* Use a WebSocket client (like `wscat` or a browser's developer console) to connect to `/mcp/ws` and send JSON-RPC messages.
|
|
* The best way to test is often through an MCP-compliant client tool like the Claude Code extension.
|
|
* 3.8.6. Accessing MCP Schemas (`/mcp/schema`)
|
|
* **How this helps:** This endpoint returns a JSON schema describing all available MCP tools, their methods, parameters, and return types. This is how MCP client tools (like Claude Code) discover what Crawl4ai can do via MCP. It's crucial for the self-describing nature of MCP.
|
|
|
|
* 3.9. Monitoring Your Crawl4ai Server
|
|
* 3.9.1. Health Checks with `/health`
|
|
* **Purpose:** A simple endpoint to verify that the Crawl4ai server is running and responsive. Commonly used by load balancers, container orchestrators (like Kubernetes), or uptime monitoring services.
|
|
* **Interpreting the response:**
|
|
* A `200 OK` response with JSON like `{"status": "ok", "timestamp": ..., "version": "..."}` indicates the server is healthy.
|
|
* Any other status code or an inability to connect suggests a problem.
|
|
* 3.9.2. Prometheus Metrics with `/metrics`
|
|
* **How to integrate:** If `observability.prometheus.enabled: true` in `config.yml` (default is true), this endpoint exposes metrics in Prometheus format. Configure your Prometheus server to scrape this endpoint.
|
|
* **Overview of important metrics (inferred from `prometheus_fastapi_instrumentator` usage):**
|
|
* Request counts, latencies, and error rates for API endpoints.
|
|
* Python process information (CPU, memory - if default instrumentator collectors are active).
|
|
* Potentially custom metrics related to crawl queue length, active browser instances, etc. (though these might need explicit addition in `server.py`).
|
|
* **Why use:** Essential for understanding server load, performance bottlenecks, error trends, and for setting up alerts.
|
|
|
|
* 3.10. Understanding the Server's Inner Workings (High-Level for Users)
|
|
Understanding these components can help you configure the server optimally and troubleshoot issues.
|
|
* 3.10.1. FastAPI Application: The Core of the Server
|
|
* **Role:** FastAPI is a modern, fast web framework for building APIs with Python. It handles incoming HTTP requests, routing, request validation, and response serialization for all Crawl4ai API endpoints.
|
|
* **Why it's used:** Its performance, ease of use, and automatic data validation/serialization make it well-suited for building robust APIs like Crawl4ai's.
|
|
* 3.10.2. Managing Browser Instances with `crawler_pool`
|
|
* **Role:** The `crawler_pool` (likely an instance of `BrowserManager` or a similar custom pool) is responsible for managing a collection of `AsyncWebCrawler` instances.
|
|
* `get_crawler`: When an API request needs a browser, this function provides an available (and potentially pre-warmed) `AsyncWebCrawler` instance from the pool. If all instances are busy, it might create a new one up to a limit, or wait.
|
|
* `close_all` and `janitor`: These are crucial for resource management.
|
|
* `close_all` is typically called on server shutdown to gracefully close all browser instances.
|
|
* The `janitor` task (referenced in `lifespan`) periodically checks for idle browser instances in the pool and closes them if they've exceeded their `idle_ttl_sec` (configured in `config.yml`).
|
|
* **Impact:** Proper pool management prevents resource leaks (e.g., too many zombie browser processes) and optimizes browser startup times by reusing instances.
|
|
* 3.10.3. Capping Concurrent Pages with `GLOBAL_SEM`
|
|
* **Role:** `GLOBAL_SEM` (an `asyncio.Semaphore`) acts as a server-wide gatekeeper, limiting the total number of browser pages that can be concurrently active across all `AsyncWebCrawler` instances.
|
|
* **Why this is important:** Each browser page consumes significant memory and CPU. Without a cap, a high volume of requests could easily overwhelm the server, leading to crashes or extreme slowdowns.
|
|
* **How `crawler.pool.max_pages` in `config.yml` relates:** This configuration value directly sets the limit for `GLOBAL_SEM`.
|
|
* **Decision:** Adjust `max_pages` carefully based on your server's RAM. If you see `asyncio.TimeoutError` or tasks getting stuck waiting for the semaphore, you might have too many concurrent requests for your `max_pages` setting, or individual crawls are taking too long.
|
|
* 3.10.4. Asynchronous Task Management (Job Router - `api.py` based)
|
|
* **Role:** For operations that can be time-consuming (like a crawl involving many URLs, or an LLM extraction that requires multiple API calls), Crawl4ai often offloads these to background tasks. This is especially true for non-streaming `/crawl` or `/llm/{url_or_task_id}` endpoints.
|
|
* The "job router" (conceptually, parts of `api.py` and `job.py`) handles:
|
|
1. Receiving the initial request.
|
|
2. Assigning a unique `task_id`.
|
|
3. Storing initial task metadata (URL, status: PENDING/PROCESSING) often in Redis.
|
|
4. Adding the actual work (e.g., `process_llm_extraction` or `handle_crawl_job`) to a FastAPI `BackgroundTasks` queue or a more robust Celery/RQ queue (if integrated).
|
|
5. Returning the `task_id` to the client immediately.
|
|
6. The client then polls a status endpoint (e.g., `/task/{task_id}`) to check progress.
|
|
7. Once the background task completes, it updates the task's status and result in Redis.
|
|
* **Role of Redis:**
|
|
* Stores task state (status, result, error).
|
|
* Can act as a message broker for task queues in more advanced setups.
|
|
* **User Interaction:** You submit a job, get a task ID, and then poll for completion. This prevents HTTP timeouts for long-running operations.
|
|
* 3.10.5. Rate Limiting and Security Middleware
|
|
* **How `config.yml` settings are applied:** FastAPI allows "middleware" to process requests before they hit your main endpoint logic and before responses are sent.
|
|
* **Rate Limiting:** The `slowapi` library is used. Middleware intercepts each request, checks the client's IP (or token identity) against configured limits (e.g., "1000/minute" from `config.yml`) stored in memory or Redis. If limits are exceeded, it returns a `429 Too Many Requests` error.
|
|
* **Security:** Middleware like `HTTPSRedirectMiddleware` and `TrustedHostMiddleware` enforce security policies (redirecting HTTP to HTTPS, validating Host headers). Security headers are added to outgoing responses.
|
|
* **Protections offered:**
|
|
* Rate limiting: Prevents abuse and server overload.
|
|
* HTTPS redirect: Enforces secure connections.
|
|
* Trusted hosts: Mitigates host header injection attacks.
|
|
* Security headers: Protect against common web vulnerabilities like XSS, clickjacking.
|
|
* 3.10.6. Mapping API Requests to `AsyncWebCrawler`
|
|
* 1. An HTTP request hits a FastAPI endpoint (e.g., `POST /crawl`).
|
|
* 2. FastAPI, using Pydantic, validates and parses the JSON request body into a `CrawlRequest` Pydantic model. This model contains `urls`, `browser_config` (as a dict), and `crawler_config` (as a dict).
|
|
* 3. The endpoint logic uses `BrowserConfig.load(browser_config_dict)` and `CrawlerRunConfig.load(crawler_config_dict)` to convert these dictionaries back into their respective Python configuration objects.
|
|
* 4. It then calls `await crawler_pool.get_crawler(browser_config_object)` to obtain an appropriate `AsyncWebCrawler` instance. The pool might reuse an existing compatible instance or create a new one.
|
|
* 5. Finally, `await crawler_instance.arun(url=..., config=crawler_run_config_object)` or `await crawler_instance.arun_many(...)` is called to perform the actual crawl.
|
|
* **Key takeaway:** The `{"type": ..., "params": ...}` JSON structure is crucial for the server to correctly deserialize configurations passed from clients into the Python objects `AsyncWebCrawler` expects. The `.dump()` methods on config objects are the Pythonic way to generate these serializable dicts.
|
|
|
|
## 4. Understanding Crawl4ai Versioning
|
|
|
|
Crawl4ai follows Semantic Versioning (SemVer) to help you manage updates and understand the implications of new releases.
|
|
|
|
* 4.1. Semantic Versioning (`MAJOR.MINOR.PATCH`)
|
|
* **`MAJOR` (e.g., `0.x.x` -> `1.x.x`):** Incremented for **incompatible API changes** (breaking changes). You will likely need to update your code when upgrading to a new major version.
|
|
* *Why it matters:* Pay close attention when a MAJOR version changes. Read release notes carefully.
|
|
* **`MINOR` (e.g., `0.5.x` -> `0.6.x`):** Incremented for **new functionality added in a backward-compatible manner**. Your existing code should continue to work.
|
|
* *Why it matters:* You can usually upgrade minor versions safely to get new features and improvements.
|
|
* **`PATCH` (e.g., `0.6.0` -> `0.6.1`):** Incremented for **backward-compatible bug fixes**.
|
|
* *Why it matters:* It's generally safe and recommended to apply patch updates.
|
|
* **Why this is important for users:** SemVer provides predictability. You can configure your dependency management (e.g., in `requirements.txt` or `pyproject.toml`) to allow automatic patch and minor updates (e.g., `crawl4ai~=0.6.0`) but require manual intervention for major updates.
|
|
|
|
* 4.2. Pre-release Suffixes
|
|
Crawl4ai uses standard suffixes for pre-release versions, allowing users to test upcoming features.
|
|
* `dev` (e.g., `0.7.0.dev1`): **Development versions.** These are typically built automatically from the main development branch. They are the most cutting-edge but can be unstable and are not recommended for production.
|
|
* `a` (alpha, e.g., `0.7.0a1`): **Alpha releases.** Early previews of new major or minor versions. Features might be incomplete or buggy. Use for testing and providing early feedback.
|
|
* `b` (beta, e.g., `0.7.0b1`): **Beta releases.** Feature-set is largely complete, but the release is still undergoing testing and refinement. More stable than alpha but may still contain bugs.
|
|
* `rc` (release candidate, e.g., `0.7.0rc1`): **Release candidates.** Believed to be stable and ready for final release, pending final testing. Good for testing in staging environments.
|
|
* **Guidance on when to use pre-release versions:**
|
|
* Use `dev`, `a`, or `b` if you want to experiment with upcoming features or contribute to testing, but be prepared for instability.
|
|
* Use `rc` if you want to test the very latest potentially stable version before its official release.
|
|
* For production, always stick to stable releases (no suffix).
|
|
* To install pre-releases: `pip install crawl4ai --pre`.
|
|
|
|
## 5. Troubleshooting Common Deployment Issues
|
|
|
|
Here are some common issues you might encounter and how to approach them:
|
|
|
|
* 5.1. Library Installation Problems
|
|
* **Playwright browser download failures:**
|
|
* **Symptom:** `crawl4ai-setup` or `playwright install` fails with network errors or messages about not being able to download browsers.
|
|
* **Reasoning:** Often due to network connectivity issues, firewalls, or proxies blocking the download. Playwright needs to download browser binaries which can be large.
|
|
* **Solution:**
|
|
* Ensure stable internet connection.
|
|
* If behind a proxy, configure Playwright's proxy environment variables (`HTTP_PROXY`, `HTTPS_PROXY`).
|
|
* Try running `playwright install --with-deps chromium` (or your browser of choice) manually to see more detailed error messages.
|
|
* Check Playwright's documentation for troubleshooting browser downloads.
|
|
* **Dependency conflicts:**
|
|
* **Symptom:** `pip install crawl4ai` fails with messages about conflicting package versions.
|
|
* **Reasoning:** Your existing Python environment might have packages with versions incompatible with Crawl4ai's dependencies.
|
|
* **Solution:**
|
|
* **Best Practice:** Use a virtual environment (e.g., `venv`, `conda`) for your Crawl4ai projects to isolate dependencies.
|
|
* Examine the error messages to identify the conflicting packages and try to resolve them, perhaps by upgrading/downgrading other packages or installing Crawl4ai in a fresh environment.
|
|
|
|
* 5.2. Docker Deployment Issues
|
|
* **Port conflicts:**
|
|
* **Symptom:** `docker run` or `docker-compose up` fails with an error like "port is already allocated."
|
|
* **Reasoning:** The default port for Crawl4ai (11235) is already in use by another application on your host machine.
|
|
* **Solution:**
|
|
* Stop the other application using the port.
|
|
* Map Crawl4ai to a different host port: `docker run -p <new_host_port>:11235 ...` (e.g., `-p 11236:11235`). Remember to update your client to use the new host port.
|
|
* **Incorrect environment variable setup for LLM API keys:**
|
|
* **Symptom:** LLM-dependent features (like `LLMExtractionStrategy`) fail, often with authentication errors from the LLM provider.
|
|
* **Reasoning:** The Docker container doesn't have access to the necessary API keys.
|
|
* **Solution:** Ensure you are correctly passing the `.llm.env` file when running the container (`--env-file .llm.env`) or that environment variables are set through Docker Compose or your orchestration platform. Double-check the variable names in your `.llm.env` file match what `config.yml` expects (e.g., `OPENAI_API_KEY`).
|
|
* **Memory allocation issues (`--shm-size`):**
|
|
* **Symptom:** Browsers inside Docker crash, pages fail to load with cryptic errors, or the container itself becomes unresponsive, especially under load.
|
|
* **Reasoning:** Chromium-based browsers use `/dev/shm` (shared memory) extensively. The Docker default for `/dev/shm` (often 64MB) is usually too small for multiple or complex browser tabs.
|
|
* **Solution:** Always run your Crawl4ai Docker container with an increased shared memory size. Start with `--shm-size=1g`. If issues persist, try `2g`. The `docker-compose.yml` provided in the Crawl4ai repository typically includes a volume mount for `/dev/shm` which effectively does the same.
|
|
* **Problems building local Docker images:**
|
|
* **Symptom:** `docker build` or `docker-compose build` fails.
|
|
* **Reasoning:** Could be network issues during dependency downloads, incorrect build arguments, problems with the Dockerfile syntax (if modified), or insufficient disk space.
|
|
* **Solution:**
|
|
* Check your internet connection.
|
|
* Carefully review the build arguments you're passing (`INSTALL_TYPE`, `ENABLE_GPU`, etc.).
|
|
* Examine the Docker build output for specific error messages.
|
|
* Ensure you have enough disk space.
|
|
|
|
* 5.3. Server Configuration (`config.yml`) Errors
|
|
* **YAML syntax errors:**
|
|
* **Symptom:** Server fails to start, with errors related to parsing `config.yml`.
|
|
* **Reasoning:** Incorrect indentation, missing colons, or other YAML syntax issues.
|
|
* **Solution:** Use a YAML linter or validator to check your `config.yml` file. Pay close attention to indentation (spaces, not tabs).
|
|
* **Misconfigured JWT settings:**
|
|
* **Symptom:** If `jwt_enabled: true`, clients might get `401 Unauthorized` or `403 Forbidden` errors even with what seems like a correct token.
|
|
* **Reasoning:** Issues with secret key consistency (if applicable, though Crawl4ai uses a fixed default or one configurable via env var), token expiration, or incorrect algorithm settings (though Crawl4ai handles this internally).
|
|
* **Solution:** Ensure clients are sending the token correctly in the `Authorization: Bearer <token>` header. Regenerate tokens if they might have expired. For complex JWT issues, you might need to debug the token generation/validation logic if you've heavily customized the server.
|
|
|
|
* 5.4. API Interaction Problems
|
|
* **Authentication failures:**
|
|
* **Symptom:** Client receives `401` or `403` errors.
|
|
* **Reasoning:** JWT is enabled on the server, but the client is not sending a valid token, or the token has expired.
|
|
* **Solution:** Ensure your client correctly obtains a token from `/token` and includes it in the `Authorization` header for subsequent requests.
|
|
* **Incorrectly structured request payloads:**
|
|
* **Symptom:** Client receives `422 Unprocessable Entity` errors.
|
|
* **Reasoning:** The JSON payload sent to endpoints like `/crawl` does not match the expected Pydantic schema (e.g., missing required fields, incorrect data types, wrong `{"type": ..., "params": ...}` structure for configs).
|
|
* **Solution:** Refer to the `/playground` (Swagger UI) for the correct request schemas. Use the `dump()` method of `BrowserConfig` and `CrawlerRunConfig` if constructing payloads in Python to ensure correct serialization.
|
|
* **Understanding error responses from the API:**
|
|
* The API usually returns JSON error responses with a `detail` field explaining the issue. Pay attention to this field.
|
|
* HTTP status codes also provide clues (400 for bad request, 401/403 for auth, 404 for not found, 422 for validation, 500 for server errors).
|
|
|
|
* 5.5. When to Check Server Logs
|
|
* **How to access Docker container logs:**
|
|
```bash
|
|
docker logs crawl4ai-server # Replace crawl4ai-server with your container name/ID
|
|
docker logs -f crawl4ai-server # To follow logs in real-time
|
|
```
|
|
If using Docker Compose:
|
|
```bash
|
|
docker-compose logs crawl4ai # Assuming 'crawl4ai' is the service name in docker-compose.yml
|
|
```
|
|
* **What to look for:**
|
|
* Python tracebacks indicating exceptions within the server code.
|
|
* Log messages from `crawl4ai` itself (often prefixed with tags like `[CRAWLER]`, `[ERROR]`, `[CONFIG]`).
|
|
* Uvicorn/FastAPI startup messages and request logs.
|
|
* Any messages related to resource limits (memory, file descriptors).
|
|
* Playwright browser errors if they are not caught and handled by the application.
|
|
|
|
## 6. Best Practices for Deployment
|
|
|
|
* 6.1. **Choosing the Right Deployment Method:**
|
|
* **Library:** For quick scripts, Python-centric projects, or when direct integration is paramount.
|
|
* **Docker (Pre-built):** For ease of use, standard deployments, and quick server setup.
|
|
* **Docker Compose:** For managing Crawl4ai with other services (like Redis) or for simplified local builds with custom arguments.
|
|
* **Docker (Manual Build):** For full customization, development, or specific CI/CD needs.
|
|
* 6.2. **Security Considerations for Server Deployment:**
|
|
* **Always enable JWT (`security.jwt_enabled: true`)** if the server is accessible beyond your local machine.
|
|
* Use strong, unique secrets for JWT if you customize it (though Crawl4ai has a default mechanism).
|
|
* Configure `security.trusted_hosts` to specific domains in production.
|
|
* Use a reverse proxy (like Nginx or Traefik) to handle SSL/TLS termination and potentially add another layer of security (WAF, IP blocking).
|
|
* Keep API keys and sensitive configurations out of version control; use `.llm.env` or environment variables.
|
|
* 6.3. **Monitoring and Scaling Your Dockerized Server:**
|
|
* Utilize the `/health` endpoint for liveness/readiness probes in orchestrators.
|
|
* Integrate `/metrics` with Prometheus and Grafana for performance monitoring and alerting.
|
|
* Scale horizontally (more container instances) behind a load balancer for high availability and increased throughput.
|
|
* Adjust `crawler.pool.max_pages` and container resources (CPU, RAM, `--shm-size`) based on observed load and performance.
|
|
* 6.4. **Managing Dependencies and Upgrades:**
|
|
* For library usage, use virtual environments.
|
|
* For Docker, pin to specific image versions (e.g., `unclecode/crawl4ai:0.6.0`) in production to avoid unexpected updates.
|
|
* Read release notes carefully before upgrading `MAJOR` or `MINOR` versions.
|
|
* 6.5. **Leveraging Configuration for Optimal Performance and Cost-Effectiveness:**
|
|
* Use appropriate `CacheMode` settings in `CrawlerRunConfig` to avoid re-crawling unchanged content.
|
|
* Fine-tune `word_count_threshold` and content filters to process only relevant data, especially before sending to costly LLMs.
|
|
* If using LLM extraction, design efficient prompts and schemas. Consider if a simpler CSS/XPath extraction can achieve the same for some fields.
|
|
* Adjust `crawler.pool.idle_ttl_sec` to balance resource usage and browser startup latency.
|
|
|
|
## 7. Next Steps & Further Learning
|
|
|
|
With a solid understanding of deployment, you're ready to explore more advanced capabilities:
|
|
|
|
* 7.1. **Exploring Advanced Crawler Configuration (`CrawlerRunConfig`):** Dive into parameters like `js_code`, `wait_for`, various filters (`word_count_threshold`, `exclude_paths`), and media handling options.
|
|
* 7.2. **Diving Deeper into Extraction Strategies:** Learn about `LLMExtractionStrategy`, `JsomCssExtractionStrategy`, and how to build custom schemas for precise data extraction.
|
|
* 7.3. **Advanced Page Interaction Techniques:** Master the use of `js_code` for complex interactions, form submissions, and handling dynamic content that simple waits can't manage.
|
|
* 7.4. **Contributing to Crawl4ai:** If you're interested in improving Crawl4ai, check out the [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTORS.md) and open issues/PRs.
|
|
|
|
This deployment guide should provide a strong foundation. Remember that the best configuration often comes from understanding your specific use case, experimenting, and monitoring performance. Happy Crawling!
|
|
``` |