Files
crawl4ai/docs/md_v2/assets/llmtxt/crawl4ai_deployment_examples_content.llm.txt
2025-05-24 20:37:09 +08:00

2186 lines
85 KiB
Plaintext

```markdown
# Examples for `crawl4ai` - Deployment Component
**Target Document Type:** Examples Collection
**Target Output Filename Suggestion:** `llm_examples_deployment.md`
**Library Version Context:** 0.5.1-d1
**Outline Generation Date:** 2025-05-24
---
This document provides runnable code examples showcasing the diverse usage patterns and configurations of the `crawl4ai` deployment component. The examples primarily focus on interacting with the API provided by a deployed Crawl4ai instance.
## I. Introduction to Crawl4ai Deployment Examples
### 1.1. Overview of the API and common interaction patterns (e.g., using `requests` library).
The Crawl4ai deployment exposes a FastAPI backend. Most examples will use the `requests` library for synchronous calls and `httpx` for asynchronous calls to interact with these API endpoints. The base URL for a local deployment is typically `http://localhost:11235`.
```python
import requests
import httpx # For async examples later
import asyncio
import json
import time
import os
import base64
# Assume the Crawl4ai API is running locally
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
API_TOKEN = os.environ.get("CRAWL4AI_API_TOKEN") # Set if your API requires auth
def get_headers():
if API_TOKEN:
return {"Authorization": f"Bearer {API_TOKEN}"}
return {}
print(f"Crawl4AI API Base URL: {BASE_URL}")
if API_TOKEN:
print("API Token will be used for authenticated requests.")
else:
print("No API Token found in env; assuming API does not require authentication for these examples.")
# A simple synchronous GET request
try:
response = requests.get(f"{BASE_URL}/health")
response.raise_for_status() # Raises an HTTPError for bad responses (4XX or 5XX)
print(f"Health check successful: {response.json()}")
except requests.exceptions.RequestException as e:
print(f"Error connecting to Crawl4AI API: {e}")
print("Please ensure the Crawl4AI Docker container or server is running.")
```
### 1.2. Note on Authentication: Brief explanation of when and how to use API tokens.
If JWT authentication is enabled in `config.yml` (via `security.jwt_enabled: true`), most API endpoints will require an `Authorization: Bearer <YOUR_TOKEN>` header. You can obtain a token from the `/token` endpoint using a whitelisted email address. The `get_headers()` helper function in the examples will attempt to use `CRAWL4AI_API_TOKEN` if set.
---
## II. Docker and Docker-Compose
### 2.1. Building the Docker Image
#### 2.1.1. Example: Basic `docker build` command.
This command builds the default Docker image from the root of the `crawl4ai` repository.
```bash
# Navigate to the root of the crawl4ai repository
# cd /path/to/crawl4ai
docker build -t crawl4ai:latest .
```
#### 2.1.2. Example: Building with `INSTALL_TYPE=all` build argument.
This installs all optional dependencies, including those for advanced AI/ML features.
```bash
# Navigate to the root of the crawl4ai repository
# cd /path/to/crawl4ai
docker build --build-arg INSTALL_TYPE=all -t crawl4ai:all-features .
```
#### 2.1.3. Example: Building with `ENABLE_GPU=true` build argument (conceptual, as GPU usage is complex).
This attempts to include GPU support (e.g., CUDA toolkits) if the base image and host support it.
```bash
# Navigate to the root of the crawl4ai repository
# cd /path/to/crawl4ai
# Ensure your Docker daemon and host are configured for GPU passthrough
docker build --build-arg ENABLE_GPU=true --build-arg TARGETARCH=amd64 -t crawl4ai:gpu-amd64 .
# For ARM64 with GPU (e.g., NVIDIA Jetson), you might need specific base images or configurations.
# docker build --build-arg ENABLE_GPU=true --build-arg TARGETARCH=arm64 -t crawl4ai:gpu-arm64 .
```
**Note:** Full GPU support in Docker can be complex and depends on your host system, NVIDIA drivers, and Docker version. The `Dockerfile` provides a basic attempt.
---
### 2.2. Running with Docker Compose
#### 2.2.1. Example: Basic `docker-compose up` using the provided `docker-compose.yml`.
This starts the Crawl4ai service as defined in the `docker-compose.yml` file.
```bash
# Navigate to the directory containing docker-compose.yml
# cd /path/to/crawl4ai
docker-compose up -d
```
#### 2.2.2. Example: Overriding image tag in `docker-compose` via environment variable `TAG`.
You can specify a different image tag for the `crawl4ai` service.
```bash
# Example: Using a specific version tag
TAG=0.6.0 docker-compose up -d
# Example: Using a custom built tag
# TAG=my-custom-crawl4ai-build docker-compose up -d
```
#### 2.2.3. Example: Overriding `INSTALL_TYPE` in `docker-compose` via environment variable.
If your `docker-compose.yml` is set up to use build arguments from environment variables, you can override `INSTALL_TYPE`.
```bash
# Assuming docker-compose.yml uses INSTALL_TYPE from env for the build context:
# (The provided docker-compose.yml directly passes it as a build arg)
# If you modify docker-compose.yml to pick up an env var for INSTALL_TYPE:
# INSTALL_TYPE=all docker-compose up -d --build
```
**Note:** The provided `docker-compose.yml` directly sets `INSTALL_TYPE` in the `args` section. To make it environment-variable driven like `TAG`, you would modify the `docker-compose.yml`'s `build.args` section.
---
### 2.3. Configuration via Environment Variables & `.llm.env`
#### 2.3.1. Example: Setting `OPENAI_API_KEY` using an `.llm.env` file.
Create a `.llm.env` file in the same directory as `docker-compose.yml` or where you run the server.
```text
# Contents of .llm.env
OPENAI_API_KEY="sk-your_openai_api_key_here"
```
The `docker-compose.yml` (or server if run directly) will load this file.
#### 2.3.2. Example: Showing how to pass multiple LLM API keys via `.llm.env`.
You can add keys for various supported LLM providers.
```text
# Contents of .llm.env
OPENAI_API_KEY="sk-your_openai_api_key_here"
ANTHROPIC_API_KEY="sk-ant-your_anthropic_api_key_here"
GROQ_API_KEY="gsk_your_groq_api_key_here"
# ...and other keys supported by LiteLLM
```
---
### 2.4. Accessing the Deployed Service
#### 2.4.1. Example: Python script to perform a basic health check (`/health`) on the locally deployed service.
```python
import requests
import os
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
try:
response = requests.get(f"{BASE_URL}/health")
response.raise_for_status()
data = response.json()
print(f"Service is healthy. Version: {data.get('version')}, Timestamp: {data.get('timestamp')}")
except requests.exceptions.RequestException as e:
print(f"Failed to connect or health check failed: {e}")
```
#### 2.4.2. Example: Accessing the API playground at `/playground`.
Open your web browser and navigate to `http://localhost:11235/playground` (or your deployed URL + `/playground`). This will show the FastAPI interactive API documentation.
---
### 2.5. Understanding Shared Memory
#### 2.5.1. Explanation: Importance of `/dev/shm` for Chromium performance and how it's configured in `docker-compose.yml`.
Chromium-based browsers (like Chrome, Edge) use `/dev/shm` (shared memory) extensively. If the default Docker limit for `/dev/shm` (often 64MB) is too small, browser instances can crash or perform poorly. The `docker-compose.yml` provided with Crawl4ai typically increases this:
```yaml
# Snippet from a typical docker-compose.yml for crawl4ai
# services:
# crawl4ai:
# # ... other configurations ...
# shm_size: '1g' # Or '2g', depending on expected load
# # Alternatively, for more flexibility but less security:
# # volumes:
# # - /dev/shm:/dev/shm
```
Setting `shm_size` or mounting `/dev/shm` directly from the host provides more shared memory, preventing common browser crashes within Docker. The `Dockerfile` also sets `ENV DEBIAN_FRONTEND=noninteractive` and browser flags like `--disable-dev-shm-usage` to mitigate some issues, but adequate shared memory is still crucial.
---
## III. Interacting with the Crawl4ai API Endpoints
### A. Authentication (`/token`)
#### A.1. Example: Python script to obtain an API token using a valid email.
This example assumes JWT authentication is enabled and "user@example.com" is whitelisted (this is illustrative, actual whitelisting is not part of the default config).
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
# This email domain would need to be configured as allowed in your security settings
# if verify_email_domain is used.
email_to_test = "user@example.com" # Replace with a valid email if your server uses domain verification
payload = {"email": email_to_test}
try:
response = requests.post(f"{BASE_URL}/token", json=payload)
if response.status_code == 200:
token_data = response.json()
print(f"Successfully obtained token for {email_to_test}:")
print(json.dumps(token_data, indent=2))
# Store this token for subsequent authenticated requests
# API_TOKEN = token_data["access_token"]
else:
print(f"Failed to obtain token for {email_to_test}. Status: {response.status_code}, Response: {response.text}")
except requests.exceptions.RequestException as e:
print(f"Error connecting to /token endpoint: {e}")
```
**Note:** The default `config.yml` has `security.jwt_enabled: false`. For this example to fully work, you would need to enable JWT and potentially configure allowed email domains.
#### A.2. Example: Python script attempting to obtain a token with an invalid email domain and handling the error.
```python
import requests
import os
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
# Assuming "invalid-domain.com" is not whitelisted.
# The default Crawl4AI config doesn't whitelist specific domains for /token,
# but if `verify_email_domain` were true in auth.py, this would be relevant.
# For now, this will likely succeed if jwt_enabled is false, or fail if jwt_enabled is true
# and no user exists, or pass if jwt_enabled is true and any email can get a token.
payload = {"email": "test@invalid-domain.com"}
try:
response = requests.post(f"{BASE_URL}/token", json=payload)
if response.status_code == 400 and "Invalid email domain" in response.text:
print(f"Correctly failed to obtain token for invalid domain: {response.text}")
elif response.status_code == 200:
print(f"Obtained token (unexpected if domain verification is strict): {response.json()}")
else:
print(f"Token request status: {response.status_code}, Response: {response.text}")
except requests.exceptions.RequestException as e:
print(f"Error connecting to /token endpoint: {e}")
```
#### A.3. Example: Python script making an authenticated request to a protected endpoint.
This example assumes an endpoint like `/md` is protected and requires a token.
```python
import requests
import os
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
# First, obtain a token (replace with actual token for a real protected setup)
# For this example, we'll use a placeholder. If API_TOKEN is set in env, it will be used.
# If not, and the endpoint is truly protected, this will fail.
# API_TOKEN = "your_manually_obtained_token_or_from_previous_step"
headers = get_headers() # Uses API_TOKEN from environment if set
md_payload = {"url": "https://example.com"}
try:
response = requests.post(f"{BASE_URL}/md", json=md_payload, headers=headers)
if response.status_code == 200:
print("Successfully accessed protected /md endpoint.")
print(json.dumps(response.json(), indent=2, ensure_ascii=False)[:500] + "...")
elif response.status_code == 401 or response.status_code == 403:
print(f"Authentication/Authorization failed for /md: {response.status_code} - {response.text}")
print("Ensure JWT is enabled and you have a valid token if this endpoint is protected.")
else:
print(f"Request to /md failed: {response.status_code} - {response.text}")
except requests.exceptions.RequestException as e:
print(f"Error connecting to /md endpoint: {e}")
```
**Note:** By default, most Crawl4ai endpoints are not protected by JWT even if `jwt_enabled` is true, unless explicitly decorated with `Depends(token_dep)`.
---
### B. Core Crawling Endpoints
#### B.1. `/crawl` (Asynchronous Job-based Crawling via Redis)
The `/crawl` endpoint submits a job to a Redis queue. You then poll the `/task/{task_id}` endpoint to get the status and results.
##### B.1.1. Example: Submitting a single URL crawl job and getting a `task_id`.
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
payload = {
"urls": ["https://example.com"],
# browser_config and crawler_config are optional, defaults will be used
}
try:
response = requests.post(f"{BASE_URL}/crawl", json=payload, headers=headers)
response.raise_for_status()
job_data = response.json()
task_id = job_data.get("task_id")
if task_id:
print(f"Crawl job submitted successfully. Task ID: {task_id}")
print(f"Poll status at: {BASE_URL}/task/{task_id}")
else:
print(f"Failed to submit job or get task_id: {job_data}")
except requests.exceptions.RequestException as e:
print(f"Error submitting crawl job: {e}")
```
##### B.1.2. Example: Submitting multiple URLs as a single crawl job.
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
payload = {
"urls": ["https://example.com", "https://www.python.org"],
}
try:
response = requests.post(f"{BASE_URL}/crawl", json=payload, headers=headers)
response.raise_for_status()
job_data = response.json()
task_id = job_data.get("task_id")
if task_id:
print(f"Multi-URL crawl job submitted. Task ID: {task_id}")
else:
print(f"Failed to submit job: {job_data}")
except requests.exceptions.RequestException as e:
print(f"Error submitting multi-URL crawl job: {e}")
```
##### B.1.3. Example: Submitting a crawl job with a custom `browser_config` (e.g., headless false).
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
payload = {
"urls": ["https://example.com"],
"browser_config": {
"headless": False, # Run browser in visible mode (if server environment supports UI)
"viewport_width": 800,
"viewport_height": 600
}
}
try:
response = requests.post(f"{BASE_URL}/crawl", json=payload, headers=headers)
response.raise_for_status()
job_data = response.json()
task_id = job_data.get("task_id")
if task_id:
print(f"Crawl job with custom browser_config submitted. Task ID: {task_id}")
else:
print(f"Failed to submit job: {job_data}")
except requests.exceptions.RequestException as e:
print(f"Error submitting crawl job with custom browser_config: {e}")
```
##### B.1.4. Example: Submitting a crawl job with a custom `crawler_config` (e.g., specific `word_count_threshold`).
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
payload = {
"urls": ["https://example.com"],
"crawler_config": {
"word_count_threshold": 50, # Only process content blocks with more than 50 words
"screenshot": True # Also take a screenshot
}
}
try:
response = requests.post(f"{BASE_URL}/crawl", json=payload, headers=headers)
response.raise_for_status()
job_data = response.json()
task_id = job_data.get("task_id")
if task_id:
print(f"Crawl job with custom crawler_config submitted. Task ID: {task_id}")
else:
print(f"Failed to submit job: {job_data}")
except requests.exceptions.RequestException as e:
print(f"Error submitting crawl job with custom crawler_config: {e}")
```
##### B.1.5. Example: Submitting a job that uses a specific `CacheMode` (e.g., `BYPASS`).
`CacheMode` values are typically: "DISABLED", "ENABLED", "BYPASS", "READ_ONLY", "WRITE_ONLY".
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
payload = {
"urls": ["https://example.com"],
"crawler_config": {
"cache_mode": "BYPASS" # Force a fresh crawl, ignore existing cache, don't write to cache
}
}
try:
response = requests.post(f"{BASE_URL}/crawl", json=payload, headers=headers)
response.raise_for_status()
job_data = response.json()
task_id = job_data.get("task_id")
if task_id:
print(f"Crawl job with CacheMode.BYPASS submitted. Task ID: {task_id}")
else:
print(f"Failed to submit job: {job_data}")
except requests.exceptions.RequestException as e:
print(f"Error submitting crawl job with CacheMode.BYPASS: {e}")
```
##### B.1.6. Example: Submitting a job to extract PDF content from a URL.
(This assumes the URL points directly to a PDF or the page leads to a PDF download that the crawler handles).
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
# URL of a sample PDF file
pdf_url = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
payload = {
"urls": [pdf_url],
"crawler_config": {
# Crawl4ai should auto-detect PDF content type and use appropriate processor
"pdf": True # Explicitly enabling PDF processing, though often auto-detected
}
}
try:
response = requests.post(f"{BASE_URL}/crawl", json=payload, headers=headers)
response.raise_for_status()
job_data = response.json()
task_id = job_data.get("task_id")
if task_id:
print(f"PDF crawl job submitted for {pdf_url}. Task ID: {task_id}")
print(f"Poll status at: {BASE_URL}/task/{task_id}")
else:
print(f"Failed to submit PDF crawl job: {job_data}")
except requests.exceptions.RequestException as e:
print(f"Error submitting PDF crawl job: {e}")
```
##### B.1.7. Example: Submitting a job to take a screenshot from a URL.
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
payload = {
"urls": ["https://example.com"],
"crawler_config": {
"screenshot": True,
"screenshot_wait_for": 2 # wait 2 seconds after page load before screenshot
}
}
try:
response = requests.post(f"{BASE_URL}/crawl", json=payload, headers=headers)
response.raise_for_status()
job_data = response.json()
task_id = job_data.get("task_id")
if task_id:
print(f"Screenshot job submitted for example.com. Task ID: {task_id}")
print(f"Poll status at: {BASE_URL}/task/{task_id}")
else:
print(f"Failed to submit screenshot job: {job_data}")
except requests.exceptions.RequestException as e:
print(f"Error submitting screenshot job: {e}")
```
---
#### B.2. `/task/{task_id}` (Job Status and Results)
##### B.2.1. Example: Python script to poll the `/task/{task_id}` endpoint for PENDING status.
```python
import requests
import time
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
# Assume task_id is obtained from a previous /crawl request
# For this example, we'll submit a quick job first
submit_payload = {"urls": ["http://example.com/nonexistent-page-for-quick-fail-or-processing"]}
task_id = None
try:
submit_response = requests.post(f"{BASE_URL}/crawl", json=submit_payload, headers=headers)
submit_response.raise_for_status()
task_id = submit_response.json().get("task_id")
except requests.exceptions.RequestException as e:
print(f"Failed to submit initial job for polling example: {e}")
if task_id:
print(f"Polling for task: {task_id}")
for _ in range(5): # Poll a few times
try:
status_response = requests.get(f"{BASE_URL}/task/{task_id}", headers=headers)
status_response.raise_for_status()
status_data = status_response.json()
print(f"Current status: {status_data.get('status')}")
if status_data.get('status') in ["COMPLETED", "FAILED"]:
break
time.sleep(2)
except requests.exceptions.RequestException as e:
print(f"Error polling task status: {e}")
break
else:
print("No task ID to poll.")
```
##### B.2.2. Example: Python script to retrieve results for a COMPLETED job.
```python
import requests
import time
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
# Submit a job that should complete successfully
submit_payload = {"urls": ["https://example.com"]}
task_id = None
try:
submit_response = requests.post(f"{BASE_URL}/crawl", json=submit_payload, headers=headers)
submit_response.raise_for_status()
task_id = submit_response.json().get("task_id")
except requests.exceptions.RequestException as e:
print(f"Failed to submit job for result retrieval example: {e}")
if task_id:
print(f"Waiting for task {task_id} to complete...")
while True:
try:
status_response = requests.get(f"{BASE_URL}/task/{task_id}", headers=headers)
status_response.raise_for_status()
status_data = status_response.json()
current_status = status_data.get('status')
print(f"Task status: {current_status}")
if current_status == "COMPLETED":
print("\nJob COMPLETED. Results:")
# The 'result' field contains the JSON string of the CrawlResult model(s)
# For a single URL job, it's typically a dict. For multiple, a list of dicts.
# The structure from api.py suggests `result` field in Redis is a JSON string
# of a dictionary which itself contains a 'results' key (list of CrawlResult dicts).
# This is based on how handle_crawl_job in api.py stores results
# and how the /task/{task_id} endpoint decodes it.
# The 'result' from /task/{task_id} should already be a parsed dict.
crawl_results_wrapper = status_data.get("result")
if crawl_results_wrapper and "results" in crawl_results_wrapper:
actual_results = crawl_results_wrapper["results"]
for i, res_item in enumerate(actual_results):
print(f"\n--- Result for URL {i+1} ({res_item.get('url', 'N/A')}) ---")
print(f" Success: {res_item.get('success')}")
print(f" Markdown (first 100 chars): {res_item.get('markdown', {}).get('raw_markdown', '')[:100]}...")
if res_item.get('screenshot'):
print(" Screenshot captured (base64 data not printed).")
else:
print(f"Unexpected result structure: {crawl_results_wrapper}")
break
elif current_status == "FAILED":
print(f"\nJob FAILED. Error: {status_data.get('error')}")
break
time.sleep(3) # Poll every 3 seconds
except requests.exceptions.RequestException as e:
print(f"Error polling task status: {e}")
break
except KeyboardInterrupt:
print("\nPolling interrupted.")
break
else:
print("No task ID to retrieve results for.")
```
##### B.2.3. Example: Python script to get error details for a FAILED job.
```python
import requests
import time
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
# Submit a job that is likely to fail (e.g., invalid URL or one that times out quickly)
submit_payload = {"urls": ["http://nonexistentdomain1234567890.com"]}
task_id = None
try:
submit_response = requests.post(f"{BASE_URL}/crawl", json=submit_payload, headers=headers)
submit_response.raise_for_status()
task_id = submit_response.json().get("task_id")
except requests.exceptions.RequestException as e:
print(f"Failed to submit job for failure example: {e}")
if task_id:
print(f"Waiting for task {task_id} (expected to fail)...")
while True:
try:
status_response = requests.get(f"{BASE_URL}/task/{task_id}", headers=headers)
status_response.raise_for_status()
status_data = status_response.json()
current_status = status_data.get('status')
print(f"Task status: {current_status}")
if current_status == "FAILED":
print("\nJob FAILED as expected.")
error_message = status_data.get('error', 'No error message provided.')
print(f"Error details: {error_message}")
break
elif current_status == "COMPLETED":
print("\nJob COMPLETED unexpectedly.")
break
time.sleep(2)
except requests.exceptions.RequestException as e:
print(f"Error polling task status: {e}")
break
except KeyboardInterrupt:
print("\nPolling interrupted.")
break
else:
print("No task ID to check for failure.")
```
##### B.2.4. Example: Full workflow - submit job, poll status, retrieve results or error.
This combines the above examples into a more complete client script.
```python
import requests
import time
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
def submit_and_poll(payload, timeout_seconds=60):
task_id = None
try:
# Submit the job
print(f"Submitting job with payload: {payload}")
submit_response = requests.post(f"{BASE_URL}/crawl", json=payload, headers=headers)
submit_response.raise_for_status()
task_id = submit_response.json().get("task_id")
if not task_id:
print("Error: No task_id received.")
return None
print(f"Job submitted. Task ID: {task_id}. Polling for completion...")
# Poll for status
start_time = time.time()
while time.time() - start_time < timeout_seconds:
status_response = requests.get(f"{BASE_URL}/task/{task_id}", headers=headers)
status_response.raise_for_status()
status_data = status_response.json()
current_status = status_data.get('status')
print(f" Task {task_id} status: {current_status} (elapsed: {time.time() - start_time:.1f}s)")
if current_status == "COMPLETED":
print(f"Task {task_id} COMPLETED.")
return status_data.get("result") # This should be the parsed JSON result
elif current_status == "FAILED":
print(f"Task {task_id} FAILED.")
print(f"Error: {status_data.get('error')}")
return None
time.sleep(5) # Poll interval
print(f"Task {task_id} timed out after {timeout_seconds} seconds.")
return None
except requests.exceptions.RequestException as e:
print(f"API request error: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred: {e}")
return None
if __name__ == "__main__":
crawl_payload = {
"urls": ["https://www.python.org/about/"],
"crawler_config": {"screenshot": False}
}
results_data = submit_and_poll(crawl_payload)
if results_data and "results" in results_data:
for i, res_item in enumerate(results_data["results"]):
print(f"\n--- Result for URL {res_item.get('url', 'N/A')} ---")
print(f" Success: {res_item.get('success')}")
print(f" Markdown (first 200 chars): {res_item.get('markdown', {}).get('raw_markdown', '')[:200]}...")
elif results_data: # If result isn't in the expected wrapper structure
print(f"\nReceived result data (unexpected structure):")
print(json.dumps(results_data, indent=2, ensure_ascii=False))
```
---
#### B.3. `/crawl/stream` (Streaming Crawl Results)
##### B.3.1. Example: Python script to stream crawl results for a single URL and process NDJSON.
```python
import requests
import json
import os
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
headers['Accept'] = 'application/x-ndjson' # Important for streaming
payload = {
"urls": ["https://example.com"],
"crawler_config": {"stream": True} # Ensure stream is True in config
}
print(f"Streaming results for {payload['urls'][0]}...")
try:
with requests.post(f"{BASE_URL}/crawl/stream", json=payload, headers=headers, stream=True) as response:
response.raise_for_status()
for line in response.iter_lines():
if line:
try:
result_chunk = json.loads(line.decode('utf-8'))
if "status" in result_chunk and result_chunk["status"] == "completed":
print("\nStream finished.")
break
print("\nReceived chunk:")
# Print some key info from the chunk
print(f" URL: {result_chunk.get('url', 'N/A')}")
print(f" Success: {result_chunk.get('success')}")
if 'markdown' in result_chunk and isinstance(result_chunk['markdown'], dict):
print(f" Markdown (snippet): {result_chunk['markdown'].get('raw_markdown', '')[:100]}...")
else:
print(f" Markdown (snippet): {str(result_chunk.get('markdown', ''))[:100]}...")
if result_chunk.get('error_message'):
print(f" Error: {result_chunk.get('error_message')}")
except json.JSONDecodeError as e:
print(f"Error decoding JSON line: {e} - Line: {line.decode('utf-8')}")
except requests.exceptions.RequestException as e:
print(f"Error during streaming request: {e}")
```
##### B.3.2. Example: Python script to stream crawl results for multiple URLs.
```python
import requests
import json
import os
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
headers['Accept'] = 'application/x-ndjson'
payload = {
"urls": ["https://example.com", "https://www.python.org/doc/"],
"crawler_config": {"stream": True}
}
print(f"Streaming results for multiple URLs...")
try:
with requests.post(f"{BASE_URL}/crawl/stream", json=payload, headers=headers, stream=True) as response:
response.raise_for_status()
for line in response.iter_lines():
if line:
try:
result_chunk = json.loads(line.decode('utf-8'))
if "status" in result_chunk and result_chunk["status"] == "completed":
print("\nStream finished for all URLs.")
break
print(f"\nChunk for URL: {result_chunk.get('url', 'N/A')}")
# Process or display part of the result
print(f" Success: {result_chunk.get('success')}")
if 'markdown' in result_chunk and isinstance(result_chunk['markdown'], dict):
print(f" Markdown (snippet): {result_chunk['markdown'].get('raw_markdown', '')[:70]}...")
else:
print(f" Markdown (snippet): {str(result_chunk.get('markdown', ''))[:70]}...")
except json.JSONDecodeError as e:
print(f"Error decoding JSON line: {e} - Line: {line.decode('utf-8')}")
except requests.exceptions.RequestException as e:
print(f"Error during streaming request: {e}")
```
##### B.3.3. Example: Streaming crawl results with custom `browser_config` and `crawler_config`.
```python
import requests
import json
import os
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
headers['Accept'] = 'application/x-ndjson'
payload = {
"urls": ["https://example.com"],
"browser_config": {
"headless": True,
"user_agent": "Crawl4AI-Stream-Tester/1.0"
},
"crawler_config": {
"stream": True,
"word_count_threshold": 10 # Lower threshold for this example
}
}
print(f"Streaming results with custom configs for {payload['urls'][0]}...")
try:
with requests.post(f"{BASE_URL}/crawl/stream", json=payload, headers=headers, stream=True) as response:
response.raise_for_status()
for line in response.iter_lines():
if line:
result_chunk = json.loads(line.decode('utf-8'))
if "status" in result_chunk and result_chunk["status"] == "completed":
print("\nStream finished.")
break
print("\nReceived chunk with custom config:")
print(f" URL: {result_chunk.get('url')}")
print(f" Word count threshold was: {payload['crawler_config']['word_count_threshold']}")
if 'markdown' in result_chunk and isinstance(result_chunk['markdown'], dict):
print(f" Markdown (snippet): {result_chunk['markdown'].get('raw_markdown', '')[:70]}...")
else:
print(f" Markdown (snippet): {str(result_chunk.get('markdown', ''))[:70]}...")
except requests.exceptions.RequestException as e:
print(f"Error during streaming request: {e}")
```
##### B.3.4. Example: Handling connection closure or errors during streaming.
```python
import requests
import json
import os
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
headers['Accept'] = 'application/x-ndjson'
payload = {
"urls": ["https://thissitedoesnotexist12345.com", "https://example.com"], # First URL will fail
"crawler_config": {"stream": True}
}
print(f"Streaming with a URL expected to fail...")
try:
with requests.post(f"{BASE_URL}/crawl/stream", json=payload, headers=headers, stream=True) as response:
# We might not get a non-200 status code immediately if the connection itself is established
# Errors for individual URLs will be part of the NDJSON stream
for line in response.iter_lines():
if line:
try:
result_chunk = json.loads(line.decode('utf-8'))
print(f"\nReceived data: {result_chunk.get('url', 'N/A')}")
if "status" in result_chunk and result_chunk["status"] == "completed":
print("Stream finished.")
break
if result_chunk.get('error_message'):
print(f" ERROR for {result_chunk.get('url')}: {result_chunk.get('error_message')}")
elif result_chunk.get('success'):
print(f" SUCCESS for {result_chunk.get('url')}")
except json.JSONDecodeError as e:
print(f" Error decoding JSON line: {e}")
except requests.exceptions.ChunkedEncodingError:
print("Connection closed unexpectedly by server during streaming (ChunkedEncodingError).")
except requests.exceptions.RequestException as e:
print(f"General error during streaming request: {e}")
```
---
### C. Content Transformation & Utility Endpoints
#### C.1. `/md` (Markdown Generation)
##### C.1.1. Example: Getting raw Markdown for a URL (default filter).
The default filter is `FIT` if no filter is specified.
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
payload = {"url": "https://example.com", "f": "RAW"} # 'f' is for filter_type
try:
response = requests.post(f"{BASE_URL}/md", json=payload, headers=headers)
response.raise_for_status()
data = response.json()
print("Markdown (RAW filter - first 300 chars):")
print(data.get("markdown", "")[:300] + "...")
except requests.exceptions.RequestException as e:
print(f"Error fetching Markdown: {e}")
```
##### C.1.2. Example: Getting Markdown using the `FIT` filter type.
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
payload = {"url": "https://example.com", "f": "FIT"}
try:
response = requests.post(f"{BASE_URL}/md", json=payload, headers=headers)
response.raise_for_status()
data = response.json()
print("Markdown (FIT filter - first 300 chars):")
print(data.get("markdown", "")[:300] + "...")
except requests.exceptions.RequestException as e:
print(f"Error fetching Markdown: {e}")
```
##### C.1.3. Example: Getting Markdown using the `BM25` filter type with a specific query.
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
payload = {
"url": "https://en.wikipedia.org/wiki/Python_(programming_language)",
"f": "BM25",
"q": "What are the key features of Python?" # Query for BM25 filtering
}
try:
response = requests.post(f"{BASE_URL}/md", json=payload, headers=headers)
response.raise_for_status()
data = response.json()
print(f"Markdown (BM25 filter, query='{payload['q']}' - first 300 chars):")
print(data.get("markdown", "")[:300] + "...")
except requests.exceptions.RequestException as e:
print(f"Error fetching Markdown: {e}")
```
##### C.1.4. Example: Getting Markdown using the `LLM` filter type with a query (conceptual, requires LLM setup).
This requires an LLM provider (like OpenAI) to be configured in `config.yml` or via environment variables loaded by the server.
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
payload = {
"url": "https://en.wikipedia.org/wiki/Python_(programming_language)",
"f": "LLM",
"q": "Summarize the history of Python" # Query for LLM to focus on
}
print("Attempting LLM-filtered Markdown (this may take a moment and requires LLM config)...")
try:
# LLM requests can take longer
response = requests.post(f"{BASE_URL}/md", json=payload, headers=headers, timeout=120)
response.raise_for_status()
data = response.json()
print(f"Markdown (LLM filter, query='{payload['q']}' - first 300 chars):")
print(data.get("markdown", "")[:300] + "...")
except requests.exceptions.RequestException as e:
print(f"Error fetching LLM-filtered Markdown: {e}")
print("Ensure your LLM provider (e.g., OPENAI_API_KEY) is configured for the server.")
```
##### C.1.5. Example: Demonstrating cache usage with the `/md` endpoint (`c` parameter).
The `c` parameter can be "0" (bypass write, read if available - effectively WRITE_ONLY for this endpoint if no cache exists), "1" (force refresh, write - effectively ENABLED for this endpoint), or other numbers for revision control (not shown here).
```python
import requests
import os
import json
import time
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
test_url = "https://example.com"
# First call: cache miss, should fetch and write to cache
print("First call (cache_mode=ENABLED implied by 'c=1', or default if 'c' omitted)")
payload1 = {"url": test_url, "f": "RAW", "c": "1"} # c="1" forces refresh and writes
start_time = time.time()
response1 = requests.post(f"{BASE_URL}/md", json=payload1, headers=headers)
duration1 = time.time() - start_time
response1.raise_for_status()
print(f"First call duration: {duration1:.2f}s. Markdown length: {len(response1.json().get('markdown', ''))}")
# Second call: should be a cache hit if c="0" or c is omitted and cache is fresh
print("\nSecond call (cache_mode=READ_ONLY implied by 'c=0', or default if 'c' omitted and cache fresh)")
payload2 = {"url": test_url, "f": "RAW", "c": "0"} # c="0" attempts to read from cache
start_time = time.time()
response2 = requests.post(f"{BASE_URL}/md", json=payload2, headers=headers)
duration2 = time.time() - start_time
response2.raise_for_status()
print(f"Second call duration: {duration2:.2f}s. Markdown length: {len(response2.json().get('markdown', ''))}")
if duration2 < duration1 / 2 and duration1 > 0.1 : # Heuristic for cache hit
print("Second call was significantly faster, likely a cache hit.")
else:
print("Cache behavior inconclusive or first call was very fast.")
```
---
#### C.2. `/html` (Preprocessed HTML)
##### C.2.1. Example: Fetching preprocessed HTML for a URL suitable for schema extraction.
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
payload = {"url": "https://example.com"}
try:
response = requests.post(f"{BASE_URL}/html", json=payload, headers=headers)
response.raise_for_status()
data = response.json()
print("Preprocessed HTML (first 500 chars):")
print(data.get("html", "")[:500] + "...")
print(f"\nOriginal URL: {data.get('url')}")
except requests.exceptions.RequestException as e:
print(f"Error fetching preprocessed HTML: {e}")
```
---
#### C.3. `/screenshot`
##### C.3.1. Example: Generating a PNG screenshot for a URL and receiving base64 data.
```python
import requests
import os
import base64
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
payload = {"url": "https://example.com"}
try:
response = requests.post(f"{BASE_URL}/screenshot", json=payload, headers=headers)
response.raise_for_status()
data = response.json()
if data.get("screenshot"):
print("Screenshot received (base64 data).")
# To save the image:
# image_data = base64.b64decode(data["screenshot"])
# with open("example_screenshot.png", "wb") as f:
# f.write(image_data)
# print("Screenshot saved as example_screenshot.png")
else:
print(f"Screenshot generation failed or no data returned: {data}")
except requests.exceptions.RequestException as e:
print(f"Error generating screenshot: {e}")
```
##### C.3.2. Example: Generating a screenshot with a custom `screenshot_wait_for` delay.
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
payload = {
"url": "https://example.com",
"screenshot_wait_for": 3 # Wait 3 seconds after page load
}
try:
response = requests.post(f"{BASE_URL}/screenshot", json=payload, headers=headers)
response.raise_for_status()
data = response.json()
if data.get("screenshot"):
print(f"Screenshot with {payload['screenshot_wait_for']}s delay received.")
else:
print(f"Screenshot generation failed: {data}")
except requests.exceptions.RequestException as e:
print(f"Error generating screenshot with delay: {e}")
```
##### C.3.3. Example: Saving screenshot to server-side path via `output_path`.
**Note:** This requires `output_path` to be a path accessible and writable by the server process. For Docker, this usually means a mounted volume.
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
# This path needs to be valid from the server's perspective
# e.g., if running in Docker, it might be a path inside the container
# that is mapped to a host volume.
server_side_path = "/app/screenshots/example_com.png" # Example path
payload = {
"url": "https://example.com",
"output_path": server_side_path
}
try:
response = requests.post(f"{BASE_URL}/screenshot", json=payload, headers=headers)
response.raise_for_status()
data = response.json()
if data.get("success") and data.get("path"):
print(f"Screenshot successfully saved to server path: {data.get('path')}")
print("Note: This file is on the server, not the client machine unless paths are mapped.")
else:
print(f"Failed to save screenshot to server: {data}")
except requests.exceptions.RequestException as e:
print(f"Error saving screenshot to server: {e}")
```
---
#### C.4. `/pdf`
##### C.4.1. Example: Generating a PDF for a URL and receiving base64 data.
```python
import requests
import os
import base64
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
payload = {"url": "https://example.com"}
try:
response = requests.post(f"{BASE_URL}/pdf", json=payload, headers=headers)
response.raise_for_status()
data = response.json()
if data.get("pdf"):
print("PDF received (base64 data).")
# To save the PDF:
# pdf_data = base64.b64decode(data["pdf"])
# with open("example_page.pdf", "wb") as f:
# f.write(pdf_data)
# print("PDF saved as example_page.pdf")
else:
print(f"PDF generation failed or no data returned: {data}")
except requests.exceptions.RequestException as e:
print(f"Error generating PDF: {e}")
```
##### C.4.2. Example: Saving PDF to server-side path via `output_path`.
**Note:** Similar to screenshots, `output_path` must be server-accessible.
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
server_side_path = "/app/pdfs/example_com.pdf" # Example path
payload = {
"url": "https://example.com",
"output_path": server_side_path
}
try:
response = requests.post(f"{BASE_URL}/pdf", json=payload, headers=headers)
response.raise_for_status()
data = response.json()
if data.get("success") and data.get("path"):
print(f"PDF successfully saved to server path: {data.get('path')}")
else:
print(f"Failed to save PDF to server: {data}")
except requests.exceptions.RequestException as e:
print(f"Error saving PDF to server: {e}")
```
---
#### C.5. `/execute_js`
##### C.5.1. Example: Executing a simple JavaScript snippet (e.g., `return document.title;`) on a page.
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
payload = {
"url": "https://example.com",
"scripts": ["return document.title;"]
}
try:
response = requests.post(f"{BASE_URL}/execute_js", json=payload, headers=headers)
response.raise_for_status()
data = response.json() # This is the full CrawlResult model as JSON
print("Full CrawlResult from /execute_js:")
# print(json.dumps(data, indent=2, ensure_ascii=False)) # Can be very long
js_results = data.get("js_execution_result")
if js_results and js_results.get("script_0"):
print(f"\nResult of script 0 (document.title): {js_results['script_0']}")
else:
print(f"\nCould not find JS execution result: {js_results}")
except requests.exceptions.RequestException as e:
print(f"Error executing JS: {e}")
```
##### C.5.2. Example: Executing multiple JavaScript snippets sequentially.
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
payload = {
"url": "https://example.com",
"scripts": [
"return document.title;",
"return document.querySelectorAll('p').length;",
"() => { const h1 = document.querySelector('h1'); return h1 ? h1.innerText : 'No H1'; }()"
]
}
try:
response = requests.post(f"{BASE_URL}/execute_js", json=payload, headers=headers)
response.raise_for_status()
data = response.json()
js_results = data.get("js_execution_result")
if js_results:
print("\nResults of JS snippets:")
print(f" Script 0 (Title): {js_results.get('script_0')}")
print(f" Script 1 (Paragraph count): {js_results.get('script_1')}")
print(f" Script 2 (H1 text): {js_results.get('script_2')}")
else:
print(f"\nCould not find JS execution results: {js_results}")
except requests.exceptions.RequestException as e:
print(f"Error executing multiple JS snippets: {e}")
```
##### C.5.3. Example: Demonstrating how the full `CrawlResult` (JSON of model) is returned.
The `/execute_js` endpoint returns the entire `CrawlResult` object, serialized to JSON. This includes HTML, Markdown, links, etc., in addition to the `js_execution_result`.
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
payload = {
"url": "https://example.com",
"scripts": ["return window.location.href;"]
}
try:
response = requests.post(f"{BASE_URL}/execute_js", json=payload, headers=headers)
response.raise_for_status()
crawl_result_data = response.json()
print("Demonstrating full CrawlResult structure from /execute_js:")
print(f" URL crawled: {crawl_result_data.get('url')}")
print(f" Success: {crawl_result_data.get('success')}")
print(f" HTML (snippet): {crawl_result_data.get('html', '')[:100]}...")
if isinstance(crawl_result_data.get('markdown'), dict):
print(f" Markdown (snippet): {crawl_result_data['markdown'].get('raw_markdown', '')[:100]}...")
else:
print(f" Markdown (snippet): {str(crawl_result_data.get('markdown', ''))[:100]}...")
js_result = crawl_result_data.get("js_execution_result", {}).get("script_0")
print(f" Result of JS (window.location.href): {js_result}")
except requests.exceptions.RequestException as e:
print(f"Error demonstrating full CrawlResult: {e}")
```
---
### D. Contextual Endpoints
#### D.1. `/ask` (RAG-like Context Retrieval)
The `/ask` endpoint uses local Markdown files (`c4ai-code-context.md` and `c4ai-doc-context.md`, which should be in the same directory as `server.py`) for retrieval.
##### D.1.1. Example: Asking a general question to retrieve "code" context.
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
params = {
"context_type": "code",
"query": "How to handle Playwright installation?" # General query
}
try:
response = requests.get(f"{BASE_URL}/ask", params=params, headers=headers)
response.raise_for_status()
data = response.json()
print("Retrieved 'code' context for 'How to handle Playwright installation?':")
if "code_results" in data:
for i, item in enumerate(data["code_results"][:2]): # Show first 2 results
print(f"\n--- Code Result {i+1} (Score: {item.get('score', 'N/A'):.2f}) ---")
print(item.get("text", "")[:300] + "...")
else:
print(json.dumps(data, indent=2))
except requests.exceptions.RequestException as e:
print(f"Error asking for code context: {e}")
```
##### D.1.2. Example: Asking a general question to retrieve "doc" context.
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
params = {
"context_type": "doc",
"query": "Explain Crawl4ai API endpoints"
}
try:
response = requests.get(f"{BASE_URL}/ask", params=params, headers=headers)
response.raise_for_status()
data = response.json()
print("Retrieved 'doc' context for 'Explain Crawl4ai API endpoints':")
if "doc_results" in data:
for i, item in enumerate(data["doc_results"][:2]):
print(f"\n--- Doc Result {i+1} (Score: {item.get('score', 'N/A'):.2f}) ---")
print(item.get("text", "")[:300] + "...")
else:
print(json.dumps(data, indent=2))
except requests.exceptions.RequestException as e:
print(f"Error asking for doc context: {e}")
```
##### D.1.3. Example: Using the `query` parameter to filter context related to a specific function.
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
params = {
"context_type": "all", # Search both code and docs
"query": "AsyncWebCrawler arun method"
}
try:
response = requests.get(f"{BASE_URL}/ask", params=params, headers=headers)
response.raise_for_status()
data = response.json()
print(f"Retrieved 'all' context for query: '{params['query']}'")
if "code_results" in data:
print(f"\nFound {len(data['code_results'])} code results.")
# Optionally print snippets
if "doc_results" in data:
print(f"Found {len(data['doc_results'])} doc results.")
# Optionally print snippets
# print(json.dumps(data, indent=2, ensure_ascii=False)[:1000] + "...")
except requests.exceptions.RequestException as e:
print(f"Error asking with specific query: {e}")
```
##### D.1.4. Example: Adjusting `score_ratio` to change result sensitivity.
A lower `score_ratio` (e.g., 0.1) will return more, less relevant results. A higher one (e.g., 0.8) will be more strict. Default is 0.5.
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
params_strict = {
"context_type": "code",
"query": "Playwright browser installation",
"score_ratio": 0.8 # Higher, more strict
}
params_loose = {
"context_type": "code",
"query": "Playwright browser installation",
"score_ratio": 0.2 # Lower, less strict
}
try:
response_strict = requests.get(f"{BASE_URL}/ask", params=params_strict, headers=headers)
response_strict.raise_for_status()
data_strict = response_strict.json()
print(f"Results with score_ratio=0.8: {len(data_strict.get('code_results', []))}")
response_loose = requests.get(f"{BASE_URL}/ask", params=params_loose, headers=headers)
response_loose.raise_for_status()
data_loose = response_loose.json()
print(f"Results with score_ratio=0.2: {len(data_loose.get('code_results', []))}")
except requests.exceptions.RequestException as e:
print(f"Error adjusting score_ratio: {e}")
```
##### D.1.5. Example: Limiting results with `max_results`.
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
params = {
"context_type": "doc",
"query": "crawl4ai features",
"max_results": 3 # Limit to top 3 results
}
try:
response = requests.get(f"{BASE_URL}/ask", params=params, headers=headers)
response.raise_for_status()
data = response.json()
print(f"Retrieved max {params['max_results']} doc_results for 'crawl4ai features':")
if "doc_results" in data:
print(f"Actual results returned: {len(data['doc_results'])}")
for item in data["doc_results"]:
print(f" - Score: {item.get('score', 0):.2f}, Text (snippet): {item.get('text', '')[:50]}...")
else:
print("No doc_results found.")
except requests.exceptions.RequestException as e:
print(f"Error limiting results: {e}")
```
---
### E. Server & Configuration Information
#### E.1. `/config/dump`
##### E.1.1. Example: Dumping a `CrawlerRunConfig` Python object representation to its JSON equivalent via the API.
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
# This is a Python-style string representation of a CrawlerRunConfig
# that the server's _safe_eval_config can parse.
config_string = "CrawlerRunConfig(word_count_threshold=50, screenshot=True, cache_mode=CacheMode.BYPASS)"
payload = {"code": config_string}
try:
response = requests.post(f"{BASE_URL}/config/dump", json=payload, headers=headers)
response.raise_for_status()
dumped_json = response.json()
print("Dumped CrawlerRunConfig JSON:")
print(json.dumps(dumped_json, indent=2))
except requests.exceptions.RequestException as e:
print(f"Error dumping CrawlerRunConfig: {e}")
```
##### E.1.2. Example: Dumping a `BrowserConfig` Python object representation to its JSON equivalent via the API.
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
config_string = "BrowserConfig(headless=False, user_agent='MyTestAgent/1.0')"
payload = {"code": config_string}
try:
response = requests.post(f"{BASE_URL}/config/dump", json=payload, headers=headers)
response.raise_for_status()
dumped_json = response.json()
print("Dumped BrowserConfig JSON:")
print(json.dumps(dumped_json, indent=2))
except requests.exceptions.RequestException as e:
print(f"Error dumping BrowserConfig: {e}")
```
##### E.1.3. Example: Attempting to dump an invalid or non-serializable configuration string.
```python
import requests
import os
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
# Invalid: not a recognized Crawl4AI config class
invalid_config_string = "MyCustomClass(param=1)"
payload = {"code": invalid_config_string}
try:
response = requests.post(f"{BASE_URL}/config/dump", json=payload, headers=headers)
if response.status_code == 400:
print(f"Correctly failed to dump invalid config string. Server response: {response.json()}")
else:
print(f"Unexpected response for invalid config: {response.status_code} - {response.text}")
except requests.exceptions.RequestException as e:
print(f"Error attempting to dump invalid config: {e}")
# Invalid: nested function call (security restriction)
unsafe_config_string = "CrawlerRunConfig(word_count_threshold=__import__('os').system('echo unsafe'))"
payload_unsafe = {"code": unsafe_config_string}
try:
response_unsafe = requests.post(f"{BASE_URL}/config/dump", json=payload_unsafe, headers=headers)
if response_unsafe.status_code == 400:
print(f"Correctly failed to dump unsafe config string. Server response: {response_unsafe.json()}")
else:
print(f"Unexpected response for unsafe config: {response_unsafe.status_code} - {response_unsafe.text}")
except requests.exceptions.RequestException as e:
print(f"Error attempting to dump unsafe config: {e}")
```
---
#### E.2. `/schema`
##### E.2.1. Example: Fetching the default JSON schemas for `BrowserConfig` and `CrawlerRunConfig`.
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
try:
response = requests.get(f"{BASE_URL}/schema", headers=headers)
response.raise_for_status()
schemas = response.json()
print("BrowserConfig Schema (sample):")
# print(json.dumps(schemas.get("browser"), indent=2)) # Full schema can be long
if "browser" in schemas and "properties" in schemas["browser"]:
print(f" BrowserConfig has {len(schemas['browser']['properties'])} properties.")
print(f" Example property 'headless': {schemas['browser']['properties'].get('headless')}")
print("\nCrawlerRunConfig Schema (sample):")
# print(json.dumps(schemas.get("crawler"), indent=2))
if "crawler" in schemas and "properties" in schemas["crawler"]:
print(f" CrawlerRunConfig has {len(schemas['crawler']['properties'])} properties.")
print(f" Example property 'word_count_threshold': {schemas['crawler']['properties'].get('word_count_threshold')}")
except requests.exceptions.RequestException as e:
print(f"Error fetching schemas: {e}")
```
---
#### E.3. `/health` & `/metrics`
##### E.3.1. Example: Python script to programmatically check the `/health` endpoint.
(Similar to example 2.4.1, but reiterated here for completeness of this section)
```python
import requests
import os
import json
from datetime import datetime
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
try:
response = requests.get(f"{BASE_URL}/health", headers=headers)
response.raise_for_status()
health_data = response.json()
print("Health Check:")
print(f" Status: {health_data.get('status')}")
print(f" Version: {health_data.get('version')}")
ts = health_data.get('timestamp')
if ts:
print(f" Timestamp: {ts} (UTC: {datetime.utcfromtimestamp(ts).isoformat()})")
except requests.exceptions.RequestException as e:
print(f"Error checking health: {e}")
```
##### E.3.2. Example: Accessing Prometheus metrics at `/metrics` (assuming Prometheus is enabled in `config.yml`).
This typically involves pointing a Prometheus scraper at the endpoint or manually fetching.
```python
import requests
import os
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
# Prometheus metrics are usually at /metrics, but the server.py config uses
# config["observability"]["prometheus"]["endpoint"] which defaults to "/metrics"
METRICS_ENDPOINT = "/metrics" # As per default config.yml
headers = get_headers()
try:
# First, check if Prometheus is enabled in the server's config
# This is a conceptual check, real check depends on your setup
config_response = requests.get(f"{BASE_URL}/health", headers=headers) # Health often includes version
# In a real scenario, you might have an endpoint to get active config or infer from behavior
print(f"Attempting to fetch metrics from {BASE_URL}{METRICS_ENDPOINT}")
response = requests.get(f"{BASE_URL}{METRICS_ENDPOINT}", headers=headers)
if response.status_code == 200:
print("Prometheus metrics response (first 500 chars):")
print(response.text[:500] + "...")
elif response.status_code == 404:
print(f"Metrics endpoint {METRICS_ENDPOINT} not found. Ensure Prometheus is enabled in config.yml.")
else:
print(f"Error fetching metrics: {response.status_code} - {response.text}")
except requests.exceptions.RequestException as e:
print(f"Error connecting to metrics endpoint: {e}")
```
**Note:** For this to work, `observability.prometheus.enabled` must be `true` in the server's `config.yml`.
---
## IV. Configuring the Deployment (via `config.yml`)
### 4.1. Note: These examples primarily show snippets of `config.yml` and describe their effect, rather than Python code to modify the live configuration.
The `config.yml` file is read by the server on startup. Changes typically require a server restart.
### 4.2. Rate Limiting Configuration
#### 4.2.1. Example `config.yml` snippet: Enabling rate limiting with a custom limit (e.g., "10/second").
```yaml
# In your config.yml
rate_limiting:
enabled: true
default_limit: "10/second" # Allows 10 requests per second per client IP
# trusted_proxies: ["127.0.0.1"] # If behind a reverse proxy
```
#### 4.2.2. Example `config.yml` snippet: Using Redis as a storage backend for rate limiting.
This is recommended for production if you have multiple server instances.
```yaml
# In your config.yml
rate_limiting:
enabled: true
default_limit: "1000/minute"
storage_uri: "redis://localhost:6379" # Or your Redis server URI
# Ensure your Redis server is running and accessible
```
---
### 4.3. Security Settings Configuration
#### 4.3.1. Example `config.yml` snippet: Enabling JWT authentication.
```yaml
# In your config.yml
security:
enabled: true
jwt_enabled: true
# jwt_secret_key: "YOUR_VERY_SECRET_KEY" # Auto-generated if not set
# jwt_algorithm: "HS256"
# jwt_access_token_expire_minutes: 30
# jwt_allowed_email_domains: ["example.com", "another.org"] # Optional: Restrict token issuance
```
**Note:** Enabling `jwt_enabled` means endpoints decorated with the token dependency will require authentication.
#### 4.3.2. Example `config.yml` snippet: Enabling HTTPS redirect.
This is useful if your server is behind a reverse proxy that handles TLS termination.
```yaml
# In your config.yml
security:
enabled: true
https_redirect: true # Adds middleware to redirect HTTP to HTTPS
```
#### 4.3.3. Example `config.yml` snippet: Setting custom trusted hosts.
Restricts which `Host` headers are accepted. Use `["*"]` to allow all (less secure).
```yaml
# In your config.yml
security:
enabled: true
trusted_hosts: ["api.example.com", "localhost", "127.0.0.1"]
```
#### 4.3.4. Example `config.yml` snippet: Configuring custom HTTP security headers (CSP, X-Frame-Options).
```yaml
# In your config.yml
security:
enabled: true
headers:
x_content_type_options: "nosniff"
x_frame_options: "DENY"
content_security_policy: "default-src 'self'; script-src 'self' 'unsafe-inline'; object-src 'none';"
strict_transport_security: "max-age=31536000; includeSubDomains"
```
---
### 4.4. LLM Provider Configuration
#### 4.4.1. Example `config.yml` snippet: Setting the default LLM provider and API key env variable.
```yaml
# In your config.yml
llm:
provider: "openai/gpt-4o-mini" # Default provider/model
api_key_env: "OPENAI_API_KEY" # Environment variable to read the API key from
```
The server will then expect the `OPENAI_API_KEY` environment variable to be set.
#### 4.4.2. Example `config.yml` snippet: Overriding the API key directly in the config (for testing/specific cases).
**Warning:** Not recommended for production due to security risks of hardcoding keys.
```yaml
# In your config.yml
llm:
provider: "openai/gpt-3.5-turbo"
api_key: "sk-this_is_a_test_key_do_not_use_in_prod" # Key directly in config
```
#### 4.4.3. Example `config.yml` snippet: Configuring for a different LiteLLM-supported provider (e.g., Groq).
```yaml
# In your config.yml
llm:
provider: "groq/llama3-8b-8192"
api_key_env: "GROQ_API_KEY" # Server will look for this env var
```
---
### 4.5. Default Crawler Settings
These settings in `config.yml` under the `crawler` key affect the default behavior if not overridden by specific `BrowserConfig` or `CrawlerRunConfig` in API requests.
#### 4.5.1. Example `config.yml` snippet: Modifying `crawler.base_config.simulate_user`.
```yaml
# In your config.yml
crawler:
base_config:
simulate_user: true # Enable user simulation features by default
```
#### 4.5.2. Example `config.yml` snippet: Adjusting `crawler.memory_threshold_percent`.
This is for the `MemoryAdaptiveDispatcher`.
```yaml
# In your config.yml
crawler:
memory_threshold_percent: 85.0 # Pause new tasks if system memory usage exceeds 85%
```
#### 4.5.3. Example `config.yml` snippet: Configuring default `crawler.rate_limiter` parameters.
```yaml
# In your config.yml
crawler:
rate_limiter:
enabled: true
base_delay: [0.5, 1.5] # Default delay between 0.5 and 1.5 seconds
```
#### 4.5.4. Example `config.yml` snippet: Adding default browser arguments to `crawler.browser.extra_args`.
```yaml
# In your config.yml
crawler:
browser:
# Default kwargs for BrowserConfig
# headless: true
# text_mode: false # etc.
extra_args:
- "--disable-gpu" # Already default, but shown for example
- "--window-size=1920,1080"
# Add other chromium flags as needed
```
#### 4.5.5. Example `config.yml` snippet: Changing `crawler.pool.max_pages` (global semaphore).
This controls the maximum number of concurrent browser pages globally for the server.
```yaml
# In your config.yml
crawler:
pool:
max_pages: 20 # Allow up to 20 concurrent browser pages
```
#### 4.5.6. Example `config.yml` snippet: Changing `crawler.pool.idle_ttl_sec` (janitor GC timeout).
This controls how long an idle browser instance in the pool will live before being closed.
```yaml
# In your config.yml
crawler:
pool:
idle_ttl_sec: 600 # Close idle browsers after 10 minutes (default is 30 min)
```
---
## V. Model-Controller-Presenter (MCP) Bridge Integration
### 5.1. Overview of MCP and its purpose with Crawl4ai.
The Model-Controller-Presenter (MCP) bridge allows AI tools and agents (like Claude Code, potentially others in the future) to interact with Crawl4ai's capabilities as "tools." Crawl4ai endpoints decorated with `@mcp_tool` become callable functions for these AI agents. This enables AIs to leverage web crawling and data extraction within their reasoning and task execution processes.
### 5.2. Accessing MCP Endpoints
#### 5.2.1. Example: Conceptual connection to the MCP WebSocket endpoint (`/mcp/ws`).
Connecting to `/mcp/ws` would typically be done by an MCP-compatible client library.
```python
# This is a conceptual Python example using a hypothetical MCP client library
# For actual MCP client usage, refer to the specific MCP tool's documentation.
# from mcp_client_library import MCPClient # Hypothetical library
# async def connect_mcp_ws():
# mcp_url = f"{BASE_URL.replace('http', 'ws')}/mcp/ws"
# async with MCPClient(mcp_url) as client:
# print(f"Connected to MCP WebSocket at {mcp_url}")
# # ... send/receive MCP messages ...
# # e.g., await client.list_tools()
# # e.g., await client.call_tool(tool_name="crawl", arguments={"urls": ["https://example.com"]})
# if __name__ == "__main__":
# # asyncio.run(connect_mcp_ws()) # Uncomment if you have a client library
print("MCP WebSocket conceptual connection. Real client library needed.")
```
#### 5.2.2. Example: Conceptual connection to the MCP SSE endpoint (`/mcp/sse`).
Server-Sent Events (SSE) is another transport for MCP.
```python
# Similar to WebSocket, an MCP-compatible SSE client would be used.
# from sseclient import SSEClient # A possible library for SSE
# def connect_mcp_sse():
# mcp_sse_url = f"{BASE_URL}/mcp/sse"
# print(f"Attempting to connect to MCP SSE at {mcp_sse_url} (conceptual)")
# try:
# messages = SSEClient(mcp_sse_url) # This is synchronous, an async version would be better
# for msg in messages:
# print(f"MCP SSE Message: {msg.data}")
# if "some_condition_to_stop": # e.g. after init message
# break
# except Exception as e:
# print(f"Error with MCP SSE: {e}")
print("MCP SSE conceptual connection. Real client library needed.")
# if __name__ == "__main__":
# connect_mcp_sse() # Uncomment if you have a client library
```
#### 5.2.3. Example: Fetching the MCP schema from `/mcp/schema` using `requests`.
This endpoint provides information about available MCP tools and resources.
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
try:
response = requests.get(f"{BASE_URL}/mcp/schema", headers=headers)
response.raise_for_status()
mcp_schema = response.json()
print("MCP Schema:")
# print(json.dumps(mcp_schema, indent=2)) # Can be verbose
if "tools" in mcp_schema:
print(f"\nAvailable MCP Tools ({len(mcp_schema['tools'])}):")
for tool in mcp_schema["tools"][:3]: # Show first 3 tools
print(f" - Name: {tool.get('name')}, Description: {tool.get('description', '')[:50]}...")
if "resources" in mcp_schema:
print(f"\nAvailable MCP Resources ({len(mcp_schema['resources'])}):")
for resource in mcp_schema["resources"][:3]: # Show first 3 resources
print(f" - Name: {resource.get('name')}, Description: {resource.get('description', '')[:50]}...")
except requests.exceptions.RequestException as e:
print(f"Error fetching MCP schema: {e}")
```
### 5.3. Understanding MCP Tool Exposure
#### 5.3.1. Explanation: How endpoints decorated with `@mcp_tool` become available through the MCP bridge.
In `server.py`, FastAPI endpoints decorated with `@mcp_tool("tool_name")` are automatically registered with the MCP bridge. The MCP bridge then exposes these tools (like `/crawl`, `/md`, `/screenshot`, etc.) to connected MCP clients (e.g., AI agents). The arguments of the FastAPI endpoint function become the expected arguments for the MCP tool call.
#### 5.3.2. Example: Invoking a Crawl4ai tool (e.g., `/md`) through a simulated MCP client request structure (if simple enough to demonstrate with `requests`).
This is a conceptual illustration. A real MCP client would handle the JSON-RPC formatting for calls via WebSocket or SSE. The `/mcp/messages` endpoint is used by the SSE client to POST messages.
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
# This requires a client_id which is usually established during SSE handshake
# For a simple test, if the server allows it, you might be able to send.
# However, this is highly dependent on the MCP server's transport implementation.
# This is a simplified, conceptual example of what an MCP call might look like
# if sent via a direct POST (which is how SSE clients send requests).
# A proper MCP client would handle session IDs and JSON-RPC framing.
mcp_tool_call_payload = {
"jsonrpc": "2.0",
"method": "call_tool",
"params": {
"name": "md", # The tool name, matches @mcp_tool("md")
"arguments": { # These map to the FastAPI endpoint's Pydantic model or parameters
"body": { # Matches the 'body: MarkdownRequest' in the get_markdown endpoint
"url": "https://example.com",
"f": "RAW"
}
}
},
"id": "some_unique_request_id"
}
# The SSE transport uses a client-specific POST endpoint, e.g., /mcp/messages/<client_id>
# This example cannot fully replicate that without a client_id.
# We'll try to hit a hypothetical endpoint or illustrate the payload.
print("Conceptual MCP tool call payload (actual call needs proper client/transport):")
print(json.dumps(mcp_tool_call_payload, indent=2))
# If you had a direct POST endpoint for tools (not standard MCP for SSE/WS):
# try:
# # This is NOT how MCP typically works for SSE/WS, but for a hypothetical direct tool POST:
# # response = requests.post(f"{BASE_URL}/mcp/call_tool_directly", json=mcp_tool_call_payload, headers=get_headers())
# # response.raise_for_status()
# # tool_result = response.json()
# # print("\nResult from conceptual direct tool call:")
# # print(json.dumps(tool_result, indent=2))
# pass
# except requests.exceptions.RequestException as e:
# print(f"Error in conceptual direct tool call: {e}")
```
---
## VI. Advanced Scenarios & Client-Side Best Practices
### 6.1. Chaining API Calls for Complex Workflows
#### 6.1.1. Example: Fetch preprocessed HTML using `/html`, then use this HTML as input to a local `crawl4ai` instance or another tool (conceptual).
```python
import requests
import os
import json
import asyncio
# Assuming crawl4ai is also installed as a library for local processing
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, DefaultMarkdownGenerator
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
async def chained_workflow():
target_url = "https://example.com/article"
# Step 1: Fetch preprocessed HTML from the API
print(f"Step 1: Fetching preprocessed HTML for {target_url} via API...")
html_payload = {"url": target_url}
preprocessed_html = None
try:
response = requests.post(f"{BASE_URL}/html", json=html_payload, headers=headers)
response.raise_for_status()
data = response.json()
preprocessed_html = data.get("html")
if preprocessed_html:
print(f"Successfully fetched preprocessed HTML (length: {len(preprocessed_html)}).")
else:
print("Failed to get preprocessed HTML from API.")
return
except requests.exceptions.RequestException as e:
print(f"Error fetching preprocessed HTML: {e}")
return
# Step 2: Use this HTML with a local Crawl4AI instance for further processing
# (e.g., applying a very specific local Markdown generator or extraction)
if preprocessed_html:
print("\nStep 2: Processing fetched HTML with a local Crawl4AI instance...")
# Example: Generate Markdown using a specific local configuration
custom_md_generator = DefaultMarkdownGenerator(
# content_source="raw_html" because we are feeding it raw HTML
content_source="raw_html",
options={"body_width": 0} # No line wrapping
)
local_run_config = CrawlerRunConfig(markdown_generator=custom_md_generator)
async with AsyncWebCrawler() as local_crawler:
# Use "raw:" prefix to tell the local crawler this is direct HTML content
result = await local_crawler.arun(url=f"raw:{preprocessed_html}", config=local_run_config)
if result.success and result.markdown:
print("Markdown generated locally from API-fetched HTML (first 300 chars):")
print(result.markdown.raw_markdown[:300] + "...")
else:
print(f"Local processing failed: {result.error_message}")
if __name__ == "__main__":
asyncio.run(chained_workflow())
```
---
### 6.2. API Error Handling
#### 6.2.1. Example: Python script showing robust error handling for common HTTP status codes (400, 401, 403, 404, 422, 500) when calling Crawl4ai API.
```python
import requests
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
# Use a token known to be invalid or expired if testing 401/403 with auth enabled
# invalid_headers = {"Authorization": "Bearer invalidtoken123"}
# For this example, we'll use the standard get_headers()
headers = get_headers()
def make_api_call(endpoint, method="GET", payload=None):
url = f"{BASE_URL}/{endpoint.lstrip('/')}"
try:
if method.upper() == "GET":
response = requests.get(url, params=payload, headers=headers, timeout=10)
elif method.upper() == "POST":
response = requests.post(url, json=payload, headers=headers, timeout=10)
else:
print(f"Unsupported method: {method}")
return
print(f"\n--- Testing {method} {url} with payload {payload} ---")
print(f"Status Code: {response.status_code}")
if response.ok: # status_code < 400
print("Response JSON:")
try:
print(json.dumps(response.json(), indent=2, ensure_ascii=False)[:500] + "...")
except json.JSONDecodeError:
print("Response is not valid JSON.")
print(f"Response Text (snippet): {response.text[:200]}...")
else:
print(f"Error Response Text: {response.text}")
# Specific error handling based on status code
if response.status_code == 400:
print("Handling Bad Request (400)... Possible malformed payload.")
elif response.status_code == 401:
print("Handling Unauthorized (401)... API token might be missing or invalid.")
elif response.status_code == 403:
print("Handling Forbidden (403)... API token might lack permissions or IP restricted.")
elif response.status_code == 404:
print("Handling Not Found (404)... Endpoint or resource does not exist.")
elif response.status_code == 422:
print("Handling Unprocessable Entity (422)... Validation error with request data.")
print(f"Details: {response.json().get('detail')}")
elif response.status_code >= 500:
print("Handling Server Error (5xx)... Problem on the server side.")
except requests.exceptions.Timeout:
print(f"Request to {url} timed out.")
except requests.exceptions.ConnectionError:
print(f"Could not connect to {url}. Is the server running?")
except requests.exceptions.RequestException as e:
print(f"An unexpected request error occurred for {url}: {e}")
if __name__ == "__main__":
# Test a valid endpoint
make_api_call("/health")
# Test a non-existent endpoint (expected 404)
make_api_call("/nonexistent_endpoint")
# Test /md with missing URL (expected 422)
make_api_call("/md", method="POST", payload={"f": "RAW"})
# Test /token with invalid payload (expected 422 if email is missing)
make_api_call("/token", method="POST", payload={"not_email": "test"})
# If JWT is enabled, an unauthenticated call to a protected endpoint would give 401/403.
# For this example, assume /admin is a hypothetical protected endpoint.
# print("\nAttempting access to hypothetical protected /admin endpoint...")
# make_api_call("/admin", headers={}) # No auth header
```
---
### 6.3. Client-Side Script for Long-Running Jobs
#### 6.3.1. Example: A Python client that submits a job to `/crawl`, polls `/task/{task_id}` with backoff, and retrieves results.
This is a more robust version of the polling mechanism shown earlier.
```python
import requests
import time
import os
import json
BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235")
headers = get_headers()
def submit_job_and_wait_with_backoff(payload, max_poll_time=300, initial_poll_interval=2, max_poll_interval=30, backoff_factor=1.5):
try:
# 1. Submit Job
submit_response = requests.post(f"{BASE_URL}/crawl", json=payload, headers=headers)
submit_response.raise_for_status()
task_id = submit_response.json().get("task_id")
if not task_id:
print("Failed to get task_id from submission.")
return None
print(f"Job submitted. Task ID: {task_id}. Polling with backoff...")
# 2. Poll with Exponential Backoff
poll_interval = initial_poll_interval
start_time = time.time()
while time.time() - start_time < max_poll_time:
status_response = requests.get(f"{BASE_URL}/task/{task_id}", headers=headers)
status_response.raise_for_status()
status_data = status_response.json()
current_status = status_data.get("status")
print(f" Task {task_id} status: {current_status} (next poll in {poll_interval:.1f}s)")
if current_status == "COMPLETED":
print(f"Task {task_id} COMPLETED.")
return status_data.get("result")
elif current_status == "FAILED":
print(f"Task {task_id} FAILED. Error: {status_data.get('error')}")
return None
time.sleep(poll_interval)
poll_interval = min(poll_interval * backoff_factor, max_poll_interval)
print(f"Task {task_id} polling timed out after {max_poll_time} seconds.")
return None
except requests.exceptions.RequestException as e:
print(f"API Error: {e}")
return None
except Exception as e:
print(f"Unexpected error: {e}")
return None
if __name__ == "__main__":
# Example of a potentially longer job (crawling a site known for being slow or large)
long_job_payload = {
"urls": ["https://archive.org/web/"], # A site that might take a bit longer
"crawler_config": {"word_count_threshold": 500} # Higher threshold
}
print("\n--- Testing Long-Running Job Client ---")
job_result = submit_job_and_wait_with_backoff(long_job_payload, max_poll_time=120) # 2 min timeout
if job_result and "results" in job_result:
for i, res_item in enumerate(job_result["results"]):
print(f"\nResult for {res_item.get('url')}:")
print(f" Success: {res_item.get('success')}")
if res_item.get('success'):
md_length = len(res_item.get('markdown', {}).get('raw_markdown', ''))
print(f" Markdown Length: {md_length}")
elif job_result:
print(f"\nReceived result data (unexpected structure):")
print(json.dumps(job_result, indent=2, ensure_ascii=False))
else:
print("\nJob did not complete successfully or timed out.")
```
---
### 6.4. Batching Requests to `/crawl/stream` vs. `/crawl`
#### 6.4.1. Discussion: When to use streaming for many URLs vs. submitting a single job with multiple URLs.
* **`/crawl` (Job-based, polling):**
* **Pros:**
* Better for very large numbers of URLs where you don't need immediate feedback for each.
* Robust to client disconnections (job continues on server).
* Redis queue handles load and persistence of jobs.
* Server manages concurrency and resources more globally.
* **Cons:**
* Requires a polling mechanism on the client side.
* Results are only available once the entire batch (or individual URL within a multi-URL job if server processes them somewhat independently before final aggregation) is complete.
* **Use when:** You have hundreds or thousands of URLs, can tolerate some delay for results, and need a fire-and-forget submission style.
* **`/crawl/stream` (Streaming):**
* **Pros:**
* Real-time feedback: results for each URL are streamed back as soon as they are processed.
* Simpler client logic if immediate processing of individual results is needed.
* Good for interactive applications or dashboards.
* **Cons:**
* Client must maintain an open connection. If it drops, the stream is lost.
* Can be less efficient for very large numbers of URLs if each URL is processed sequentially within the stream handler on the server (though `handle_stream_crawl_request` does process them concurrently up to server limits).
* The client needs to handle NDJSON parsing.
* **Use when:** You need results for URLs as they come in, are processing a moderate number of URLs, or building an interactive tool.
**General Guideline:**
* For a few to a few dozen URLs where you want results quickly and can process them one-by-one: `/crawl/stream`.
* For hundreds or thousands of URLs, or when you prefer to submit a batch and check back later: `/crawl` with polling.
* If using `/crawl/stream` for many URLs, ensure your client-side processing of each streamed result is fast to avoid becoming a bottleneck. The server-side uses an `AsyncGenerator` which processes URLs concurrently up to its internal limits, so the client should be ready to consume these results efficiently.
```