```markdown # Examples for `crawl4ai` - Deployment Component **Target Document Type:** Examples Collection **Target Output Filename Suggestion:** `llm_examples_deployment.md` **Library Version Context:** 0.5.1-d1 **Outline Generation Date:** 2025-05-24 --- This document provides runnable code examples showcasing the diverse usage patterns and configurations of the `crawl4ai` deployment component. The examples primarily focus on interacting with the API provided by a deployed Crawl4ai instance. ## I. Introduction to Crawl4ai Deployment Examples ### 1.1. Overview of the API and common interaction patterns (e.g., using `requests` library). The Crawl4ai deployment exposes a FastAPI backend. Most examples will use the `requests` library for synchronous calls and `httpx` for asynchronous calls to interact with these API endpoints. The base URL for a local deployment is typically `http://localhost:11235`. ```python import requests import httpx # For async examples later import asyncio import json import time import os import base64 # Assume the Crawl4ai API is running locally BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") API_TOKEN = os.environ.get("CRAWL4AI_API_TOKEN") # Set if your API requires auth def get_headers(): if API_TOKEN: return {"Authorization": f"Bearer {API_TOKEN}"} return {} print(f"Crawl4AI API Base URL: {BASE_URL}") if API_TOKEN: print("API Token will be used for authenticated requests.") else: print("No API Token found in env; assuming API does not require authentication for these examples.") # A simple synchronous GET request try: response = requests.get(f"{BASE_URL}/health") response.raise_for_status() # Raises an HTTPError for bad responses (4XX or 5XX) print(f"Health check successful: {response.json()}") except requests.exceptions.RequestException as e: print(f"Error connecting to Crawl4AI API: {e}") print("Please ensure the Crawl4AI Docker container or server is running.") ``` ### 1.2. Note on Authentication: Brief explanation of when and how to use API tokens. If JWT authentication is enabled in `config.yml` (via `security.jwt_enabled: true`), most API endpoints will require an `Authorization: Bearer ` header. You can obtain a token from the `/token` endpoint using a whitelisted email address. The `get_headers()` helper function in the examples will attempt to use `CRAWL4AI_API_TOKEN` if set. --- ## II. Docker and Docker-Compose ### 2.1. Building the Docker Image #### 2.1.1. Example: Basic `docker build` command. This command builds the default Docker image from the root of the `crawl4ai` repository. ```bash # Navigate to the root of the crawl4ai repository # cd /path/to/crawl4ai docker build -t crawl4ai:latest . ``` #### 2.1.2. Example: Building with `INSTALL_TYPE=all` build argument. This installs all optional dependencies, including those for advanced AI/ML features. ```bash # Navigate to the root of the crawl4ai repository # cd /path/to/crawl4ai docker build --build-arg INSTALL_TYPE=all -t crawl4ai:all-features . ``` #### 2.1.3. Example: Building with `ENABLE_GPU=true` build argument (conceptual, as GPU usage is complex). This attempts to include GPU support (e.g., CUDA toolkits) if the base image and host support it. ```bash # Navigate to the root of the crawl4ai repository # cd /path/to/crawl4ai # Ensure your Docker daemon and host are configured for GPU passthrough docker build --build-arg ENABLE_GPU=true --build-arg TARGETARCH=amd64 -t crawl4ai:gpu-amd64 . # For ARM64 with GPU (e.g., NVIDIA Jetson), you might need specific base images or configurations. # docker build --build-arg ENABLE_GPU=true --build-arg TARGETARCH=arm64 -t crawl4ai:gpu-arm64 . ``` **Note:** Full GPU support in Docker can be complex and depends on your host system, NVIDIA drivers, and Docker version. The `Dockerfile` provides a basic attempt. --- ### 2.2. Running with Docker Compose #### 2.2.1. Example: Basic `docker-compose up` using the provided `docker-compose.yml`. This starts the Crawl4ai service as defined in the `docker-compose.yml` file. ```bash # Navigate to the directory containing docker-compose.yml # cd /path/to/crawl4ai docker-compose up -d ``` #### 2.2.2. Example: Overriding image tag in `docker-compose` via environment variable `TAG`. You can specify a different image tag for the `crawl4ai` service. ```bash # Example: Using a specific version tag TAG=0.6.0 docker-compose up -d # Example: Using a custom built tag # TAG=my-custom-crawl4ai-build docker-compose up -d ``` #### 2.2.3. Example: Overriding `INSTALL_TYPE` in `docker-compose` via environment variable. If your `docker-compose.yml` is set up to use build arguments from environment variables, you can override `INSTALL_TYPE`. ```bash # Assuming docker-compose.yml uses INSTALL_TYPE from env for the build context: # (The provided docker-compose.yml directly passes it as a build arg) # If you modify docker-compose.yml to pick up an env var for INSTALL_TYPE: # INSTALL_TYPE=all docker-compose up -d --build ``` **Note:** The provided `docker-compose.yml` directly sets `INSTALL_TYPE` in the `args` section. To make it environment-variable driven like `TAG`, you would modify the `docker-compose.yml`'s `build.args` section. --- ### 2.3. Configuration via Environment Variables & `.llm.env` #### 2.3.1. Example: Setting `OPENAI_API_KEY` using an `.llm.env` file. Create a `.llm.env` file in the same directory as `docker-compose.yml` or where you run the server. ```text # Contents of .llm.env OPENAI_API_KEY="sk-your_openai_api_key_here" ``` The `docker-compose.yml` (or server if run directly) will load this file. #### 2.3.2. Example: Showing how to pass multiple LLM API keys via `.llm.env`. You can add keys for various supported LLM providers. ```text # Contents of .llm.env OPENAI_API_KEY="sk-your_openai_api_key_here" ANTHROPIC_API_KEY="sk-ant-your_anthropic_api_key_here" GROQ_API_KEY="gsk_your_groq_api_key_here" # ...and other keys supported by LiteLLM ``` --- ### 2.4. Accessing the Deployed Service #### 2.4.1. Example: Python script to perform a basic health check (`/health`) on the locally deployed service. ```python import requests import os BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") try: response = requests.get(f"{BASE_URL}/health") response.raise_for_status() data = response.json() print(f"Service is healthy. Version: {data.get('version')}, Timestamp: {data.get('timestamp')}") except requests.exceptions.RequestException as e: print(f"Failed to connect or health check failed: {e}") ``` #### 2.4.2. Example: Accessing the API playground at `/playground`. Open your web browser and navigate to `http://localhost:11235/playground` (or your deployed URL + `/playground`). This will show the FastAPI interactive API documentation. --- ### 2.5. Understanding Shared Memory #### 2.5.1. Explanation: Importance of `/dev/shm` for Chromium performance and how it's configured in `docker-compose.yml`. Chromium-based browsers (like Chrome, Edge) use `/dev/shm` (shared memory) extensively. If the default Docker limit for `/dev/shm` (often 64MB) is too small, browser instances can crash or perform poorly. The `docker-compose.yml` provided with Crawl4ai typically increases this: ```yaml # Snippet from a typical docker-compose.yml for crawl4ai # services: # crawl4ai: # # ... other configurations ... # shm_size: '1g' # Or '2g', depending on expected load # # Alternatively, for more flexibility but less security: # # volumes: # # - /dev/shm:/dev/shm ``` Setting `shm_size` or mounting `/dev/shm` directly from the host provides more shared memory, preventing common browser crashes within Docker. The `Dockerfile` also sets `ENV DEBIAN_FRONTEND=noninteractive` and browser flags like `--disable-dev-shm-usage` to mitigate some issues, but adequate shared memory is still crucial. --- ## III. Interacting with the Crawl4ai API Endpoints ### A. Authentication (`/token`) #### A.1. Example: Python script to obtain an API token using a valid email. This example assumes JWT authentication is enabled and "user@example.com" is whitelisted (this is illustrative, actual whitelisting is not part of the default config). ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") # This email domain would need to be configured as allowed in your security settings # if verify_email_domain is used. email_to_test = "user@example.com" # Replace with a valid email if your server uses domain verification payload = {"email": email_to_test} try: response = requests.post(f"{BASE_URL}/token", json=payload) if response.status_code == 200: token_data = response.json() print(f"Successfully obtained token for {email_to_test}:") print(json.dumps(token_data, indent=2)) # Store this token for subsequent authenticated requests # API_TOKEN = token_data["access_token"] else: print(f"Failed to obtain token for {email_to_test}. Status: {response.status_code}, Response: {response.text}") except requests.exceptions.RequestException as e: print(f"Error connecting to /token endpoint: {e}") ``` **Note:** The default `config.yml` has `security.jwt_enabled: false`. For this example to fully work, you would need to enable JWT and potentially configure allowed email domains. #### A.2. Example: Python script attempting to obtain a token with an invalid email domain and handling the error. ```python import requests import os BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") # Assuming "invalid-domain.com" is not whitelisted. # The default Crawl4AI config doesn't whitelist specific domains for /token, # but if `verify_email_domain` were true in auth.py, this would be relevant. # For now, this will likely succeed if jwt_enabled is false, or fail if jwt_enabled is true # and no user exists, or pass if jwt_enabled is true and any email can get a token. payload = {"email": "test@invalid-domain.com"} try: response = requests.post(f"{BASE_URL}/token", json=payload) if response.status_code == 400 and "Invalid email domain" in response.text: print(f"Correctly failed to obtain token for invalid domain: {response.text}") elif response.status_code == 200: print(f"Obtained token (unexpected if domain verification is strict): {response.json()}") else: print(f"Token request status: {response.status_code}, Response: {response.text}") except requests.exceptions.RequestException as e: print(f"Error connecting to /token endpoint: {e}") ``` #### A.3. Example: Python script making an authenticated request to a protected endpoint. This example assumes an endpoint like `/md` is protected and requires a token. ```python import requests import os BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") # First, obtain a token (replace with actual token for a real protected setup) # For this example, we'll use a placeholder. If API_TOKEN is set in env, it will be used. # If not, and the endpoint is truly protected, this will fail. # API_TOKEN = "your_manually_obtained_token_or_from_previous_step" headers = get_headers() # Uses API_TOKEN from environment if set md_payload = {"url": "https://example.com"} try: response = requests.post(f"{BASE_URL}/md", json=md_payload, headers=headers) if response.status_code == 200: print("Successfully accessed protected /md endpoint.") print(json.dumps(response.json(), indent=2, ensure_ascii=False)[:500] + "...") elif response.status_code == 401 or response.status_code == 403: print(f"Authentication/Authorization failed for /md: {response.status_code} - {response.text}") print("Ensure JWT is enabled and you have a valid token if this endpoint is protected.") else: print(f"Request to /md failed: {response.status_code} - {response.text}") except requests.exceptions.RequestException as e: print(f"Error connecting to /md endpoint: {e}") ``` **Note:** By default, most Crawl4ai endpoints are not protected by JWT even if `jwt_enabled` is true, unless explicitly decorated with `Depends(token_dep)`. --- ### B. Core Crawling Endpoints #### B.1. `/crawl` (Asynchronous Job-based Crawling via Redis) The `/crawl` endpoint submits a job to a Redis queue. You then poll the `/task/{task_id}` endpoint to get the status and results. ##### B.1.1. Example: Submitting a single URL crawl job and getting a `task_id`. ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() payload = { "urls": ["https://example.com"], # browser_config and crawler_config are optional, defaults will be used } try: response = requests.post(f"{BASE_URL}/crawl", json=payload, headers=headers) response.raise_for_status() job_data = response.json() task_id = job_data.get("task_id") if task_id: print(f"Crawl job submitted successfully. Task ID: {task_id}") print(f"Poll status at: {BASE_URL}/task/{task_id}") else: print(f"Failed to submit job or get task_id: {job_data}") except requests.exceptions.RequestException as e: print(f"Error submitting crawl job: {e}") ``` ##### B.1.2. Example: Submitting multiple URLs as a single crawl job. ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() payload = { "urls": ["https://example.com", "https://www.python.org"], } try: response = requests.post(f"{BASE_URL}/crawl", json=payload, headers=headers) response.raise_for_status() job_data = response.json() task_id = job_data.get("task_id") if task_id: print(f"Multi-URL crawl job submitted. Task ID: {task_id}") else: print(f"Failed to submit job: {job_data}") except requests.exceptions.RequestException as e: print(f"Error submitting multi-URL crawl job: {e}") ``` ##### B.1.3. Example: Submitting a crawl job with a custom `browser_config` (e.g., headless false). ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() payload = { "urls": ["https://example.com"], "browser_config": { "headless": False, # Run browser in visible mode (if server environment supports UI) "viewport_width": 800, "viewport_height": 600 } } try: response = requests.post(f"{BASE_URL}/crawl", json=payload, headers=headers) response.raise_for_status() job_data = response.json() task_id = job_data.get("task_id") if task_id: print(f"Crawl job with custom browser_config submitted. Task ID: {task_id}") else: print(f"Failed to submit job: {job_data}") except requests.exceptions.RequestException as e: print(f"Error submitting crawl job with custom browser_config: {e}") ``` ##### B.1.4. Example: Submitting a crawl job with a custom `crawler_config` (e.g., specific `word_count_threshold`). ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() payload = { "urls": ["https://example.com"], "crawler_config": { "word_count_threshold": 50, # Only process content blocks with more than 50 words "screenshot": True # Also take a screenshot } } try: response = requests.post(f"{BASE_URL}/crawl", json=payload, headers=headers) response.raise_for_status() job_data = response.json() task_id = job_data.get("task_id") if task_id: print(f"Crawl job with custom crawler_config submitted. Task ID: {task_id}") else: print(f"Failed to submit job: {job_data}") except requests.exceptions.RequestException as e: print(f"Error submitting crawl job with custom crawler_config: {e}") ``` ##### B.1.5. Example: Submitting a job that uses a specific `CacheMode` (e.g., `BYPASS`). `CacheMode` values are typically: "DISABLED", "ENABLED", "BYPASS", "READ_ONLY", "WRITE_ONLY". ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() payload = { "urls": ["https://example.com"], "crawler_config": { "cache_mode": "BYPASS" # Force a fresh crawl, ignore existing cache, don't write to cache } } try: response = requests.post(f"{BASE_URL}/crawl", json=payload, headers=headers) response.raise_for_status() job_data = response.json() task_id = job_data.get("task_id") if task_id: print(f"Crawl job with CacheMode.BYPASS submitted. Task ID: {task_id}") else: print(f"Failed to submit job: {job_data}") except requests.exceptions.RequestException as e: print(f"Error submitting crawl job with CacheMode.BYPASS: {e}") ``` ##### B.1.6. Example: Submitting a job to extract PDF content from a URL. (This assumes the URL points directly to a PDF or the page leads to a PDF download that the crawler handles). ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() # URL of a sample PDF file pdf_url = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf" payload = { "urls": [pdf_url], "crawler_config": { # Crawl4ai should auto-detect PDF content type and use appropriate processor "pdf": True # Explicitly enabling PDF processing, though often auto-detected } } try: response = requests.post(f"{BASE_URL}/crawl", json=payload, headers=headers) response.raise_for_status() job_data = response.json() task_id = job_data.get("task_id") if task_id: print(f"PDF crawl job submitted for {pdf_url}. Task ID: {task_id}") print(f"Poll status at: {BASE_URL}/task/{task_id}") else: print(f"Failed to submit PDF crawl job: {job_data}") except requests.exceptions.RequestException as e: print(f"Error submitting PDF crawl job: {e}") ``` ##### B.1.7. Example: Submitting a job to take a screenshot from a URL. ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() payload = { "urls": ["https://example.com"], "crawler_config": { "screenshot": True, "screenshot_wait_for": 2 # wait 2 seconds after page load before screenshot } } try: response = requests.post(f"{BASE_URL}/crawl", json=payload, headers=headers) response.raise_for_status() job_data = response.json() task_id = job_data.get("task_id") if task_id: print(f"Screenshot job submitted for example.com. Task ID: {task_id}") print(f"Poll status at: {BASE_URL}/task/{task_id}") else: print(f"Failed to submit screenshot job: {job_data}") except requests.exceptions.RequestException as e: print(f"Error submitting screenshot job: {e}") ``` --- #### B.2. `/task/{task_id}` (Job Status and Results) ##### B.2.1. Example: Python script to poll the `/task/{task_id}` endpoint for PENDING status. ```python import requests import time import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() # Assume task_id is obtained from a previous /crawl request # For this example, we'll submit a quick job first submit_payload = {"urls": ["http://example.com/nonexistent-page-for-quick-fail-or-processing"]} task_id = None try: submit_response = requests.post(f"{BASE_URL}/crawl", json=submit_payload, headers=headers) submit_response.raise_for_status() task_id = submit_response.json().get("task_id") except requests.exceptions.RequestException as e: print(f"Failed to submit initial job for polling example: {e}") if task_id: print(f"Polling for task: {task_id}") for _ in range(5): # Poll a few times try: status_response = requests.get(f"{BASE_URL}/task/{task_id}", headers=headers) status_response.raise_for_status() status_data = status_response.json() print(f"Current status: {status_data.get('status')}") if status_data.get('status') in ["COMPLETED", "FAILED"]: break time.sleep(2) except requests.exceptions.RequestException as e: print(f"Error polling task status: {e}") break else: print("No task ID to poll.") ``` ##### B.2.2. Example: Python script to retrieve results for a COMPLETED job. ```python import requests import time import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() # Submit a job that should complete successfully submit_payload = {"urls": ["https://example.com"]} task_id = None try: submit_response = requests.post(f"{BASE_URL}/crawl", json=submit_payload, headers=headers) submit_response.raise_for_status() task_id = submit_response.json().get("task_id") except requests.exceptions.RequestException as e: print(f"Failed to submit job for result retrieval example: {e}") if task_id: print(f"Waiting for task {task_id} to complete...") while True: try: status_response = requests.get(f"{BASE_URL}/task/{task_id}", headers=headers) status_response.raise_for_status() status_data = status_response.json() current_status = status_data.get('status') print(f"Task status: {current_status}") if current_status == "COMPLETED": print("\nJob COMPLETED. Results:") # The 'result' field contains the JSON string of the CrawlResult model(s) # For a single URL job, it's typically a dict. For multiple, a list of dicts. # The structure from api.py suggests `result` field in Redis is a JSON string # of a dictionary which itself contains a 'results' key (list of CrawlResult dicts). # This is based on how handle_crawl_job in api.py stores results # and how the /task/{task_id} endpoint decodes it. # The 'result' from /task/{task_id} should already be a parsed dict. crawl_results_wrapper = status_data.get("result") if crawl_results_wrapper and "results" in crawl_results_wrapper: actual_results = crawl_results_wrapper["results"] for i, res_item in enumerate(actual_results): print(f"\n--- Result for URL {i+1} ({res_item.get('url', 'N/A')}) ---") print(f" Success: {res_item.get('success')}") print(f" Markdown (first 100 chars): {res_item.get('markdown', {}).get('raw_markdown', '')[:100]}...") if res_item.get('screenshot'): print(" Screenshot captured (base64 data not printed).") else: print(f"Unexpected result structure: {crawl_results_wrapper}") break elif current_status == "FAILED": print(f"\nJob FAILED. Error: {status_data.get('error')}") break time.sleep(3) # Poll every 3 seconds except requests.exceptions.RequestException as e: print(f"Error polling task status: {e}") break except KeyboardInterrupt: print("\nPolling interrupted.") break else: print("No task ID to retrieve results for.") ``` ##### B.2.3. Example: Python script to get error details for a FAILED job. ```python import requests import time import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() # Submit a job that is likely to fail (e.g., invalid URL or one that times out quickly) submit_payload = {"urls": ["http://nonexistentdomain1234567890.com"]} task_id = None try: submit_response = requests.post(f"{BASE_URL}/crawl", json=submit_payload, headers=headers) submit_response.raise_for_status() task_id = submit_response.json().get("task_id") except requests.exceptions.RequestException as e: print(f"Failed to submit job for failure example: {e}") if task_id: print(f"Waiting for task {task_id} (expected to fail)...") while True: try: status_response = requests.get(f"{BASE_URL}/task/{task_id}", headers=headers) status_response.raise_for_status() status_data = status_response.json() current_status = status_data.get('status') print(f"Task status: {current_status}") if current_status == "FAILED": print("\nJob FAILED as expected.") error_message = status_data.get('error', 'No error message provided.') print(f"Error details: {error_message}") break elif current_status == "COMPLETED": print("\nJob COMPLETED unexpectedly.") break time.sleep(2) except requests.exceptions.RequestException as e: print(f"Error polling task status: {e}") break except KeyboardInterrupt: print("\nPolling interrupted.") break else: print("No task ID to check for failure.") ``` ##### B.2.4. Example: Full workflow - submit job, poll status, retrieve results or error. This combines the above examples into a more complete client script. ```python import requests import time import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() def submit_and_poll(payload, timeout_seconds=60): task_id = None try: # Submit the job print(f"Submitting job with payload: {payload}") submit_response = requests.post(f"{BASE_URL}/crawl", json=payload, headers=headers) submit_response.raise_for_status() task_id = submit_response.json().get("task_id") if not task_id: print("Error: No task_id received.") return None print(f"Job submitted. Task ID: {task_id}. Polling for completion...") # Poll for status start_time = time.time() while time.time() - start_time < timeout_seconds: status_response = requests.get(f"{BASE_URL}/task/{task_id}", headers=headers) status_response.raise_for_status() status_data = status_response.json() current_status = status_data.get('status') print(f" Task {task_id} status: {current_status} (elapsed: {time.time() - start_time:.1f}s)") if current_status == "COMPLETED": print(f"Task {task_id} COMPLETED.") return status_data.get("result") # This should be the parsed JSON result elif current_status == "FAILED": print(f"Task {task_id} FAILED.") print(f"Error: {status_data.get('error')}") return None time.sleep(5) # Poll interval print(f"Task {task_id} timed out after {timeout_seconds} seconds.") return None except requests.exceptions.RequestException as e: print(f"API request error: {e}") return None except Exception as e: print(f"An unexpected error occurred: {e}") return None if __name__ == "__main__": crawl_payload = { "urls": ["https://www.python.org/about/"], "crawler_config": {"screenshot": False} } results_data = submit_and_poll(crawl_payload) if results_data and "results" in results_data: for i, res_item in enumerate(results_data["results"]): print(f"\n--- Result for URL {res_item.get('url', 'N/A')} ---") print(f" Success: {res_item.get('success')}") print(f" Markdown (first 200 chars): {res_item.get('markdown', {}).get('raw_markdown', '')[:200]}...") elif results_data: # If result isn't in the expected wrapper structure print(f"\nReceived result data (unexpected structure):") print(json.dumps(results_data, indent=2, ensure_ascii=False)) ``` --- #### B.3. `/crawl/stream` (Streaming Crawl Results) ##### B.3.1. Example: Python script to stream crawl results for a single URL and process NDJSON. ```python import requests import json import os BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() headers['Accept'] = 'application/x-ndjson' # Important for streaming payload = { "urls": ["https://example.com"], "crawler_config": {"stream": True} # Ensure stream is True in config } print(f"Streaming results for {payload['urls'][0]}...") try: with requests.post(f"{BASE_URL}/crawl/stream", json=payload, headers=headers, stream=True) as response: response.raise_for_status() for line in response.iter_lines(): if line: try: result_chunk = json.loads(line.decode('utf-8')) if "status" in result_chunk and result_chunk["status"] == "completed": print("\nStream finished.") break print("\nReceived chunk:") # Print some key info from the chunk print(f" URL: {result_chunk.get('url', 'N/A')}") print(f" Success: {result_chunk.get('success')}") if 'markdown' in result_chunk and isinstance(result_chunk['markdown'], dict): print(f" Markdown (snippet): {result_chunk['markdown'].get('raw_markdown', '')[:100]}...") else: print(f" Markdown (snippet): {str(result_chunk.get('markdown', ''))[:100]}...") if result_chunk.get('error_message'): print(f" Error: {result_chunk.get('error_message')}") except json.JSONDecodeError as e: print(f"Error decoding JSON line: {e} - Line: {line.decode('utf-8')}") except requests.exceptions.RequestException as e: print(f"Error during streaming request: {e}") ``` ##### B.3.2. Example: Python script to stream crawl results for multiple URLs. ```python import requests import json import os BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() headers['Accept'] = 'application/x-ndjson' payload = { "urls": ["https://example.com", "https://www.python.org/doc/"], "crawler_config": {"stream": True} } print(f"Streaming results for multiple URLs...") try: with requests.post(f"{BASE_URL}/crawl/stream", json=payload, headers=headers, stream=True) as response: response.raise_for_status() for line in response.iter_lines(): if line: try: result_chunk = json.loads(line.decode('utf-8')) if "status" in result_chunk and result_chunk["status"] == "completed": print("\nStream finished for all URLs.") break print(f"\nChunk for URL: {result_chunk.get('url', 'N/A')}") # Process or display part of the result print(f" Success: {result_chunk.get('success')}") if 'markdown' in result_chunk and isinstance(result_chunk['markdown'], dict): print(f" Markdown (snippet): {result_chunk['markdown'].get('raw_markdown', '')[:70]}...") else: print(f" Markdown (snippet): {str(result_chunk.get('markdown', ''))[:70]}...") except json.JSONDecodeError as e: print(f"Error decoding JSON line: {e} - Line: {line.decode('utf-8')}") except requests.exceptions.RequestException as e: print(f"Error during streaming request: {e}") ``` ##### B.3.3. Example: Streaming crawl results with custom `browser_config` and `crawler_config`. ```python import requests import json import os BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() headers['Accept'] = 'application/x-ndjson' payload = { "urls": ["https://example.com"], "browser_config": { "headless": True, "user_agent": "Crawl4AI-Stream-Tester/1.0" }, "crawler_config": { "stream": True, "word_count_threshold": 10 # Lower threshold for this example } } print(f"Streaming results with custom configs for {payload['urls'][0]}...") try: with requests.post(f"{BASE_URL}/crawl/stream", json=payload, headers=headers, stream=True) as response: response.raise_for_status() for line in response.iter_lines(): if line: result_chunk = json.loads(line.decode('utf-8')) if "status" in result_chunk and result_chunk["status"] == "completed": print("\nStream finished.") break print("\nReceived chunk with custom config:") print(f" URL: {result_chunk.get('url')}") print(f" Word count threshold was: {payload['crawler_config']['word_count_threshold']}") if 'markdown' in result_chunk and isinstance(result_chunk['markdown'], dict): print(f" Markdown (snippet): {result_chunk['markdown'].get('raw_markdown', '')[:70]}...") else: print(f" Markdown (snippet): {str(result_chunk.get('markdown', ''))[:70]}...") except requests.exceptions.RequestException as e: print(f"Error during streaming request: {e}") ``` ##### B.3.4. Example: Handling connection closure or errors during streaming. ```python import requests import json import os BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() headers['Accept'] = 'application/x-ndjson' payload = { "urls": ["https://thissitedoesnotexist12345.com", "https://example.com"], # First URL will fail "crawler_config": {"stream": True} } print(f"Streaming with a URL expected to fail...") try: with requests.post(f"{BASE_URL}/crawl/stream", json=payload, headers=headers, stream=True) as response: # We might not get a non-200 status code immediately if the connection itself is established # Errors for individual URLs will be part of the NDJSON stream for line in response.iter_lines(): if line: try: result_chunk = json.loads(line.decode('utf-8')) print(f"\nReceived data: {result_chunk.get('url', 'N/A')}") if "status" in result_chunk and result_chunk["status"] == "completed": print("Stream finished.") break if result_chunk.get('error_message'): print(f" ERROR for {result_chunk.get('url')}: {result_chunk.get('error_message')}") elif result_chunk.get('success'): print(f" SUCCESS for {result_chunk.get('url')}") except json.JSONDecodeError as e: print(f" Error decoding JSON line: {e}") except requests.exceptions.ChunkedEncodingError: print("Connection closed unexpectedly by server during streaming (ChunkedEncodingError).") except requests.exceptions.RequestException as e: print(f"General error during streaming request: {e}") ``` --- ### C. Content Transformation & Utility Endpoints #### C.1. `/md` (Markdown Generation) ##### C.1.1. Example: Getting raw Markdown for a URL (default filter). The default filter is `FIT` if no filter is specified. ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() payload = {"url": "https://example.com", "f": "RAW"} # 'f' is for filter_type try: response = requests.post(f"{BASE_URL}/md", json=payload, headers=headers) response.raise_for_status() data = response.json() print("Markdown (RAW filter - first 300 chars):") print(data.get("markdown", "")[:300] + "...") except requests.exceptions.RequestException as e: print(f"Error fetching Markdown: {e}") ``` ##### C.1.2. Example: Getting Markdown using the `FIT` filter type. ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() payload = {"url": "https://example.com", "f": "FIT"} try: response = requests.post(f"{BASE_URL}/md", json=payload, headers=headers) response.raise_for_status() data = response.json() print("Markdown (FIT filter - first 300 chars):") print(data.get("markdown", "")[:300] + "...") except requests.exceptions.RequestException as e: print(f"Error fetching Markdown: {e}") ``` ##### C.1.3. Example: Getting Markdown using the `BM25` filter type with a specific query. ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() payload = { "url": "https://en.wikipedia.org/wiki/Python_(programming_language)", "f": "BM25", "q": "What are the key features of Python?" # Query for BM25 filtering } try: response = requests.post(f"{BASE_URL}/md", json=payload, headers=headers) response.raise_for_status() data = response.json() print(f"Markdown (BM25 filter, query='{payload['q']}' - first 300 chars):") print(data.get("markdown", "")[:300] + "...") except requests.exceptions.RequestException as e: print(f"Error fetching Markdown: {e}") ``` ##### C.1.4. Example: Getting Markdown using the `LLM` filter type with a query (conceptual, requires LLM setup). This requires an LLM provider (like OpenAI) to be configured in `config.yml` or via environment variables loaded by the server. ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() payload = { "url": "https://en.wikipedia.org/wiki/Python_(programming_language)", "f": "LLM", "q": "Summarize the history of Python" # Query for LLM to focus on } print("Attempting LLM-filtered Markdown (this may take a moment and requires LLM config)...") try: # LLM requests can take longer response = requests.post(f"{BASE_URL}/md", json=payload, headers=headers, timeout=120) response.raise_for_status() data = response.json() print(f"Markdown (LLM filter, query='{payload['q']}' - first 300 chars):") print(data.get("markdown", "")[:300] + "...") except requests.exceptions.RequestException as e: print(f"Error fetching LLM-filtered Markdown: {e}") print("Ensure your LLM provider (e.g., OPENAI_API_KEY) is configured for the server.") ``` ##### C.1.5. Example: Demonstrating cache usage with the `/md` endpoint (`c` parameter). The `c` parameter can be "0" (bypass write, read if available - effectively WRITE_ONLY for this endpoint if no cache exists), "1" (force refresh, write - effectively ENABLED for this endpoint), or other numbers for revision control (not shown here). ```python import requests import os import json import time BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() test_url = "https://example.com" # First call: cache miss, should fetch and write to cache print("First call (cache_mode=ENABLED implied by 'c=1', or default if 'c' omitted)") payload1 = {"url": test_url, "f": "RAW", "c": "1"} # c="1" forces refresh and writes start_time = time.time() response1 = requests.post(f"{BASE_URL}/md", json=payload1, headers=headers) duration1 = time.time() - start_time response1.raise_for_status() print(f"First call duration: {duration1:.2f}s. Markdown length: {len(response1.json().get('markdown', ''))}") # Second call: should be a cache hit if c="0" or c is omitted and cache is fresh print("\nSecond call (cache_mode=READ_ONLY implied by 'c=0', or default if 'c' omitted and cache fresh)") payload2 = {"url": test_url, "f": "RAW", "c": "0"} # c="0" attempts to read from cache start_time = time.time() response2 = requests.post(f"{BASE_URL}/md", json=payload2, headers=headers) duration2 = time.time() - start_time response2.raise_for_status() print(f"Second call duration: {duration2:.2f}s. Markdown length: {len(response2.json().get('markdown', ''))}") if duration2 < duration1 / 2 and duration1 > 0.1 : # Heuristic for cache hit print("Second call was significantly faster, likely a cache hit.") else: print("Cache behavior inconclusive or first call was very fast.") ``` --- #### C.2. `/html` (Preprocessed HTML) ##### C.2.1. Example: Fetching preprocessed HTML for a URL suitable for schema extraction. ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() payload = {"url": "https://example.com"} try: response = requests.post(f"{BASE_URL}/html", json=payload, headers=headers) response.raise_for_status() data = response.json() print("Preprocessed HTML (first 500 chars):") print(data.get("html", "")[:500] + "...") print(f"\nOriginal URL: {data.get('url')}") except requests.exceptions.RequestException as e: print(f"Error fetching preprocessed HTML: {e}") ``` --- #### C.3. `/screenshot` ##### C.3.1. Example: Generating a PNG screenshot for a URL and receiving base64 data. ```python import requests import os import base64 import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() payload = {"url": "https://example.com"} try: response = requests.post(f"{BASE_URL}/screenshot", json=payload, headers=headers) response.raise_for_status() data = response.json() if data.get("screenshot"): print("Screenshot received (base64 data).") # To save the image: # image_data = base64.b64decode(data["screenshot"]) # with open("example_screenshot.png", "wb") as f: # f.write(image_data) # print("Screenshot saved as example_screenshot.png") else: print(f"Screenshot generation failed or no data returned: {data}") except requests.exceptions.RequestException as e: print(f"Error generating screenshot: {e}") ``` ##### C.3.2. Example: Generating a screenshot with a custom `screenshot_wait_for` delay. ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() payload = { "url": "https://example.com", "screenshot_wait_for": 3 # Wait 3 seconds after page load } try: response = requests.post(f"{BASE_URL}/screenshot", json=payload, headers=headers) response.raise_for_status() data = response.json() if data.get("screenshot"): print(f"Screenshot with {payload['screenshot_wait_for']}s delay received.") else: print(f"Screenshot generation failed: {data}") except requests.exceptions.RequestException as e: print(f"Error generating screenshot with delay: {e}") ``` ##### C.3.3. Example: Saving screenshot to server-side path via `output_path`. **Note:** This requires `output_path` to be a path accessible and writable by the server process. For Docker, this usually means a mounted volume. ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() # This path needs to be valid from the server's perspective # e.g., if running in Docker, it might be a path inside the container # that is mapped to a host volume. server_side_path = "/app/screenshots/example_com.png" # Example path payload = { "url": "https://example.com", "output_path": server_side_path } try: response = requests.post(f"{BASE_URL}/screenshot", json=payload, headers=headers) response.raise_for_status() data = response.json() if data.get("success") and data.get("path"): print(f"Screenshot successfully saved to server path: {data.get('path')}") print("Note: This file is on the server, not the client machine unless paths are mapped.") else: print(f"Failed to save screenshot to server: {data}") except requests.exceptions.RequestException as e: print(f"Error saving screenshot to server: {e}") ``` --- #### C.4. `/pdf` ##### C.4.1. Example: Generating a PDF for a URL and receiving base64 data. ```python import requests import os import base64 import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() payload = {"url": "https://example.com"} try: response = requests.post(f"{BASE_URL}/pdf", json=payload, headers=headers) response.raise_for_status() data = response.json() if data.get("pdf"): print("PDF received (base64 data).") # To save the PDF: # pdf_data = base64.b64decode(data["pdf"]) # with open("example_page.pdf", "wb") as f: # f.write(pdf_data) # print("PDF saved as example_page.pdf") else: print(f"PDF generation failed or no data returned: {data}") except requests.exceptions.RequestException as e: print(f"Error generating PDF: {e}") ``` ##### C.4.2. Example: Saving PDF to server-side path via `output_path`. **Note:** Similar to screenshots, `output_path` must be server-accessible. ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() server_side_path = "/app/pdfs/example_com.pdf" # Example path payload = { "url": "https://example.com", "output_path": server_side_path } try: response = requests.post(f"{BASE_URL}/pdf", json=payload, headers=headers) response.raise_for_status() data = response.json() if data.get("success") and data.get("path"): print(f"PDF successfully saved to server path: {data.get('path')}") else: print(f"Failed to save PDF to server: {data}") except requests.exceptions.RequestException as e: print(f"Error saving PDF to server: {e}") ``` --- #### C.5. `/execute_js` ##### C.5.1. Example: Executing a simple JavaScript snippet (e.g., `return document.title;`) on a page. ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() payload = { "url": "https://example.com", "scripts": ["return document.title;"] } try: response = requests.post(f"{BASE_URL}/execute_js", json=payload, headers=headers) response.raise_for_status() data = response.json() # This is the full CrawlResult model as JSON print("Full CrawlResult from /execute_js:") # print(json.dumps(data, indent=2, ensure_ascii=False)) # Can be very long js_results = data.get("js_execution_result") if js_results and js_results.get("script_0"): print(f"\nResult of script 0 (document.title): {js_results['script_0']}") else: print(f"\nCould not find JS execution result: {js_results}") except requests.exceptions.RequestException as e: print(f"Error executing JS: {e}") ``` ##### C.5.2. Example: Executing multiple JavaScript snippets sequentially. ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() payload = { "url": "https://example.com", "scripts": [ "return document.title;", "return document.querySelectorAll('p').length;", "() => { const h1 = document.querySelector('h1'); return h1 ? h1.innerText : 'No H1'; }()" ] } try: response = requests.post(f"{BASE_URL}/execute_js", json=payload, headers=headers) response.raise_for_status() data = response.json() js_results = data.get("js_execution_result") if js_results: print("\nResults of JS snippets:") print(f" Script 0 (Title): {js_results.get('script_0')}") print(f" Script 1 (Paragraph count): {js_results.get('script_1')}") print(f" Script 2 (H1 text): {js_results.get('script_2')}") else: print(f"\nCould not find JS execution results: {js_results}") except requests.exceptions.RequestException as e: print(f"Error executing multiple JS snippets: {e}") ``` ##### C.5.3. Example: Demonstrating how the full `CrawlResult` (JSON of model) is returned. The `/execute_js` endpoint returns the entire `CrawlResult` object, serialized to JSON. This includes HTML, Markdown, links, etc., in addition to the `js_execution_result`. ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() payload = { "url": "https://example.com", "scripts": ["return window.location.href;"] } try: response = requests.post(f"{BASE_URL}/execute_js", json=payload, headers=headers) response.raise_for_status() crawl_result_data = response.json() print("Demonstrating full CrawlResult structure from /execute_js:") print(f" URL crawled: {crawl_result_data.get('url')}") print(f" Success: {crawl_result_data.get('success')}") print(f" HTML (snippet): {crawl_result_data.get('html', '')[:100]}...") if isinstance(crawl_result_data.get('markdown'), dict): print(f" Markdown (snippet): {crawl_result_data['markdown'].get('raw_markdown', '')[:100]}...") else: print(f" Markdown (snippet): {str(crawl_result_data.get('markdown', ''))[:100]}...") js_result = crawl_result_data.get("js_execution_result", {}).get("script_0") print(f" Result of JS (window.location.href): {js_result}") except requests.exceptions.RequestException as e: print(f"Error demonstrating full CrawlResult: {e}") ``` --- ### D. Contextual Endpoints #### D.1. `/ask` (RAG-like Context Retrieval) The `/ask` endpoint uses local Markdown files (`c4ai-code-context.md` and `c4ai-doc-context.md`, which should be in the same directory as `server.py`) for retrieval. ##### D.1.1. Example: Asking a general question to retrieve "code" context. ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() params = { "context_type": "code", "query": "How to handle Playwright installation?" # General query } try: response = requests.get(f"{BASE_URL}/ask", params=params, headers=headers) response.raise_for_status() data = response.json() print("Retrieved 'code' context for 'How to handle Playwright installation?':") if "code_results" in data: for i, item in enumerate(data["code_results"][:2]): # Show first 2 results print(f"\n--- Code Result {i+1} (Score: {item.get('score', 'N/A'):.2f}) ---") print(item.get("text", "")[:300] + "...") else: print(json.dumps(data, indent=2)) except requests.exceptions.RequestException as e: print(f"Error asking for code context: {e}") ``` ##### D.1.2. Example: Asking a general question to retrieve "doc" context. ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() params = { "context_type": "doc", "query": "Explain Crawl4ai API endpoints" } try: response = requests.get(f"{BASE_URL}/ask", params=params, headers=headers) response.raise_for_status() data = response.json() print("Retrieved 'doc' context for 'Explain Crawl4ai API endpoints':") if "doc_results" in data: for i, item in enumerate(data["doc_results"][:2]): print(f"\n--- Doc Result {i+1} (Score: {item.get('score', 'N/A'):.2f}) ---") print(item.get("text", "")[:300] + "...") else: print(json.dumps(data, indent=2)) except requests.exceptions.RequestException as e: print(f"Error asking for doc context: {e}") ``` ##### D.1.3. Example: Using the `query` parameter to filter context related to a specific function. ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() params = { "context_type": "all", # Search both code and docs "query": "AsyncWebCrawler arun method" } try: response = requests.get(f"{BASE_URL}/ask", params=params, headers=headers) response.raise_for_status() data = response.json() print(f"Retrieved 'all' context for query: '{params['query']}'") if "code_results" in data: print(f"\nFound {len(data['code_results'])} code results.") # Optionally print snippets if "doc_results" in data: print(f"Found {len(data['doc_results'])} doc results.") # Optionally print snippets # print(json.dumps(data, indent=2, ensure_ascii=False)[:1000] + "...") except requests.exceptions.RequestException as e: print(f"Error asking with specific query: {e}") ``` ##### D.1.4. Example: Adjusting `score_ratio` to change result sensitivity. A lower `score_ratio` (e.g., 0.1) will return more, less relevant results. A higher one (e.g., 0.8) will be more strict. Default is 0.5. ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() params_strict = { "context_type": "code", "query": "Playwright browser installation", "score_ratio": 0.8 # Higher, more strict } params_loose = { "context_type": "code", "query": "Playwright browser installation", "score_ratio": 0.2 # Lower, less strict } try: response_strict = requests.get(f"{BASE_URL}/ask", params=params_strict, headers=headers) response_strict.raise_for_status() data_strict = response_strict.json() print(f"Results with score_ratio=0.8: {len(data_strict.get('code_results', []))}") response_loose = requests.get(f"{BASE_URL}/ask", params=params_loose, headers=headers) response_loose.raise_for_status() data_loose = response_loose.json() print(f"Results with score_ratio=0.2: {len(data_loose.get('code_results', []))}") except requests.exceptions.RequestException as e: print(f"Error adjusting score_ratio: {e}") ``` ##### D.1.5. Example: Limiting results with `max_results`. ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() params = { "context_type": "doc", "query": "crawl4ai features", "max_results": 3 # Limit to top 3 results } try: response = requests.get(f"{BASE_URL}/ask", params=params, headers=headers) response.raise_for_status() data = response.json() print(f"Retrieved max {params['max_results']} doc_results for 'crawl4ai features':") if "doc_results" in data: print(f"Actual results returned: {len(data['doc_results'])}") for item in data["doc_results"]: print(f" - Score: {item.get('score', 0):.2f}, Text (snippet): {item.get('text', '')[:50]}...") else: print("No doc_results found.") except requests.exceptions.RequestException as e: print(f"Error limiting results: {e}") ``` --- ### E. Server & Configuration Information #### E.1. `/config/dump` ##### E.1.1. Example: Dumping a `CrawlerRunConfig` Python object representation to its JSON equivalent via the API. ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() # This is a Python-style string representation of a CrawlerRunConfig # that the server's _safe_eval_config can parse. config_string = "CrawlerRunConfig(word_count_threshold=50, screenshot=True, cache_mode=CacheMode.BYPASS)" payload = {"code": config_string} try: response = requests.post(f"{BASE_URL}/config/dump", json=payload, headers=headers) response.raise_for_status() dumped_json = response.json() print("Dumped CrawlerRunConfig JSON:") print(json.dumps(dumped_json, indent=2)) except requests.exceptions.RequestException as e: print(f"Error dumping CrawlerRunConfig: {e}") ``` ##### E.1.2. Example: Dumping a `BrowserConfig` Python object representation to its JSON equivalent via the API. ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() config_string = "BrowserConfig(headless=False, user_agent='MyTestAgent/1.0')" payload = {"code": config_string} try: response = requests.post(f"{BASE_URL}/config/dump", json=payload, headers=headers) response.raise_for_status() dumped_json = response.json() print("Dumped BrowserConfig JSON:") print(json.dumps(dumped_json, indent=2)) except requests.exceptions.RequestException as e: print(f"Error dumping BrowserConfig: {e}") ``` ##### E.1.3. Example: Attempting to dump an invalid or non-serializable configuration string. ```python import requests import os BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() # Invalid: not a recognized Crawl4AI config class invalid_config_string = "MyCustomClass(param=1)" payload = {"code": invalid_config_string} try: response = requests.post(f"{BASE_URL}/config/dump", json=payload, headers=headers) if response.status_code == 400: print(f"Correctly failed to dump invalid config string. Server response: {response.json()}") else: print(f"Unexpected response for invalid config: {response.status_code} - {response.text}") except requests.exceptions.RequestException as e: print(f"Error attempting to dump invalid config: {e}") # Invalid: nested function call (security restriction) unsafe_config_string = "CrawlerRunConfig(word_count_threshold=__import__('os').system('echo unsafe'))" payload_unsafe = {"code": unsafe_config_string} try: response_unsafe = requests.post(f"{BASE_URL}/config/dump", json=payload_unsafe, headers=headers) if response_unsafe.status_code == 400: print(f"Correctly failed to dump unsafe config string. Server response: {response_unsafe.json()}") else: print(f"Unexpected response for unsafe config: {response_unsafe.status_code} - {response_unsafe.text}") except requests.exceptions.RequestException as e: print(f"Error attempting to dump unsafe config: {e}") ``` --- #### E.2. `/schema` ##### E.2.1. Example: Fetching the default JSON schemas for `BrowserConfig` and `CrawlerRunConfig`. ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() try: response = requests.get(f"{BASE_URL}/schema", headers=headers) response.raise_for_status() schemas = response.json() print("BrowserConfig Schema (sample):") # print(json.dumps(schemas.get("browser"), indent=2)) # Full schema can be long if "browser" in schemas and "properties" in schemas["browser"]: print(f" BrowserConfig has {len(schemas['browser']['properties'])} properties.") print(f" Example property 'headless': {schemas['browser']['properties'].get('headless')}") print("\nCrawlerRunConfig Schema (sample):") # print(json.dumps(schemas.get("crawler"), indent=2)) if "crawler" in schemas and "properties" in schemas["crawler"]: print(f" CrawlerRunConfig has {len(schemas['crawler']['properties'])} properties.") print(f" Example property 'word_count_threshold': {schemas['crawler']['properties'].get('word_count_threshold')}") except requests.exceptions.RequestException as e: print(f"Error fetching schemas: {e}") ``` --- #### E.3. `/health` & `/metrics` ##### E.3.1. Example: Python script to programmatically check the `/health` endpoint. (Similar to example 2.4.1, but reiterated here for completeness of this section) ```python import requests import os import json from datetime import datetime BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() try: response = requests.get(f"{BASE_URL}/health", headers=headers) response.raise_for_status() health_data = response.json() print("Health Check:") print(f" Status: {health_data.get('status')}") print(f" Version: {health_data.get('version')}") ts = health_data.get('timestamp') if ts: print(f" Timestamp: {ts} (UTC: {datetime.utcfromtimestamp(ts).isoformat()})") except requests.exceptions.RequestException as e: print(f"Error checking health: {e}") ``` ##### E.3.2. Example: Accessing Prometheus metrics at `/metrics` (assuming Prometheus is enabled in `config.yml`). This typically involves pointing a Prometheus scraper at the endpoint or manually fetching. ```python import requests import os BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") # Prometheus metrics are usually at /metrics, but the server.py config uses # config["observability"]["prometheus"]["endpoint"] which defaults to "/metrics" METRICS_ENDPOINT = "/metrics" # As per default config.yml headers = get_headers() try: # First, check if Prometheus is enabled in the server's config # This is a conceptual check, real check depends on your setup config_response = requests.get(f"{BASE_URL}/health", headers=headers) # Health often includes version # In a real scenario, you might have an endpoint to get active config or infer from behavior print(f"Attempting to fetch metrics from {BASE_URL}{METRICS_ENDPOINT}") response = requests.get(f"{BASE_URL}{METRICS_ENDPOINT}", headers=headers) if response.status_code == 200: print("Prometheus metrics response (first 500 chars):") print(response.text[:500] + "...") elif response.status_code == 404: print(f"Metrics endpoint {METRICS_ENDPOINT} not found. Ensure Prometheus is enabled in config.yml.") else: print(f"Error fetching metrics: {response.status_code} - {response.text}") except requests.exceptions.RequestException as e: print(f"Error connecting to metrics endpoint: {e}") ``` **Note:** For this to work, `observability.prometheus.enabled` must be `true` in the server's `config.yml`. --- ## IV. Configuring the Deployment (via `config.yml`) ### 4.1. Note: These examples primarily show snippets of `config.yml` and describe their effect, rather than Python code to modify the live configuration. The `config.yml` file is read by the server on startup. Changes typically require a server restart. ### 4.2. Rate Limiting Configuration #### 4.2.1. Example `config.yml` snippet: Enabling rate limiting with a custom limit (e.g., "10/second"). ```yaml # In your config.yml rate_limiting: enabled: true default_limit: "10/second" # Allows 10 requests per second per client IP # trusted_proxies: ["127.0.0.1"] # If behind a reverse proxy ``` #### 4.2.2. Example `config.yml` snippet: Using Redis as a storage backend for rate limiting. This is recommended for production if you have multiple server instances. ```yaml # In your config.yml rate_limiting: enabled: true default_limit: "1000/minute" storage_uri: "redis://localhost:6379" # Or your Redis server URI # Ensure your Redis server is running and accessible ``` --- ### 4.3. Security Settings Configuration #### 4.3.1. Example `config.yml` snippet: Enabling JWT authentication. ```yaml # In your config.yml security: enabled: true jwt_enabled: true # jwt_secret_key: "YOUR_VERY_SECRET_KEY" # Auto-generated if not set # jwt_algorithm: "HS256" # jwt_access_token_expire_minutes: 30 # jwt_allowed_email_domains: ["example.com", "another.org"] # Optional: Restrict token issuance ``` **Note:** Enabling `jwt_enabled` means endpoints decorated with the token dependency will require authentication. #### 4.3.2. Example `config.yml` snippet: Enabling HTTPS redirect. This is useful if your server is behind a reverse proxy that handles TLS termination. ```yaml # In your config.yml security: enabled: true https_redirect: true # Adds middleware to redirect HTTP to HTTPS ``` #### 4.3.3. Example `config.yml` snippet: Setting custom trusted hosts. Restricts which `Host` headers are accepted. Use `["*"]` to allow all (less secure). ```yaml # In your config.yml security: enabled: true trusted_hosts: ["api.example.com", "localhost", "127.0.0.1"] ``` #### 4.3.4. Example `config.yml` snippet: Configuring custom HTTP security headers (CSP, X-Frame-Options). ```yaml # In your config.yml security: enabled: true headers: x_content_type_options: "nosniff" x_frame_options: "DENY" content_security_policy: "default-src 'self'; script-src 'self' 'unsafe-inline'; object-src 'none';" strict_transport_security: "max-age=31536000; includeSubDomains" ``` --- ### 4.4. LLM Provider Configuration #### 4.4.1. Example `config.yml` snippet: Setting the default LLM provider and API key env variable. ```yaml # In your config.yml llm: provider: "openai/gpt-4o-mini" # Default provider/model api_key_env: "OPENAI_API_KEY" # Environment variable to read the API key from ``` The server will then expect the `OPENAI_API_KEY` environment variable to be set. #### 4.4.2. Example `config.yml` snippet: Overriding the API key directly in the config (for testing/specific cases). **Warning:** Not recommended for production due to security risks of hardcoding keys. ```yaml # In your config.yml llm: provider: "openai/gpt-3.5-turbo" api_key: "sk-this_is_a_test_key_do_not_use_in_prod" # Key directly in config ``` #### 4.4.3. Example `config.yml` snippet: Configuring for a different LiteLLM-supported provider (e.g., Groq). ```yaml # In your config.yml llm: provider: "groq/llama3-8b-8192" api_key_env: "GROQ_API_KEY" # Server will look for this env var ``` --- ### 4.5. Default Crawler Settings These settings in `config.yml` under the `crawler` key affect the default behavior if not overridden by specific `BrowserConfig` or `CrawlerRunConfig` in API requests. #### 4.5.1. Example `config.yml` snippet: Modifying `crawler.base_config.simulate_user`. ```yaml # In your config.yml crawler: base_config: simulate_user: true # Enable user simulation features by default ``` #### 4.5.2. Example `config.yml` snippet: Adjusting `crawler.memory_threshold_percent`. This is for the `MemoryAdaptiveDispatcher`. ```yaml # In your config.yml crawler: memory_threshold_percent: 85.0 # Pause new tasks if system memory usage exceeds 85% ``` #### 4.5.3. Example `config.yml` snippet: Configuring default `crawler.rate_limiter` parameters. ```yaml # In your config.yml crawler: rate_limiter: enabled: true base_delay: [0.5, 1.5] # Default delay between 0.5 and 1.5 seconds ``` #### 4.5.4. Example `config.yml` snippet: Adding default browser arguments to `crawler.browser.extra_args`. ```yaml # In your config.yml crawler: browser: # Default kwargs for BrowserConfig # headless: true # text_mode: false # etc. extra_args: - "--disable-gpu" # Already default, but shown for example - "--window-size=1920,1080" # Add other chromium flags as needed ``` #### 4.5.5. Example `config.yml` snippet: Changing `crawler.pool.max_pages` (global semaphore). This controls the maximum number of concurrent browser pages globally for the server. ```yaml # In your config.yml crawler: pool: max_pages: 20 # Allow up to 20 concurrent browser pages ``` #### 4.5.6. Example `config.yml` snippet: Changing `crawler.pool.idle_ttl_sec` (janitor GC timeout). This controls how long an idle browser instance in the pool will live before being closed. ```yaml # In your config.yml crawler: pool: idle_ttl_sec: 600 # Close idle browsers after 10 minutes (default is 30 min) ``` --- ## V. Model-Controller-Presenter (MCP) Bridge Integration ### 5.1. Overview of MCP and its purpose with Crawl4ai. The Model-Controller-Presenter (MCP) bridge allows AI tools and agents (like Claude Code, potentially others in the future) to interact with Crawl4ai's capabilities as "tools." Crawl4ai endpoints decorated with `@mcp_tool` become callable functions for these AI agents. This enables AIs to leverage web crawling and data extraction within their reasoning and task execution processes. ### 5.2. Accessing MCP Endpoints #### 5.2.1. Example: Conceptual connection to the MCP WebSocket endpoint (`/mcp/ws`). Connecting to `/mcp/ws` would typically be done by an MCP-compatible client library. ```python # This is a conceptual Python example using a hypothetical MCP client library # For actual MCP client usage, refer to the specific MCP tool's documentation. # from mcp_client_library import MCPClient # Hypothetical library # async def connect_mcp_ws(): # mcp_url = f"{BASE_URL.replace('http', 'ws')}/mcp/ws" # async with MCPClient(mcp_url) as client: # print(f"Connected to MCP WebSocket at {mcp_url}") # # ... send/receive MCP messages ... # # e.g., await client.list_tools() # # e.g., await client.call_tool(tool_name="crawl", arguments={"urls": ["https://example.com"]}) # if __name__ == "__main__": # # asyncio.run(connect_mcp_ws()) # Uncomment if you have a client library print("MCP WebSocket conceptual connection. Real client library needed.") ``` #### 5.2.2. Example: Conceptual connection to the MCP SSE endpoint (`/mcp/sse`). Server-Sent Events (SSE) is another transport for MCP. ```python # Similar to WebSocket, an MCP-compatible SSE client would be used. # from sseclient import SSEClient # A possible library for SSE # def connect_mcp_sse(): # mcp_sse_url = f"{BASE_URL}/mcp/sse" # print(f"Attempting to connect to MCP SSE at {mcp_sse_url} (conceptual)") # try: # messages = SSEClient(mcp_sse_url) # This is synchronous, an async version would be better # for msg in messages: # print(f"MCP SSE Message: {msg.data}") # if "some_condition_to_stop": # e.g. after init message # break # except Exception as e: # print(f"Error with MCP SSE: {e}") print("MCP SSE conceptual connection. Real client library needed.") # if __name__ == "__main__": # connect_mcp_sse() # Uncomment if you have a client library ``` #### 5.2.3. Example: Fetching the MCP schema from `/mcp/schema` using `requests`. This endpoint provides information about available MCP tools and resources. ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() try: response = requests.get(f"{BASE_URL}/mcp/schema", headers=headers) response.raise_for_status() mcp_schema = response.json() print("MCP Schema:") # print(json.dumps(mcp_schema, indent=2)) # Can be verbose if "tools" in mcp_schema: print(f"\nAvailable MCP Tools ({len(mcp_schema['tools'])}):") for tool in mcp_schema["tools"][:3]: # Show first 3 tools print(f" - Name: {tool.get('name')}, Description: {tool.get('description', '')[:50]}...") if "resources" in mcp_schema: print(f"\nAvailable MCP Resources ({len(mcp_schema['resources'])}):") for resource in mcp_schema["resources"][:3]: # Show first 3 resources print(f" - Name: {resource.get('name')}, Description: {resource.get('description', '')[:50]}...") except requests.exceptions.RequestException as e: print(f"Error fetching MCP schema: {e}") ``` ### 5.3. Understanding MCP Tool Exposure #### 5.3.1. Explanation: How endpoints decorated with `@mcp_tool` become available through the MCP bridge. In `server.py`, FastAPI endpoints decorated with `@mcp_tool("tool_name")` are automatically registered with the MCP bridge. The MCP bridge then exposes these tools (like `/crawl`, `/md`, `/screenshot`, etc.) to connected MCP clients (e.g., AI agents). The arguments of the FastAPI endpoint function become the expected arguments for the MCP tool call. #### 5.3.2. Example: Invoking a Crawl4ai tool (e.g., `/md`) through a simulated MCP client request structure (if simple enough to demonstrate with `requests`). This is a conceptual illustration. A real MCP client would handle the JSON-RPC formatting for calls via WebSocket or SSE. The `/mcp/messages` endpoint is used by the SSE client to POST messages. ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") # This requires a client_id which is usually established during SSE handshake # For a simple test, if the server allows it, you might be able to send. # However, this is highly dependent on the MCP server's transport implementation. # This is a simplified, conceptual example of what an MCP call might look like # if sent via a direct POST (which is how SSE clients send requests). # A proper MCP client would handle session IDs and JSON-RPC framing. mcp_tool_call_payload = { "jsonrpc": "2.0", "method": "call_tool", "params": { "name": "md", # The tool name, matches @mcp_tool("md") "arguments": { # These map to the FastAPI endpoint's Pydantic model or parameters "body": { # Matches the 'body: MarkdownRequest' in the get_markdown endpoint "url": "https://example.com", "f": "RAW" } } }, "id": "some_unique_request_id" } # The SSE transport uses a client-specific POST endpoint, e.g., /mcp/messages/ # This example cannot fully replicate that without a client_id. # We'll try to hit a hypothetical endpoint or illustrate the payload. print("Conceptual MCP tool call payload (actual call needs proper client/transport):") print(json.dumps(mcp_tool_call_payload, indent=2)) # If you had a direct POST endpoint for tools (not standard MCP for SSE/WS): # try: # # This is NOT how MCP typically works for SSE/WS, but for a hypothetical direct tool POST: # # response = requests.post(f"{BASE_URL}/mcp/call_tool_directly", json=mcp_tool_call_payload, headers=get_headers()) # # response.raise_for_status() # # tool_result = response.json() # # print("\nResult from conceptual direct tool call:") # # print(json.dumps(tool_result, indent=2)) # pass # except requests.exceptions.RequestException as e: # print(f"Error in conceptual direct tool call: {e}") ``` --- ## VI. Advanced Scenarios & Client-Side Best Practices ### 6.1. Chaining API Calls for Complex Workflows #### 6.1.1. Example: Fetch preprocessed HTML using `/html`, then use this HTML as input to a local `crawl4ai` instance or another tool (conceptual). ```python import requests import os import json import asyncio # Assuming crawl4ai is also installed as a library for local processing from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, DefaultMarkdownGenerator BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() async def chained_workflow(): target_url = "https://example.com/article" # Step 1: Fetch preprocessed HTML from the API print(f"Step 1: Fetching preprocessed HTML for {target_url} via API...") html_payload = {"url": target_url} preprocessed_html = None try: response = requests.post(f"{BASE_URL}/html", json=html_payload, headers=headers) response.raise_for_status() data = response.json() preprocessed_html = data.get("html") if preprocessed_html: print(f"Successfully fetched preprocessed HTML (length: {len(preprocessed_html)}).") else: print("Failed to get preprocessed HTML from API.") return except requests.exceptions.RequestException as e: print(f"Error fetching preprocessed HTML: {e}") return # Step 2: Use this HTML with a local Crawl4AI instance for further processing # (e.g., applying a very specific local Markdown generator or extraction) if preprocessed_html: print("\nStep 2: Processing fetched HTML with a local Crawl4AI instance...") # Example: Generate Markdown using a specific local configuration custom_md_generator = DefaultMarkdownGenerator( # content_source="raw_html" because we are feeding it raw HTML content_source="raw_html", options={"body_width": 0} # No line wrapping ) local_run_config = CrawlerRunConfig(markdown_generator=custom_md_generator) async with AsyncWebCrawler() as local_crawler: # Use "raw:" prefix to tell the local crawler this is direct HTML content result = await local_crawler.arun(url=f"raw:{preprocessed_html}", config=local_run_config) if result.success and result.markdown: print("Markdown generated locally from API-fetched HTML (first 300 chars):") print(result.markdown.raw_markdown[:300] + "...") else: print(f"Local processing failed: {result.error_message}") if __name__ == "__main__": asyncio.run(chained_workflow()) ``` --- ### 6.2. API Error Handling #### 6.2.1. Example: Python script showing robust error handling for common HTTP status codes (400, 401, 403, 404, 422, 500) when calling Crawl4ai API. ```python import requests import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") # Use a token known to be invalid or expired if testing 401/403 with auth enabled # invalid_headers = {"Authorization": "Bearer invalidtoken123"} # For this example, we'll use the standard get_headers() headers = get_headers() def make_api_call(endpoint, method="GET", payload=None): url = f"{BASE_URL}/{endpoint.lstrip('/')}" try: if method.upper() == "GET": response = requests.get(url, params=payload, headers=headers, timeout=10) elif method.upper() == "POST": response = requests.post(url, json=payload, headers=headers, timeout=10) else: print(f"Unsupported method: {method}") return print(f"\n--- Testing {method} {url} with payload {payload} ---") print(f"Status Code: {response.status_code}") if response.ok: # status_code < 400 print("Response JSON:") try: print(json.dumps(response.json(), indent=2, ensure_ascii=False)[:500] + "...") except json.JSONDecodeError: print("Response is not valid JSON.") print(f"Response Text (snippet): {response.text[:200]}...") else: print(f"Error Response Text: {response.text}") # Specific error handling based on status code if response.status_code == 400: print("Handling Bad Request (400)... Possible malformed payload.") elif response.status_code == 401: print("Handling Unauthorized (401)... API token might be missing or invalid.") elif response.status_code == 403: print("Handling Forbidden (403)... API token might lack permissions or IP restricted.") elif response.status_code == 404: print("Handling Not Found (404)... Endpoint or resource does not exist.") elif response.status_code == 422: print("Handling Unprocessable Entity (422)... Validation error with request data.") print(f"Details: {response.json().get('detail')}") elif response.status_code >= 500: print("Handling Server Error (5xx)... Problem on the server side.") except requests.exceptions.Timeout: print(f"Request to {url} timed out.") except requests.exceptions.ConnectionError: print(f"Could not connect to {url}. Is the server running?") except requests.exceptions.RequestException as e: print(f"An unexpected request error occurred for {url}: {e}") if __name__ == "__main__": # Test a valid endpoint make_api_call("/health") # Test a non-existent endpoint (expected 404) make_api_call("/nonexistent_endpoint") # Test /md with missing URL (expected 422) make_api_call("/md", method="POST", payload={"f": "RAW"}) # Test /token with invalid payload (expected 422 if email is missing) make_api_call("/token", method="POST", payload={"not_email": "test"}) # If JWT is enabled, an unauthenticated call to a protected endpoint would give 401/403. # For this example, assume /admin is a hypothetical protected endpoint. # print("\nAttempting access to hypothetical protected /admin endpoint...") # make_api_call("/admin", headers={}) # No auth header ``` --- ### 6.3. Client-Side Script for Long-Running Jobs #### 6.3.1. Example: A Python client that submits a job to `/crawl`, polls `/task/{task_id}` with backoff, and retrieves results. This is a more robust version of the polling mechanism shown earlier. ```python import requests import time import os import json BASE_URL = os.environ.get("CRAWL4AI_BASE_URL", "http://localhost:11235") headers = get_headers() def submit_job_and_wait_with_backoff(payload, max_poll_time=300, initial_poll_interval=2, max_poll_interval=30, backoff_factor=1.5): try: # 1. Submit Job submit_response = requests.post(f"{BASE_URL}/crawl", json=payload, headers=headers) submit_response.raise_for_status() task_id = submit_response.json().get("task_id") if not task_id: print("Failed to get task_id from submission.") return None print(f"Job submitted. Task ID: {task_id}. Polling with backoff...") # 2. Poll with Exponential Backoff poll_interval = initial_poll_interval start_time = time.time() while time.time() - start_time < max_poll_time: status_response = requests.get(f"{BASE_URL}/task/{task_id}", headers=headers) status_response.raise_for_status() status_data = status_response.json() current_status = status_data.get("status") print(f" Task {task_id} status: {current_status} (next poll in {poll_interval:.1f}s)") if current_status == "COMPLETED": print(f"Task {task_id} COMPLETED.") return status_data.get("result") elif current_status == "FAILED": print(f"Task {task_id} FAILED. Error: {status_data.get('error')}") return None time.sleep(poll_interval) poll_interval = min(poll_interval * backoff_factor, max_poll_interval) print(f"Task {task_id} polling timed out after {max_poll_time} seconds.") return None except requests.exceptions.RequestException as e: print(f"API Error: {e}") return None except Exception as e: print(f"Unexpected error: {e}") return None if __name__ == "__main__": # Example of a potentially longer job (crawling a site known for being slow or large) long_job_payload = { "urls": ["https://archive.org/web/"], # A site that might take a bit longer "crawler_config": {"word_count_threshold": 500} # Higher threshold } print("\n--- Testing Long-Running Job Client ---") job_result = submit_job_and_wait_with_backoff(long_job_payload, max_poll_time=120) # 2 min timeout if job_result and "results" in job_result: for i, res_item in enumerate(job_result["results"]): print(f"\nResult for {res_item.get('url')}:") print(f" Success: {res_item.get('success')}") if res_item.get('success'): md_length = len(res_item.get('markdown', {}).get('raw_markdown', '')) print(f" Markdown Length: {md_length}") elif job_result: print(f"\nReceived result data (unexpected structure):") print(json.dumps(job_result, indent=2, ensure_ascii=False)) else: print("\nJob did not complete successfully or timed out.") ``` --- ### 6.4. Batching Requests to `/crawl/stream` vs. `/crawl` #### 6.4.1. Discussion: When to use streaming for many URLs vs. submitting a single job with multiple URLs. * **`/crawl` (Job-based, polling):** * **Pros:** * Better for very large numbers of URLs where you don't need immediate feedback for each. * Robust to client disconnections (job continues on server). * Redis queue handles load and persistence of jobs. * Server manages concurrency and resources more globally. * **Cons:** * Requires a polling mechanism on the client side. * Results are only available once the entire batch (or individual URL within a multi-URL job if server processes them somewhat independently before final aggregation) is complete. * **Use when:** You have hundreds or thousands of URLs, can tolerate some delay for results, and need a fire-and-forget submission style. * **`/crawl/stream` (Streaming):** * **Pros:** * Real-time feedback: results for each URL are streamed back as soon as they are processed. * Simpler client logic if immediate processing of individual results is needed. * Good for interactive applications or dashboards. * **Cons:** * Client must maintain an open connection. If it drops, the stream is lost. * Can be less efficient for very large numbers of URLs if each URL is processed sequentially within the stream handler on the server (though `handle_stream_crawl_request` does process them concurrently up to server limits). * The client needs to handle NDJSON parsing. * **Use when:** You need results for URLs as they come in, are processing a moderate number of URLs, or building an interactive tool. **General Guideline:** * For a few to a few dozen URLs where you want results quickly and can process them one-by-one: `/crawl/stream`. * For hundreds or thousands of URLs, or when you prefer to submit a batch and check back later: `/crawl` with polling. * If using `/crawl/stream` for many URLs, ensure your client-side processing of each streamed result is fast to avoid becoming a bottleneck. The server-side uses an `AsyncGenerator` which processes URLs concurrently up to its internal limits, so the client should be ready to consume these results efficiently. ```