* Fix: Use correct URL variable for raw HTML extraction (#1116) - Prevents full HTML content from being passed as URL to extraction strategies - Added unit tests to verify raw HTML and regular URL processing Fix: Wrong URL variable used for extraction of raw html * Fix #1181: Preserve whitespace in code blocks during HTML scraping The remove_empty_elements_fast() method was removing whitespace-only span elements inside <pre> and <code> tags, causing import statements like "import torch" to become "importtorch". Now skips elements inside code blocks where whitespace is significant. * Refactor Pydantic model configuration to use ConfigDict for arbitrary types * Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621 * Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638 * fix: ensure BrowserConfig.to_dict serializes proxy_config * feat: make LLM backoff configurable end-to-end - extend LLMConfig with backoff delay/attempt/factor fields and thread them through LLMExtractionStrategy, LLMContentFilter, table extraction, and Docker API handlers - expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff and document them in the md_v2 guides * reproduced AttributeError from #1642 * pass timeout parameter to docker client request * added missing deep crawling objects to init * generalized query in ContentRelevanceFilter to be a str or list * import modules from enhanceable deserialization * parameterized tests * Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268 * refactor: replace PyPDF2 with pypdf across the codebase. ref #1412 * Add browser_context_id and target_id parameters to BrowserConfig Enable Crawl4AI to connect to pre-created CDP browser contexts, which is essential for cloud browser services that pre-create isolated contexts. Changes: - Add browser_context_id and target_id parameters to BrowserConfig - Update from_kwargs() and to_dict() methods - Modify BrowserManager.start() to use existing context when provided - Add _get_page_by_target_id() helper method - Update get_page() to handle pre-existing targets - Add test for browser_context_id functionality This enables cloud services to: 1. Create isolated CDP contexts before Crawl4AI connects 2. Pass context/target IDs to BrowserConfig 3. Have Crawl4AI reuse existing contexts instead of creating new ones * Add cdp_cleanup_on_close flag to prevent memory leaks in cloud/server scenarios * Fix: add cdp_cleanup_on_close to from_kwargs * Fix: find context by target_id for concurrent CDP connections * Fix: use target_id to find correct page in get_page * Fix: use CDP to find context by browserContextId for concurrent sessions * Revert context matching attempts - Playwright cannot see CDP-created contexts * Add create_isolated_context flag for concurrent CDP crawls When True, forces creation of a new browser context instead of reusing the default context. Essential for concurrent crawls on the same browser to prevent navigation conflicts. * Add context caching to create_isolated_context branch Uses contexts_by_config cache (same as non-CDP mode) to reuse contexts for multiple URLs with same config. Still creates new page per crawl for navigation isolation. Benefits batch/deep crawls. * Add init_scripts support to BrowserConfig for pre-page-load JS injection This adds the ability to inject JavaScript that runs before any page loads, useful for stealth evasions (canvas/audio fingerprinting, userAgentData). - Add init_scripts parameter to BrowserConfig (list of JS strings) - Apply init_scripts in setup_context() via context.add_init_script() - Update from_kwargs() and to_dict() for serialization * Fix CDP connection handling: support WS URLs and proper cleanup Changes to browser_manager.py: 1. _verify_cdp_ready(): Support multiple URL formats - WebSocket URLs (ws://, wss://): Skip HTTP verification, Playwright handles directly - HTTP URLs with query params: Properly parse with urlparse to preserve query string - Fixes issue where naive f"{cdp_url}/json/version" broke WS URLs and query params 2. close(): Proper cleanup when cdp_cleanup_on_close=True - Close all sessions (pages) - Close all contexts - Call browser.close() to disconnect (doesn't terminate browser, just releases connection) - Wait 1 second for CDP connection to fully release - Stop Playwright instance to prevent memory leaks This enables: - Connecting to specific browsers via WS URL - Reusing the same browser with multiple sequential connections - No user wait needed between connections (internal 1s delay handles it) Added tests/browser/test_cdp_cleanup_reuse.py with comprehensive tests. * Update gitignore * Some debugging for caching * Add _generate_screenshot_from_html for raw: and file:// URLs Implements the missing method that was being called but never defined. Now raw: and file:// URLs can generate screenshots by: 1. Loading HTML into a browser page via page.set_content() 2. Taking screenshot using existing take_screenshot() method 3. Cleaning up the page afterward This enables cached HTML to be rendered with screenshots in crawl4ai-cloud. * Add PDF and MHTML support for raw: and file:// URLs - Replace _generate_screenshot_from_html with _generate_media_from_html - New method handles screenshot, PDF, and MHTML in one browser session - Update raw: and file:// URL handlers to use new method - Enables cached HTML to generate all media types * Add crash recovery for deep crawl strategies Add optional resume_state and on_state_change parameters to all deep crawl strategies (BFS, DFS, Best-First) for cloud deployment crash recovery. Features: - resume_state: Pass saved state to resume from checkpoint - on_state_change: Async callback fired after each URL for real-time state persistence to external storage (Redis, DB, etc.) - export_state(): Get last captured state manually - Zero overhead when features are disabled (None defaults) State includes visited URLs, pending queue/stack, depths, and pages_crawled count. All state is JSON-serializable. * Fix: HTTP strategy raw: URL parsing truncates at # character The AsyncHTTPCrawlerStrategy.crawl() method used urlparse() to extract content from raw: URLs. This caused HTML with CSS color codes like #eee to be truncated because # is treated as a URL fragment delimiter. Before: raw:body{background:#eee} -> parsed.path = 'body{background:' After: raw:body{background:#eee} -> raw_content = 'body{background:#eee' Fix: Strip the raw: or raw:// prefix directly instead of using urlparse, matching how the browser strategy handles it. * Add base_url parameter to CrawlerRunConfig for raw HTML processing When processing raw: HTML (e.g., from cache), the URL parameter is meaningless for markdown link resolution. This adds a base_url parameter that can be set explicitly to provide proper URL resolution context. Changes: - Add base_url parameter to CrawlerRunConfig.__init__ - Add base_url to CrawlerRunConfig.from_kwargs - Update aprocess_html to use base_url for markdown generation Usage: config = CrawlerRunConfig(base_url='https://example.com') result = await crawler.arun(url='raw:{html}', config=config) * Add prefetch mode for two-phase deep crawling - Add `prefetch` parameter to CrawlerRunConfig - Add `quick_extract_links()` function for fast link extraction - Add short-circuit in aprocess_html() for prefetch mode - Add 42 tests (unit, integration, regression) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Updates on proxy rotation and proxy configuration * Add proxy support to HTTP crawler strategy * Add browser pipeline support for raw:/file:// URLs - Add process_in_browser parameter to CrawlerRunConfig - Route raw:/file:// URLs through _crawl_web() when browser operations needed - Use page.set_content() instead of goto() for local content - Fix cookie handling for non-HTTP URLs in browser_manager - Auto-detect browser requirements: js_code, wait_for, screenshot, etc. - Maintain fast path for raw:/file:// without browser params Fixes #310 * Add smart TTL cache for sitemap URL seeder - Add cache_ttl_hours and validate_sitemap_lastmod params to SeedingConfig - New JSON cache format with metadata (version, created_at, lastmod, url_count) - Cache validation by TTL expiry and sitemap lastmod comparison - Auto-migration from old .jsonl to new .json format - Fixes bug where incomplete cache was used indefinitely * Update URL seeder docs with smart TTL cache parameters - Add cache_ttl_hours and validate_sitemap_lastmod to parameter table - Document smart TTL cache validation with examples - Add cache-related troubleshooting entries - Update key features summary * Add MEMORY.md to gitignore * Docs: Add multi-sample schema generation section Add documentation explaining how to pass multiple HTML samples to generate_schema() for stable selectors that work across pages with varying DOM structures. Includes: - Problem explanation (fragile nth-child selectors) - Solution with code example - Key points for multi-sample queries - Comparison table of fragile vs stable selectors * Fix critical RCE and LFI vulnerabilities in Docker API deployment Security fixes for vulnerabilities reported by ProjectDiscovery: 1. Remote Code Execution via Hooks (CVE pending) - Remove __import__ from allowed_builtins in hook_manager.py - Prevents arbitrary module imports (os, subprocess, etc.) - Hooks now disabled by default via CRAWL4AI_HOOKS_ENABLED env var 2. Local File Inclusion via file:// URLs (CVE pending) - Add URL scheme validation to /execute_js, /screenshot, /pdf, /html - Block file://, javascript:, data: and other dangerous schemes - Only allow http://, https://, and raw: (where appropriate) 3. Security hardening - Add CRAWL4AI_HOOKS_ENABLED=false as default (opt-in for hooks) - Add security warning comments in config.yml - Add validate_url_scheme() helper for consistent validation Testing: - Add unit tests (test_security_fixes.py) - 16 tests - Add integration tests (run_security_tests.py) for live server Affected endpoints: - POST /crawl (hooks disabled by default) - POST /crawl/stream (hooks disabled by default) - POST /execute_js (URL validation added) - POST /screenshot (URL validation added) - POST /pdf (URL validation added) - POST /html (URL validation added) Breaking changes: - Hooks require CRAWL4AI_HOOKS_ENABLED=true to function - file:// URLs no longer work on API endpoints (use library directly) * Enhance authentication flow by implementing JWT token retrieval and adding authorization headers to API requests * Add release notes for v0.7.9, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates * Add release notes for v0.8.0, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates Documentation for v0.8.0 release: - SECURITY.md: Security policy and vulnerability reporting guidelines - RELEASE_NOTES_v0.8.0.md: Comprehensive release notes - migration/v0.8.0-upgrade-guide.md: Step-by-step migration guide - security/GHSA-DRAFT-RCE-LFI.md: GitHub security advisory drafts - CHANGELOG.md: Updated with v0.8.0 changes Breaking changes documented: - Docker API hooks disabled by default (CRAWL4AI_HOOKS_ENABLED) - file:// URLs blocked on Docker API endpoints Security fixes credited to Neo by ProjectDiscovery * Add examples for deep crawl crash recovery and prefetch mode in documentation * Release v0.8.0: The v0.8.0 Update - Updated version to 0.8.0 - Added comprehensive demo and release notes - Updated all documentation * Update security researcher acknowledgment with a hyperlink for Neo by ProjectDiscovery * Add async agenerate_schema method for schema generation - Extract prompt building to shared _build_schema_prompt() method - Add agenerate_schema() async version using aperform_completion_with_backoff - Refactor generate_schema() to use shared prompt builder - Fixes Gemini/Vertex AI compatibility in async contexts (FastAPI) * Fix: Enable litellm.drop_params for O-series/GPT-5 model compatibility O-series (o1, o3) and GPT-5 models only support temperature=1. Setting litellm.drop_params=True auto-drops unsupported parameters instead of throwing UnsupportedParamsError. Fixes temperature=0.01 error for these models in LLM extraction. --------- Co-authored-by: rbushria <rbushri@gmail.com> Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com> Co-authored-by: Soham Kukreti <kukretisoham@gmail.com> Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com> Co-authored-by: unclecode <unclecode@kidocode.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
845 lines
32 KiB
Python
845 lines
32 KiB
Python
# ───────────────────────── server.py ─────────────────────────
|
||
"""
|
||
Crawl4AI FastAPI entry‑point
|
||
• Browser pool + global page cap
|
||
• Rate‑limiting, security, metrics
|
||
• /crawl, /crawl/stream, /md, /llm endpoints
|
||
"""
|
||
|
||
# ── stdlib & 3rd‑party imports ───────────────────────────────
|
||
from crawler_pool import get_crawler, close_all, janitor
|
||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||
from auth import create_access_token, get_token_dependency, TokenRequest
|
||
from pydantic import BaseModel
|
||
from typing import Optional, List, Dict
|
||
from fastapi import Request, Depends
|
||
from fastapi.responses import FileResponse
|
||
import base64
|
||
import re
|
||
import logging
|
||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||
from api import (
|
||
handle_markdown_request, handle_llm_qa,
|
||
handle_stream_crawl_request, handle_crawl_request,
|
||
stream_results
|
||
)
|
||
from schemas import (
|
||
CrawlRequestWithHooks,
|
||
MarkdownRequest,
|
||
RawCode,
|
||
HTMLRequest,
|
||
ScreenshotRequest,
|
||
PDFRequest,
|
||
JSEndpointRequest,
|
||
)
|
||
|
||
from utils import (
|
||
FilterType, load_config, setup_logging, verify_email_domain
|
||
)
|
||
import os
|
||
import sys
|
||
import time
|
||
import asyncio
|
||
from typing import List
|
||
from contextlib import asynccontextmanager
|
||
import pathlib
|
||
|
||
from fastapi import (
|
||
FastAPI, HTTPException, Request, Path, Query, Depends
|
||
)
|
||
from rank_bm25 import BM25Okapi
|
||
from fastapi.responses import (
|
||
StreamingResponse, RedirectResponse, PlainTextResponse, JSONResponse
|
||
)
|
||
from fastapi.middleware.httpsredirect import HTTPSRedirectMiddleware
|
||
from fastapi.middleware.trustedhost import TrustedHostMiddleware
|
||
from fastapi.staticfiles import StaticFiles
|
||
from job import init_job_router
|
||
|
||
from mcp_bridge import attach_mcp, mcp_resource, mcp_template, mcp_tool
|
||
|
||
import ast
|
||
import crawl4ai as _c4
|
||
from pydantic import BaseModel, Field
|
||
from slowapi import Limiter
|
||
from slowapi.util import get_remote_address
|
||
from prometheus_fastapi_instrumentator import Instrumentator
|
||
from redis import asyncio as aioredis
|
||
|
||
# ── internal imports (after sys.path append) ─────────────────
|
||
sys.path.append(os.path.dirname(os.path.realpath(__file__)))
|
||
|
||
# ────────────────── configuration / logging ──────────────────
|
||
config = load_config()
|
||
setup_logging(config)
|
||
|
||
__version__ = "0.5.1-d1"
|
||
|
||
# ── global page semaphore (hard cap) ─────────────────────────
|
||
MAX_PAGES = config["crawler"]["pool"].get("max_pages", 30)
|
||
GLOBAL_SEM = asyncio.Semaphore(MAX_PAGES)
|
||
|
||
# ── security feature flags ───────────────────────────────────
|
||
# Hooks are disabled by default for security (RCE risk). Set to "true" to enable.
|
||
HOOKS_ENABLED = os.environ.get("CRAWL4AI_HOOKS_ENABLED", "false").lower() == "true"
|
||
|
||
# ── default browser config helper ─────────────────────────────
|
||
def get_default_browser_config() -> BrowserConfig:
|
||
"""Get default BrowserConfig from config.yml."""
|
||
return BrowserConfig(
|
||
extra_args=config["crawler"]["browser"].get("extra_args", []),
|
||
**config["crawler"]["browser"].get("kwargs", {}),
|
||
)
|
||
|
||
# import logging
|
||
# page_log = logging.getLogger("page_cap")
|
||
# orig_arun = AsyncWebCrawler.arun
|
||
# async def capped_arun(self, *a, **kw):
|
||
# await GLOBAL_SEM.acquire() # ← take slot
|
||
# try:
|
||
# in_flight = MAX_PAGES - GLOBAL_SEM._value # used permits
|
||
# page_log.info("🕸️ pages_in_flight=%s / %s", in_flight, MAX_PAGES)
|
||
# return await orig_arun(self, *a, **kw)
|
||
# finally:
|
||
# GLOBAL_SEM.release() # ← free slot
|
||
|
||
orig_arun = AsyncWebCrawler.arun
|
||
|
||
|
||
async def capped_arun(self, *a, **kw):
|
||
async with GLOBAL_SEM:
|
||
return await orig_arun(self, *a, **kw)
|
||
AsyncWebCrawler.arun = capped_arun
|
||
|
||
# ───────────────────── FastAPI lifespan ──────────────────────
|
||
|
||
|
||
@asynccontextmanager
|
||
async def lifespan(_: FastAPI):
|
||
from crawler_pool import init_permanent
|
||
from monitor import MonitorStats
|
||
import monitor as monitor_module
|
||
|
||
# Initialize monitor
|
||
monitor_module.monitor_stats = MonitorStats(redis)
|
||
await monitor_module.monitor_stats.load_from_redis()
|
||
monitor_module.monitor_stats.start_persistence_worker()
|
||
|
||
# Initialize browser pool
|
||
await init_permanent(BrowserConfig(
|
||
extra_args=config["crawler"]["browser"].get("extra_args", []),
|
||
**config["crawler"]["browser"].get("kwargs", {}),
|
||
))
|
||
|
||
# Start background tasks
|
||
app.state.janitor = asyncio.create_task(janitor())
|
||
app.state.timeline_updater = asyncio.create_task(_timeline_updater())
|
||
|
||
yield
|
||
|
||
# Cleanup
|
||
app.state.janitor.cancel()
|
||
app.state.timeline_updater.cancel()
|
||
|
||
# Monitor cleanup (persist stats and stop workers)
|
||
from monitor import get_monitor
|
||
try:
|
||
await get_monitor().cleanup()
|
||
except Exception as e:
|
||
logger.error(f"Monitor cleanup failed: {e}")
|
||
|
||
await close_all()
|
||
|
||
async def _timeline_updater():
|
||
"""Update timeline data every 5 seconds."""
|
||
from monitor import get_monitor
|
||
while True:
|
||
await asyncio.sleep(5)
|
||
try:
|
||
await asyncio.wait_for(get_monitor().update_timeline(), timeout=4.0)
|
||
except asyncio.TimeoutError:
|
||
logger.warning("Timeline update timeout after 4s")
|
||
except Exception as e:
|
||
logger.warning(f"Timeline update error: {e}")
|
||
|
||
# ───────────────────── FastAPI instance ──────────────────────
|
||
app = FastAPI(
|
||
title=config["app"]["title"],
|
||
version=config["app"]["version"],
|
||
lifespan=lifespan,
|
||
)
|
||
|
||
# ── static playground ──────────────────────────────────────
|
||
STATIC_DIR = pathlib.Path(__file__).parent / "static" / "playground"
|
||
if not STATIC_DIR.exists():
|
||
raise RuntimeError(f"Playground assets not found at {STATIC_DIR}")
|
||
app.mount(
|
||
"/playground",
|
||
StaticFiles(directory=STATIC_DIR, html=True),
|
||
name="play",
|
||
)
|
||
|
||
# ── static monitor dashboard ────────────────────────────────
|
||
MONITOR_DIR = pathlib.Path(__file__).parent / "static" / "monitor"
|
||
if not MONITOR_DIR.exists():
|
||
raise RuntimeError(f"Monitor assets not found at {MONITOR_DIR}")
|
||
app.mount(
|
||
"/dashboard",
|
||
StaticFiles(directory=MONITOR_DIR, html=True),
|
||
name="monitor_ui",
|
||
)
|
||
|
||
# ── static assets (logo, etc) ────────────────────────────────
|
||
ASSETS_DIR = pathlib.Path(__file__).parent / "static" / "assets"
|
||
if ASSETS_DIR.exists():
|
||
app.mount(
|
||
"/static/assets",
|
||
StaticFiles(directory=ASSETS_DIR),
|
||
name="assets",
|
||
)
|
||
|
||
|
||
@app.get("/")
|
||
async def root():
|
||
return RedirectResponse("/playground")
|
||
|
||
# ─────────────────── infra / middleware ─────────────────────
|
||
redis = aioredis.from_url(config["redis"].get("uri", "redis://localhost"))
|
||
|
||
limiter = Limiter(
|
||
key_func=get_remote_address,
|
||
default_limits=[config["rate_limiting"]["default_limit"]],
|
||
storage_uri=config["rate_limiting"]["storage_uri"],
|
||
)
|
||
|
||
|
||
def _setup_security(app_: FastAPI):
|
||
sec = config["security"]
|
||
if not sec["enabled"]:
|
||
return
|
||
if sec.get("https_redirect"):
|
||
app_.add_middleware(HTTPSRedirectMiddleware)
|
||
if sec.get("trusted_hosts", []) != ["*"]:
|
||
app_.add_middleware(
|
||
TrustedHostMiddleware, allowed_hosts=sec["trusted_hosts"]
|
||
)
|
||
|
||
|
||
_setup_security(app)
|
||
|
||
if config["observability"]["prometheus"]["enabled"]:
|
||
Instrumentator().instrument(app).expose(app)
|
||
|
||
token_dep = get_token_dependency(config)
|
||
|
||
|
||
@app.middleware("http")
|
||
async def add_security_headers(request: Request, call_next):
|
||
resp = await call_next(request)
|
||
if config["security"]["enabled"]:
|
||
resp.headers.update(config["security"]["headers"])
|
||
return resp
|
||
|
||
# ───────────────── URL validation helper ─────────────────
|
||
ALLOWED_URL_SCHEMES = ("http://", "https://")
|
||
ALLOWED_URL_SCHEMES_WITH_RAW = ("http://", "https://", "raw:", "raw://")
|
||
|
||
|
||
def validate_url_scheme(url: str, allow_raw: bool = False) -> None:
|
||
"""Validate URL scheme to prevent file:// LFI attacks."""
|
||
allowed = ALLOWED_URL_SCHEMES_WITH_RAW if allow_raw else ALLOWED_URL_SCHEMES
|
||
if not url.startswith(allowed):
|
||
schemes = ", ".join(allowed)
|
||
raise HTTPException(400, f"URL must start with {schemes}")
|
||
|
||
|
||
# ───────────────── safe config‑dump helper ─────────────────
|
||
ALLOWED_TYPES = {
|
||
"CrawlerRunConfig": CrawlerRunConfig,
|
||
"BrowserConfig": BrowserConfig,
|
||
}
|
||
|
||
|
||
def _safe_eval_config(expr: str) -> dict:
|
||
"""
|
||
Accept exactly one top‑level call to CrawlerRunConfig(...) or BrowserConfig(...).
|
||
Whatever is inside the parentheses is fine *except* further function calls
|
||
(so no __import__('os') stuff). All public names from crawl4ai are available
|
||
when we eval.
|
||
"""
|
||
tree = ast.parse(expr, mode="eval")
|
||
|
||
# must be a single call
|
||
if not isinstance(tree.body, ast.Call):
|
||
raise ValueError("Expression must be a single constructor call")
|
||
|
||
call = tree.body
|
||
if not (isinstance(call.func, ast.Name) and call.func.id in {"CrawlerRunConfig", "BrowserConfig"}):
|
||
raise ValueError(
|
||
"Only CrawlerRunConfig(...) or BrowserConfig(...) are allowed")
|
||
|
||
# forbid nested calls to keep the surface tiny
|
||
for node in ast.walk(call):
|
||
if isinstance(node, ast.Call) and node is not call:
|
||
raise ValueError("Nested function calls are not permitted")
|
||
|
||
# expose everything that crawl4ai exports, nothing else
|
||
safe_env = {name: getattr(_c4, name)
|
||
for name in dir(_c4) if not name.startswith("_")}
|
||
obj = eval(compile(tree, "<config>", "eval"),
|
||
{"__builtins__": {}}, safe_env)
|
||
return obj.dump()
|
||
|
||
|
||
# ── job router ──────────────────────────────────────────────
|
||
app.include_router(init_job_router(redis, config, token_dep))
|
||
|
||
# ── monitor router ──────────────────────────────────────────
|
||
from monitor_routes import router as monitor_router
|
||
app.include_router(monitor_router)
|
||
|
||
logger = logging.getLogger(__name__)
|
||
|
||
# ──────────────────────── Endpoints ──────────────────────────
|
||
@app.post("/token")
|
||
async def get_token(req: TokenRequest):
|
||
if not verify_email_domain(req.email):
|
||
raise HTTPException(400, "Invalid email domain")
|
||
token = create_access_token({"sub": req.email})
|
||
return {"email": req.email, "access_token": token, "token_type": "bearer"}
|
||
|
||
|
||
@app.post("/config/dump")
|
||
async def config_dump(raw: RawCode):
|
||
try:
|
||
return JSONResponse(_safe_eval_config(raw.code.strip()))
|
||
except Exception as e:
|
||
raise HTTPException(400, str(e))
|
||
|
||
|
||
@app.post("/md")
|
||
@limiter.limit(config["rate_limiting"]["default_limit"])
|
||
@mcp_tool("md")
|
||
async def get_markdown(
|
||
request: Request,
|
||
body: MarkdownRequest,
|
||
_td: Dict = Depends(token_dep),
|
||
):
|
||
if not body.url.startswith(("http://", "https://")) and not body.url.startswith(("raw:", "raw://")):
|
||
raise HTTPException(
|
||
400, "Invalid URL format. Must start with http://, https://, or for raw HTML (raw:, raw://)")
|
||
markdown = await handle_markdown_request(
|
||
body.url, body.f, body.q, body.c, config, body.provider,
|
||
body.temperature, body.base_url
|
||
)
|
||
return JSONResponse({
|
||
"url": body.url,
|
||
"filter": body.f,
|
||
"query": body.q,
|
||
"cache": body.c,
|
||
"markdown": markdown,
|
||
"success": True
|
||
})
|
||
|
||
|
||
@app.post("/html")
|
||
@limiter.limit(config["rate_limiting"]["default_limit"])
|
||
@mcp_tool("html")
|
||
async def generate_html(
|
||
request: Request,
|
||
body: HTMLRequest,
|
||
_td: Dict = Depends(token_dep),
|
||
):
|
||
"""
|
||
Crawls the URL, preprocesses the raw HTML for schema extraction, and returns the processed HTML.
|
||
Use when you need sanitized HTML structures for building schemas or further processing.
|
||
"""
|
||
validate_url_scheme(body.url, allow_raw=True)
|
||
from crawler_pool import get_crawler
|
||
cfg = CrawlerRunConfig()
|
||
try:
|
||
crawler = await get_crawler(get_default_browser_config())
|
||
results = await crawler.arun(url=body.url, config=cfg)
|
||
if not results[0].success:
|
||
raise HTTPException(500, detail=results[0].error_message or "Crawl failed")
|
||
|
||
raw_html = results[0].html
|
||
from crawl4ai.utils import preprocess_html_for_schema
|
||
processed_html = preprocess_html_for_schema(raw_html)
|
||
return JSONResponse({"html": processed_html, "url": body.url, "success": True})
|
||
except Exception as e:
|
||
raise HTTPException(500, detail=str(e))
|
||
|
||
# Screenshot endpoint
|
||
|
||
|
||
@app.post("/screenshot")
|
||
@limiter.limit(config["rate_limiting"]["default_limit"])
|
||
@mcp_tool("screenshot")
|
||
async def generate_screenshot(
|
||
request: Request,
|
||
body: ScreenshotRequest,
|
||
_td: Dict = Depends(token_dep),
|
||
):
|
||
"""
|
||
Capture a full-page PNG screenshot of the specified URL, waiting an optional delay before capture,
|
||
Use when you need an image snapshot of the rendered page. Its recommened to provide an output path to save the screenshot.
|
||
Then in result instead of the screenshot you will get a path to the saved file.
|
||
"""
|
||
validate_url_scheme(body.url)
|
||
from crawler_pool import get_crawler
|
||
try:
|
||
cfg = CrawlerRunConfig(screenshot=True, screenshot_wait_for=body.screenshot_wait_for)
|
||
crawler = await get_crawler(get_default_browser_config())
|
||
results = await crawler.arun(url=body.url, config=cfg)
|
||
if not results[0].success:
|
||
raise HTTPException(500, detail=results[0].error_message or "Crawl failed")
|
||
screenshot_data = results[0].screenshot
|
||
if body.output_path:
|
||
abs_path = os.path.abspath(body.output_path)
|
||
os.makedirs(os.path.dirname(abs_path), exist_ok=True)
|
||
with open(abs_path, "wb") as f:
|
||
f.write(base64.b64decode(screenshot_data))
|
||
return {"success": True, "path": abs_path}
|
||
return {"success": True, "screenshot": screenshot_data}
|
||
except Exception as e:
|
||
raise HTTPException(500, detail=str(e))
|
||
|
||
# PDF endpoint
|
||
|
||
|
||
@app.post("/pdf")
|
||
@limiter.limit(config["rate_limiting"]["default_limit"])
|
||
@mcp_tool("pdf")
|
||
async def generate_pdf(
|
||
request: Request,
|
||
body: PDFRequest,
|
||
_td: Dict = Depends(token_dep),
|
||
):
|
||
"""
|
||
Generate a PDF document of the specified URL,
|
||
Use when you need a printable or archivable snapshot of the page. It is recommended to provide an output path to save the PDF.
|
||
Then in result instead of the PDF you will get a path to the saved file.
|
||
"""
|
||
validate_url_scheme(body.url)
|
||
from crawler_pool import get_crawler
|
||
try:
|
||
cfg = CrawlerRunConfig(pdf=True)
|
||
crawler = await get_crawler(get_default_browser_config())
|
||
results = await crawler.arun(url=body.url, config=cfg)
|
||
if not results[0].success:
|
||
raise HTTPException(500, detail=results[0].error_message or "Crawl failed")
|
||
pdf_data = results[0].pdf
|
||
if body.output_path:
|
||
abs_path = os.path.abspath(body.output_path)
|
||
os.makedirs(os.path.dirname(abs_path), exist_ok=True)
|
||
with open(abs_path, "wb") as f:
|
||
f.write(pdf_data)
|
||
return {"success": True, "path": abs_path}
|
||
return {"success": True, "pdf": base64.b64encode(pdf_data).decode()}
|
||
except Exception as e:
|
||
raise HTTPException(500, detail=str(e))
|
||
|
||
|
||
@app.post("/execute_js")
|
||
@limiter.limit(config["rate_limiting"]["default_limit"])
|
||
@mcp_tool("execute_js")
|
||
async def execute_js(
|
||
request: Request,
|
||
body: JSEndpointRequest,
|
||
_td: Dict = Depends(token_dep),
|
||
):
|
||
"""
|
||
Execute a sequence of JavaScript snippets on the specified URL.
|
||
Return the full CrawlResult JSON (first result).
|
||
Use this when you need to interact with dynamic pages using JS.
|
||
REMEMBER: Scripts accept a list of separated JS snippets to execute and execute them in order.
|
||
IMPORTANT: Each script should be an expression that returns a value. It can be an IIFE or an async function. You can think of it as such.
|
||
Your script will replace '{script}' and execute in the browser context. So provide either an IIFE or a sync/async function that returns a value.
|
||
Return Format:
|
||
- The return result is an instance of CrawlResult, so you have access to markdown, links, and other stuff. If this is enough, you don't need to call again for other endpoints.
|
||
|
||
```python
|
||
class CrawlResult(BaseModel):
|
||
url: str
|
||
html: str
|
||
success: bool
|
||
cleaned_html: Optional[str] = None
|
||
media: Dict[str, List[Dict]] = {}
|
||
links: Dict[str, List[Dict]] = {}
|
||
downloaded_files: Optional[List[str]] = None
|
||
js_execution_result: Optional[Dict[str, Any]] = None
|
||
screenshot: Optional[str] = None
|
||
pdf: Optional[bytes] = None
|
||
mhtml: Optional[str] = None
|
||
_markdown: Optional[MarkdownGenerationResult] = PrivateAttr(default=None)
|
||
extracted_content: Optional[str] = None
|
||
metadata: Optional[dict] = None
|
||
error_message: Optional[str] = None
|
||
session_id: Optional[str] = None
|
||
response_headers: Optional[dict] = None
|
||
status_code: Optional[int] = None
|
||
ssl_certificate: Optional[SSLCertificate] = None
|
||
dispatch_result: Optional[DispatchResult] = None
|
||
redirected_url: Optional[str] = None
|
||
network_requests: Optional[List[Dict[str, Any]]] = None
|
||
console_messages: Optional[List[Dict[str, Any]]] = None
|
||
|
||
class MarkdownGenerationResult(BaseModel):
|
||
raw_markdown: str
|
||
markdown_with_citations: str
|
||
references_markdown: str
|
||
fit_markdown: Optional[str] = None
|
||
fit_html: Optional[str] = None
|
||
```
|
||
|
||
"""
|
||
validate_url_scheme(body.url)
|
||
from crawler_pool import get_crawler
|
||
try:
|
||
cfg = CrawlerRunConfig(js_code=body.scripts)
|
||
crawler = await get_crawler(get_default_browser_config())
|
||
results = await crawler.arun(url=body.url, config=cfg)
|
||
if not results[0].success:
|
||
raise HTTPException(500, detail=results[0].error_message or "Crawl failed")
|
||
data = results[0].model_dump()
|
||
return JSONResponse(data)
|
||
except Exception as e:
|
||
raise HTTPException(500, detail=str(e))
|
||
|
||
|
||
@app.get("/llm/{url:path}")
|
||
async def llm_endpoint(
|
||
request: Request,
|
||
url: str = Path(...),
|
||
q: str = Query(...),
|
||
_td: Dict = Depends(token_dep),
|
||
):
|
||
if not q:
|
||
raise HTTPException(400, "Query parameter 'q' is required")
|
||
if not url.startswith(("http://", "https://")) and not url.startswith(("raw:", "raw://")):
|
||
url = "https://" + url
|
||
answer = await handle_llm_qa(url, q, config)
|
||
return JSONResponse({"answer": answer})
|
||
|
||
|
||
@app.get("/schema")
|
||
async def get_schema():
|
||
from crawl4ai import BrowserConfig, CrawlerRunConfig
|
||
return {"browser": BrowserConfig().dump(),
|
||
"crawler": CrawlerRunConfig().dump()}
|
||
|
||
|
||
@app.get("/hooks/info")
|
||
async def get_hooks_info():
|
||
"""Get information about available hook points and their signatures"""
|
||
from hook_manager import UserHookManager
|
||
|
||
hook_info = {}
|
||
for hook_point, params in UserHookManager.HOOK_SIGNATURES.items():
|
||
hook_info[hook_point] = {
|
||
"parameters": params,
|
||
"description": get_hook_description(hook_point),
|
||
"example": get_hook_example(hook_point)
|
||
}
|
||
|
||
return JSONResponse({
|
||
"available_hooks": hook_info,
|
||
"timeout_limits": {
|
||
"min": 1,
|
||
"max": 120,
|
||
"default": 30
|
||
}
|
||
})
|
||
|
||
|
||
def get_hook_description(hook_point: str) -> str:
|
||
"""Get description for each hook point"""
|
||
descriptions = {
|
||
"on_browser_created": "Called after browser instance is created",
|
||
"on_page_context_created": "Called after page and context are created - ideal for authentication",
|
||
"before_goto": "Called before navigating to the target URL",
|
||
"after_goto": "Called after navigation is complete",
|
||
"on_user_agent_updated": "Called when user agent is updated",
|
||
"on_execution_started": "Called when custom JavaScript execution begins",
|
||
"before_retrieve_html": "Called before retrieving the final HTML - ideal for scrolling",
|
||
"before_return_html": "Called just before returning the HTML content"
|
||
}
|
||
return descriptions.get(hook_point, "")
|
||
|
||
|
||
def get_hook_example(hook_point: str) -> str:
|
||
"""Get example code for each hook point"""
|
||
examples = {
|
||
"on_page_context_created": """async def hook(page, context, **kwargs):
|
||
# Add authentication cookie
|
||
await context.add_cookies([{
|
||
'name': 'session',
|
||
'value': 'my-session-id',
|
||
'domain': '.example.com'
|
||
}])
|
||
return page""",
|
||
|
||
"before_retrieve_html": """async def hook(page, context, **kwargs):
|
||
# Scroll to load lazy content
|
||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||
await page.wait_for_timeout(2000)
|
||
return page""",
|
||
|
||
"before_goto": """async def hook(page, context, url, **kwargs):
|
||
# Set custom headers
|
||
await page.set_extra_http_headers({
|
||
'X-Custom-Header': 'value'
|
||
})
|
||
return page"""
|
||
}
|
||
return examples.get(hook_point, "# Implement your hook logic here\nreturn page")
|
||
|
||
|
||
@app.get(config["observability"]["health_check"]["endpoint"])
|
||
async def health():
|
||
return {"status": "ok", "timestamp": time.time(), "version": __version__}
|
||
|
||
|
||
@app.get(config["observability"]["prometheus"]["endpoint"])
|
||
async def metrics():
|
||
return RedirectResponse(config["observability"]["prometheus"]["endpoint"])
|
||
|
||
|
||
@app.post("/crawl")
|
||
@limiter.limit(config["rate_limiting"]["default_limit"])
|
||
@mcp_tool("crawl")
|
||
async def crawl(
|
||
request: Request,
|
||
crawl_request: CrawlRequestWithHooks,
|
||
_td: Dict = Depends(token_dep),
|
||
):
|
||
"""
|
||
Crawl a list of URLs and return the results as JSON.
|
||
For streaming responses, use /crawl/stream endpoint.
|
||
Supports optional user-provided hook functions for customization.
|
||
"""
|
||
if not crawl_request.urls:
|
||
raise HTTPException(400, "At least one URL required")
|
||
if crawl_request.hooks and not HOOKS_ENABLED:
|
||
raise HTTPException(403, "Hooks are disabled. Set CRAWL4AI_HOOKS_ENABLED=true to enable.")
|
||
# Check whether it is a redirection for a streaming request
|
||
crawler_config = CrawlerRunConfig.load(crawl_request.crawler_config)
|
||
if crawler_config.stream:
|
||
return await stream_process(crawl_request=crawl_request)
|
||
|
||
# Prepare hooks config if provided
|
||
hooks_config = None
|
||
if crawl_request.hooks:
|
||
hooks_config = {
|
||
'code': crawl_request.hooks.code,
|
||
'timeout': crawl_request.hooks.timeout
|
||
}
|
||
|
||
results = await handle_crawl_request(
|
||
urls=crawl_request.urls,
|
||
browser_config=crawl_request.browser_config,
|
||
crawler_config=crawl_request.crawler_config,
|
||
config=config,
|
||
hooks_config=hooks_config
|
||
)
|
||
# check if all of the results are not successful
|
||
if all(not result["success"] for result in results["results"]):
|
||
raise HTTPException(500, f"Crawl request failed: {results['results'][0]['error_message']}")
|
||
return JSONResponse(results)
|
||
|
||
|
||
@app.post("/crawl/stream")
|
||
@limiter.limit(config["rate_limiting"]["default_limit"])
|
||
async def crawl_stream(
|
||
request: Request,
|
||
crawl_request: CrawlRequestWithHooks,
|
||
_td: Dict = Depends(token_dep),
|
||
):
|
||
if not crawl_request.urls:
|
||
raise HTTPException(400, "At least one URL required")
|
||
if crawl_request.hooks and not HOOKS_ENABLED:
|
||
raise HTTPException(403, "Hooks are disabled. Set CRAWL4AI_HOOKS_ENABLED=true to enable.")
|
||
|
||
return await stream_process(crawl_request=crawl_request)
|
||
|
||
async def stream_process(crawl_request: CrawlRequestWithHooks):
|
||
|
||
# Prepare hooks config if provided# Prepare hooks config if provided
|
||
hooks_config = None
|
||
if crawl_request.hooks:
|
||
hooks_config = {
|
||
'code': crawl_request.hooks.code,
|
||
'timeout': crawl_request.hooks.timeout
|
||
}
|
||
|
||
crawler, gen, hooks_info = await handle_stream_crawl_request(
|
||
urls=crawl_request.urls,
|
||
browser_config=crawl_request.browser_config,
|
||
crawler_config=crawl_request.crawler_config,
|
||
config=config,
|
||
hooks_config=hooks_config
|
||
)
|
||
|
||
# Add hooks info to response headers if available
|
||
headers = {
|
||
"Cache-Control": "no-cache",
|
||
"Connection": "keep-alive",
|
||
"X-Stream-Status": "active",
|
||
}
|
||
if hooks_info:
|
||
import json
|
||
headers["X-Hooks-Status"] = json.dumps(hooks_info['status']['status'])
|
||
|
||
return StreamingResponse(
|
||
stream_results(crawler, gen),
|
||
media_type="application/x-ndjson",
|
||
headers=headers,
|
||
)
|
||
|
||
|
||
def chunk_code_functions(code_md: str) -> List[str]:
|
||
"""Extract each function/class from markdown code blocks per file."""
|
||
pattern = re.compile(
|
||
# match "## File: <path>" then a ```py fence, then capture until the closing ```
|
||
r'##\s*File:\s*(?P<path>.+?)\s*?\r?\n' # file header
|
||
r'```py\s*?\r?\n' # opening fence
|
||
r'(?P<code>.*?)(?=\r?\n```)', # code block
|
||
re.DOTALL
|
||
)
|
||
chunks: List[str] = []
|
||
for m in pattern.finditer(code_md):
|
||
file_path = m.group("path").strip()
|
||
code_blk = m.group("code")
|
||
tree = ast.parse(code_blk)
|
||
lines = code_blk.splitlines()
|
||
for node in tree.body:
|
||
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
|
||
start = node.lineno - 1
|
||
end = getattr(node, "end_lineno", start + 1)
|
||
snippet = "\n".join(lines[start:end])
|
||
chunks.append(f"# File: {file_path}\n{snippet}")
|
||
return chunks
|
||
|
||
|
||
def chunk_doc_sections(doc: str) -> List[str]:
|
||
lines = doc.splitlines(keepends=True)
|
||
sections = []
|
||
current: List[str] = []
|
||
for line in lines:
|
||
if re.match(r"^#{1,6}\s", line):
|
||
if current:
|
||
sections.append("".join(current))
|
||
current = [line]
|
||
else:
|
||
current.append(line)
|
||
if current:
|
||
sections.append("".join(current))
|
||
return sections
|
||
|
||
|
||
@app.get("/ask")
|
||
@limiter.limit(config["rate_limiting"]["default_limit"])
|
||
@mcp_tool("ask")
|
||
async def get_context(
|
||
request: Request,
|
||
_td: Dict = Depends(token_dep),
|
||
context_type: str = Query("all", regex="^(code|doc|all)$"),
|
||
query: Optional[str] = Query(
|
||
None, description="search query to filter chunks"),
|
||
score_ratio: float = Query(
|
||
0.5, ge=0.0, le=1.0, description="min score as fraction of max_score"),
|
||
max_results: int = Query(
|
||
20, ge=1, description="absolute cap on returned chunks"),
|
||
):
|
||
"""
|
||
This end point is design for any questions about Crawl4ai library. It returns a plain text markdown with extensive information about Crawl4ai.
|
||
You can use this as a context for any AI assistant. Use this endpoint for AI assistants to retrieve library context for decision making or code generation tasks.
|
||
Alway is BEST practice you provide a query to filter the context. Otherwise the lenght of the response will be very long.
|
||
|
||
Parameters:
|
||
- context_type: Specify "code" for code context, "doc" for documentation context, or "all" for both.
|
||
- query: RECOMMENDED search query to filter paragraphs using BM25. You can leave this empty to get all the context.
|
||
- score_ratio: Minimum score as a fraction of the maximum score for filtering results.
|
||
- max_results: Maximum number of results to return. Default is 20.
|
||
|
||
Returns:
|
||
- JSON response with the requested context.
|
||
- If "code" is specified, returns the code context.
|
||
- If "doc" is specified, returns the documentation context.
|
||
- If "all" is specified, returns both code and documentation contexts.
|
||
"""
|
||
# load contexts
|
||
base = os.path.dirname(__file__)
|
||
code_path = os.path.join(base, "c4ai-code-context.md")
|
||
doc_path = os.path.join(base, "c4ai-doc-context.md")
|
||
if not os.path.exists(code_path) or not os.path.exists(doc_path):
|
||
raise HTTPException(404, "Context files not found")
|
||
|
||
with open(code_path, "r") as f:
|
||
code_content = f.read()
|
||
with open(doc_path, "r") as f:
|
||
doc_content = f.read()
|
||
|
||
# if no query, just return raw contexts
|
||
if not query:
|
||
if context_type == "code":
|
||
return JSONResponse({"code_context": code_content})
|
||
if context_type == "doc":
|
||
return JSONResponse({"doc_context": doc_content})
|
||
return JSONResponse({
|
||
"code_context": code_content,
|
||
"doc_context": doc_content,
|
||
})
|
||
|
||
tokens = query.split()
|
||
results: Dict[str, List[Dict[str, float]]] = {}
|
||
|
||
# code BM25 over functions/classes
|
||
if context_type in ("code", "all"):
|
||
code_chunks = chunk_code_functions(code_content)
|
||
bm25 = BM25Okapi([c.split() for c in code_chunks])
|
||
scores = bm25.get_scores(tokens)
|
||
max_sc = float(scores.max()) if scores.size > 0 else 0.0
|
||
cutoff = max_sc * score_ratio
|
||
picked = [(c, s) for c, s in zip(code_chunks, scores) if s >= cutoff]
|
||
picked = sorted(picked, key=lambda x: x[1], reverse=True)[:max_results]
|
||
results["code_results"] = [{"text": c, "score": s} for c, s in picked]
|
||
|
||
# doc BM25 over markdown sections
|
||
if context_type in ("doc", "all"):
|
||
sections = chunk_doc_sections(doc_content)
|
||
bm25d = BM25Okapi([sec.split() for sec in sections])
|
||
scores_d = bm25d.get_scores(tokens)
|
||
max_sd = float(scores_d.max()) if scores_d.size > 0 else 0.0
|
||
cutoff_d = max_sd * score_ratio
|
||
idxs = [i for i, s in enumerate(scores_d) if s >= cutoff_d]
|
||
neighbors = set(i for idx in idxs for i in (idx-1, idx, idx+1))
|
||
valid = [i for i in sorted(neighbors) if 0 <= i < len(sections)]
|
||
valid = valid[:max_results]
|
||
results["doc_results"] = [
|
||
{"text": sections[i], "score": scores_d[i]} for i in valid
|
||
]
|
||
|
||
return JSONResponse(results)
|
||
|
||
|
||
# attach MCP layer (adds /mcp/ws, /mcp/sse, /mcp/schema)
|
||
print(f"MCP server running on {config['app']['host']}:{config['app']['port']}")
|
||
attach_mcp(
|
||
app,
|
||
base_url=f"http://{config['app']['host']}:{config['app']['port']}"
|
||
)
|
||
|
||
# ────────────────────────── cli ──────────────────────────────
|
||
if __name__ == "__main__":
|
||
import uvicorn
|
||
uvicorn.run(
|
||
"server:app",
|
||
host=config["app"]["host"],
|
||
port=config["app"]["port"],
|
||
reload=config["app"]["reload"],
|
||
timeout_keep_alive=config["app"]["timeout_keep_alive"],
|
||
)
|
||
# ─────────────────────────────────────────────────────────────
|