Files
crawl4ai/docs/md_v2/assets/llm.txt/diagrams/docker.txt
UncleCode 40640badad feat: add Script Builder to Chrome Extension and reorganize LLM context files
This commit introduces significant enhancements to the Crawl4AI ecosystem:

  Chrome Extension - Script Builder (Alpha):
  - Add recording functionality to capture user interactions (clicks, typing, scrolling)
  - Implement smart event grouping for cleaner script generation
  - Support export to both JavaScript and C4A script formats
  - Add timeline view for visualizing and editing recorded actions
  - Include wait commands (time-based and element-based)
  - Add saved flows functionality for reusing automation scripts
  - Update UI with consistent dark terminal theme (Dank Mono font, green/pink accents)
  - Release new extension versions: v1.1.0, v1.2.0, v1.2.1

  LLM Context Builder Improvements:
  - Reorganize context files from llmtxt/ to llm.txt/ with better structure
  - Separate diagram templates from text content (diagrams/ and txt/ subdirectories)
  - Add comprehensive context files for all major Crawl4AI components
  - Improve file naming convention for better discoverability

  Documentation Updates:
  - Update apps index page to match main documentation theme
  - Standardize color scheme: "Available" tags use primary color (#50ffff)
  - Change "Coming Soon" tags to dark gray for better visual hierarchy
  - Add interactive two-column layout for extension landing page
  - Include code examples for both Schema Builder and Script Builder features

  Technical Improvements:
  - Enhance event capture mechanism with better element selection
  - Add support for contenteditable elements and complex form interactions
  - Implement proper scroll event handling for both window and element scrolling
  - Add meta key support for keyboard shortcuts
  - Improve selector generation for more reliable element targeting

  The Script Builder is released as Alpha, acknowledging potential bugs while providing
  early access to this powerful automation recording feature.
2025-06-08 22:02:12 +08:00

603 lines
16 KiB
Plaintext

## Docker Deployment Architecture and Workflows
Visual representations of Crawl4AI Docker deployment, API architecture, configuration management, and service interactions.
### Docker Deployment Decision Flow
```mermaid
flowchart TD
A[Start Docker Deployment] --> B{Deployment Type?}
B -->|Quick Start| C[Pre-built Image]
B -->|Development| D[Docker Compose]
B -->|Custom Build| E[Manual Build]
B -->|Production| F[Production Setup]
C --> C1[docker pull unclecode/crawl4ai]
C1 --> C2{Need LLM Support?}
C2 -->|Yes| C3[Setup .llm.env]
C2 -->|No| C4[Basic run]
C3 --> C5[docker run with --env-file]
C4 --> C6[docker run basic]
D --> D1[git clone repository]
D1 --> D2[cp .llm.env.example .llm.env]
D2 --> D3{Build Type?}
D3 -->|Pre-built| D4[IMAGE=latest docker compose up]
D3 -->|Local Build| D5[docker compose up --build]
D3 -->|All Features| D6[INSTALL_TYPE=all docker compose up]
E --> E1[docker buildx build]
E1 --> E2{Architecture?}
E2 -->|Single| E3[--platform linux/amd64]
E2 -->|Multi| E4[--platform linux/amd64,linux/arm64]
E3 --> E5[Build complete]
E4 --> E5
F --> F1[Production configuration]
F1 --> F2[Custom config.yml]
F2 --> F3[Resource limits]
F3 --> F4[Health monitoring]
F4 --> F5[Production ready]
C5 --> G[Service running on :11235]
C6 --> G
D4 --> G
D5 --> G
D6 --> G
E5 --> H[docker run custom image]
H --> G
F5 --> I[Production deployment]
G --> J[Access playground at /playground]
G --> K[Health check at /health]
I --> L[Production monitoring]
style A fill:#e1f5fe
style G fill:#c8e6c9
style I fill:#c8e6c9
style J fill:#fff3e0
style K fill:#fff3e0
style L fill:#e8f5e8
```
### Docker Container Architecture
```mermaid
graph TB
subgraph "Host Environment"
A[Docker Engine] --> B[Crawl4AI Container]
C[.llm.env] --> B
D[Custom config.yml] --> B
E[Port 11235] --> B
F[Shared Memory 1GB+] --> B
end
subgraph "Container Services"
B --> G[FastAPI Server :8020]
B --> H[Gunicorn WSGI]
B --> I[Supervisord Process Manager]
B --> J[Redis Cache :6379]
G --> K[REST API Endpoints]
G --> L[WebSocket Connections]
G --> M[MCP Protocol]
H --> N[Worker Processes]
I --> O[Service Monitoring]
J --> P[Request Caching]
end
subgraph "Browser Management"
B --> Q[Playwright Framework]
Q --> R[Chromium Browser]
Q --> S[Firefox Browser]
Q --> T[WebKit Browser]
R --> U[Browser Pool]
S --> U
T --> U
U --> V[Page Sessions]
U --> W[Context Management]
end
subgraph "External Services"
X[OpenAI API] -.-> K
Y[Anthropic Claude] -.-> K
Z[Local Ollama] -.-> K
AA[Groq API] -.-> K
BB[Google Gemini] -.-> K
end
subgraph "Client Interactions"
CC[Python SDK] --> K
DD[REST API Calls] --> K
EE[MCP Clients] --> M
FF[Web Browser] --> G
GG[Monitoring Tools] --> K
end
style B fill:#e3f2fd
style G fill:#f3e5f5
style Q fill:#e8f5e8
style K fill:#fff3e0
```
### API Endpoints Architecture
```mermaid
graph LR
subgraph "Core Endpoints"
A[/crawl] --> A1[Single URL crawl]
A2[/crawl/stream] --> A3[Streaming multi-URL]
A4[/crawl/job] --> A5[Async job submission]
A6[/crawl/job/{id}] --> A7[Job status check]
end
subgraph "Specialized Endpoints"
B[/html] --> B1[Preprocessed HTML]
B2[/screenshot] --> B3[PNG capture]
B4[/pdf] --> B5[PDF generation]
B6[/execute_js] --> B7[JavaScript execution]
B8[/md] --> B9[Markdown extraction]
end
subgraph "Utility Endpoints"
C[/health] --> C1[Service status]
C2[/metrics] --> C3[Prometheus metrics]
C4[/schema] --> C5[API documentation]
C6[/playground] --> C7[Interactive testing]
end
subgraph "LLM Integration"
D[/llm/{url}] --> D1[Q&A over URL]
D2[/ask] --> D3[Library context search]
D4[/config/dump] --> D5[Config validation]
end
subgraph "MCP Protocol"
E[/mcp/sse] --> E1[Server-Sent Events]
E2[/mcp/ws] --> E3[WebSocket connection]
E4[/mcp/schema] --> E5[MCP tool definitions]
end
style A fill:#e3f2fd
style B fill:#f3e5f5
style C fill:#e8f5e8
style D fill:#fff3e0
style E fill:#fce4ec
```
### Request Processing Flow
```mermaid
sequenceDiagram
participant Client
participant FastAPI
participant RequestValidator
participant BrowserPool
participant Playwright
participant ExtractionEngine
participant LLMProvider
Client->>FastAPI: POST /crawl with config
FastAPI->>RequestValidator: Validate JSON structure
alt Valid Request
RequestValidator-->>FastAPI: ✓ Validated
FastAPI->>BrowserPool: Request browser instance
BrowserPool->>Playwright: Launch browser/reuse session
Playwright-->>BrowserPool: Browser ready
BrowserPool-->>FastAPI: Browser allocated
FastAPI->>Playwright: Navigate to URL
Playwright->>Playwright: Execute JS, wait conditions
Playwright-->>FastAPI: Page content ready
FastAPI->>ExtractionEngine: Process content
alt LLM Extraction
ExtractionEngine->>LLMProvider: Send content + schema
LLMProvider-->>ExtractionEngine: Structured data
else CSS Extraction
ExtractionEngine->>ExtractionEngine: Apply CSS selectors
end
ExtractionEngine-->>FastAPI: Extraction complete
FastAPI->>BrowserPool: Release browser
FastAPI-->>Client: CrawlResult response
else Invalid Request
RequestValidator-->>FastAPI: ✗ Validation error
FastAPI-->>Client: 400 Bad Request
end
```
### Configuration Management Flow
```mermaid
stateDiagram-v2
[*] --> ConfigLoading
ConfigLoading --> DefaultConfig: Load default config.yml
ConfigLoading --> CustomConfig: Custom config mounted
ConfigLoading --> EnvOverrides: Environment variables
DefaultConfig --> ConfigMerging
CustomConfig --> ConfigMerging
EnvOverrides --> ConfigMerging
ConfigMerging --> ConfigValidation
ConfigValidation --> Valid: Schema validation passes
ConfigValidation --> Invalid: Validation errors
Invalid --> ConfigError: Log errors and exit
ConfigError --> [*]
Valid --> ServiceInitialization
ServiceInitialization --> FastAPISetup
ServiceInitialization --> BrowserPoolInit
ServiceInitialization --> CacheSetup
FastAPISetup --> Running
BrowserPoolInit --> Running
CacheSetup --> Running
Running --> ConfigReload: Config change detected
ConfigReload --> ConfigValidation
Running --> [*]: Service shutdown
note right of ConfigMerging : Priority: ENV > Custom > Default
note right of ServiceInitialization : All services must initialize successfully
```
### Multi-Architecture Build Process
```mermaid
flowchart TD
A[Developer Push] --> B[GitHub Repository]
B --> C[Docker Buildx]
C --> D{Build Strategy}
D -->|Multi-arch| E[Parallel Builds]
D -->|Single-arch| F[Platform-specific Build]
E --> G[AMD64 Build]
E --> H[ARM64 Build]
F --> I[Target Platform Build]
subgraph "AMD64 Build Process"
G --> G1[Ubuntu base image]
G1 --> G2[Python 3.11 install]
G2 --> G3[System dependencies]
G3 --> G4[Crawl4AI installation]
G4 --> G5[Playwright setup]
G5 --> G6[FastAPI configuration]
G6 --> G7[AMD64 image ready]
end
subgraph "ARM64 Build Process"
H --> H1[Ubuntu ARM64 base]
H1 --> H2[Python 3.11 install]
H2 --> H3[ARM-specific deps]
H3 --> H4[Crawl4AI installation]
H4 --> H5[Playwright setup]
H5 --> H6[FastAPI configuration]
H6 --> H7[ARM64 image ready]
end
subgraph "Single Architecture"
I --> I1[Base image selection]
I1 --> I2[Platform dependencies]
I2 --> I3[Application setup]
I3 --> I4[Platform image ready]
end
G7 --> J[Multi-arch Manifest]
H7 --> J
I4 --> K[Platform Image]
J --> L[Docker Hub Registry]
K --> L
L --> M[Pull Request Auto-selects Architecture]
style A fill:#e1f5fe
style J fill:#c8e6c9
style K fill:#c8e6c9
style L fill:#f3e5f5
style M fill:#e8f5e8
```
### MCP Integration Architecture
```mermaid
graph TB
subgraph "MCP Client Applications"
A[Claude Code] --> B[MCP Protocol]
C[Cursor IDE] --> B
D[Windsurf] --> B
E[Custom MCP Client] --> B
end
subgraph "Crawl4AI MCP Server"
B --> F[MCP Endpoint Router]
F --> G[SSE Transport /mcp/sse]
F --> H[WebSocket Transport /mcp/ws]
F --> I[Schema Endpoint /mcp/schema]
G --> J[MCP Tool Handler]
H --> J
J --> K[Tool: md]
J --> L[Tool: html]
J --> M[Tool: screenshot]
J --> N[Tool: pdf]
J --> O[Tool: execute_js]
J --> P[Tool: crawl]
J --> Q[Tool: ask]
end
subgraph "Crawl4AI Core Services"
K --> R[Markdown Generator]
L --> S[HTML Preprocessor]
M --> T[Screenshot Service]
N --> U[PDF Generator]
O --> V[JavaScript Executor]
P --> W[Batch Crawler]
Q --> X[Context Search]
R --> Y[Browser Pool]
S --> Y
T --> Y
U --> Y
V --> Y
W --> Y
X --> Z[Knowledge Base]
end
subgraph "External Resources"
Y --> AA[Playwright Browsers]
Z --> BB[Library Documentation]
Z --> CC[Code Examples]
AA --> DD[Web Pages]
end
style B fill:#e3f2fd
style J fill:#f3e5f5
style Y fill:#e8f5e8
style Z fill:#fff3e0
```
### API Request/Response Flow Patterns
```mermaid
sequenceDiagram
participant Client
participant LoadBalancer
participant FastAPI
participant ConfigValidator
participant BrowserManager
participant CrawlEngine
participant ResponseBuilder
Note over Client,ResponseBuilder: Basic Crawl Request
Client->>LoadBalancer: POST /crawl
LoadBalancer->>FastAPI: Route request
FastAPI->>ConfigValidator: Validate browser_config
ConfigValidator-->>FastAPI: ✓ Valid BrowserConfig
FastAPI->>ConfigValidator: Validate crawler_config
ConfigValidator-->>FastAPI: ✓ Valid CrawlerRunConfig
FastAPI->>BrowserManager: Allocate browser
BrowserManager-->>FastAPI: Browser instance
FastAPI->>CrawlEngine: Execute crawl
Note over CrawlEngine: Page processing
CrawlEngine->>CrawlEngine: Navigate & wait
CrawlEngine->>CrawlEngine: Extract content
CrawlEngine->>CrawlEngine: Apply strategies
CrawlEngine-->>FastAPI: CrawlResult
FastAPI->>ResponseBuilder: Format response
ResponseBuilder-->>FastAPI: JSON response
FastAPI->>BrowserManager: Release browser
FastAPI-->>LoadBalancer: Response ready
LoadBalancer-->>Client: 200 OK + CrawlResult
Note over Client,ResponseBuilder: Streaming Request
Client->>FastAPI: POST /crawl/stream
FastAPI-->>Client: 200 OK (stream start)
loop For each URL
FastAPI->>CrawlEngine: Process URL
CrawlEngine-->>FastAPI: Result ready
FastAPI-->>Client: NDJSON line
end
FastAPI-->>Client: Stream completed
```
### Configuration Validation Workflow
```mermaid
flowchart TD
A[Client Request] --> B[JSON Payload]
B --> C{Pre-validation}
C -->|✓ Valid JSON| D[Extract Configurations]
C -->|✗ Invalid JSON| E[Return 400 Bad Request]
D --> F[BrowserConfig Validation]
D --> G[CrawlerRunConfig Validation]
F --> H{BrowserConfig Valid?}
G --> I{CrawlerRunConfig Valid?}
H -->|✓ Valid| J[Browser Setup]
H -->|✗ Invalid| K[Log Browser Config Errors]
I -->|✓ Valid| L[Crawler Setup]
I -->|✗ Invalid| M[Log Crawler Config Errors]
K --> N[Collect All Errors]
M --> N
N --> O[Return 422 Validation Error]
J --> P{Both Configs Valid?}
L --> P
P -->|✓ Yes| Q[Proceed to Crawling]
P -->|✗ No| O
Q --> R[Execute Crawl Pipeline]
R --> S[Return CrawlResult]
E --> T[Client Error Response]
O --> T
S --> U[Client Success Response]
style A fill:#e1f5fe
style Q fill:#c8e6c9
style S fill:#c8e6c9
style U fill:#c8e6c9
style E fill:#ffcdd2
style O fill:#ffcdd2
style T fill:#ffcdd2
```
### Production Deployment Architecture
```mermaid
graph TB
subgraph "Load Balancer Layer"
A[NGINX/HAProxy] --> B[Health Check]
A --> C[Request Routing]
A --> D[SSL Termination]
end
subgraph "Application Layer"
C --> E[Crawl4AI Instance 1]
C --> F[Crawl4AI Instance 2]
C --> G[Crawl4AI Instance N]
E --> H[FastAPI Server]
F --> I[FastAPI Server]
G --> J[FastAPI Server]
H --> K[Browser Pool 1]
I --> L[Browser Pool 2]
J --> M[Browser Pool N]
end
subgraph "Shared Services"
N[Redis Cluster] --> E
N --> F
N --> G
O[Monitoring Stack] --> P[Prometheus]
O --> Q[Grafana]
O --> R[AlertManager]
P --> E
P --> F
P --> G
end
subgraph "External Dependencies"
S[OpenAI API] -.-> H
T[Anthropic API] -.-> I
U[Local LLM Cluster] -.-> J
end
subgraph "Persistent Storage"
V[Configuration Volume] --> E
V --> F
V --> G
W[Cache Volume] --> N
X[Logs Volume] --> O
end
style A fill:#e3f2fd
style E fill:#f3e5f5
style F fill:#f3e5f5
style G fill:#f3e5f5
style N fill:#e8f5e8
style O fill:#fff3e0
```
### Docker Resource Management
```mermaid
graph TD
subgraph "Resource Allocation"
A[Host Resources] --> B[CPU Cores]
A --> C[Memory GB]
A --> D[Disk Space]
A --> E[Network Bandwidth]
B --> F[Container Limits]
C --> F
D --> F
E --> F
end
subgraph "Container Configuration"
F --> G[--cpus=4]
F --> H[--memory=8g]
F --> I[--shm-size=2g]
F --> J[Volume Mounts]
G --> K[Browser Processes]
H --> L[Browser Memory]
I --> M[Shared Memory for Browsers]
J --> N[Config & Cache Storage]
end
subgraph "Monitoring & Scaling"
O[Resource Monitor] --> P[CPU Usage %]
O --> Q[Memory Usage %]
O --> R[Request Queue Length]
P --> S{CPU > 80%?}
Q --> T{Memory > 90%?}
R --> U{Queue > 100?}
S -->|Yes| V[Scale Up]
T -->|Yes| V
U -->|Yes| V
V --> W[Add Container Instance]
W --> X[Update Load Balancer]
end
subgraph "Performance Optimization"
Y[Browser Pool Tuning] --> Z[Max Pages: 40]
Y --> AA[Idle TTL: 30min]
Y --> BB[Concurrency Limits]
Z --> CC[Memory Efficiency]
AA --> DD[Resource Cleanup]
BB --> EE[Throughput Control]
end
style A fill:#e1f5fe
style F fill:#f3e5f5
style O fill:#e8f5e8
style Y fill:#fff3e0
```
**📖 Learn more:** [Docker Deployment Guide](https://docs.crawl4ai.com/core/docker-deployment/), [API Reference](https://docs.crawl4ai.com/api/), [MCP Integration](https://docs.crawl4ai.com/core/docker-deployment/#mcp-model-context-protocol-support), [Production Configuration](https://docs.crawl4ai.com/core/docker-deployment/#production-deployment)