Files
crawl4ai/.github/workflows/docs/ARCHITECTURE.md
unclecode 6d1a398419 feat(ci): split release pipeline and add Docker caching
- Split release.yml into PyPI/GitHub release and Docker workflows
- Add GitHub Actions cache for Docker builds (10-15x faster rebuilds)
- Implement dual-trigger for docker-release.yml (auto + manual)
- Add comprehensive workflow documentation in .github/workflows/docs/
- Backup original workflow as release.yml.backup
2025-10-21 10:53:12 +08:00

30 KiB
Raw Blame History

Workflow Architecture Documentation

Overview

This document describes the technical architecture of the split release pipeline for Crawl4AI.


Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│                         Developer                                │
│                              │                                   │
│                              ▼                                   │
│                    git tag v1.2.3                               │
│                    git push --tags                              │
└──────────────────────────────┬──────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                      GitHub Repository                           │
│                                                                  │
│  ┌────────────────────────────────────────────────────────┐   │
│  │                  Tag Event: v1.2.3                      │   │
│  └────────────────────────────────────────────────────────┘   │
│                              │                                   │
│                              ▼                                   │
│  ┌────────────────────────────────────────────────────────┐   │
│  │           release.yml (Release Pipeline)               │   │
│  │  ┌──────────────────────────────────────────────┐     │   │
│  │  │ 1. Extract Version                            │     │   │
│  │  │    v1.2.3 → 1.2.3                            │     │   │
│  │  └──────────────────────────────────────────────┘     │   │
│  │  ┌──────────────────────────────────────────────┐     │   │
│  │  │ 2. Validate Version                           │     │   │
│  │  │    Tag == __version__.py                      │     │   │
│  │  └──────────────────────────────────────────────┘     │   │
│  │  ┌──────────────────────────────────────────────┐     │   │
│  │  │ 3. Build Python Package                       │     │   │
│  │  │    - Source dist (.tar.gz)                    │     │   │
│  │  │    - Wheel (.whl)                             │     │   │
│  │  └──────────────────────────────────────────────┘     │   │
│  │  ┌──────────────────────────────────────────────┐     │   │
│  │  │ 4. Upload to PyPI                             │     │   │
│  │  │    - Authenticate with token                  │     │   │
│  │  │    - Upload dist/*                            │     │   │
│  │  └──────────────────────────────────────────────┘     │   │
│  │  ┌──────────────────────────────────────────────┐     │   │
│  │  │ 5. Create GitHub Release                      │     │   │
│  │  │    - Tag: v1.2.3                              │     │   │
│  │  │    - Body: Install instructions               │     │   │
│  │  │    - Status: Published                        │     │   │
│  │  └──────────────────────────────────────────────┘     │   │
│  └────────────────────────────────────────────────────────┘   │
│                              │                                   │
│                              ▼                                   │
│  ┌────────────────────────────────────────────────────────┐   │
│  │         Release Event: published (v1.2.3)              │   │
│  └────────────────────────────────────────────────────────┘   │
│                              │                                   │
│                              ▼                                   │
│  ┌────────────────────────────────────────────────────────┐   │
│  │         docker-release.yml (Docker Pipeline)           │   │
│  │  ┌──────────────────────────────────────────────┐     │   │
│  │  │ 1. Extract Version from Release               │     │   │
│  │  │    github.event.release.tag_name → 1.2.3     │     │   │
│  │  └──────────────────────────────────────────────┘     │   │
│  │  ┌──────────────────────────────────────────────┐     │   │
│  │  │ 2. Parse Semantic Versions                    │     │   │
│  │  │    1.2.3 → Major: 1, Minor: 1.2              │     │   │
│  │  └──────────────────────────────────────────────┘     │   │
│  │  ┌──────────────────────────────────────────────┐     │   │
│  │  │ 3. Setup Multi-Arch Build                     │     │   │
│  │  │    - Docker Buildx                            │     │   │
│  │  │    - QEMU emulation                           │     │   │
│  │  └──────────────────────────────────────────────┘     │   │
│  │  ┌──────────────────────────────────────────────┐     │   │
│  │  │ 4. Authenticate Docker Hub                    │     │   │
│  │  │    - Username: DOCKER_USERNAME                │     │   │
│  │  │    - Token: DOCKER_TOKEN                      │     │   │
│  │  └──────────────────────────────────────────────┘     │   │
│  │  ┌──────────────────────────────────────────────┐     │   │
│  │  │ 5. Build Multi-Arch Images                    │     │   │
│  │  │    ┌────────────────┬────────────────┐       │     │   │
│  │  │    │  linux/amd64   │  linux/arm64   │       │     │   │
│  │  │    └────────────────┴────────────────┘       │     │   │
│  │  │    Cache: GitHub Actions (type=gha)          │     │   │
│  │  └──────────────────────────────────────────────┘     │   │
│  │  ┌──────────────────────────────────────────────┐     │   │
│  │  │ 6. Push to Docker Hub                         │     │   │
│  │  │    Tags:                                      │     │   │
│  │  │    - unclecode/crawl4ai:1.2.3                │     │   │
│  │  │    - unclecode/crawl4ai:1.2                  │     │   │
│  │  │    - unclecode/crawl4ai:1                    │     │   │
│  │  │    - unclecode/crawl4ai:latest               │     │   │
│  │  └──────────────────────────────────────────────┘     │   │
│  └────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                     External Services                            │
│                                                                  │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐         │
│  │    PyPI      │  │  Docker Hub  │  │   GitHub     │         │
│  │              │  │              │  │              │         │
│  │  crawl4ai    │  │ unclecode/   │  │  Releases    │         │
│  │  1.2.3       │  │ crawl4ai     │  │  v1.2.3      │         │
│  └──────────────┘  └──────────────┘  └──────────────┘         │
└─────────────────────────────────────────────────────────────────┘

Component Details

1. Release Pipeline (release.yml)

Purpose

Fast publication of Python package and GitHub release.

Input

  • Trigger: Git tag matching v* (excluding test-v*)
  • Example: v1.2.3

Processing Stages

Stage 1: Version Extraction
Input:  refs/tags/v1.2.3
Output: VERSION=1.2.3

Implementation:

TAG_VERSION=${GITHUB_REF#refs/tags/v}  # Remove 'refs/tags/v' prefix
echo "VERSION=$TAG_VERSION" >> $GITHUB_OUTPUT
Stage 2: Version Validation
Input:  TAG_VERSION=1.2.3
Check:  crawl4ai/__version__.py contains __version__ = "1.2.3"
Output: Pass/Fail

Implementation:

PACKAGE_VERSION=$(python -c "from crawl4ai.__version__ import __version__; print(__version__)")
if [ "$TAG_VERSION" != "$PACKAGE_VERSION" ]; then
  exit 1
fi
Stage 3: Package Build
Input:  Source code + pyproject.toml
Output: dist/crawl4ai-1.2.3.tar.gz
        dist/crawl4ai-1.2.3-py3-none-any.whl

Implementation:

python -m build
# Uses build backend defined in pyproject.toml
Stage 4: PyPI Upload
Input:  dist/*.{tar.gz,whl}
Auth:   PYPI_TOKEN
Output: Package published to PyPI

Implementation:

twine upload dist/*
# Environment:
#   TWINE_USERNAME: __token__
#   TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }}
Stage 5: GitHub Release Creation
Input:  Tag: v1.2.3
        Body: Markdown content
Output: Published GitHub release

Implementation:

uses: softprops/action-gh-release@v2
with:
  tag_name: v1.2.3
  name: Release v1.2.3
  body: |
    Installation instructions and changelog
  draft: false
  prerelease: false

Output

Timeline

0:00 - Tag pushed
0:01 - Checkout + Python setup
0:02 - Version validation
0:03 - Package build
0:04 - PyPI upload starts
0:06 - PyPI upload complete
0:07 - GitHub release created
0:08 - Workflow complete

2. Docker Release Pipeline (docker-release.yml)

Purpose

Build and publish multi-architecture Docker images.

Inputs

Input 1: Release Event (Automatic)
Event: release.published
Data:  github.event.release.tag_name = "v1.2.3"
Input 2: Docker Rebuild Tag (Manual)
Tag: docker-rebuild-v1.2.3

Processing Stages

Stage 1: Version Detection
# From release event:
VERSION = github.event.release.tag_name.strip("v")
# Result: "1.2.3"

# From rebuild tag:
VERSION = GITHUB_REF.replace("refs/tags/docker-rebuild-v", "")
# Result: "1.2.3"
Stage 2: Semantic Version Parsing
Input:  VERSION=1.2.3
Output: MAJOR=1
        MINOR=1.2
        PATCH=3 (implicit)

Implementation:

MAJOR=$(echo $VERSION | cut -d. -f1)    # Extract first component
MINOR=$(echo $VERSION | cut -d. -f1-2)  # Extract first two components
Stage 3: Multi-Architecture Setup
Setup:
  - Docker Buildx (multi-platform builder)
  - QEMU (ARM emulation on x86)

Platforms:
  - linux/amd64 (x86_64)
  - linux/arm64 (aarch64)

Architecture:

GitHub Runner (linux/amd64)
  ├─ Buildx Builder
  │   ├─ Native: Build linux/amd64 image
  │   └─ QEMU: Emulate ARM to build linux/arm64 image
  └─ Generate manifest list (points to both images)
Stage 4: Docker Hub Authentication
Input:  DOCKER_USERNAME
        DOCKER_TOKEN
Output: Authenticated Docker client
Stage 5: Build with Cache
Cache Configuration:
  cache-from: type=gha           # Read from GitHub Actions cache
  cache-to: type=gha,mode=max    # Write all layers

Cache Key Components:
  - Workflow file path
  - Branch name
  - Architecture (amd64/arm64)

Cache Hierarchy:

Cache Entry: main/docker-release.yml/linux-amd64
  ├─ Layer: sha256:abc123... (FROM python:3.12)
  ├─ Layer: sha256:def456... (RUN apt-get update)
  ├─ Layer: sha256:ghi789... (COPY requirements.txt)
  ├─ Layer: sha256:jkl012... (RUN pip install)
  └─ Layer: sha256:mno345... (COPY . /app)

Cache Hit/Miss Logic:
  - If layer input unchanged → cache hit → skip build
  - If layer input changed → cache miss → rebuild + all subsequent layers
Stage 6: Tag Generation
Input:  VERSION=1.2.3, MAJOR=1, MINOR=1.2

Output Tags:
  - unclecode/crawl4ai:1.2.3    (exact version)
  - unclecode/crawl4ai:1.2      (minor version)
  - unclecode/crawl4ai:1        (major version)
  - unclecode/crawl4ai:latest   (latest stable)

Tag Strategy:

  • All tags point to same image SHA
  • Users can pin to desired stability level
  • Pushing new version updates 1, 1.2, and latest automatically
Stage 7: Push to Registry
For each tag:
  For each platform (amd64, arm64):
    Push image to Docker Hub

Create manifest list:
  Manifest: unclecode/crawl4ai:1.2.3
    ├─ linux/amd64: sha256:abc...
    └─ linux/arm64: sha256:def...

Docker CLI automatically selects correct platform on pull

Output

Timeline

Cold Cache (First Build):

0:00 - Release event received
0:01 - Checkout + Buildx setup
0:02 - Docker Hub auth
0:03 - Start build (amd64)
0:08 - Complete amd64 build
0:09 - Start build (arm64)
0:14 - Complete arm64 build
0:15 - Generate manifests
0:16 - Push all tags
0:17 - Workflow complete

Warm Cache (Code Change Only):

0:00 - Release event received
0:01 - Checkout + Buildx setup
0:02 - Docker Hub auth
0:03 - Start build (amd64) - cache hit for layers 1-4
0:04 - Complete amd64 build (only layer 5 rebuilt)
0:05 - Start build (arm64) - cache hit for layers 1-4
0:06 - Complete arm64 build (only layer 5 rebuilt)
0:07 - Generate manifests
0:08 - Push all tags
0:09 - Workflow complete

Data Flow

Version Information Flow

Developer
  │
  ▼
crawl4ai/__version__.py
  __version__ = "1.2.3"
  │
  ├─► Git Tag
  │     v1.2.3
  │       │
  │       ▼
  │     release.yml
  │       │
  │       ├─► Validation
  │       │     ✓ Match
  │       │
  │       ├─► PyPI Package
  │       │     crawl4ai==1.2.3
  │       │
  │       └─► GitHub Release
  │             v1.2.3
  │               │
  │               ▼
  │           docker-release.yml
  │               │
  │               └─► Docker Tags
  │                     1.2.3, 1.2, 1, latest
  │
  └─► Package Metadata
        pyproject.toml
          version = "1.2.3"

Secrets Flow

GitHub Secrets (Encrypted at Rest)
  │
  ├─► PYPI_TOKEN
  │     │
  │     ▼
  │   release.yml
  │     │
  │     ▼
  │   TWINE_PASSWORD env var (masked in logs)
  │     │
  │     ▼
  │   PyPI API (HTTPS)
  │
  ├─► DOCKER_USERNAME
  │     │
  │     ▼
  │   docker-release.yml
  │     │
  │     ▼
  │   docker/login-action (masked in logs)
  │     │
  │     ▼
  │   Docker Hub API (HTTPS)
  │
  └─► DOCKER_TOKEN
        │
        ▼
      docker-release.yml
        │
        ▼
      docker/login-action (masked in logs)
        │
        ▼
      Docker Hub API (HTTPS)

Artifact Flow

Source Code
  │
  ├─► release.yml
  │     │
  │     ▼
  │   python -m build
  │     │
  │     ├─► crawl4ai-1.2.3.tar.gz
  │     │     │
  │     │     ▼
  │     │   PyPI Storage
  │     │     │
  │     │     ▼
  │     │   pip install crawl4ai
  │     │
  │     └─► crawl4ai-1.2.3-py3-none-any.whl
  │           │
  │           ▼
  │         PyPI Storage
  │           │
  │           ▼
  │         pip install crawl4ai
  │
  └─► docker-release.yml
        │
        ▼
      docker build
        │
        ├─► Image: linux/amd64
        │     │
        │     └─► Docker Hub
        │           unclecode/crawl4ai:1.2.3-amd64
        │
        └─► Image: linux/arm64
              │
              └─► Docker Hub
                    unclecode/crawl4ai:1.2.3-arm64

State Machines

Release Pipeline State Machine

┌─────────┐
│  START  │
└────┬────┘
     │
     ▼
┌──────────────┐
│ Extract      │
│ Version      │
└──────┬───────┘
       │
       ▼
┌──────────────┐      ┌─────────┐
│ Validate     │─────►│ FAILED  │
│ Version      │ No   │ (Exit 1)│
└──────┬───────┘      └─────────┘
       │ Yes
       ▼
┌──────────────┐
│ Build        │
│ Package      │
└──────┬───────┘
       │
       ▼
┌──────────────┐      ┌─────────┐
│ Upload       │─────►│ FAILED  │
│ to PyPI      │ Error│ (Exit 1)│
└──────┬───────┘      └─────────┘
       │ Success
       ▼
┌──────────────┐
│ Create       │
│ GH Release   │
└──────┬───────┘
       │
       ▼
┌──────────────┐
│  SUCCESS     │
│ (Emit Event) │
└──────────────┘

Docker Pipeline State Machine

┌─────────┐
│  START  │
│ (Event) │
└────┬────┘
     │
     ▼
┌──────────────┐
│ Detect       │
│ Version      │
│ Source       │
└──────┬───────┘
       │
       ▼
┌──────────────┐
│ Parse        │
│ Semantic     │
│ Versions     │
└──────┬───────┘
       │
       ▼
┌──────────────┐      ┌─────────┐
│ Authenticate │─────►│ FAILED  │
│ Docker Hub   │ Error│ (Exit 1)│
└──────┬───────┘      └─────────┘
       │ Success
       ▼
┌──────────────┐
│ Build        │
│ amd64        │
└──────┬───────┘
       │
       ▼
┌──────────────┐      ┌─────────┐
│ Build        │─────►│ FAILED  │
│ arm64        │ Error│ (Exit 1)│
└──────┬───────┘      └─────────┘
       │ Success
       ▼
┌──────────────┐
│ Push All     │
│ Tags         │
└──────┬───────┘
       │
       ▼
┌──────────────┐
│  SUCCESS     │
└──────────────┘

Security Architecture

Threat Model

Threats Mitigated

  1. Secret Exposure

    • Mitigation: GitHub Actions secret masking
    • Evidence: Secrets never appear in logs
  2. Unauthorized Package Upload

    • Mitigation: Scoped PyPI tokens
    • Evidence: Token limited to crawl4ai project
  3. Man-in-the-Middle

    • Mitigation: HTTPS for all API calls
    • Evidence: PyPI, Docker Hub, GitHub all use TLS
  4. Supply Chain Tampering

    • Mitigation: Immutable artifacts, content checksums
    • Evidence: PyPI stores SHA256, Docker uses content-addressable storage

Trust Boundaries

┌─────────────────────────────────────────┐
│         Trusted Zone                     │
│  ┌────────────────────────────────┐    │
│  │  GitHub Actions Runner         │    │
│  │  - Ephemeral VM                │    │
│  │  - Isolated environment        │    │
│  │  - Access to secrets           │    │
│  └────────────────────────────────┘    │
│                │                         │
│                │ HTTPS (TLS 1.2+)       │
│                ▼                         │
└─────────────────────────────────────────┘
                 │
    ┌────────────┼────────────┐
    │            │            │
    ▼            ▼            ▼
┌────────┐  ┌─────────┐  ┌──────────┐
│  PyPI  │  │  Docker │  │  GitHub  │
│  API   │  │  Hub    │  │  API     │
└────────┘  └─────────┘  └──────────┘
 External     External     External
  Service      Service      Service

Secret Management

Secret Lifecycle

Creation (Developer)
  │
  ├─► PyPI: Create API token (scoped to project)
  ├─► Docker Hub: Create access token (read/write)
  │
  ▼
Storage (GitHub)
  │
  ├─► Encrypted at rest (AES-256)
  ├─► Access controlled (repo-scoped)
  │
  ▼
Usage (Workflow)
  │
  ├─► Injected as env vars
  ├─► Masked in logs (GitHub redacts on output)
  ├─► Never persisted to disk (in-memory only)
  │
  ▼
Transmission (API Call)
  │
  ├─► HTTPS only
  ├─► TLS 1.2+ with strong ciphers
  │
  ▼
Rotation (Manual)
  │
  └─► Regenerate on PyPI/Docker Hub
      Update GitHub secret

Performance Characteristics

Release Pipeline Performance

Metric Value Notes
Cold start ~2-3 min First run on new runner
Warm start ~2-3 min Minimal caching benefit
PyPI upload ~30-60 sec Network-bound
Package build ~30 sec CPU-bound
Parallelization None Sequential by design

Docker Pipeline Performance

Metric Cold Cache Warm Cache (code) Warm Cache (deps)
Total time 10-15 min 1-2 min 3-5 min
amd64 build 5-7 min 30-60 sec 1-2 min
arm64 build 5-7 min 30-60 sec 1-2 min
Push time 1-2 min 30 sec 30 sec
Cache hit rate 0% 85% 60%

Cache Performance Model

def estimate_build_time(changes):
    base_time = 60  # seconds (setup + push)

    if "Dockerfile" in changes:
        return base_time + (10 * 60)  # Full rebuild: ~11 min
    elif "requirements.txt" in changes:
        return base_time + (3 * 60)   # Deps rebuild: ~4 min
    elif any(f.endswith(".py") for f in changes):
        return base_time + 60          # Code only: ~2 min
    else:
        return base_time               # No changes: ~1 min

Scalability Considerations

Current Limits

Resource Limit Impact
Workflow concurrency 20 (default) Max 20 releases in parallel
Artifact storage 500 MB/artifact PyPI packages small (<10 MB)
Cache storage 10 GB/repo Docker layers fit comfortably
Workflow run time 6 hours Plenty of headroom

Scaling Strategies

Horizontal Scaling (Multiple Repos)

crawl4ai (main)
  ├─ release.yml
  └─ docker-release.yml

crawl4ai-plugins (separate)
  ├─ release.yml
  └─ docker-release.yml

Each repo has independent:
  - Secrets
  - Cache (10 GB each)
  - Concurrency limits (20 each)

Vertical Scaling (Larger Runners)

jobs:
  docker:
    runs-on: ubuntu-latest-8-cores  # GitHub-hosted larger runner
    # 4x faster builds for CPU-bound layers

Disaster Recovery

Failure Scenarios

Scenario 1: Release Pipeline Fails

Failure Point: PyPI upload fails (network error)

State:

  • ✓ Version validated
  • ✓ Package built
  • ✗ PyPI upload
  • ✗ GitHub release

Recovery:

# Manual upload
twine upload dist/*

# Retry workflow (re-run from GitHub Actions UI)

Prevention: Add retry logic to PyPI upload

Scenario 2: Docker Pipeline Fails

Failure Point: ARM build fails (dependency issue)

State:

  • ✓ PyPI published
  • ✓ GitHub release created
  • ✓ amd64 image built
  • ✗ arm64 image build

Recovery:

# Fix Dockerfile
git commit -am "fix: ARM build dependency"

# Trigger rebuild
git tag docker-rebuild-v1.2.3
git push origin docker-rebuild-v1.2.3

Impact: PyPI package available, only Docker ARM users affected

Scenario 3: Partial Release

Failure Point: GitHub release creation fails

State:

  • ✓ PyPI published
  • ✗ GitHub release
  • ✗ Docker images

Recovery:

# Create release manually
gh release create v1.2.3 \
  --title "Release v1.2.3" \
  --notes "..."

# This triggers docker-release.yml automatically

Monitoring and Observability

Metrics to Track

Release Pipeline

  • Success rate (target: >99%)
  • Duration (target: <3 min)
  • PyPI upload time (target: <60 sec)

Docker Pipeline

  • Success rate (target: >95%)
  • Duration (target: <15 min cold, <2 min warm)
  • Cache hit rate (target: >80% for code changes)

Alerting

Critical Alerts:

  • Release pipeline failure (blocks release)
  • PyPI authentication failure (expired token)

Warning Alerts:

  • Docker build >15 min (performance degradation)
  • Cache hit rate <50% (cache issue)

Logging

GitHub Actions Logs:

  • Retention: 90 days
  • Downloadable: Yes
  • Searchable: Limited

Recommended External Logging:

- name: Send logs to external service
  if: failure()
  run: |
    curl -X POST https://logs.example.com/api/v1/logs \
      -H "Content-Type: application/json" \
      -d "{\"workflow\": \"${{ github.workflow }}\", \"status\": \"failed\"}"

Future Enhancements

Planned Improvements

  1. Automated Changelog Generation

    • Use conventional commits
    • Generate CHANGELOG.md automatically
  2. Pre-release Testing

    • Test builds on test-v* tags
    • Upload to TestPyPI
  3. Notification System

    • Slack/Discord notifications on release
    • Email on failure
  4. Performance Optimization

    • Parallel Docker builds (amd64 + arm64 simultaneously)
    • Persistent runners for better caching
  5. Enhanced Validation

    • Smoke tests after PyPI upload
    • Container security scanning

References


Last Updated: 2025-01-21 Version: 2.0