- Split release.yml into PyPI/GitHub release and Docker workflows - Add GitHub Actions cache for Docker builds (10-15x faster rebuilds) - Implement dual-trigger for docker-release.yml (auto + manual) - Add comprehensive workflow documentation in .github/workflows/docs/ - Backup original workflow as release.yml.backup
30 KiB
Workflow Architecture Documentation
Overview
This document describes the technical architecture of the split release pipeline for Crawl4AI.
Architecture Diagram
┌─────────────────────────────────────────────────────────────────┐
│ Developer │
│ │ │
│ ▼ │
│ git tag v1.2.3 │
│ git push --tags │
└──────────────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ GitHub Repository │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Tag Event: v1.2.3 │ │
│ └────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ release.yml (Release Pipeline) │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 1. Extract Version │ │ │
│ │ │ v1.2.3 → 1.2.3 │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 2. Validate Version │ │ │
│ │ │ Tag == __version__.py │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 3. Build Python Package │ │ │
│ │ │ - Source dist (.tar.gz) │ │ │
│ │ │ - Wheel (.whl) │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 4. Upload to PyPI │ │ │
│ │ │ - Authenticate with token │ │ │
│ │ │ - Upload dist/* │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 5. Create GitHub Release │ │ │
│ │ │ - Tag: v1.2.3 │ │ │
│ │ │ - Body: Install instructions │ │ │
│ │ │ - Status: Published │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Release Event: published (v1.2.3) │ │
│ └────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ docker-release.yml (Docker Pipeline) │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 1. Extract Version from Release │ │ │
│ │ │ github.event.release.tag_name → 1.2.3 │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 2. Parse Semantic Versions │ │ │
│ │ │ 1.2.3 → Major: 1, Minor: 1.2 │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 3. Setup Multi-Arch Build │ │ │
│ │ │ - Docker Buildx │ │ │
│ │ │ - QEMU emulation │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 4. Authenticate Docker Hub │ │ │
│ │ │ - Username: DOCKER_USERNAME │ │ │
│ │ │ - Token: DOCKER_TOKEN │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 5. Build Multi-Arch Images │ │ │
│ │ │ ┌────────────────┬────────────────┐ │ │ │
│ │ │ │ linux/amd64 │ linux/arm64 │ │ │ │
│ │ │ └────────────────┴────────────────┘ │ │ │
│ │ │ Cache: GitHub Actions (type=gha) │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 6. Push to Docker Hub │ │ │
│ │ │ Tags: │ │ │
│ │ │ - unclecode/crawl4ai:1.2.3 │ │ │
│ │ │ - unclecode/crawl4ai:1.2 │ │ │
│ │ │ - unclecode/crawl4ai:1 │ │ │
│ │ │ - unclecode/crawl4ai:latest │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ External Services │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ PyPI │ │ Docker Hub │ │ GitHub │ │
│ │ │ │ │ │ │ │
│ │ crawl4ai │ │ unclecode/ │ │ Releases │ │
│ │ 1.2.3 │ │ crawl4ai │ │ v1.2.3 │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Component Details
1. Release Pipeline (release.yml)
Purpose
Fast publication of Python package and GitHub release.
Input
- Trigger: Git tag matching
v*(excludingtest-v*) - Example:
v1.2.3
Processing Stages
Stage 1: Version Extraction
Input: refs/tags/v1.2.3
Output: VERSION=1.2.3
Implementation:
TAG_VERSION=${GITHUB_REF#refs/tags/v} # Remove 'refs/tags/v' prefix
echo "VERSION=$TAG_VERSION" >> $GITHUB_OUTPUT
Stage 2: Version Validation
Input: TAG_VERSION=1.2.3
Check: crawl4ai/__version__.py contains __version__ = "1.2.3"
Output: Pass/Fail
Implementation:
PACKAGE_VERSION=$(python -c "from crawl4ai.__version__ import __version__; print(__version__)")
if [ "$TAG_VERSION" != "$PACKAGE_VERSION" ]; then
exit 1
fi
Stage 3: Package Build
Input: Source code + pyproject.toml
Output: dist/crawl4ai-1.2.3.tar.gz
dist/crawl4ai-1.2.3-py3-none-any.whl
Implementation:
python -m build
# Uses build backend defined in pyproject.toml
Stage 4: PyPI Upload
Input: dist/*.{tar.gz,whl}
Auth: PYPI_TOKEN
Output: Package published to PyPI
Implementation:
twine upload dist/*
# Environment:
# TWINE_USERNAME: __token__
# TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }}
Stage 5: GitHub Release Creation
Input: Tag: v1.2.3
Body: Markdown content
Output: Published GitHub release
Implementation:
uses: softprops/action-gh-release@v2
with:
tag_name: v1.2.3
name: Release v1.2.3
body: |
Installation instructions and changelog
draft: false
prerelease: false
Output
- PyPI Package: https://pypi.org/project/crawl4ai/1.2.3/
- GitHub Release: Published release on repository
- Event:
release.published(triggers Docker workflow)
Timeline
0:00 - Tag pushed
0:01 - Checkout + Python setup
0:02 - Version validation
0:03 - Package build
0:04 - PyPI upload starts
0:06 - PyPI upload complete
0:07 - GitHub release created
0:08 - Workflow complete
2. Docker Release Pipeline (docker-release.yml)
Purpose
Build and publish multi-architecture Docker images.
Inputs
Input 1: Release Event (Automatic)
Event: release.published
Data: github.event.release.tag_name = "v1.2.3"
Input 2: Docker Rebuild Tag (Manual)
Tag: docker-rebuild-v1.2.3
Processing Stages
Stage 1: Version Detection
# From release event:
VERSION = github.event.release.tag_name.strip("v")
# Result: "1.2.3"
# From rebuild tag:
VERSION = GITHUB_REF.replace("refs/tags/docker-rebuild-v", "")
# Result: "1.2.3"
Stage 2: Semantic Version Parsing
Input: VERSION=1.2.3
Output: MAJOR=1
MINOR=1.2
PATCH=3 (implicit)
Implementation:
MAJOR=$(echo $VERSION | cut -d. -f1) # Extract first component
MINOR=$(echo $VERSION | cut -d. -f1-2) # Extract first two components
Stage 3: Multi-Architecture Setup
Setup:
- Docker Buildx (multi-platform builder)
- QEMU (ARM emulation on x86)
Platforms:
- linux/amd64 (x86_64)
- linux/arm64 (aarch64)
Architecture:
GitHub Runner (linux/amd64)
├─ Buildx Builder
│ ├─ Native: Build linux/amd64 image
│ └─ QEMU: Emulate ARM to build linux/arm64 image
└─ Generate manifest list (points to both images)
Stage 4: Docker Hub Authentication
Input: DOCKER_USERNAME
DOCKER_TOKEN
Output: Authenticated Docker client
Stage 5: Build with Cache
Cache Configuration:
cache-from: type=gha # Read from GitHub Actions cache
cache-to: type=gha,mode=max # Write all layers
Cache Key Components:
- Workflow file path
- Branch name
- Architecture (amd64/arm64)
Cache Hierarchy:
Cache Entry: main/docker-release.yml/linux-amd64
├─ Layer: sha256:abc123... (FROM python:3.12)
├─ Layer: sha256:def456... (RUN apt-get update)
├─ Layer: sha256:ghi789... (COPY requirements.txt)
├─ Layer: sha256:jkl012... (RUN pip install)
└─ Layer: sha256:mno345... (COPY . /app)
Cache Hit/Miss Logic:
- If layer input unchanged → cache hit → skip build
- If layer input changed → cache miss → rebuild + all subsequent layers
Stage 6: Tag Generation
Input: VERSION=1.2.3, MAJOR=1, MINOR=1.2
Output Tags:
- unclecode/crawl4ai:1.2.3 (exact version)
- unclecode/crawl4ai:1.2 (minor version)
- unclecode/crawl4ai:1 (major version)
- unclecode/crawl4ai:latest (latest stable)
Tag Strategy:
- All tags point to same image SHA
- Users can pin to desired stability level
- Pushing new version updates
1,1.2, andlatestautomatically
Stage 7: Push to Registry
For each tag:
For each platform (amd64, arm64):
Push image to Docker Hub
Create manifest list:
Manifest: unclecode/crawl4ai:1.2.3
├─ linux/amd64: sha256:abc...
└─ linux/arm64: sha256:def...
Docker CLI automatically selects correct platform on pull
Output
- Docker Images: 4 tags × 2 platforms = 8 image variants + 4 manifests
- Docker Hub: https://hub.docker.com/r/unclecode/crawl4ai/tags
Timeline
Cold Cache (First Build):
0:00 - Release event received
0:01 - Checkout + Buildx setup
0:02 - Docker Hub auth
0:03 - Start build (amd64)
0:08 - Complete amd64 build
0:09 - Start build (arm64)
0:14 - Complete arm64 build
0:15 - Generate manifests
0:16 - Push all tags
0:17 - Workflow complete
Warm Cache (Code Change Only):
0:00 - Release event received
0:01 - Checkout + Buildx setup
0:02 - Docker Hub auth
0:03 - Start build (amd64) - cache hit for layers 1-4
0:04 - Complete amd64 build (only layer 5 rebuilt)
0:05 - Start build (arm64) - cache hit for layers 1-4
0:06 - Complete arm64 build (only layer 5 rebuilt)
0:07 - Generate manifests
0:08 - Push all tags
0:09 - Workflow complete
Data Flow
Version Information Flow
Developer
│
▼
crawl4ai/__version__.py
__version__ = "1.2.3"
│
├─► Git Tag
│ v1.2.3
│ │
│ ▼
│ release.yml
│ │
│ ├─► Validation
│ │ ✓ Match
│ │
│ ├─► PyPI Package
│ │ crawl4ai==1.2.3
│ │
│ └─► GitHub Release
│ v1.2.3
│ │
│ ▼
│ docker-release.yml
│ │
│ └─► Docker Tags
│ 1.2.3, 1.2, 1, latest
│
└─► Package Metadata
pyproject.toml
version = "1.2.3"
Secrets Flow
GitHub Secrets (Encrypted at Rest)
│
├─► PYPI_TOKEN
│ │
│ ▼
│ release.yml
│ │
│ ▼
│ TWINE_PASSWORD env var (masked in logs)
│ │
│ ▼
│ PyPI API (HTTPS)
│
├─► DOCKER_USERNAME
│ │
│ ▼
│ docker-release.yml
│ │
│ ▼
│ docker/login-action (masked in logs)
│ │
│ ▼
│ Docker Hub API (HTTPS)
│
└─► DOCKER_TOKEN
│
▼
docker-release.yml
│
▼
docker/login-action (masked in logs)
│
▼
Docker Hub API (HTTPS)
Artifact Flow
Source Code
│
├─► release.yml
│ │
│ ▼
│ python -m build
│ │
│ ├─► crawl4ai-1.2.3.tar.gz
│ │ │
│ │ ▼
│ │ PyPI Storage
│ │ │
│ │ ▼
│ │ pip install crawl4ai
│ │
│ └─► crawl4ai-1.2.3-py3-none-any.whl
│ │
│ ▼
│ PyPI Storage
│ │
│ ▼
│ pip install crawl4ai
│
└─► docker-release.yml
│
▼
docker build
│
├─► Image: linux/amd64
│ │
│ └─► Docker Hub
│ unclecode/crawl4ai:1.2.3-amd64
│
└─► Image: linux/arm64
│
└─► Docker Hub
unclecode/crawl4ai:1.2.3-arm64
State Machines
Release Pipeline State Machine
┌─────────┐
│ START │
└────┬────┘
│
▼
┌──────────────┐
│ Extract │
│ Version │
└──────┬───────┘
│
▼
┌──────────────┐ ┌─────────┐
│ Validate │─────►│ FAILED │
│ Version │ No │ (Exit 1)│
└──────┬───────┘ └─────────┘
│ Yes
▼
┌──────────────┐
│ Build │
│ Package │
└──────┬───────┘
│
▼
┌──────────────┐ ┌─────────┐
│ Upload │─────►│ FAILED │
│ to PyPI │ Error│ (Exit 1)│
└──────┬───────┘ └─────────┘
│ Success
▼
┌──────────────┐
│ Create │
│ GH Release │
└──────┬───────┘
│
▼
┌──────────────┐
│ SUCCESS │
│ (Emit Event) │
└──────────────┘
Docker Pipeline State Machine
┌─────────┐
│ START │
│ (Event) │
└────┬────┘
│
▼
┌──────────────┐
│ Detect │
│ Version │
│ Source │
└──────┬───────┘
│
▼
┌──────────────┐
│ Parse │
│ Semantic │
│ Versions │
└──────┬───────┘
│
▼
┌──────────────┐ ┌─────────┐
│ Authenticate │─────►│ FAILED │
│ Docker Hub │ Error│ (Exit 1)│
└──────┬───────┘ └─────────┘
│ Success
▼
┌──────────────┐
│ Build │
│ amd64 │
└──────┬───────┘
│
▼
┌──────────────┐ ┌─────────┐
│ Build │─────►│ FAILED │
│ arm64 │ Error│ (Exit 1)│
└──────┬───────┘ └─────────┘
│ Success
▼
┌──────────────┐
│ Push All │
│ Tags │
└──────┬───────┘
│
▼
┌──────────────┐
│ SUCCESS │
└──────────────┘
Security Architecture
Threat Model
Threats Mitigated
-
Secret Exposure
- Mitigation: GitHub Actions secret masking
- Evidence: Secrets never appear in logs
-
Unauthorized Package Upload
- Mitigation: Scoped PyPI tokens
- Evidence: Token limited to
crawl4aiproject
-
Man-in-the-Middle
- Mitigation: HTTPS for all API calls
- Evidence: PyPI, Docker Hub, GitHub all use TLS
-
Supply Chain Tampering
- Mitigation: Immutable artifacts, content checksums
- Evidence: PyPI stores SHA256, Docker uses content-addressable storage
Trust Boundaries
┌─────────────────────────────────────────┐
│ Trusted Zone │
│ ┌────────────────────────────────┐ │
│ │ GitHub Actions Runner │ │
│ │ - Ephemeral VM │ │
│ │ - Isolated environment │ │
│ │ - Access to secrets │ │
│ └────────────────────────────────┘ │
│ │ │
│ │ HTTPS (TLS 1.2+) │
│ ▼ │
└─────────────────────────────────────────┘
│
┌────────────┼────────────┐
│ │ │
▼ ▼ ▼
┌────────┐ ┌─────────┐ ┌──────────┐
│ PyPI │ │ Docker │ │ GitHub │
│ API │ │ Hub │ │ API │
└────────┘ └─────────┘ └──────────┘
External External External
Service Service Service
Secret Management
Secret Lifecycle
Creation (Developer)
│
├─► PyPI: Create API token (scoped to project)
├─► Docker Hub: Create access token (read/write)
│
▼
Storage (GitHub)
│
├─► Encrypted at rest (AES-256)
├─► Access controlled (repo-scoped)
│
▼
Usage (Workflow)
│
├─► Injected as env vars
├─► Masked in logs (GitHub redacts on output)
├─► Never persisted to disk (in-memory only)
│
▼
Transmission (API Call)
│
├─► HTTPS only
├─► TLS 1.2+ with strong ciphers
│
▼
Rotation (Manual)
│
└─► Regenerate on PyPI/Docker Hub
Update GitHub secret
Performance Characteristics
Release Pipeline Performance
| Metric | Value | Notes |
|---|---|---|
| Cold start | ~2-3 min | First run on new runner |
| Warm start | ~2-3 min | Minimal caching benefit |
| PyPI upload | ~30-60 sec | Network-bound |
| Package build | ~30 sec | CPU-bound |
| Parallelization | None | Sequential by design |
Docker Pipeline Performance
| Metric | Cold Cache | Warm Cache (code) | Warm Cache (deps) |
|---|---|---|---|
| Total time | 10-15 min | 1-2 min | 3-5 min |
| amd64 build | 5-7 min | 30-60 sec | 1-2 min |
| arm64 build | 5-7 min | 30-60 sec | 1-2 min |
| Push time | 1-2 min | 30 sec | 30 sec |
| Cache hit rate | 0% | 85% | 60% |
Cache Performance Model
def estimate_build_time(changes):
base_time = 60 # seconds (setup + push)
if "Dockerfile" in changes:
return base_time + (10 * 60) # Full rebuild: ~11 min
elif "requirements.txt" in changes:
return base_time + (3 * 60) # Deps rebuild: ~4 min
elif any(f.endswith(".py") for f in changes):
return base_time + 60 # Code only: ~2 min
else:
return base_time # No changes: ~1 min
Scalability Considerations
Current Limits
| Resource | Limit | Impact |
|---|---|---|
| Workflow concurrency | 20 (default) | Max 20 releases in parallel |
| Artifact storage | 500 MB/artifact | PyPI packages small (<10 MB) |
| Cache storage | 10 GB/repo | Docker layers fit comfortably |
| Workflow run time | 6 hours | Plenty of headroom |
Scaling Strategies
Horizontal Scaling (Multiple Repos)
crawl4ai (main)
├─ release.yml
└─ docker-release.yml
crawl4ai-plugins (separate)
├─ release.yml
└─ docker-release.yml
Each repo has independent:
- Secrets
- Cache (10 GB each)
- Concurrency limits (20 each)
Vertical Scaling (Larger Runners)
jobs:
docker:
runs-on: ubuntu-latest-8-cores # GitHub-hosted larger runner
# 4x faster builds for CPU-bound layers
Disaster Recovery
Failure Scenarios
Scenario 1: Release Pipeline Fails
Failure Point: PyPI upload fails (network error)
State:
- ✓ Version validated
- ✓ Package built
- ✗ PyPI upload
- ✗ GitHub release
Recovery:
# Manual upload
twine upload dist/*
# Retry workflow (re-run from GitHub Actions UI)
Prevention: Add retry logic to PyPI upload
Scenario 2: Docker Pipeline Fails
Failure Point: ARM build fails (dependency issue)
State:
- ✓ PyPI published
- ✓ GitHub release created
- ✓ amd64 image built
- ✗ arm64 image build
Recovery:
# Fix Dockerfile
git commit -am "fix: ARM build dependency"
# Trigger rebuild
git tag docker-rebuild-v1.2.3
git push origin docker-rebuild-v1.2.3
Impact: PyPI package available, only Docker ARM users affected
Scenario 3: Partial Release
Failure Point: GitHub release creation fails
State:
- ✓ PyPI published
- ✗ GitHub release
- ✗ Docker images
Recovery:
# Create release manually
gh release create v1.2.3 \
--title "Release v1.2.3" \
--notes "..."
# This triggers docker-release.yml automatically
Monitoring and Observability
Metrics to Track
Release Pipeline
- Success rate (target: >99%)
- Duration (target: <3 min)
- PyPI upload time (target: <60 sec)
Docker Pipeline
- Success rate (target: >95%)
- Duration (target: <15 min cold, <2 min warm)
- Cache hit rate (target: >80% for code changes)
Alerting
Critical Alerts:
- Release pipeline failure (blocks release)
- PyPI authentication failure (expired token)
Warning Alerts:
- Docker build >15 min (performance degradation)
- Cache hit rate <50% (cache issue)
Logging
GitHub Actions Logs:
- Retention: 90 days
- Downloadable: Yes
- Searchable: Limited
Recommended External Logging:
- name: Send logs to external service
if: failure()
run: |
curl -X POST https://logs.example.com/api/v1/logs \
-H "Content-Type: application/json" \
-d "{\"workflow\": \"${{ github.workflow }}\", \"status\": \"failed\"}"
Future Enhancements
Planned Improvements
-
Automated Changelog Generation
- Use conventional commits
- Generate CHANGELOG.md automatically
-
Pre-release Testing
- Test builds on
test-v*tags - Upload to TestPyPI
- Test builds on
-
Notification System
- Slack/Discord notifications on release
- Email on failure
-
Performance Optimization
- Parallel Docker builds (amd64 + arm64 simultaneously)
- Persistent runners for better caching
-
Enhanced Validation
- Smoke tests after PyPI upload
- Container security scanning
References
Last Updated: 2025-01-21 Version: 2.0