# Workflow Architecture Documentation ## Overview This document describes the technical architecture of the split release pipeline for Crawl4AI. --- ## Architecture Diagram ``` ┌─────────────────────────────────────────────────────────────────┐ │ Developer │ │ │ │ │ ▼ │ │ git tag v1.2.3 │ │ git push --tags │ └──────────────────────────────┬──────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ GitHub Repository │ │ │ │ ┌────────────────────────────────────────────────────────┐ │ │ │ Tag Event: v1.2.3 │ │ │ └────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌────────────────────────────────────────────────────────┐ │ │ │ release.yml (Release Pipeline) │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ │ │ 1. Extract Version │ │ │ │ │ │ v1.2.3 → 1.2.3 │ │ │ │ │ └──────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ │ │ 2. Validate Version │ │ │ │ │ │ Tag == __version__.py │ │ │ │ │ └──────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ │ │ 3. Build Python Package │ │ │ │ │ │ - Source dist (.tar.gz) │ │ │ │ │ │ - Wheel (.whl) │ │ │ │ │ └──────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ │ │ 4. Upload to PyPI │ │ │ │ │ │ - Authenticate with token │ │ │ │ │ │ - Upload dist/* │ │ │ │ │ └──────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ │ │ 5. Create GitHub Release │ │ │ │ │ │ - Tag: v1.2.3 │ │ │ │ │ │ - Body: Install instructions │ │ │ │ │ │ - Status: Published │ │ │ │ │ └──────────────────────────────────────────────┘ │ │ │ └────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌────────────────────────────────────────────────────────┐ │ │ │ Release Event: published (v1.2.3) │ │ │ └────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌────────────────────────────────────────────────────────┐ │ │ │ docker-release.yml (Docker Pipeline) │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ │ │ 1. Extract Version from Release │ │ │ │ │ │ github.event.release.tag_name → 1.2.3 │ │ │ │ │ └──────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ │ │ 2. Parse Semantic Versions │ │ │ │ │ │ 1.2.3 → Major: 1, Minor: 1.2 │ │ │ │ │ └──────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ │ │ 3. Setup Multi-Arch Build │ │ │ │ │ │ - Docker Buildx │ │ │ │ │ │ - QEMU emulation │ │ │ │ │ └──────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ │ │ 4. Authenticate Docker Hub │ │ │ │ │ │ - Username: DOCKER_USERNAME │ │ │ │ │ │ - Token: DOCKER_TOKEN │ │ │ │ │ └──────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ │ │ 5. Build Multi-Arch Images │ │ │ │ │ │ ┌────────────────┬────────────────┐ │ │ │ │ │ │ │ linux/amd64 │ linux/arm64 │ │ │ │ │ │ │ └────────────────┴────────────────┘ │ │ │ │ │ │ Cache: GitHub Actions (type=gha) │ │ │ │ │ └──────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ │ │ 6. Push to Docker Hub │ │ │ │ │ │ Tags: │ │ │ │ │ │ - unclecode/crawl4ai:1.2.3 │ │ │ │ │ │ - unclecode/crawl4ai:1.2 │ │ │ │ │ │ - unclecode/crawl4ai:1 │ │ │ │ │ │ - unclecode/crawl4ai:latest │ │ │ │ │ └──────────────────────────────────────────────┘ │ │ │ └────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ External Services │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ PyPI │ │ Docker Hub │ │ GitHub │ │ │ │ │ │ │ │ │ │ │ │ crawl4ai │ │ unclecode/ │ │ Releases │ │ │ │ 1.2.3 │ │ crawl4ai │ │ v1.2.3 │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ └─────────────────────────────────────────────────────────────────┘ ``` --- ## Component Details ### 1. Release Pipeline (release.yml) #### Purpose Fast publication of Python package and GitHub release. #### Input - **Trigger**: Git tag matching `v*` (excluding `test-v*`) - **Example**: `v1.2.3` #### Processing Stages ##### Stage 1: Version Extraction ```bash Input: refs/tags/v1.2.3 Output: VERSION=1.2.3 ``` **Implementation**: ```bash TAG_VERSION=${GITHUB_REF#refs/tags/v} # Remove 'refs/tags/v' prefix echo "VERSION=$TAG_VERSION" >> $GITHUB_OUTPUT ``` ##### Stage 2: Version Validation ```bash Input: TAG_VERSION=1.2.3 Check: crawl4ai/__version__.py contains __version__ = "1.2.3" Output: Pass/Fail ``` **Implementation**: ```bash PACKAGE_VERSION=$(python -c "from crawl4ai.__version__ import __version__; print(__version__)") if [ "$TAG_VERSION" != "$PACKAGE_VERSION" ]; then exit 1 fi ``` ##### Stage 3: Package Build ```bash Input: Source code + pyproject.toml Output: dist/crawl4ai-1.2.3.tar.gz dist/crawl4ai-1.2.3-py3-none-any.whl ``` **Implementation**: ```bash python -m build # Uses build backend defined in pyproject.toml ``` ##### Stage 4: PyPI Upload ```bash Input: dist/*.{tar.gz,whl} Auth: PYPI_TOKEN Output: Package published to PyPI ``` **Implementation**: ```bash twine upload dist/* # Environment: # TWINE_USERNAME: __token__ # TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }} ``` ##### Stage 5: GitHub Release Creation ```bash Input: Tag: v1.2.3 Body: Markdown content Output: Published GitHub release ``` **Implementation**: ```yaml uses: softprops/action-gh-release@v2 with: tag_name: v1.2.3 name: Release v1.2.3 body: | Installation instructions and changelog draft: false prerelease: false ``` #### Output - **PyPI Package**: https://pypi.org/project/crawl4ai/1.2.3/ - **GitHub Release**: Published release on repository - **Event**: `release.published` (triggers Docker workflow) #### Timeline ``` 0:00 - Tag pushed 0:01 - Checkout + Python setup 0:02 - Version validation 0:03 - Package build 0:04 - PyPI upload starts 0:06 - PyPI upload complete 0:07 - GitHub release created 0:08 - Workflow complete ``` --- ### 2. Docker Release Pipeline (docker-release.yml) #### Purpose Build and publish multi-architecture Docker images. #### Inputs ##### Input 1: Release Event (Automatic) ```yaml Event: release.published Data: github.event.release.tag_name = "v1.2.3" ``` ##### Input 2: Docker Rebuild Tag (Manual) ```yaml Tag: docker-rebuild-v1.2.3 ``` #### Processing Stages ##### Stage 1: Version Detection ```bash # From release event: VERSION = github.event.release.tag_name.strip("v") # Result: "1.2.3" # From rebuild tag: VERSION = GITHUB_REF.replace("refs/tags/docker-rebuild-v", "") # Result: "1.2.3" ``` ##### Stage 2: Semantic Version Parsing ```bash Input: VERSION=1.2.3 Output: MAJOR=1 MINOR=1.2 PATCH=3 (implicit) ``` **Implementation**: ```bash MAJOR=$(echo $VERSION | cut -d. -f1) # Extract first component MINOR=$(echo $VERSION | cut -d. -f1-2) # Extract first two components ``` ##### Stage 3: Multi-Architecture Setup ```yaml Setup: - Docker Buildx (multi-platform builder) - QEMU (ARM emulation on x86) Platforms: - linux/amd64 (x86_64) - linux/arm64 (aarch64) ``` **Architecture**: ``` GitHub Runner (linux/amd64) ├─ Buildx Builder │ ├─ Native: Build linux/amd64 image │ └─ QEMU: Emulate ARM to build linux/arm64 image └─ Generate manifest list (points to both images) ``` ##### Stage 4: Docker Hub Authentication ```bash Input: DOCKER_USERNAME DOCKER_TOKEN Output: Authenticated Docker client ``` ##### Stage 5: Build with Cache ```yaml Cache Configuration: cache-from: type=gha # Read from GitHub Actions cache cache-to: type=gha,mode=max # Write all layers Cache Key Components: - Workflow file path - Branch name - Architecture (amd64/arm64) ``` **Cache Hierarchy**: ``` Cache Entry: main/docker-release.yml/linux-amd64 ├─ Layer: sha256:abc123... (FROM python:3.12) ├─ Layer: sha256:def456... (RUN apt-get update) ├─ Layer: sha256:ghi789... (COPY requirements.txt) ├─ Layer: sha256:jkl012... (RUN pip install) └─ Layer: sha256:mno345... (COPY . /app) Cache Hit/Miss Logic: - If layer input unchanged → cache hit → skip build - If layer input changed → cache miss → rebuild + all subsequent layers ``` ##### Stage 6: Tag Generation ```bash Input: VERSION=1.2.3, MAJOR=1, MINOR=1.2 Output Tags: - unclecode/crawl4ai:1.2.3 (exact version) - unclecode/crawl4ai:1.2 (minor version) - unclecode/crawl4ai:1 (major version) - unclecode/crawl4ai:latest (latest stable) ``` **Tag Strategy**: - All tags point to same image SHA - Users can pin to desired stability level - Pushing new version updates `1`, `1.2`, and `latest` automatically ##### Stage 7: Push to Registry ```bash For each tag: For each platform (amd64, arm64): Push image to Docker Hub Create manifest list: Manifest: unclecode/crawl4ai:1.2.3 ├─ linux/amd64: sha256:abc... └─ linux/arm64: sha256:def... Docker CLI automatically selects correct platform on pull ``` #### Output - **Docker Images**: 4 tags × 2 platforms = 8 image variants + 4 manifests - **Docker Hub**: https://hub.docker.com/r/unclecode/crawl4ai/tags #### Timeline **Cold Cache (First Build)**: ``` 0:00 - Release event received 0:01 - Checkout + Buildx setup 0:02 - Docker Hub auth 0:03 - Start build (amd64) 0:08 - Complete amd64 build 0:09 - Start build (arm64) 0:14 - Complete arm64 build 0:15 - Generate manifests 0:16 - Push all tags 0:17 - Workflow complete ``` **Warm Cache (Code Change Only)**: ``` 0:00 - Release event received 0:01 - Checkout + Buildx setup 0:02 - Docker Hub auth 0:03 - Start build (amd64) - cache hit for layers 1-4 0:04 - Complete amd64 build (only layer 5 rebuilt) 0:05 - Start build (arm64) - cache hit for layers 1-4 0:06 - Complete arm64 build (only layer 5 rebuilt) 0:07 - Generate manifests 0:08 - Push all tags 0:09 - Workflow complete ``` --- ## Data Flow ### Version Information Flow ``` Developer │ ▼ crawl4ai/__version__.py __version__ = "1.2.3" │ ├─► Git Tag │ v1.2.3 │ │ │ ▼ │ release.yml │ │ │ ├─► Validation │ │ ✓ Match │ │ │ ├─► PyPI Package │ │ crawl4ai==1.2.3 │ │ │ └─► GitHub Release │ v1.2.3 │ │ │ ▼ │ docker-release.yml │ │ │ └─► Docker Tags │ 1.2.3, 1.2, 1, latest │ └─► Package Metadata pyproject.toml version = "1.2.3" ``` ### Secrets Flow ``` GitHub Secrets (Encrypted at Rest) │ ├─► PYPI_TOKEN │ │ │ ▼ │ release.yml │ │ │ ▼ │ TWINE_PASSWORD env var (masked in logs) │ │ │ ▼ │ PyPI API (HTTPS) │ ├─► DOCKER_USERNAME │ │ │ ▼ │ docker-release.yml │ │ │ ▼ │ docker/login-action (masked in logs) │ │ │ ▼ │ Docker Hub API (HTTPS) │ └─► DOCKER_TOKEN │ ▼ docker-release.yml │ ▼ docker/login-action (masked in logs) │ ▼ Docker Hub API (HTTPS) ``` ### Artifact Flow ``` Source Code │ ├─► release.yml │ │ │ ▼ │ python -m build │ │ │ ├─► crawl4ai-1.2.3.tar.gz │ │ │ │ │ ▼ │ │ PyPI Storage │ │ │ │ │ ▼ │ │ pip install crawl4ai │ │ │ └─► crawl4ai-1.2.3-py3-none-any.whl │ │ │ ▼ │ PyPI Storage │ │ │ ▼ │ pip install crawl4ai │ └─► docker-release.yml │ ▼ docker build │ ├─► Image: linux/amd64 │ │ │ └─► Docker Hub │ unclecode/crawl4ai:1.2.3-amd64 │ └─► Image: linux/arm64 │ └─► Docker Hub unclecode/crawl4ai:1.2.3-arm64 ``` --- ## State Machines ### Release Pipeline State Machine ``` ┌─────────┐ │ START │ └────┬────┘ │ ▼ ┌──────────────┐ │ Extract │ │ Version │ └──────┬───────┘ │ ▼ ┌──────────────┐ ┌─────────┐ │ Validate │─────►│ FAILED │ │ Version │ No │ (Exit 1)│ └──────┬───────┘ └─────────┘ │ Yes ▼ ┌──────────────┐ │ Build │ │ Package │ └──────┬───────┘ │ ▼ ┌──────────────┐ ┌─────────┐ │ Upload │─────►│ FAILED │ │ to PyPI │ Error│ (Exit 1)│ └──────┬───────┘ └─────────┘ │ Success ▼ ┌──────────────┐ │ Create │ │ GH Release │ └──────┬───────┘ │ ▼ ┌──────────────┐ │ SUCCESS │ │ (Emit Event) │ └──────────────┘ ``` ### Docker Pipeline State Machine ``` ┌─────────┐ │ START │ │ (Event) │ └────┬────┘ │ ▼ ┌──────────────┐ │ Detect │ │ Version │ │ Source │ └──────┬───────┘ │ ▼ ┌──────────────┐ │ Parse │ │ Semantic │ │ Versions │ └──────┬───────┘ │ ▼ ┌──────────────┐ ┌─────────┐ │ Authenticate │─────►│ FAILED │ │ Docker Hub │ Error│ (Exit 1)│ └──────┬───────┘ └─────────┘ │ Success ▼ ┌──────────────┐ │ Build │ │ amd64 │ └──────┬───────┘ │ ▼ ┌──────────────┐ ┌─────────┐ │ Build │─────►│ FAILED │ │ arm64 │ Error│ (Exit 1)│ └──────┬───────┘ └─────────┘ │ Success ▼ ┌──────────────┐ │ Push All │ │ Tags │ └──────┬───────┘ │ ▼ ┌──────────────┐ │ SUCCESS │ └──────────────┘ ``` --- ## Security Architecture ### Threat Model #### Threats Mitigated 1. **Secret Exposure** - Mitigation: GitHub Actions secret masking - Evidence: Secrets never appear in logs 2. **Unauthorized Package Upload** - Mitigation: Scoped PyPI tokens - Evidence: Token limited to `crawl4ai` project 3. **Man-in-the-Middle** - Mitigation: HTTPS for all API calls - Evidence: PyPI, Docker Hub, GitHub all use TLS 4. **Supply Chain Tampering** - Mitigation: Immutable artifacts, content checksums - Evidence: PyPI stores SHA256, Docker uses content-addressable storage #### Trust Boundaries ``` ┌─────────────────────────────────────────┐ │ Trusted Zone │ │ ┌────────────────────────────────┐ │ │ │ GitHub Actions Runner │ │ │ │ - Ephemeral VM │ │ │ │ - Isolated environment │ │ │ │ - Access to secrets │ │ │ └────────────────────────────────┘ │ │ │ │ │ │ HTTPS (TLS 1.2+) │ │ ▼ │ └─────────────────────────────────────────┘ │ ┌────────────┼────────────┐ │ │ │ ▼ ▼ ▼ ┌────────┐ ┌─────────┐ ┌──────────┐ │ PyPI │ │ Docker │ │ GitHub │ │ API │ │ Hub │ │ API │ └────────┘ └─────────┘ └──────────┘ External External External Service Service Service ``` ### Secret Management #### Secret Lifecycle ``` Creation (Developer) │ ├─► PyPI: Create API token (scoped to project) ├─► Docker Hub: Create access token (read/write) │ ▼ Storage (GitHub) │ ├─► Encrypted at rest (AES-256) ├─► Access controlled (repo-scoped) │ ▼ Usage (Workflow) │ ├─► Injected as env vars ├─► Masked in logs (GitHub redacts on output) ├─► Never persisted to disk (in-memory only) │ ▼ Transmission (API Call) │ ├─► HTTPS only ├─► TLS 1.2+ with strong ciphers │ ▼ Rotation (Manual) │ └─► Regenerate on PyPI/Docker Hub Update GitHub secret ``` --- ## Performance Characteristics ### Release Pipeline Performance | Metric | Value | Notes | |--------|-------|-------| | Cold start | ~2-3 min | First run on new runner | | Warm start | ~2-3 min | Minimal caching benefit | | PyPI upload | ~30-60 sec | Network-bound | | Package build | ~30 sec | CPU-bound | | Parallelization | None | Sequential by design | ### Docker Pipeline Performance | Metric | Cold Cache | Warm Cache (code) | Warm Cache (deps) | |--------|-----------|-------------------|-------------------| | Total time | 10-15 min | 1-2 min | 3-5 min | | amd64 build | 5-7 min | 30-60 sec | 1-2 min | | arm64 build | 5-7 min | 30-60 sec | 1-2 min | | Push time | 1-2 min | 30 sec | 30 sec | | Cache hit rate | 0% | 85% | 60% | ### Cache Performance Model ```python def estimate_build_time(changes): base_time = 60 # seconds (setup + push) if "Dockerfile" in changes: return base_time + (10 * 60) # Full rebuild: ~11 min elif "requirements.txt" in changes: return base_time + (3 * 60) # Deps rebuild: ~4 min elif any(f.endswith(".py") for f in changes): return base_time + 60 # Code only: ~2 min else: return base_time # No changes: ~1 min ``` --- ## Scalability Considerations ### Current Limits | Resource | Limit | Impact | |----------|-------|--------| | Workflow concurrency | 20 (default) | Max 20 releases in parallel | | Artifact storage | 500 MB/artifact | PyPI packages small (<10 MB) | | Cache storage | 10 GB/repo | Docker layers fit comfortably | | Workflow run time | 6 hours | Plenty of headroom | ### Scaling Strategies #### Horizontal Scaling (Multiple Repos) ``` crawl4ai (main) ├─ release.yml └─ docker-release.yml crawl4ai-plugins (separate) ├─ release.yml └─ docker-release.yml Each repo has independent: - Secrets - Cache (10 GB each) - Concurrency limits (20 each) ``` #### Vertical Scaling (Larger Runners) ```yaml jobs: docker: runs-on: ubuntu-latest-8-cores # GitHub-hosted larger runner # 4x faster builds for CPU-bound layers ``` --- ## Disaster Recovery ### Failure Scenarios #### Scenario 1: Release Pipeline Fails **Failure Point**: PyPI upload fails (network error) **State**: - ✓ Version validated - ✓ Package built - ✗ PyPI upload - ✗ GitHub release **Recovery**: ```bash # Manual upload twine upload dist/* # Retry workflow (re-run from GitHub Actions UI) ``` **Prevention**: Add retry logic to PyPI upload #### Scenario 2: Docker Pipeline Fails **Failure Point**: ARM build fails (dependency issue) **State**: - ✓ PyPI published - ✓ GitHub release created - ✓ amd64 image built - ✗ arm64 image build **Recovery**: ```bash # Fix Dockerfile git commit -am "fix: ARM build dependency" # Trigger rebuild git tag docker-rebuild-v1.2.3 git push origin docker-rebuild-v1.2.3 ``` **Impact**: PyPI package available, only Docker ARM users affected #### Scenario 3: Partial Release **Failure Point**: GitHub release creation fails **State**: - ✓ PyPI published - ✗ GitHub release - ✗ Docker images **Recovery**: ```bash # Create release manually gh release create v1.2.3 \ --title "Release v1.2.3" \ --notes "..." # This triggers docker-release.yml automatically ``` --- ## Monitoring and Observability ### Metrics to Track #### Release Pipeline - Success rate (target: >99%) - Duration (target: <3 min) - PyPI upload time (target: <60 sec) #### Docker Pipeline - Success rate (target: >95%) - Duration (target: <15 min cold, <2 min warm) - Cache hit rate (target: >80% for code changes) ### Alerting **Critical Alerts**: - Release pipeline failure (blocks release) - PyPI authentication failure (expired token) **Warning Alerts**: - Docker build >15 min (performance degradation) - Cache hit rate <50% (cache issue) ### Logging **GitHub Actions Logs**: - Retention: 90 days - Downloadable: Yes - Searchable: Limited **Recommended External Logging**: ```yaml - name: Send logs to external service if: failure() run: | curl -X POST https://logs.example.com/api/v1/logs \ -H "Content-Type: application/json" \ -d "{\"workflow\": \"${{ github.workflow }}\", \"status\": \"failed\"}" ``` --- ## Future Enhancements ### Planned Improvements 1. **Automated Changelog Generation** - Use conventional commits - Generate CHANGELOG.md automatically 2. **Pre-release Testing** - Test builds on `test-v*` tags - Upload to TestPyPI 3. **Notification System** - Slack/Discord notifications on release - Email on failure 4. **Performance Optimization** - Parallel Docker builds (amd64 + arm64 simultaneously) - Persistent runners for better caching 5. **Enhanced Validation** - Smoke tests after PyPI upload - Container security scanning --- ## References - [GitHub Actions Architecture](https://docs.github.com/en/actions/learn-github-actions/understanding-github-actions) - [Docker Build Cache](https://docs.docker.com/build/cache/) - [PyPI API Documentation](https://warehouse.pypa.io/api-reference/) --- **Last Updated**: 2025-01-21 **Version**: 2.0