Compare commits

...

9 Commits

Author SHA1 Message Date
ntohidi
13e116610d fix(marketplace): improve app detail page content rendering and UX
Fixed multiple issues with app detail page content display and formatting
2025-10-23 16:12:30 +02:00
ntohidi
97c92c4f62 fix(marketplace): replace hardcoded app detail content with database-driven fields.
The app detail page was displaying hardcoded/templated content instead of
using actual data from the database. This prevented admins from controlling
the content shown in Overview, Integration, and Documentation tabs.
2025-10-21 15:39:04 +02:00
unclecode
6d1a398419 feat(ci): split release pipeline and add Docker caching
- Split release.yml into PyPI/GitHub release and Docker workflows
- Add GitHub Actions cache for Docker builds (10-15x faster rebuilds)
- Implement dual-trigger for docker-release.yml (auto + manual)
- Add comprehensive workflow documentation in .github/workflows/docs/
- Backup original workflow as release.yml.backup
2025-10-21 10:53:12 +08:00
unclecode
c107617920 fix: thoroughly verify and fix all Crawl4AI skill examples
- Cross-checked every section against actual docs
- Fixed BM25ContentFilter parameters (user_query, bm25_threshold)
- Removed incorrect wait_for selector from basic example
- Added comprehensive test suite (4 test files)
- All examples now tested and verified working
- Tests validate: basic crawling, markdown generation, data extraction, advanced patterns
- Package size: 76.6 KB (includes tests for future validation)
2025-10-19 17:08:04 +08:00
unclecode
69d0ef89dd fix: update Crawl4AI skill with corrected parameters and examples
- Fixed CrawlerConfig → CrawlerRunConfig throughout
- Fixed parameter names (timeout → page_timeout, store_html removed)
- Fixed schema format (selector → baseSelector)
- Corrected proxy configuration (in BrowserConfig, not CrawlerRunConfig)
- Fixed fit_markdown usage with content filters
- Added comprehensive references to docs/examples/ directory
- Created safe packaging script to avoid root directory pollution
- All scripts tested and verified working
2025-10-19 16:16:20 +08:00
unclecode
1bf85bcb1a fix: remove non-existent wiki link and clarify skill usage instructions 2025-10-19 13:19:14 +08:00
unclecode
749232ba1a feat: add AI assistant skill package for Crawl4AI
- Create comprehensive skill package for AI coding assistants
- Include complete SDK reference (23K words, v0.7.4)
- Add three extraction scripts (basic, batch, pipeline)
- Implement version tracking in skill and scripts
- Add prominent download section on homepage
- Place skill in docs/assets for web distribution

The skill enables AI assistants like Claude, Cursor, and Windsurf
to effectively use Crawl4AI with optimized workflows for markdown
generation and data extraction.
2025-10-19 13:19:14 +08:00
unclecode
c7288dd2f1 docs: add complete SDK reference documentation
Add comprehensive single-page SDK reference combining:
- Installation & Setup
- Quick Start
- Core API (AsyncWebCrawler, arun, arun_many, CrawlResult)
- Configuration (BrowserConfig, CrawlerConfig, Parameters)
- Crawling Patterns
- Content Processing (Markdown, Fit Markdown, Selection, Interaction, Link & Media)
- Extraction Strategies (LLM and No-LLM)
- Advanced Features (Session Management, Hooks & Auth)

Generated using scripts/generate_sdk_docs.py in ultra-dense mode
optimized for AI assistant consumption.

Stats: 23K words, 185 code blocks, 220KB
2025-10-19 13:19:14 +08:00
ntohidi
a3f057e19f feat: Add hooks utility for function-based hooks with Docker client integration. ref #1377
Add hooks_to_string() utility function that converts Python function objects
   to string representations for the Docker API, enabling developers to write hooks
   as regular Python functions instead of strings.

   Core Changes:
   - New hooks_to_string() utility in crawl4ai/utils.py using inspect.getsource()
   - Docker client now accepts both function objects and strings for hooks
   - Automatic detection and conversion in Crawl4aiDockerClient._prepare_request()
   - New hooks and hooks_timeout parameters in client.crawl() method

   Documentation:
   - Docker client examples with function-based hooks (docs/examples/docker_client_hooks_example.py)
   - Updated main Docker deployment guide with comprehensive hooks section
   - Added unit tests for hooks utility (tests/docker/test_hooks_utility.py)
2025-10-13 12:34:08 +08:00
21 changed files with 9054 additions and 286 deletions

81
.github/workflows/docker-release.yml vendored Normal file
View File

@@ -0,0 +1,81 @@
name: Docker Release
on:
release:
types: [published]
push:
tags:
- 'docker-rebuild-v*' # Allow manual Docker rebuilds via tags
jobs:
docker:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Extract version from release or tag
id: get_version
run: |
if [ "${{ github.event_name }}" == "release" ]; then
# Triggered by release event
VERSION="${{ github.event.release.tag_name }}"
VERSION=${VERSION#v} # Remove 'v' prefix
else
# Triggered by docker-rebuild-v* tag
VERSION=${GITHUB_REF#refs/tags/docker-rebuild-v}
fi
echo "VERSION=$VERSION" >> $GITHUB_OUTPUT
echo "Building Docker images for version: $VERSION"
- name: Extract major and minor versions
id: versions
run: |
VERSION=${{ steps.get_version.outputs.VERSION }}
MAJOR=$(echo $VERSION | cut -d. -f1)
MINOR=$(echo $VERSION | cut -d. -f1-2)
echo "MAJOR=$MAJOR" >> $GITHUB_OUTPUT
echo "MINOR=$MINOR" >> $GITHUB_OUTPUT
echo "Semantic versions - Major: $MAJOR, Minor: $MINOR"
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_TOKEN }}
- name: Build and push Docker images
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: |
unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}
unclecode/crawl4ai:${{ steps.versions.outputs.MINOR }}
unclecode/crawl4ai:${{ steps.versions.outputs.MAJOR }}
unclecode/crawl4ai:latest
platforms: linux/amd64,linux/arm64
cache-from: type=gha
cache-to: type=gha,mode=max
- name: Summary
run: |
echo "## 🐳 Docker Release Complete!" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### Published Images" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.versions.outputs.MINOR }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.versions.outputs.MAJOR }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:latest\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### Platforms" >> $GITHUB_STEP_SUMMARY
echo "- linux/amd64" >> $GITHUB_STEP_SUMMARY
echo "- linux/arm64" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 🚀 Pull Command" >> $GITHUB_STEP_SUMMARY
echo "\`\`\`bash" >> $GITHUB_STEP_SUMMARY
echo "docker pull unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "\`\`\`" >> $GITHUB_STEP_SUMMARY

917
.github/workflows/docs/ARCHITECTURE.md vendored Normal file
View File

@@ -0,0 +1,917 @@
# Workflow Architecture Documentation
## Overview
This document describes the technical architecture of the split release pipeline for Crawl4AI.
---
## Architecture Diagram
```
┌─────────────────────────────────────────────────────────────────┐
│ Developer │
│ │ │
│ ▼ │
│ git tag v1.2.3 │
│ git push --tags │
└──────────────────────────────┬──────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ GitHub Repository │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Tag Event: v1.2.3 │ │
│ └────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ release.yml (Release Pipeline) │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 1. Extract Version │ │ │
│ │ │ v1.2.3 → 1.2.3 │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 2. Validate Version │ │ │
│ │ │ Tag == __version__.py │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 3. Build Python Package │ │ │
│ │ │ - Source dist (.tar.gz) │ │ │
│ │ │ - Wheel (.whl) │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 4. Upload to PyPI │ │ │
│ │ │ - Authenticate with token │ │ │
│ │ │ - Upload dist/* │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 5. Create GitHub Release │ │ │
│ │ │ - Tag: v1.2.3 │ │ │
│ │ │ - Body: Install instructions │ │ │
│ │ │ - Status: Published │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Release Event: published (v1.2.3) │ │
│ └────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ docker-release.yml (Docker Pipeline) │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 1. Extract Version from Release │ │ │
│ │ │ github.event.release.tag_name → 1.2.3 │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 2. Parse Semantic Versions │ │ │
│ │ │ 1.2.3 → Major: 1, Minor: 1.2 │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 3. Setup Multi-Arch Build │ │ │
│ │ │ - Docker Buildx │ │ │
│ │ │ - QEMU emulation │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 4. Authenticate Docker Hub │ │ │
│ │ │ - Username: DOCKER_USERNAME │ │ │
│ │ │ - Token: DOCKER_TOKEN │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 5. Build Multi-Arch Images │ │ │
│ │ │ ┌────────────────┬────────────────┐ │ │ │
│ │ │ │ linux/amd64 │ linux/arm64 │ │ │ │
│ │ │ └────────────────┴────────────────┘ │ │ │
│ │ │ Cache: GitHub Actions (type=gha) │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 6. Push to Docker Hub │ │ │
│ │ │ Tags: │ │ │
│ │ │ - unclecode/crawl4ai:1.2.3 │ │ │
│ │ │ - unclecode/crawl4ai:1.2 │ │ │
│ │ │ - unclecode/crawl4ai:1 │ │ │
│ │ │ - unclecode/crawl4ai:latest │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ External Services │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ PyPI │ │ Docker Hub │ │ GitHub │ │
│ │ │ │ │ │ │ │
│ │ crawl4ai │ │ unclecode/ │ │ Releases │ │
│ │ 1.2.3 │ │ crawl4ai │ │ v1.2.3 │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
---
## Component Details
### 1. Release Pipeline (release.yml)
#### Purpose
Fast publication of Python package and GitHub release.
#### Input
- **Trigger**: Git tag matching `v*` (excluding `test-v*`)
- **Example**: `v1.2.3`
#### Processing Stages
##### Stage 1: Version Extraction
```bash
Input: refs/tags/v1.2.3
Output: VERSION=1.2.3
```
**Implementation**:
```bash
TAG_VERSION=${GITHUB_REF#refs/tags/v} # Remove 'refs/tags/v' prefix
echo "VERSION=$TAG_VERSION" >> $GITHUB_OUTPUT
```
##### Stage 2: Version Validation
```bash
Input: TAG_VERSION=1.2.3
Check: crawl4ai/__version__.py contains __version__ = "1.2.3"
Output: Pass/Fail
```
**Implementation**:
```bash
PACKAGE_VERSION=$(python -c "from crawl4ai.__version__ import __version__; print(__version__)")
if [ "$TAG_VERSION" != "$PACKAGE_VERSION" ]; then
exit 1
fi
```
##### Stage 3: Package Build
```bash
Input: Source code + pyproject.toml
Output: dist/crawl4ai-1.2.3.tar.gz
dist/crawl4ai-1.2.3-py3-none-any.whl
```
**Implementation**:
```bash
python -m build
# Uses build backend defined in pyproject.toml
```
##### Stage 4: PyPI Upload
```bash
Input: dist/*.{tar.gz,whl}
Auth: PYPI_TOKEN
Output: Package published to PyPI
```
**Implementation**:
```bash
twine upload dist/*
# Environment:
# TWINE_USERNAME: __token__
# TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }}
```
##### Stage 5: GitHub Release Creation
```bash
Input: Tag: v1.2.3
Body: Markdown content
Output: Published GitHub release
```
**Implementation**:
```yaml
uses: softprops/action-gh-release@v2
with:
tag_name: v1.2.3
name: Release v1.2.3
body: |
Installation instructions and changelog
draft: false
prerelease: false
```
#### Output
- **PyPI Package**: https://pypi.org/project/crawl4ai/1.2.3/
- **GitHub Release**: Published release on repository
- **Event**: `release.published` (triggers Docker workflow)
#### Timeline
```
0:00 - Tag pushed
0:01 - Checkout + Python setup
0:02 - Version validation
0:03 - Package build
0:04 - PyPI upload starts
0:06 - PyPI upload complete
0:07 - GitHub release created
0:08 - Workflow complete
```
---
### 2. Docker Release Pipeline (docker-release.yml)
#### Purpose
Build and publish multi-architecture Docker images.
#### Inputs
##### Input 1: Release Event (Automatic)
```yaml
Event: release.published
Data: github.event.release.tag_name = "v1.2.3"
```
##### Input 2: Docker Rebuild Tag (Manual)
```yaml
Tag: docker-rebuild-v1.2.3
```
#### Processing Stages
##### Stage 1: Version Detection
```bash
# From release event:
VERSION = github.event.release.tag_name.strip("v")
# Result: "1.2.3"
# From rebuild tag:
VERSION = GITHUB_REF.replace("refs/tags/docker-rebuild-v", "")
# Result: "1.2.3"
```
##### Stage 2: Semantic Version Parsing
```bash
Input: VERSION=1.2.3
Output: MAJOR=1
MINOR=1.2
PATCH=3 (implicit)
```
**Implementation**:
```bash
MAJOR=$(echo $VERSION | cut -d. -f1) # Extract first component
MINOR=$(echo $VERSION | cut -d. -f1-2) # Extract first two components
```
##### Stage 3: Multi-Architecture Setup
```yaml
Setup:
- Docker Buildx (multi-platform builder)
- QEMU (ARM emulation on x86)
Platforms:
- linux/amd64 (x86_64)
- linux/arm64 (aarch64)
```
**Architecture**:
```
GitHub Runner (linux/amd64)
├─ Buildx Builder
│ ├─ Native: Build linux/amd64 image
│ └─ QEMU: Emulate ARM to build linux/arm64 image
└─ Generate manifest list (points to both images)
```
##### Stage 4: Docker Hub Authentication
```bash
Input: DOCKER_USERNAME
DOCKER_TOKEN
Output: Authenticated Docker client
```
##### Stage 5: Build with Cache
```yaml
Cache Configuration:
cache-from: type=gha # Read from GitHub Actions cache
cache-to: type=gha,mode=max # Write all layers
Cache Key Components:
- Workflow file path
- Branch name
- Architecture (amd64/arm64)
```
**Cache Hierarchy**:
```
Cache Entry: main/docker-release.yml/linux-amd64
├─ Layer: sha256:abc123... (FROM python:3.12)
├─ Layer: sha256:def456... (RUN apt-get update)
├─ Layer: sha256:ghi789... (COPY requirements.txt)
├─ Layer: sha256:jkl012... (RUN pip install)
└─ Layer: sha256:mno345... (COPY . /app)
Cache Hit/Miss Logic:
- If layer input unchanged → cache hit → skip build
- If layer input changed → cache miss → rebuild + all subsequent layers
```
##### Stage 6: Tag Generation
```bash
Input: VERSION=1.2.3, MAJOR=1, MINOR=1.2
Output Tags:
- unclecode/crawl4ai:1.2.3 (exact version)
- unclecode/crawl4ai:1.2 (minor version)
- unclecode/crawl4ai:1 (major version)
- unclecode/crawl4ai:latest (latest stable)
```
**Tag Strategy**:
- All tags point to same image SHA
- Users can pin to desired stability level
- Pushing new version updates `1`, `1.2`, and `latest` automatically
##### Stage 7: Push to Registry
```bash
For each tag:
For each platform (amd64, arm64):
Push image to Docker Hub
Create manifest list:
Manifest: unclecode/crawl4ai:1.2.3
├─ linux/amd64: sha256:abc...
└─ linux/arm64: sha256:def...
Docker CLI automatically selects correct platform on pull
```
#### Output
- **Docker Images**: 4 tags × 2 platforms = 8 image variants + 4 manifests
- **Docker Hub**: https://hub.docker.com/r/unclecode/crawl4ai/tags
#### Timeline
**Cold Cache (First Build)**:
```
0:00 - Release event received
0:01 - Checkout + Buildx setup
0:02 - Docker Hub auth
0:03 - Start build (amd64)
0:08 - Complete amd64 build
0:09 - Start build (arm64)
0:14 - Complete arm64 build
0:15 - Generate manifests
0:16 - Push all tags
0:17 - Workflow complete
```
**Warm Cache (Code Change Only)**:
```
0:00 - Release event received
0:01 - Checkout + Buildx setup
0:02 - Docker Hub auth
0:03 - Start build (amd64) - cache hit for layers 1-4
0:04 - Complete amd64 build (only layer 5 rebuilt)
0:05 - Start build (arm64) - cache hit for layers 1-4
0:06 - Complete arm64 build (only layer 5 rebuilt)
0:07 - Generate manifests
0:08 - Push all tags
0:09 - Workflow complete
```
---
## Data Flow
### Version Information Flow
```
Developer
crawl4ai/__version__.py
__version__ = "1.2.3"
├─► Git Tag
│ v1.2.3
│ │
│ ▼
│ release.yml
│ │
│ ├─► Validation
│ │ ✓ Match
│ │
│ ├─► PyPI Package
│ │ crawl4ai==1.2.3
│ │
│ └─► GitHub Release
│ v1.2.3
│ │
│ ▼
│ docker-release.yml
│ │
│ └─► Docker Tags
│ 1.2.3, 1.2, 1, latest
└─► Package Metadata
pyproject.toml
version = "1.2.3"
```
### Secrets Flow
```
GitHub Secrets (Encrypted at Rest)
├─► PYPI_TOKEN
│ │
│ ▼
│ release.yml
│ │
│ ▼
│ TWINE_PASSWORD env var (masked in logs)
│ │
│ ▼
│ PyPI API (HTTPS)
├─► DOCKER_USERNAME
│ │
│ ▼
│ docker-release.yml
│ │
│ ▼
│ docker/login-action (masked in logs)
│ │
│ ▼
│ Docker Hub API (HTTPS)
└─► DOCKER_TOKEN
docker-release.yml
docker/login-action (masked in logs)
Docker Hub API (HTTPS)
```
### Artifact Flow
```
Source Code
├─► release.yml
│ │
│ ▼
│ python -m build
│ │
│ ├─► crawl4ai-1.2.3.tar.gz
│ │ │
│ │ ▼
│ │ PyPI Storage
│ │ │
│ │ ▼
│ │ pip install crawl4ai
│ │
│ └─► crawl4ai-1.2.3-py3-none-any.whl
│ │
│ ▼
│ PyPI Storage
│ │
│ ▼
│ pip install crawl4ai
└─► docker-release.yml
docker build
├─► Image: linux/amd64
│ │
│ └─► Docker Hub
│ unclecode/crawl4ai:1.2.3-amd64
└─► Image: linux/arm64
└─► Docker Hub
unclecode/crawl4ai:1.2.3-arm64
```
---
## State Machines
### Release Pipeline State Machine
```
┌─────────┐
│ START │
└────┬────┘
┌──────────────┐
│ Extract │
│ Version │
└──────┬───────┘
┌──────────────┐ ┌─────────┐
│ Validate │─────►│ FAILED │
│ Version │ No │ (Exit 1)│
└──────┬───────┘ └─────────┘
│ Yes
┌──────────────┐
│ Build │
│ Package │
└──────┬───────┘
┌──────────────┐ ┌─────────┐
│ Upload │─────►│ FAILED │
│ to PyPI │ Error│ (Exit 1)│
└──────┬───────┘ └─────────┘
│ Success
┌──────────────┐
│ Create │
│ GH Release │
└──────┬───────┘
┌──────────────┐
│ SUCCESS │
│ (Emit Event) │
└──────────────┘
```
### Docker Pipeline State Machine
```
┌─────────┐
│ START │
│ (Event) │
└────┬────┘
┌──────────────┐
│ Detect │
│ Version │
│ Source │
└──────┬───────┘
┌──────────────┐
│ Parse │
│ Semantic │
│ Versions │
└──────┬───────┘
┌──────────────┐ ┌─────────┐
│ Authenticate │─────►│ FAILED │
│ Docker Hub │ Error│ (Exit 1)│
└──────┬───────┘ └─────────┘
│ Success
┌──────────────┐
│ Build │
│ amd64 │
└──────┬───────┘
┌──────────────┐ ┌─────────┐
│ Build │─────►│ FAILED │
│ arm64 │ Error│ (Exit 1)│
└──────┬───────┘ └─────────┘
│ Success
┌──────────────┐
│ Push All │
│ Tags │
└──────┬───────┘
┌──────────────┐
│ SUCCESS │
└──────────────┘
```
---
## Security Architecture
### Threat Model
#### Threats Mitigated
1. **Secret Exposure**
- Mitigation: GitHub Actions secret masking
- Evidence: Secrets never appear in logs
2. **Unauthorized Package Upload**
- Mitigation: Scoped PyPI tokens
- Evidence: Token limited to `crawl4ai` project
3. **Man-in-the-Middle**
- Mitigation: HTTPS for all API calls
- Evidence: PyPI, Docker Hub, GitHub all use TLS
4. **Supply Chain Tampering**
- Mitigation: Immutable artifacts, content checksums
- Evidence: PyPI stores SHA256, Docker uses content-addressable storage
#### Trust Boundaries
```
┌─────────────────────────────────────────┐
│ Trusted Zone │
│ ┌────────────────────────────────┐ │
│ │ GitHub Actions Runner │ │
│ │ - Ephemeral VM │ │
│ │ - Isolated environment │ │
│ │ - Access to secrets │ │
│ └────────────────────────────────┘ │
│ │ │
│ │ HTTPS (TLS 1.2+) │
│ ▼ │
└─────────────────────────────────────────┘
┌────────────┼────────────┐
│ │ │
▼ ▼ ▼
┌────────┐ ┌─────────┐ ┌──────────┐
│ PyPI │ │ Docker │ │ GitHub │
│ API │ │ Hub │ │ API │
└────────┘ └─────────┘ └──────────┘
External External External
Service Service Service
```
### Secret Management
#### Secret Lifecycle
```
Creation (Developer)
├─► PyPI: Create API token (scoped to project)
├─► Docker Hub: Create access token (read/write)
Storage (GitHub)
├─► Encrypted at rest (AES-256)
├─► Access controlled (repo-scoped)
Usage (Workflow)
├─► Injected as env vars
├─► Masked in logs (GitHub redacts on output)
├─► Never persisted to disk (in-memory only)
Transmission (API Call)
├─► HTTPS only
├─► TLS 1.2+ with strong ciphers
Rotation (Manual)
└─► Regenerate on PyPI/Docker Hub
Update GitHub secret
```
---
## Performance Characteristics
### Release Pipeline Performance
| Metric | Value | Notes |
|--------|-------|-------|
| Cold start | ~2-3 min | First run on new runner |
| Warm start | ~2-3 min | Minimal caching benefit |
| PyPI upload | ~30-60 sec | Network-bound |
| Package build | ~30 sec | CPU-bound |
| Parallelization | None | Sequential by design |
### Docker Pipeline Performance
| Metric | Cold Cache | Warm Cache (code) | Warm Cache (deps) |
|--------|-----------|-------------------|-------------------|
| Total time | 10-15 min | 1-2 min | 3-5 min |
| amd64 build | 5-7 min | 30-60 sec | 1-2 min |
| arm64 build | 5-7 min | 30-60 sec | 1-2 min |
| Push time | 1-2 min | 30 sec | 30 sec |
| Cache hit rate | 0% | 85% | 60% |
### Cache Performance Model
```python
def estimate_build_time(changes):
base_time = 60 # seconds (setup + push)
if "Dockerfile" in changes:
return base_time + (10 * 60) # Full rebuild: ~11 min
elif "requirements.txt" in changes:
return base_time + (3 * 60) # Deps rebuild: ~4 min
elif any(f.endswith(".py") for f in changes):
return base_time + 60 # Code only: ~2 min
else:
return base_time # No changes: ~1 min
```
---
## Scalability Considerations
### Current Limits
| Resource | Limit | Impact |
|----------|-------|--------|
| Workflow concurrency | 20 (default) | Max 20 releases in parallel |
| Artifact storage | 500 MB/artifact | PyPI packages small (<10 MB) |
| Cache storage | 10 GB/repo | Docker layers fit comfortably |
| Workflow run time | 6 hours | Plenty of headroom |
### Scaling Strategies
#### Horizontal Scaling (Multiple Repos)
```
crawl4ai (main)
├─ release.yml
└─ docker-release.yml
crawl4ai-plugins (separate)
├─ release.yml
└─ docker-release.yml
Each repo has independent:
- Secrets
- Cache (10 GB each)
- Concurrency limits (20 each)
```
#### Vertical Scaling (Larger Runners)
```yaml
jobs:
docker:
runs-on: ubuntu-latest-8-cores # GitHub-hosted larger runner
# 4x faster builds for CPU-bound layers
```
---
## Disaster Recovery
### Failure Scenarios
#### Scenario 1: Release Pipeline Fails
**Failure Point**: PyPI upload fails (network error)
**State**:
- ✓ Version validated
- ✓ Package built
- ✗ PyPI upload
- ✗ GitHub release
**Recovery**:
```bash
# Manual upload
twine upload dist/*
# Retry workflow (re-run from GitHub Actions UI)
```
**Prevention**: Add retry logic to PyPI upload
#### Scenario 2: Docker Pipeline Fails
**Failure Point**: ARM build fails (dependency issue)
**State**:
- ✓ PyPI published
- ✓ GitHub release created
- ✓ amd64 image built
- ✗ arm64 image build
**Recovery**:
```bash
# Fix Dockerfile
git commit -am "fix: ARM build dependency"
# Trigger rebuild
git tag docker-rebuild-v1.2.3
git push origin docker-rebuild-v1.2.3
```
**Impact**: PyPI package available, only Docker ARM users affected
#### Scenario 3: Partial Release
**Failure Point**: GitHub release creation fails
**State**:
- ✓ PyPI published
- ✗ GitHub release
- ✗ Docker images
**Recovery**:
```bash
# Create release manually
gh release create v1.2.3 \
--title "Release v1.2.3" \
--notes "..."
# This triggers docker-release.yml automatically
```
---
## Monitoring and Observability
### Metrics to Track
#### Release Pipeline
- Success rate (target: >99%)
- Duration (target: <3 min)
- PyPI upload time (target: <60 sec)
#### Docker Pipeline
- Success rate (target: >95%)
- Duration (target: <15 min cold, <2 min warm)
- Cache hit rate (target: >80% for code changes)
### Alerting
**Critical Alerts**:
- Release pipeline failure (blocks release)
- PyPI authentication failure (expired token)
**Warning Alerts**:
- Docker build >15 min (performance degradation)
- Cache hit rate <50% (cache issue)
### Logging
**GitHub Actions Logs**:
- Retention: 90 days
- Downloadable: Yes
- Searchable: Limited
**Recommended External Logging**:
```yaml
- name: Send logs to external service
if: failure()
run: |
curl -X POST https://logs.example.com/api/v1/logs \
-H "Content-Type: application/json" \
-d "{\"workflow\": \"${{ github.workflow }}\", \"status\": \"failed\"}"
```
---
## Future Enhancements
### Planned Improvements
1. **Automated Changelog Generation**
- Use conventional commits
- Generate CHANGELOG.md automatically
2. **Pre-release Testing**
- Test builds on `test-v*` tags
- Upload to TestPyPI
3. **Notification System**
- Slack/Discord notifications on release
- Email on failure
4. **Performance Optimization**
- Parallel Docker builds (amd64 + arm64 simultaneously)
- Persistent runners for better caching
5. **Enhanced Validation**
- Smoke tests after PyPI upload
- Container security scanning
---
## References
- [GitHub Actions Architecture](https://docs.github.com/en/actions/learn-github-actions/understanding-github-actions)
- [Docker Build Cache](https://docs.docker.com/build/cache/)
- [PyPI API Documentation](https://warehouse.pypa.io/api-reference/)
---
**Last Updated**: 2025-01-21
**Version**: 2.0

1029
.github/workflows/docs/README.md vendored Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,287 @@
# Workflow Quick Reference
## Quick Commands
### Standard Release
```bash
# 1. Update version
vim crawl4ai/__version__.py # Set to "1.2.3"
# 2. Commit and tag
git add crawl4ai/__version__.py
git commit -m "chore: bump version to 1.2.3"
git tag v1.2.3
git push origin main
git push origin v1.2.3
# 3. Monitor
# - PyPI: ~2-3 minutes
# - Docker: ~1-15 minutes
```
### Docker Rebuild Only
```bash
git tag docker-rebuild-v1.2.3
git push origin docker-rebuild-v1.2.3
```
### Delete Tag (Undo Release)
```bash
# Local
git tag -d v1.2.3
# Remote
git push --delete origin v1.2.3
# GitHub Release
gh release delete v1.2.3
```
---
## Workflow Triggers
### release.yml
| Event | Pattern | Example |
|-------|---------|---------|
| Tag push | `v*` | `v1.2.3` |
| Excludes | `test-v*` | `test-v1.2.3` |
### docker-release.yml
| Event | Pattern | Example |
|-------|---------|---------|
| Release published | `release.published` | Automatic |
| Tag push | `docker-rebuild-v*` | `docker-rebuild-v1.2.3` |
---
## Environment Variables
### release.yml
| Variable | Source | Example |
|----------|--------|---------|
| `VERSION` | Git tag | `1.2.3` |
| `TWINE_USERNAME` | Static | `__token__` |
| `TWINE_PASSWORD` | Secret | `pypi-Ag...` |
| `GITHUB_TOKEN` | Auto | `ghp_...` |
### docker-release.yml
| Variable | Source | Example |
|----------|--------|---------|
| `VERSION` | Release/Tag | `1.2.3` |
| `MAJOR` | Computed | `1` |
| `MINOR` | Computed | `1.2` |
| `DOCKER_USERNAME` | Secret | `unclecode` |
| `DOCKER_TOKEN` | Secret | `dckr_pat_...` |
---
## Docker Tags Generated
| Version | Tags Created |
|---------|-------------|
| v1.0.0 | `1.0.0`, `1.0`, `1`, `latest` |
| v1.1.0 | `1.1.0`, `1.1`, `1`, `latest` |
| v1.2.3 | `1.2.3`, `1.2`, `1`, `latest` |
| v2.0.0 | `2.0.0`, `2.0`, `2`, `latest` |
---
## Workflow Outputs
### release.yml
| Output | Location | Time |
|--------|----------|------|
| PyPI Package | https://pypi.org/project/crawl4ai/ | ~2-3 min |
| GitHub Release | Repository → Releases | ~2-3 min |
| Workflow Summary | Actions → Run → Summary | Immediate |
### docker-release.yml
| Output | Location | Time |
|--------|----------|------|
| Docker Images | https://hub.docker.com/r/unclecode/crawl4ai | ~1-15 min |
| Workflow Summary | Actions → Run → Summary | Immediate |
---
## Common Issues
| Issue | Solution |
|-------|----------|
| Version mismatch | Update `crawl4ai/__version__.py` to match tag |
| PyPI 403 Forbidden | Check `PYPI_TOKEN` secret |
| PyPI 400 File exists | Version already published, increment version |
| Docker auth failed | Regenerate `DOCKER_TOKEN` |
| Docker build timeout | Check Dockerfile, review build logs |
| Cache not working | First build on branch always cold |
---
## Secrets Checklist
- [ ] `PYPI_TOKEN` - PyPI API token (project or account scope)
- [ ] `DOCKER_USERNAME` - Docker Hub username
- [ ] `DOCKER_TOKEN` - Docker Hub access token (read/write)
- [ ] `GITHUB_TOKEN` - Auto-provided (no action needed)
---
## Workflow Dependencies
### release.yml Dependencies
```yaml
Python: 3.12
Actions:
- actions/checkout@v4
- actions/setup-python@v5
- softprops/action-gh-release@v2
PyPI Packages:
- build
- twine
```
### docker-release.yml Dependencies
```yaml
Actions:
- actions/checkout@v4
- docker/setup-buildx-action@v3
- docker/login-action@v3
- docker/build-push-action@v5
Docker:
- Buildx
- QEMU (for multi-arch)
```
---
## Cache Information
### Type
- GitHub Actions Cache (`type=gha`)
### Storage
- **Limit**: 10GB per repository
- **Retention**: 7 days for unused entries
- **Cleanup**: Automatic LRU eviction
### Performance
| Scenario | Cache Hit | Build Time |
|----------|-----------|------------|
| First build | 0% | 10-15 min |
| Code change only | 85% | 1-2 min |
| Dependency update | 60% | 3-5 min |
| No changes | 100% | 30-60 sec |
---
## Build Platforms
| Platform | Architecture | Devices |
|----------|--------------|---------|
| linux/amd64 | x86_64 | Intel/AMD servers, AWS EC2, GCP |
| linux/arm64 | aarch64 | Apple Silicon, AWS Graviton, Raspberry Pi |
---
## Version Validation
### Pre-Tag Checklist
```bash
# Check current version
python -c "from crawl4ai.__version__ import __version__; print(__version__)"
# Verify it matches intended tag
# If tag is v1.2.3, version should be "1.2.3"
```
### Post-Release Verification
```bash
# PyPI
pip install crawl4ai==1.2.3
python -c "import crawl4ai; print(crawl4ai.__version__)"
# Docker
docker pull unclecode/crawl4ai:1.2.3
docker run unclecode/crawl4ai:1.2.3 python -c "import crawl4ai; print(crawl4ai.__version__)"
```
---
## Monitoring URLs
| Service | URL |
|---------|-----|
| GitHub Actions | `https://github.com/{owner}/{repo}/actions` |
| PyPI Project | `https://pypi.org/project/crawl4ai/` |
| Docker Hub | `https://hub.docker.com/r/unclecode/crawl4ai` |
| GitHub Releases | `https://github.com/{owner}/{repo}/releases` |
---
## Rollback Strategy
### PyPI (Cannot Delete)
```bash
# Increment patch version
git tag v1.2.4
git push origin v1.2.4
```
### Docker (Can Overwrite)
```bash
# Rebuild with fix
git tag docker-rebuild-v1.2.3
git push origin docker-rebuild-v1.2.3
```
### GitHub Release
```bash
# Delete release
gh release delete v1.2.3
# Delete tag
git push --delete origin v1.2.3
```
---
## Status Badge Markdown
```markdown
[![Release Pipeline](https://github.com/{owner}/{repo}/actions/workflows/release.yml/badge.svg)](https://github.com/{owner}/{repo}/actions/workflows/release.yml)
[![Docker Release](https://github.com/{owner}/{repo}/actions/workflows/docker-release.yml/badge.svg)](https://github.com/{owner}/{repo}/actions/workflows/docker-release.yml)
```
---
## Timeline Example
```
0:00 - Push tag v1.2.3
0:01 - release.yml starts
0:02 - Version validation passes
0:03 - Package built
0:04 - PyPI upload starts
0:06 - PyPI upload complete ✓
0:07 - GitHub release created ✓
0:08 - release.yml complete
0:08 - docker-release.yml triggered
0:10 - Docker build starts
0:12 - amd64 image built (cache hit)
0:14 - arm64 image built (cache hit)
0:15 - Images pushed to Docker Hub ✓
0:16 - docker-release.yml complete
Total: ~16 minutes
Critical path (PyPI + GitHub): ~8 minutes
```
---
## Contact
For workflow issues:
1. Check Actions tab for logs
2. Review this reference
3. See [README.md](./README.md) for detailed docs

View File

@@ -10,53 +10,53 @@ jobs:
runs-on: ubuntu-latest
permissions:
contents: write # Required for creating releases
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Extract version from tag
id: get_version
run: |
TAG_VERSION=${GITHUB_REF#refs/tags/v}
echo "VERSION=$TAG_VERSION" >> $GITHUB_OUTPUT
echo "Releasing version: $TAG_VERSION"
- name: Install package dependencies
run: |
pip install -e .
- name: Check version consistency
run: |
TAG_VERSION=${{ steps.get_version.outputs.VERSION }}
PACKAGE_VERSION=$(python -c "from crawl4ai.__version__ import __version__; print(__version__)")
echo "Tag version: $TAG_VERSION"
echo "Package version: $PACKAGE_VERSION"
if [ "$TAG_VERSION" != "$PACKAGE_VERSION" ]; then
echo "❌ Version mismatch! Tag: $TAG_VERSION, Package: $PACKAGE_VERSION"
echo "Please update crawl4ai/__version__.py to match the tag version"
exit 1
fi
echo "✅ Version check passed: $TAG_VERSION"
- name: Install build dependencies
run: |
python -m pip install --upgrade pip
pip install build twine
- name: Build package
run: python -m build
- name: Check package
run: twine check dist/*
- name: Upload to PyPI
env:
TWINE_USERNAME: __token__
@@ -65,37 +65,7 @@ jobs:
echo "📦 Uploading to PyPI..."
twine upload dist/*
echo "✅ Package uploaded to https://pypi.org/project/crawl4ai/"
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_TOKEN }}
- name: Extract major and minor versions
id: versions
run: |
VERSION=${{ steps.get_version.outputs.VERSION }}
MAJOR=$(echo $VERSION | cut -d. -f1)
MINOR=$(echo $VERSION | cut -d. -f1-2)
echo "MAJOR=$MAJOR" >> $GITHUB_OUTPUT
echo "MINOR=$MINOR" >> $GITHUB_OUTPUT
- name: Build and push Docker images
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: |
unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}
unclecode/crawl4ai:${{ steps.versions.outputs.MINOR }}
unclecode/crawl4ai:${{ steps.versions.outputs.MAJOR }}
unclecode/crawl4ai:latest
platforms: linux/amd64,linux/arm64
- name: Create GitHub Release
uses: softprops/action-gh-release@v2
with:
@@ -103,26 +73,29 @@ jobs:
name: Release v${{ steps.get_version.outputs.VERSION }}
body: |
## 🎉 Crawl4AI v${{ steps.get_version.outputs.VERSION }} Released!
### 📦 Installation
**PyPI:**
```bash
pip install crawl4ai==${{ steps.get_version.outputs.VERSION }}
```
**Docker:**
```bash
docker pull unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}
docker pull unclecode/crawl4ai:latest
```
**Note:** Docker images are being built and will be available shortly.
Check the [Docker Release workflow](https://github.com/${{ github.repository }}/actions/workflows/docker-release.yml) for build status.
### 📝 What's Changed
See [CHANGELOG.md](https://github.com/${{ github.repository }}/blob/main/CHANGELOG.md) for details.
draft: false
prerelease: false
token: ${{ secrets.GITHUB_TOKEN }}
- name: Summary
run: |
echo "## 🚀 Release Complete!" >> $GITHUB_STEP_SUMMARY
@@ -132,11 +105,9 @@ jobs:
echo "- URL: https://pypi.org/project/crawl4ai/" >> $GITHUB_STEP_SUMMARY
echo "- Install: \`pip install crawl4ai==${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 🐳 Docker Images" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.versions.outputs.MINOR }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.versions.outputs.MAJOR }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:latest\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 📋 GitHub Release" >> $GITHUB_STEP_SUMMARY
echo "https://github.com/${{ github.repository }}/releases/tag/v${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "- https://github.com/${{ github.repository }}/releases/tag/v${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 🐳 Docker Images" >> $GITHUB_STEP_SUMMARY
echo "Docker images are being built in a separate workflow." >> $GITHUB_STEP_SUMMARY
echo "Check: https://github.com/${{ github.repository }}/actions/workflows/docker-release.yml" >> $GITHUB_STEP_SUMMARY

142
.github/workflows/release.yml.backup vendored Normal file
View File

@@ -0,0 +1,142 @@
name: Release Pipeline
on:
push:
tags:
- 'v*'
- '!test-v*' # Exclude test tags
jobs:
release:
runs-on: ubuntu-latest
permissions:
contents: write # Required for creating releases
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Extract version from tag
id: get_version
run: |
TAG_VERSION=${GITHUB_REF#refs/tags/v}
echo "VERSION=$TAG_VERSION" >> $GITHUB_OUTPUT
echo "Releasing version: $TAG_VERSION"
- name: Install package dependencies
run: |
pip install -e .
- name: Check version consistency
run: |
TAG_VERSION=${{ steps.get_version.outputs.VERSION }}
PACKAGE_VERSION=$(python -c "from crawl4ai.__version__ import __version__; print(__version__)")
echo "Tag version: $TAG_VERSION"
echo "Package version: $PACKAGE_VERSION"
if [ "$TAG_VERSION" != "$PACKAGE_VERSION" ]; then
echo "❌ Version mismatch! Tag: $TAG_VERSION, Package: $PACKAGE_VERSION"
echo "Please update crawl4ai/__version__.py to match the tag version"
exit 1
fi
echo "✅ Version check passed: $TAG_VERSION"
- name: Install build dependencies
run: |
python -m pip install --upgrade pip
pip install build twine
- name: Build package
run: python -m build
- name: Check package
run: twine check dist/*
- name: Upload to PyPI
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }}
run: |
echo "📦 Uploading to PyPI..."
twine upload dist/*
echo "✅ Package uploaded to https://pypi.org/project/crawl4ai/"
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_TOKEN }}
- name: Extract major and minor versions
id: versions
run: |
VERSION=${{ steps.get_version.outputs.VERSION }}
MAJOR=$(echo $VERSION | cut -d. -f1)
MINOR=$(echo $VERSION | cut -d. -f1-2)
echo "MAJOR=$MAJOR" >> $GITHUB_OUTPUT
echo "MINOR=$MINOR" >> $GITHUB_OUTPUT
- name: Build and push Docker images
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: |
unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}
unclecode/crawl4ai:${{ steps.versions.outputs.MINOR }}
unclecode/crawl4ai:${{ steps.versions.outputs.MAJOR }}
unclecode/crawl4ai:latest
platforms: linux/amd64,linux/arm64
- name: Create GitHub Release
uses: softprops/action-gh-release@v2
with:
tag_name: v${{ steps.get_version.outputs.VERSION }}
name: Release v${{ steps.get_version.outputs.VERSION }}
body: |
## 🎉 Crawl4AI v${{ steps.get_version.outputs.VERSION }} Released!
### 📦 Installation
**PyPI:**
```bash
pip install crawl4ai==${{ steps.get_version.outputs.VERSION }}
```
**Docker:**
```bash
docker pull unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}
docker pull unclecode/crawl4ai:latest
```
### 📝 What's Changed
See [CHANGELOG.md](https://github.com/${{ github.repository }}/blob/main/CHANGELOG.md) for details.
draft: false
prerelease: false
token: ${{ secrets.GITHUB_TOKEN }}
- name: Summary
run: |
echo "## 🚀 Release Complete!" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 📦 PyPI Package" >> $GITHUB_STEP_SUMMARY
echo "- Version: ${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "- URL: https://pypi.org/project/crawl4ai/" >> $GITHUB_STEP_SUMMARY
echo "- Install: \`pip install crawl4ai==${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 🐳 Docker Images" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.versions.outputs.MINOR }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.versions.outputs.MAJOR }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:latest\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 📋 GitHub Release" >> $GITHUB_STEP_SUMMARY
echo "https://github.com/${{ github.repository }}/releases/tag/v${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY

2
.gitignore vendored
View File

@@ -266,6 +266,8 @@ continue_config.json
.llm.env
.private/
.claude/
CLAUDE_MONITOR.md
CLAUDE.md

View File

@@ -103,7 +103,8 @@ from .browser_adapter import (
from .utils import (
start_colab_display_server,
setup_colab_environment
setup_colab_environment,
hooks_to_string
)
__all__ = [
@@ -183,6 +184,7 @@ __all__ = [
"ProxyConfig",
"start_colab_display_server",
"setup_colab_environment",
"hooks_to_string",
# C4A Script additions
"c4a_compile",
"c4a_validate",

View File

@@ -1,4 +1,4 @@
from typing import List, Optional, Union, AsyncGenerator, Dict, Any
from typing import List, Optional, Union, AsyncGenerator, Dict, Any, Callable
import httpx
import json
from urllib.parse import urljoin
@@ -7,6 +7,7 @@ import asyncio
from .async_configs import BrowserConfig, CrawlerRunConfig
from .models import CrawlResult
from .async_logger import AsyncLogger, LogLevel
from .utils import hooks_to_string
class Crawl4aiClientError(Exception):
@@ -70,17 +71,41 @@ class Crawl4aiDockerClient:
self.logger.error(f"Server unreachable: {str(e)}", tag="ERROR")
raise ConnectionError(f"Cannot connect to server: {str(e)}")
def _prepare_request(self, urls: List[str], browser_config: Optional[BrowserConfig] = None,
crawler_config: Optional[CrawlerRunConfig] = None) -> Dict[str, Any]:
def _prepare_request(
self,
urls: List[str],
browser_config: Optional[BrowserConfig] = None,
crawler_config: Optional[CrawlerRunConfig] = None,
hooks: Optional[Union[Dict[str, Callable], Dict[str, str]]] = None,
hooks_timeout: int = 30
) -> Dict[str, Any]:
"""Prepare request data from configs."""
if self._token:
self._http_client.headers["Authorization"] = f"Bearer {self._token}"
return {
request_data = {
"urls": urls,
"browser_config": browser_config.dump() if browser_config else {},
"crawler_config": crawler_config.dump() if crawler_config else {}
}
# Handle hooks if provided
if hooks:
# Check if hooks are already strings or need conversion
if any(callable(v) for v in hooks.values()):
# Convert function objects to strings
hooks_code = hooks_to_string(hooks)
else:
# Already in string format
hooks_code = hooks
request_data["hooks"] = {
"code": hooks_code,
"timeout": hooks_timeout
}
return request_data
async def _request(self, method: str, endpoint: str, **kwargs) -> httpx.Response:
"""Make an HTTP request with error handling."""
url = urljoin(self.base_url, endpoint)
@@ -102,16 +127,42 @@ class Crawl4aiDockerClient:
self,
urls: List[str],
browser_config: Optional[BrowserConfig] = None,
crawler_config: Optional[CrawlerRunConfig] = None
crawler_config: Optional[CrawlerRunConfig] = None,
hooks: Optional[Union[Dict[str, Callable], Dict[str, str]]] = None,
hooks_timeout: int = 30
) -> Union[CrawlResult, List[CrawlResult], AsyncGenerator[CrawlResult, None]]:
"""Execute a crawl operation."""
"""
Execute a crawl operation.
Args:
urls: List of URLs to crawl
browser_config: Browser configuration
crawler_config: Crawler configuration
hooks: Optional hooks - can be either:
- Dict[str, Callable]: Function objects that will be converted to strings
- Dict[str, str]: Already stringified hook code
hooks_timeout: Timeout in seconds for each hook execution (1-120)
Returns:
Single CrawlResult, list of results, or async generator for streaming
Example with function hooks:
>>> async def my_hook(page, context, **kwargs):
... await page.set_viewport_size({"width": 1920, "height": 1080})
... return page
>>>
>>> result = await client.crawl(
... ["https://example.com"],
... hooks={"on_page_context_created": my_hook}
... )
"""
await self._check_server()
data = self._prepare_request(urls, browser_config, crawler_config)
data = self._prepare_request(urls, browser_config, crawler_config, hooks, hooks_timeout)
is_streaming = crawler_config and crawler_config.stream
self.logger.info(f"Crawling {len(urls)} URLs {'(streaming)' if is_streaming else ''}", tag="CRAWL")
if is_streaming:
async def stream_results() -> AsyncGenerator[CrawlResult, None]:
async with self._http_client.stream("POST", f"{self.base_url}/crawl/stream", json=data) as response:
@@ -128,12 +179,12 @@ class Crawl4aiDockerClient:
else:
yield CrawlResult(**result)
return stream_results()
response = await self._request("POST", "/crawl", json=data)
result_data = response.json()
if not result_data.get("success", False):
raise RequestError(f"Crawl failed: {result_data.get('msg', 'Unknown error')}")
results = [CrawlResult(**r) for r in result_data.get("results", [])]
self.logger.success(f"Crawl completed with {len(results)} results", tag="CRAWL")
return results[0] if len(results) == 1 else results

View File

@@ -47,6 +47,7 @@ from urllib.parse import (
urljoin, urlparse, urlunparse,
parse_qsl, urlencode, quote, unquote
)
import inspect
# Monkey patch to fix wildcard handling in urllib.robotparser
@@ -3529,4 +3530,52 @@ def get_memory_stats() -> Tuple[float, float, float]:
available_gb = get_true_available_memory_gb()
used_percent = get_true_memory_usage_percent()
return used_percent, available_gb, total_gb
return used_percent, available_gb, total_gb
# Hook utilities for Docker API
def hooks_to_string(hooks: Dict[str, Callable]) -> Dict[str, str]:
"""
Convert hook function objects to string representations for Docker API.
This utility simplifies the process of using hooks with the Docker API by converting
Python function objects into the string format required by the API.
Args:
hooks: Dictionary mapping hook point names to Python function objects.
Functions should be async and follow hook signature requirements.
Returns:
Dictionary mapping hook point names to string representations of the functions.
Example:
>>> async def my_hook(page, context, **kwargs):
... await page.set_viewport_size({"width": 1920, "height": 1080})
... return page
>>>
>>> hooks_dict = {"on_page_context_created": my_hook}
>>> api_hooks = hooks_to_string(hooks_dict)
>>> # api_hooks is now ready to use with Docker API
Raises:
ValueError: If a hook is not callable or source cannot be extracted
"""
result = {}
for hook_name, hook_func in hooks.items():
if not callable(hook_func):
raise ValueError(f"Hook '{hook_name}' must be a callable function, got {type(hook_func)}")
try:
# Get the source code of the function
source = inspect.getsource(hook_func)
# Remove any leading indentation to get clean source
source = textwrap.dedent(source)
result[hook_name] = source
except (OSError, TypeError) as e:
raise ValueError(
f"Cannot extract source code for hook '{hook_name}'. "
f"Make sure the function is defined in a file (not interactively). Error: {e}"
)
return result

View File

@@ -0,0 +1,522 @@
#!/usr/bin/env python3
"""
Comprehensive hooks examples using Docker Client with function objects.
This approach is recommended because:
- Write hooks as regular Python functions
- Full IDE support (autocomplete, type checking)
- Automatic conversion to API format
- Reusable and testable code
- Clean, readable syntax
"""
import asyncio
from crawl4ai import Crawl4aiDockerClient
# API_BASE_URL = "http://localhost:11235"
API_BASE_URL = "http://localhost:11234"
# ============================================================================
# Hook Function Definitions
# ============================================================================
# --- All Hooks Demo ---
async def browser_created_hook(browser, **kwargs):
"""Called after browser is created"""
print("[HOOK] Browser created and ready")
return browser
async def page_context_hook(page, context, **kwargs):
"""Setup page environment"""
print("[HOOK] Setting up page environment")
# Set viewport
await page.set_viewport_size({"width": 1920, "height": 1080})
# Add cookies
await context.add_cookies([{
"name": "test_session",
"value": "abc123xyz",
"domain": ".httpbin.org",
"path": "/"
}])
# Block resources
await context.route("**/*.{png,jpg,jpeg,gif}", lambda route: route.abort())
await context.route("**/analytics/*", lambda route: route.abort())
print("[HOOK] Environment configured")
return page
async def user_agent_hook(page, context, user_agent, **kwargs):
"""Called when user agent is updated"""
print(f"[HOOK] User agent: {user_agent[:50]}...")
return page
async def before_goto_hook(page, context, url, **kwargs):
"""Called before navigating to URL"""
print(f"[HOOK] Navigating to: {url}")
await page.set_extra_http_headers({
"X-Custom-Header": "crawl4ai-test",
"Accept-Language": "en-US"
})
return page
async def after_goto_hook(page, context, url, response, **kwargs):
"""Called after page loads"""
print(f"[HOOK] Page loaded: {url}")
await page.wait_for_timeout(1000)
try:
await page.wait_for_selector("body", timeout=2000)
print("[HOOK] Body element ready")
except:
print("[HOOK] Timeout, continuing")
return page
async def execution_started_hook(page, context, **kwargs):
"""Called when custom JS execution starts"""
print("[HOOK] JS execution started")
await page.evaluate("console.log('[HOOK] Custom JS');")
return page
async def before_retrieve_hook(page, context, **kwargs):
"""Called before retrieving HTML"""
print("[HOOK] Preparing HTML retrieval")
# Scroll for lazy content
await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
await page.wait_for_timeout(500)
await page.evaluate("window.scrollTo(0, 0);")
print("[HOOK] Scrolling complete")
return page
async def before_return_hook(page, context, html, **kwargs):
"""Called before returning HTML"""
print(f"[HOOK] HTML ready: {len(html)} chars")
metrics = await page.evaluate('''() => ({
images: document.images.length,
links: document.links.length,
scripts: document.scripts.length
})''')
print(f"[HOOK] Metrics - Images: {metrics['images']}, Links: {metrics['links']}")
return page
# --- Authentication Hooks ---
async def auth_context_hook(page, context, **kwargs):
"""Setup authentication context"""
print("[HOOK] Setting up authentication")
# Add auth cookies
await context.add_cookies([{
"name": "auth_token",
"value": "fake_jwt_token",
"domain": ".httpbin.org",
"path": "/",
"httpOnly": True
}])
# Set localStorage
await page.evaluate('''
localStorage.setItem('user_id', '12345');
localStorage.setItem('auth_time', new Date().toISOString());
''')
print("[HOOK] Auth context ready")
return page
async def auth_headers_hook(page, context, url, **kwargs):
"""Add authentication headers"""
print(f"[HOOK] Adding auth headers for {url}")
import base64
credentials = base64.b64encode(b"user:passwd").decode('ascii')
await page.set_extra_http_headers({
'Authorization': f'Basic {credentials}',
'X-API-Key': 'test-key-123'
})
return page
# --- Performance Optimization Hooks ---
async def performance_hook(page, context, **kwargs):
"""Optimize page for performance"""
print("[HOOK] Optimizing for performance")
# Block resource-heavy content
await context.route("**/*.{png,jpg,jpeg,gif,webp,svg}", lambda r: r.abort())
await context.route("**/*.{woff,woff2,ttf}", lambda r: r.abort())
await context.route("**/*.{mp4,webm,ogg}", lambda r: r.abort())
await context.route("**/googletagmanager.com/*", lambda r: r.abort())
await context.route("**/google-analytics.com/*", lambda r: r.abort())
await context.route("**/facebook.com/*", lambda r: r.abort())
# Disable animations
await page.add_style_tag(content='''
*, *::before, *::after {
animation-duration: 0s !important;
transition-duration: 0s !important;
}
''')
print("[HOOK] Optimizations applied")
return page
async def cleanup_hook(page, context, **kwargs):
"""Clean page before extraction"""
print("[HOOK] Cleaning page")
await page.evaluate('''() => {
const selectors = [
'.ad', '.ads', '.advertisement',
'.popup', '.modal', '.overlay',
'.cookie-banner', '.newsletter'
];
selectors.forEach(sel => {
document.querySelectorAll(sel).forEach(el => el.remove());
});
document.querySelectorAll('script, style').forEach(el => el.remove());
}''')
print("[HOOK] Page cleaned")
return page
# --- Content Extraction Hooks ---
async def wait_dynamic_content_hook(page, context, url, response, **kwargs):
"""Wait for dynamic content to load"""
print(f"[HOOK] Waiting for dynamic content on {url}")
await page.wait_for_timeout(2000)
# Click "Load More" if exists
try:
load_more = await page.query_selector('[class*="load-more"], button:has-text("Load More")')
if load_more:
await load_more.click()
await page.wait_for_timeout(1000)
print("[HOOK] Clicked 'Load More'")
except:
pass
return page
async def extract_metadata_hook(page, context, **kwargs):
"""Extract page metadata"""
print("[HOOK] Extracting metadata")
metadata = await page.evaluate('''() => {
const getMeta = (name) => {
const el = document.querySelector(`meta[name="${name}"], meta[property="${name}"]`);
return el ? el.getAttribute('content') : null;
};
return {
title: document.title,
description: getMeta('description'),
author: getMeta('author'),
keywords: getMeta('keywords'),
};
}''')
print(f"[HOOK] Metadata: {metadata}")
# Infinite scroll
for i in range(3):
await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
await page.wait_for_timeout(1000)
print(f"[HOOK] Scroll {i+1}/3")
return page
# --- Multi-URL Hooks ---
async def url_specific_hook(page, context, url, **kwargs):
"""Apply URL-specific logic"""
print(f"[HOOK] Processing URL: {url}")
# URL-specific headers
if 'html' in url:
await page.set_extra_http_headers({"X-Type": "HTML"})
elif 'json' in url:
await page.set_extra_http_headers({"X-Type": "JSON"})
return page
async def track_progress_hook(page, context, url, response, **kwargs):
"""Track crawl progress"""
status = response.status if response else 'unknown'
print(f"[HOOK] Loaded {url} - Status: {status}")
return page
# ============================================================================
# Test Functions
# ============================================================================
async def test_all_hooks_comprehensive():
"""Test all 8 hook types"""
print("=" * 70)
print("Test 1: All Hooks Comprehensive Demo (Docker Client)")
print("=" * 70)
async with Crawl4aiDockerClient(base_url=API_BASE_URL, verbose=False) as client:
print("\nCrawling with all 8 hooks...")
# Define hooks with function objects
hooks = {
"on_browser_created": browser_created_hook,
"on_page_context_created": page_context_hook,
"on_user_agent_updated": user_agent_hook,
"before_goto": before_goto_hook,
"after_goto": after_goto_hook,
"on_execution_started": execution_started_hook,
"before_retrieve_html": before_retrieve_hook,
"before_return_html": before_return_hook
}
result = await client.crawl(
["https://httpbin.org/html"],
hooks=hooks,
hooks_timeout=30
)
print("\n✅ Success!")
print(f" URL: {result.url}")
print(f" Success: {result.success}")
print(f" HTML: {len(result.html)} chars")
async def test_authentication_workflow():
"""Test authentication with hooks"""
print("\n" + "=" * 70)
print("Test 2: Authentication Workflow (Docker Client)")
print("=" * 70)
async with Crawl4aiDockerClient(base_url=API_BASE_URL, verbose=False) as client:
print("\nTesting authentication...")
hooks = {
"on_page_context_created": auth_context_hook,
"before_goto": auth_headers_hook
}
result = await client.crawl(
["https://httpbin.org/basic-auth/user/passwd"],
hooks=hooks,
hooks_timeout=15
)
print("\n✅ Authentication completed")
if result.success:
if '"authenticated"' in result.html and 'true' in result.html:
print(" ✅ Basic auth successful!")
else:
print(" ⚠️ Auth status unclear")
else:
print(f" ❌ Failed: {result.error_message}")
async def test_performance_optimization():
"""Test performance optimization"""
print("\n" + "=" * 70)
print("Test 3: Performance Optimization (Docker Client)")
print("=" * 70)
async with Crawl4aiDockerClient(base_url=API_BASE_URL, verbose=False) as client:
print("\nTesting performance hooks...")
hooks = {
"on_page_context_created": performance_hook,
"before_retrieve_html": cleanup_hook
}
result = await client.crawl(
["https://httpbin.org/html"],
hooks=hooks,
hooks_timeout=10
)
print("\n✅ Optimization completed")
print(f" HTML size: {len(result.html):,} chars")
print(" Resources blocked, ads removed")
async def test_content_extraction():
"""Test content extraction"""
print("\n" + "=" * 70)
print("Test 4: Content Extraction (Docker Client)")
print("=" * 70)
async with Crawl4aiDockerClient(base_url=API_BASE_URL, verbose=False) as client:
print("\nTesting extraction hooks...")
hooks = {
"after_goto": wait_dynamic_content_hook,
"before_retrieve_html": extract_metadata_hook
}
result = await client.crawl(
["https://www.kidocode.com/"],
hooks=hooks,
hooks_timeout=20
)
print("\n✅ Extraction completed")
print(f" URL: {result.url}")
print(f" Success: {result.success}")
print(f" Metadata: {result.metadata}")
async def test_multi_url_crawl():
"""Test hooks with multiple URLs"""
print("\n" + "=" * 70)
print("Test 5: Multi-URL Crawl (Docker Client)")
print("=" * 70)
async with Crawl4aiDockerClient(base_url=API_BASE_URL, verbose=False) as client:
print("\nCrawling multiple URLs...")
hooks = {
"before_goto": url_specific_hook,
"after_goto": track_progress_hook
}
results = await client.crawl(
[
"https://httpbin.org/html",
"https://httpbin.org/json",
"https://httpbin.org/xml"
],
hooks=hooks,
hooks_timeout=15
)
print("\n✅ Multi-URL crawl completed")
print(f"\n Crawled {len(results)} URLs:")
for i, result in enumerate(results, 1):
status = "" if result.success else ""
print(f" {status} {i}. {result.url}")
async def test_reusable_hook_library():
"""Test using reusable hook library"""
print("\n" + "=" * 70)
print("Test 6: Reusable Hook Library (Docker Client)")
print("=" * 70)
# Create a library of reusable hooks
class HookLibrary:
@staticmethod
async def block_images(page, context, **kwargs):
"""Block all images"""
await context.route("**/*.{png,jpg,jpeg,gif}", lambda r: r.abort())
print("[LIBRARY] Images blocked")
return page
@staticmethod
async def block_analytics(page, context, **kwargs):
"""Block analytics"""
await context.route("**/analytics/*", lambda r: r.abort())
await context.route("**/google-analytics.com/*", lambda r: r.abort())
print("[LIBRARY] Analytics blocked")
return page
@staticmethod
async def scroll_infinite(page, context, **kwargs):
"""Handle infinite scroll"""
for i in range(5):
prev = await page.evaluate("document.body.scrollHeight")
await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
await page.wait_for_timeout(1000)
curr = await page.evaluate("document.body.scrollHeight")
if curr == prev:
break
print("[LIBRARY] Infinite scroll complete")
return page
async with Crawl4aiDockerClient(base_url=API_BASE_URL, verbose=False) as client:
print("\nUsing hook library...")
hooks = {
"on_page_context_created": HookLibrary.block_images,
"before_retrieve_html": HookLibrary.scroll_infinite
}
result = await client.crawl(
["https://www.kidocode.com/"],
hooks=hooks,
hooks_timeout=20
)
print("\n✅ Library hooks completed")
print(f" Success: {result.success}")
# ============================================================================
# Main
# ============================================================================
async def main():
"""Run all Docker client hook examples"""
print("🔧 Crawl4AI Docker Client - Hooks Examples (Function-Based)")
print("Using Python function objects with automatic conversion")
print("=" * 70)
tests = [
("All Hooks Demo", test_all_hooks_comprehensive),
("Authentication", test_authentication_workflow),
("Performance", test_performance_optimization),
("Extraction", test_content_extraction),
("Multi-URL", test_multi_url_crawl),
("Hook Library", test_reusable_hook_library)
]
for i, (name, test_func) in enumerate(tests, 1):
try:
await test_func()
print(f"\n✅ Test {i}/{len(tests)}: {name} completed\n")
except Exception as e:
print(f"\n❌ Test {i}/{len(tests)}: {name} failed: {e}\n")
import traceback
traceback.print_exc()
print("=" * 70)
print("🎉 All Docker client hook examples completed!")
print("\n💡 Key Benefits of Function-Based Hooks:")
print(" • Write as regular Python functions")
print(" • Full IDE support (autocomplete, types)")
print(" • Automatic conversion to API format")
print(" • Reusable across projects")
print(" • Clean, readable code")
print(" • Easy to test and debug")
print("=" * 70)
if __name__ == "__main__":
asyncio.run(main())

Binary file not shown.

File diff suppressed because it is too large Load Diff

View File

@@ -6,18 +6,6 @@
- [Option 1: Using Pre-built Docker Hub Images (Recommended)](#option-1-using-pre-built-docker-hub-images-recommended)
- [Option 2: Using Docker Compose](#option-2-using-docker-compose)
- [Option 3: Manual Local Build & Run](#option-3-manual-local-build--run)
- [Dockerfile Parameters](#dockerfile-parameters)
- [Using the API](#using-the-api)
- [Playground Interface](#playground-interface)
- [Python SDK](#python-sdk)
- [Understanding Request Schema](#understanding-request-schema)
- [REST API Examples](#rest-api-examples)
- [Additional API Endpoints](#additional-api-endpoints)
- [HTML Extraction Endpoint](#html-extraction-endpoint)
- [Screenshot Endpoint](#screenshot-endpoint)
- [PDF Export Endpoint](#pdf-export-endpoint)
- [JavaScript Execution Endpoint](#javascript-execution-endpoint)
- [Library Context Endpoint](#library-context-endpoint)
- [MCP (Model Context Protocol) Support](#mcp-model-context-protocol-support)
- [What is MCP?](#what-is-mcp)
- [Connecting via MCP](#connecting-via-mcp)
@@ -25,9 +13,28 @@
- [Available MCP Tools](#available-mcp-tools)
- [Testing MCP Connections](#testing-mcp-connections)
- [MCP Schemas](#mcp-schemas)
- [Additional API Endpoints](#additional-api-endpoints)
- [HTML Extraction Endpoint](#html-extraction-endpoint)
- [Screenshot Endpoint](#screenshot-endpoint)
- [PDF Export Endpoint](#pdf-export-endpoint)
- [JavaScript Execution Endpoint](#javascript-execution-endpoint)
- [User-Provided Hooks API](#user-provided-hooks-api)
- [Hook Information Endpoint](#hook-information-endpoint)
- [Available Hook Points](#available-hook-points)
- [Using Hooks in Requests](#using-hooks-in-requests)
- [Hook Examples with Real URLs](#hook-examples-with-real-urls)
- [Security Best Practices](#security-best-practices)
- [Hook Response Information](#hook-response-information)
- [Error Handling](#error-handling)
- [Hooks Utility: Function-Based Approach (Python)](#hooks-utility-function-based-approach-python)
- [Dockerfile Parameters](#dockerfile-parameters)
- [Using the API](#using-the-api)
- [Playground Interface](#playground-interface)
- [Python SDK](#python-sdk)
- [Understanding Request Schema](#understanding-request-schema)
- [REST API Examples](#rest-api-examples)
- [LLM Configuration Examples](#llm-configuration-examples)
- [Metrics & Monitoring](#metrics--monitoring)
- [Deployment Scenarios](#deployment-scenarios)
- [Complete Examples](#complete-examples)
- [Server Configuration](#server-configuration)
- [Understanding config.yml](#understanding-configyml)
- [JWT Authentication](#jwt-authentication)
@@ -832,6 +839,275 @@ else:
> 💡 **Remember**: Always test your hooks on safe, known websites first before using them on production sites. Never crawl sites that you don't have permission to access or that might be malicious.
### Hooks Utility: Function-Based Approach (Python)
For Python developers, Crawl4AI provides a more convenient way to work with hooks using the `hooks_to_string()` utility function and Docker client integration.
#### Why Use Function-Based Hooks?
**String-Based Approach (shown above)**:
```python
hooks_code = {
"on_page_context_created": """
async def hook(page, context, **kwargs):
await page.set_viewport_size({"width": 1920, "height": 1080})
return page
"""
}
```
**Function-Based Approach (recommended for Python)**:
```python
from crawl4ai import Crawl4aiDockerClient
async def my_hook(page, context, **kwargs):
await page.set_viewport_size({"width": 1920, "height": 1080})
return page
async with Crawl4aiDockerClient(base_url="http://localhost:11235") as client:
result = await client.crawl(
["https://example.com"],
hooks={"on_page_context_created": my_hook}
)
```
**Benefits**:
- ✅ Write hooks as regular Python functions
- ✅ Full IDE support (autocomplete, syntax highlighting, type checking)
- ✅ Easy to test and debug
- ✅ Reusable hook libraries
- ✅ Automatic conversion to API format
#### Using the Hooks Utility
The `hooks_to_string()` utility converts Python function objects to the string format required by the API:
```python
from crawl4ai import hooks_to_string
# Define your hooks as functions
async def setup_hook(page, context, **kwargs):
await page.set_viewport_size({"width": 1920, "height": 1080})
await context.add_cookies([{
"name": "session",
"value": "token",
"domain": ".example.com"
}])
return page
async def scroll_hook(page, context, **kwargs):
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
return page
# Convert to string format
hooks_dict = {
"on_page_context_created": setup_hook,
"before_retrieve_html": scroll_hook
}
hooks_string = hooks_to_string(hooks_dict)
# Now use with REST API or Docker client
# hooks_string contains the string representations
```
#### Docker Client with Automatic Conversion
The Docker client automatically detects and converts function objects:
```python
from crawl4ai import Crawl4aiDockerClient
async def auth_hook(page, context, **kwargs):
"""Add authentication cookies"""
await context.add_cookies([{
"name": "auth_token",
"value": "your_token",
"domain": ".example.com"
}])
return page
async def performance_hook(page, context, **kwargs):
"""Block unnecessary resources"""
await context.route("**/*.{png,jpg,gif}", lambda r: r.abort())
await context.route("**/analytics/*", lambda r: r.abort())
return page
async with Crawl4aiDockerClient(base_url="http://localhost:11235") as client:
# Pass functions directly - automatic conversion!
result = await client.crawl(
["https://example.com"],
hooks={
"on_page_context_created": performance_hook,
"before_goto": auth_hook
},
hooks_timeout=30 # Optional timeout in seconds (1-120)
)
print(f"Success: {result.success}")
print(f"HTML: {len(result.html)} chars")
```
#### Creating Reusable Hook Libraries
Build collections of reusable hooks:
```python
# hooks_library.py
class CrawlHooks:
"""Reusable hook collection for common crawling tasks"""
@staticmethod
async def block_images(page, context, **kwargs):
"""Block all images to speed up crawling"""
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda r: r.abort())
return page
@staticmethod
async def block_analytics(page, context, **kwargs):
"""Block analytics and tracking scripts"""
tracking_domains = [
"**/google-analytics.com/*",
"**/googletagmanager.com/*",
"**/facebook.com/tr/*",
"**/doubleclick.net/*"
]
for domain in tracking_domains:
await context.route(domain, lambda r: r.abort())
return page
@staticmethod
async def scroll_infinite(page, context, **kwargs):
"""Handle infinite scroll to load more content"""
previous_height = 0
for i in range(5): # Max 5 scrolls
current_height = await page.evaluate("document.body.scrollHeight")
if current_height == previous_height:
break
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1000)
previous_height = current_height
return page
@staticmethod
async def wait_for_dynamic_content(page, context, url, response, **kwargs):
"""Wait for dynamic content to load"""
await page.wait_for_timeout(2000)
try:
# Click "Load More" if present
load_more = await page.query_selector('[class*="load-more"]')
if load_more:
await load_more.click()
await page.wait_for_timeout(1000)
except:
pass
return page
# Use in your application
from hooks_library import CrawlHooks
from crawl4ai import Crawl4aiDockerClient
async def crawl_with_optimizations(url):
async with Crawl4aiDockerClient() as client:
result = await client.crawl(
[url],
hooks={
"on_page_context_created": CrawlHooks.block_images,
"before_retrieve_html": CrawlHooks.scroll_infinite
}
)
return result
```
#### Choosing the Right Approach
| Approach | Best For | IDE Support | Language |
|----------|----------|-------------|----------|
| **String-based** | Non-Python clients, REST APIs, other languages | ❌ None | Any |
| **Function-based** | Python applications, local development | ✅ Full | Python only |
| **Docker Client** | Python apps with automatic conversion | ✅ Full | Python only |
**Recommendation**:
- **Python applications**: Use Docker client with function objects (easiest)
- **Non-Python or REST API**: Use string-based hooks (most flexible)
- **Manual control**: Use `hooks_to_string()` utility (middle ground)
#### Complete Example with Function Hooks
```python
from crawl4ai import Crawl4aiDockerClient, BrowserConfig, CrawlerRunConfig, CacheMode
# Define hooks as regular Python functions
async def setup_environment(page, context, **kwargs):
"""Setup crawling environment"""
# Set viewport
await page.set_viewport_size({"width": 1920, "height": 1080})
# Block resources for speed
await context.route("**/*.{png,jpg,gif}", lambda r: r.abort())
# Add custom headers
await page.set_extra_http_headers({
"Accept-Language": "en-US",
"X-Custom-Header": "Crawl4AI"
})
print("[HOOK] Environment configured")
return page
async def extract_content(page, context, **kwargs):
"""Extract and prepare content"""
# Scroll to load lazy content
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1000)
# Extract metadata
metadata = await page.evaluate('''() => ({
title: document.title,
links: document.links.length,
images: document.images.length
})''')
print(f"[HOOK] Page metadata: {metadata}")
return page
async def main():
async with Crawl4aiDockerClient(base_url="http://localhost:11235", verbose=True) as client:
# Configure crawl
browser_config = BrowserConfig(headless=True)
crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
# Crawl with hooks
result = await client.crawl(
["https://httpbin.org/html"],
browser_config=browser_config,
crawler_config=crawler_config,
hooks={
"on_page_context_created": setup_environment,
"before_retrieve_html": extract_content
},
hooks_timeout=30
)
if result.success:
print(f"✅ Crawl successful!")
print(f" URL: {result.url}")
print(f" HTML: {len(result.html)} chars")
print(f" Markdown: {len(result.markdown)} chars")
else:
print(f"❌ Crawl failed: {result.error_message}")
if __name__ == "__main__":
import asyncio
asyncio.run(main())
```
#### Additional Resources
- **Comprehensive Examples**: See `/docs/examples/hooks_docker_client_example.py` for Python function-based examples
- **REST API Examples**: See `/docs/examples/hooks_rest_api_example.py` for string-based examples
- **Comparison Guide**: See `/docs/examples/README_HOOKS.md` for detailed comparison
- **Utility Documentation**: See `/docs/hooks-utility-guide.md` for complete guide
---
## Dockerfile Parameters
@@ -892,10 +1168,12 @@ This is the easiest way to translate Python configuration to JSON requests when
Install the SDK: `pip install crawl4ai`
The Python SDK provides a convenient way to interact with the Docker API, including **automatic hook conversion** when using function objects.
```python
import asyncio
from crawl4ai.docker_client import Crawl4aiDockerClient
from crawl4ai import BrowserConfig, CrawlerRunConfig, CacheMode # Assuming you have crawl4ai installed
from crawl4ai import BrowserConfig, CrawlerRunConfig, CacheMode
async def main():
# Point to the correct server port
@@ -907,23 +1185,22 @@ async def main():
print("--- Running Non-Streaming Crawl ---")
results = await client.crawl(
["https://httpbin.org/html"],
browser_config=BrowserConfig(headless=True), # Use library classes for config aid
browser_config=BrowserConfig(headless=True),
crawler_config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
)
if results: # client.crawl returns None on failure
print(f"Non-streaming results success: {results.success}")
if results.success:
for result in results: # Iterate through the CrawlResultContainer
print(f"URL: {result.url}, Success: {result.success}")
if results:
print(f"Non-streaming results success: {results.success}")
if results.success:
for result in results:
print(f"URL: {result.url}, Success: {result.success}")
else:
print("Non-streaming crawl failed.")
# Example Streaming crawl
print("\n--- Running Streaming Crawl ---")
stream_config = CrawlerRunConfig(stream=True, cache_mode=CacheMode.BYPASS)
try:
async for result in await client.crawl( # client.crawl returns an async generator for streaming
async for result in await client.crawl(
["https://httpbin.org/html", "https://httpbin.org/links/5/0"],
browser_config=BrowserConfig(headless=True),
crawler_config=stream_config
@@ -932,17 +1209,56 @@ async def main():
except Exception as e:
print(f"Streaming crawl failed: {e}")
# Example with hooks (Python function objects)
print("\n--- Crawl with Hooks ---")
async def my_hook(page, context, **kwargs):
"""Custom hook to optimize performance"""
await page.set_viewport_size({"width": 1920, "height": 1080})
await context.route("**/*.{png,jpg}", lambda r: r.abort())
print("[HOOK] Page optimized")
return page
result = await client.crawl(
["https://httpbin.org/html"],
browser_config=BrowserConfig(headless=True),
crawler_config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS),
hooks={"on_page_context_created": my_hook}, # Pass function directly!
hooks_timeout=30
)
print(f"Crawl with hooks success: {result.success}")
# Example Get schema
print("\n--- Getting Schema ---")
schema = await client.get_schema()
print(f"Schema received: {bool(schema)}") # Print whether schema was received
print(f"Schema received: {bool(schema)}")
if __name__ == "__main__":
asyncio.run(main())
```
*(SDK parameters like timeout, verify_ssl etc. remain the same)*
#### SDK Parameters
The Docker client supports the following parameters:
**Client Initialization**:
- `base_url` (str): URL of the Docker server (default: `http://localhost:8000`)
- `timeout` (float): Request timeout in seconds (default: 30.0)
- `verify_ssl` (bool): Verify SSL certificates (default: True)
- `verbose` (bool): Enable verbose logging (default: True)
- `log_file` (Optional[str]): Path to log file (default: None)
**crawl() Method**:
- `urls` (List[str]): List of URLs to crawl
- `browser_config` (Optional[BrowserConfig]): Browser configuration
- `crawler_config` (Optional[CrawlerRunConfig]): Crawler configuration
- `hooks` (Optional[Dict]): Hook functions or strings - **automatically converts function objects!**
- `hooks_timeout` (int): Timeout for each hook execution in seconds (default: 30)
**Returns**:
- Single URL: `CrawlResult` object
- Multiple URLs: `List[CrawlResult]`
- Streaming: `AsyncGenerator[CrawlResult]`
### Second Approach: Direct API Calls
@@ -1352,19 +1668,40 @@ We're here to help you succeed with Crawl4AI! Here's how to get support:
In this guide, we've covered everything you need to get started with Crawl4AI's Docker deployment:
- Building and running the Docker container
- Configuring the environment
- Configuring the environment
- Using the interactive playground for testing
- Making API requests with proper typing
- Using the Python SDK
- Using the Python SDK with **automatic hook conversion**
- **Working with hooks** - both string-based (REST API) and function-based (Python SDK)
- Leveraging specialized endpoints for screenshots, PDFs, and JavaScript execution
- Connecting via the Model Context Protocol (MCP)
- Monitoring your deployment
The new playground interface at `http://localhost:11235/playground` makes it much easier to test configurations and generate the corresponding JSON for API requests.
### Key Features
For AI application developers, the MCP integration allows tools like Claude Code to directly access Crawl4AI's capabilities without complex API handling.
**Hooks Support**: Crawl4AI offers two approaches for working with hooks:
- **String-based** (REST API): Works with any language, requires manual string formatting
- **Function-based** (Python SDK): Write hooks as regular Python functions with full IDE support and automatic conversion
Remember, the examples in the `examples` folder are your friends - they show real-world usage patterns that you can adapt for your needs.
**Playground Interface**: The built-in playground at `http://localhost:11235/playground` makes it easy to test configurations and generate corresponding JSON for API requests.
**MCP Integration**: For AI application developers, the MCP integration allows tools like Claude Code to directly access Crawl4AI's capabilities without complex API handling.
### Next Steps
1. **Explore Examples**: Check out the comprehensive examples in:
- `/docs/examples/hooks_docker_client_example.py` - Python function-based hooks
- `/docs/examples/hooks_rest_api_example.py` - REST API string-based hooks
- `/docs/examples/README_HOOKS.md` - Comparison and guide
2. **Read Documentation**:
- `/docs/hooks-utility-guide.md` - Complete hooks utility guide
- API documentation for detailed configuration options
3. **Join the Community**:
- GitHub: Report issues and contribute
- Discord: Get help and share your experiences
- Documentation: Comprehensive guides and tutorials
Keep exploring, and don't hesitate to reach out if you need help! We're building something amazing together. 🚀

View File

@@ -59,6 +59,27 @@ Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant
> **Note**: If you're looking for the old documentation, you can access it [here](https://old.docs.crawl4ai.com).
## 🆕 AI Assistant Skill Now Available!
<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 20px; border-radius: 10px; margin: 20px 0; box-shadow: 0 4px 6px rgba(0,0,0,0.1);">
<h3 style="color: white; margin: 0 0 10px 0;">🤖 Crawl4AI Skill for Claude & AI Assistants</h3>
<p style="color: white; margin: 10px 0;">Supercharge your AI coding assistant with complete Crawl4AI knowledge! Download our comprehensive skill package that includes:</p>
<ul style="color: white; margin: 10px 0;">
<li>📚 Complete SDK reference (23K+ words)</li>
<li>🚀 Ready-to-use extraction scripts</li>
<li>⚡ Schema generation for efficient scraping</li>
<li>🔧 Version 0.7.4 compatible</li>
</ul>
<div style="text-align: center; margin-top: 15px;">
<a href="assets/crawl4ai-skill.zip" download style="background: white; color: #667eea; padding: 12px 30px; border-radius: 5px; text-decoration: none; font-weight: bold; display: inline-block; transition: transform 0.2s;">
📦 Download Skill Package
</a>
</div>
<p style="color: white; margin: 15px 0 0 0; font-size: 0.9em; text-align: center;">
Works with Claude, Cursor, Windsurf, and other AI coding assistants. Import the .zip file into your AI assistant's skill/knowledge system.
</p>
</div>
## 🎯 New: Adaptive Web Crawling
Crawl4AI now features intelligent adaptive crawling that knows when to stop! Using advanced information foraging algorithms, it determines when sufficient information has been gathered to answer your query.

View File

@@ -529,8 +529,19 @@ class AdminDashboard {
</label>
</div>
<div class="form-group full-width">
<label>Integration Guide</label>
<textarea id="form-integration" rows="10">${app?.integration_guide || ''}</textarea>
<label>Long Description (Markdown - Overview tab)</label>
<textarea id="form-long-description" rows="10" placeholder="Enter detailed description with markdown formatting...">${app?.long_description || ''}</textarea>
<small>Markdown support: **bold**, *italic*, [links](url), # headers, code blocks, lists</small>
</div>
<div class="form-group full-width">
<label>Integration Guide (Markdown - Integration tab)</label>
<textarea id="form-integration" rows="20" placeholder="Enter integration guide with installation, examples, and code snippets using markdown...">${app?.integration_guide || ''}</textarea>
<small>Single markdown field with installation, examples, and complete guide. Code blocks get auto copy buttons.</small>
</div>
<div class="form-group full-width">
<label>Documentation (Markdown - Documentation tab)</label>
<textarea id="form-documentation" rows="20" placeholder="Enter documentation with API reference, examples, and best practices using markdown...">${app?.documentation || ''}</textarea>
<small>Full documentation with API reference, examples, best practices, etc.</small>
</div>
</div>
`;
@@ -712,7 +723,9 @@ class AdminDashboard {
data.contact_email = document.getElementById('form-email').value;
data.featured = document.getElementById('form-featured').checked ? 1 : 0;
data.sponsored = document.getElementById('form-sponsored').checked ? 1 : 0;
data.long_description = document.getElementById('form-long-description').value;
data.integration_guide = document.getElementById('form-integration').value;
data.documentation = document.getElementById('form-documentation').value;
} else if (type === 'articles') {
data.title = document.getElementById('form-title').value;
data.slug = this.generateSlug(data.title);

View File

@@ -510,6 +510,31 @@
line-height: 1.5;
}
/* Markdown rendered code blocks */
.integration-content pre,
.docs-content pre {
background: var(--bg-dark);
border: 1px solid var(--border-color);
margin: 1rem 0;
padding: 1rem;
padding-top: 2.5rem; /* Space for copy button */
overflow-x: auto;
position: relative;
max-height: none; /* Remove any height restrictions */
height: auto; /* Allow content to expand */
}
.integration-content pre code,
.docs-content pre code {
background: transparent;
padding: 0;
color: var(--text-secondary);
font-size: 0.875rem;
line-height: 1.5;
white-space: pre; /* Preserve whitespace and line breaks */
display: block;
}
/* Feature Grid */
.feature-grid {
display: grid;

View File

@@ -80,20 +80,7 @@
<section id="overview-tab" class="tab-content active">
<div class="overview-columns">
<div class="overview-main">
<h2>Overview</h2>
<div id="app-overview">Overview content goes here.</div>
<h3>Key Features</h3>
<ul id="app-features" class="features-list">
<li>Feature 1</li>
<li>Feature 2</li>
<li>Feature 3</li>
</ul>
<h3>Use Cases</h3>
<div id="app-use-cases" class="use-cases">
<p>Describe how this app can help your workflow.</p>
</div>
</div>
<aside class="sidebar">
@@ -142,33 +129,14 @@
</section>
<section id="integration-tab" class="tab-content">
<div class="integration-content">
<h2>Integration Guide</h2>
<h3>Installation</h3>
<div class="code-block">
<pre><code id="install-code"># Installation instructions will appear here</code></pre>
</div>
<h3>Basic Usage</h3>
<div class="code-block">
<pre><code id="usage-code"># Usage example will appear here</code></pre>
</div>
<h3>Complete Integration Example</h3>
<div class="code-block">
<button class="copy-btn" id="copy-integration">Copy</button>
<pre><code id="integration-code"># Complete integration guide will appear here</code></pre>
</div>
<div class="integration-content" id="app-integration">
<!-- Integration guide markdown content will be rendered here -->
</div>
</section>
<section id="docs-tab" class="tab-content">
<div class="docs-content">
<h2>Documentation</h2>
<div id="app-docs" class="doc-sections">
<p>Documentation coming soon.</p>
</div>
<div class="docs-content" id="app-docs">
<!-- Documentation markdown content will be rendered here -->
</div>
</section>

View File

@@ -123,144 +123,132 @@ class AppDetailPage {
document.getElementById('sidebar-pricing').textContent = this.appData.pricing || 'Free';
document.getElementById('sidebar-contact').textContent = this.appData.contact_email || 'contact@example.com';
// Integration guide
this.renderIntegrationGuide();
// Render tab contents from database fields
this.renderTabContents();
}
renderIntegrationGuide() {
// Installation code
const installCode = document.getElementById('install-code');
if (installCode) {
if (this.appData.type === 'Open Source' && this.appData.github_url) {
installCode.textContent = `# Clone from GitHub
git clone ${this.appData.github_url}
# Install dependencies
pip install -r requirements.txt`;
} else if (this.appData.name.toLowerCase().includes('api')) {
installCode.textContent = `# Install via pip
pip install ${this.appData.slug}
# Or install from source
pip install git+${this.appData.github_url || 'https://github.com/example/repo'}`;
renderTabContents() {
// Overview tab - use long_description from database
const overviewDiv = document.getElementById('app-overview');
if (overviewDiv) {
if (this.appData.long_description) {
overviewDiv.innerHTML = this.renderMarkdown(this.appData.long_description);
} else {
overviewDiv.innerHTML = `<p>${this.appData.description || 'No overview available.'}</p>`;
}
}
// Usage code - customize based on category
const usageCode = document.getElementById('usage-code');
if (usageCode) {
if (this.appData.category === 'Browser Automation') {
usageCode.textContent = `from crawl4ai import AsyncWebCrawler
from ${this.appData.slug.replace(/-/g, '_')} import ${this.appData.name.replace(/\s+/g, '')}
async def main():
# Initialize ${this.appData.name}
automation = ${this.appData.name.replace(/\s+/g, '')}()
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
browser_config=automation.config,
wait_for="css:body"
)
print(result.markdown)`;
} else if (this.appData.category === 'Proxy Services') {
usageCode.textContent = `from crawl4ai import AsyncWebCrawler
import ${this.appData.slug.replace(/-/g, '_')}
# Configure proxy
proxy_config = {
"server": "${this.appData.website_url || 'https://proxy.example.com'}",
"username": "your_username",
"password": "your_password"
}
async with AsyncWebCrawler(proxy=proxy_config) as crawler:
result = await crawler.arun(
url="https://example.com",
bypass_cache=True
)
print(result.status_code)`;
} else if (this.appData.category === 'LLM Integration') {
usageCode.textContent = `from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
# Configure LLM extraction
strategy = LLMExtractionStrategy(
provider="${this.appData.name.toLowerCase().includes('gpt') ? 'openai' : 'anthropic'}",
api_key="your-api-key",
model="${this.appData.name.toLowerCase().includes('gpt') ? 'gpt-4' : 'claude-3'}",
instruction="Extract structured data"
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
extraction_strategy=strategy
)
print(result.extracted_content)`;
// Integration tab - use integration_guide field from database
const integrationDiv = document.getElementById('app-integration');
if (integrationDiv) {
if (this.appData.integration_guide) {
integrationDiv.innerHTML = this.renderMarkdown(this.appData.integration_guide);
// Add copy buttons to all code blocks
this.addCopyButtonsToCodeBlocks(integrationDiv);
} else {
integrationDiv.innerHTML = '<p>Integration guide not yet available. Please check the official website for details.</p>';
}
}
// Integration example
const integrationCode = document.getElementById('integration-code');
if (integrationCode) {
integrationCode.textContent = this.appData.integration_guide ||
`# Complete ${this.appData.name} Integration Example
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json
async def crawl_with_${this.appData.slug.replace(/-/g, '_')}():
"""
Complete example showing how to use ${this.appData.name}
with Crawl4AI for production web scraping
"""
# Define extraction schema
schema = {
"name": "ProductList",
"baseSelector": "div.product",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"},
{"name": "image", "selector": "img", "type": "attribute", "attribute": "src"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
// Documentation tab - use documentation field from database
const docsDiv = document.getElementById('app-docs');
if (docsDiv) {
if (this.appData.documentation) {
docsDiv.innerHTML = this.renderMarkdown(this.appData.documentation);
// Add copy buttons to all code blocks
this.addCopyButtonsToCodeBlocks(docsDiv);
} else {
docsDiv.innerHTML = '<p>Documentation coming soon.</p>';
}
}
}
# Initialize crawler with ${this.appData.name}
async with AsyncWebCrawler(
browser_type="chromium",
headless=True,
verbose=True
) as crawler:
addCopyButtonsToCodeBlocks(container) {
// Find all code blocks and add copy buttons
const codeBlocks = container.querySelectorAll('pre code');
codeBlocks.forEach(codeBlock => {
const pre = codeBlock.parentElement;
# Crawl with extraction
result = await crawler.arun(
url="https://example.com/products",
extraction_strategy=JsonCssExtractionStrategy(schema),
cache_mode="bypass",
wait_for="css:.product",
screenshot=True
)
// Skip if already has a copy button
if (pre.querySelector('.copy-btn')) return;
# Process results
if result.success:
products = json.loads(result.extracted_content)
print(f"Found {len(products)} products")
// Create copy button
const copyBtn = document.createElement('button');
copyBtn.className = 'copy-btn';
copyBtn.textContent = 'Copy';
copyBtn.onclick = () => {
navigator.clipboard.writeText(codeBlock.textContent).then(() => {
copyBtn.textContent = '✓ Copied!';
setTimeout(() => {
copyBtn.textContent = 'Copy';
}, 2000);
});
};
for product in products[:5]:
print(f"- {product['title']}: {product['price']}")
// Add button to pre element
pre.style.position = 'relative';
pre.insertBefore(copyBtn, codeBlock);
});
}
return products
renderMarkdown(text) {
if (!text) return '';
# Run the crawler
if __name__ == "__main__":
import asyncio
asyncio.run(crawl_with_${this.appData.slug.replace(/-/g, '_')}())`;
}
// Store code blocks temporarily to protect them from processing
const codeBlocks = [];
let processed = text.replace(/```(\w+)?\n([\s\S]*?)```/g, (match, lang, code) => {
const placeholder = `___CODE_BLOCK_${codeBlocks.length}___`;
codeBlocks.push(`<pre><code class="language-${lang || ''}">${this.escapeHtml(code)}</code></pre>`);
return placeholder;
});
// Store inline code temporarily
const inlineCodes = [];
processed = processed.replace(/`([^`]+)`/g, (match, code) => {
const placeholder = `___INLINE_CODE_${inlineCodes.length}___`;
inlineCodes.push(`<code>${this.escapeHtml(code)}</code>`);
return placeholder;
});
// Now process the rest of the markdown
processed = processed
// Headers
.replace(/^### (.*$)/gim, '<h3>$1</h3>')
.replace(/^## (.*$)/gim, '<h2>$1</h2>')
.replace(/^# (.*$)/gim, '<h1>$1</h1>')
// Bold
.replace(/\*\*(.*?)\*\*/g, '<strong>$1</strong>')
// Italic
.replace(/\*(.*?)\*/g, '<em>$1</em>')
// Links
.replace(/\[([^\]]+)\]\(([^)]+)\)/g, '<a href="$2" target="_blank">$1</a>')
// Line breaks
.replace(/\n\n/g, '</p><p>')
.replace(/\n/g, '<br>')
// Lists
.replace(/^\* (.*)$/gim, '<li>$1</li>')
.replace(/^- (.*)$/gim, '<li>$1</li>')
// Wrap in paragraphs
.replace(/^(?!<[h|p|pre|ul|ol|li])/gim, '<p>')
.replace(/(?<![>])$/gim, '</p>');
// Restore inline code
inlineCodes.forEach((code, i) => {
processed = processed.replace(`___INLINE_CODE_${i}___`, code);
});
// Restore code blocks
codeBlocks.forEach((block, i) => {
processed = processed.replace(`___CODE_BLOCK_${i}___`, block);
});
return processed;
}
escapeHtml(text) {
const div = document.createElement('div');
div.textContent = text;
return div.innerHTML;
}
formatNumber(num) {
@@ -289,33 +277,6 @@ if __name__ == "__main__":
document.getElementById(`${tabName}-tab`).classList.add('active');
});
});
// Copy integration code
document.getElementById('copy-integration').addEventListener('click', () => {
const code = document.getElementById('integration-code').textContent;
navigator.clipboard.writeText(code).then(() => {
const btn = document.getElementById('copy-integration');
const originalText = btn.innerHTML;
btn.innerHTML = '<span>✓</span> Copied!';
setTimeout(() => {
btn.innerHTML = originalText;
}, 2000);
});
});
// Copy code buttons
document.querySelectorAll('.copy-btn').forEach(btn => {
btn.addEventListener('click', (e) => {
const codeBlock = e.target.closest('.code-block');
const code = codeBlock.querySelector('code').textContent;
navigator.clipboard.writeText(code).then(() => {
btn.textContent = 'Copied!';
setTimeout(() => {
btn.textContent = 'Copy';
}, 2000);
});
});
});
}
async loadRelatedApps() {

View File

@@ -7,6 +7,7 @@ docs_dir: docs/md_v2
nav:
- Home: 'index.md'
- "📚 Complete SDK Reference": "complete-sdk-reference.md"
- "Ask AI": "core/ask-ai.md"
- "Quick Start": "core/quickstart.md"
- "Code Examples": "core/examples.md"

View File

@@ -0,0 +1,193 @@
"""
Test script demonstrating the hooks_to_string utility and Docker client integration.
"""
import asyncio
from crawl4ai import Crawl4aiDockerClient, hooks_to_string
# Define hook functions as regular Python functions
async def auth_hook(page, context, **kwargs):
"""Add authentication cookies."""
await context.add_cookies([{
'name': 'test_cookie',
'value': 'test_value',
'domain': '.httpbin.org',
'path': '/'
}])
return page
async def scroll_hook(page, context, **kwargs):
"""Scroll to load lazy content."""
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1000)
return page
async def viewport_hook(page, context, **kwargs):
"""Set custom viewport."""
await page.set_viewport_size({"width": 1920, "height": 1080})
return page
async def test_hooks_utility():
"""Test the hooks_to_string utility function."""
print("=" * 60)
print("Testing hooks_to_string utility")
print("=" * 60)
# Create hooks dictionary with function objects
hooks_dict = {
"on_page_context_created": auth_hook,
"before_retrieve_html": scroll_hook
}
# Convert to string format
hooks_string = hooks_to_string(hooks_dict)
print("\n✓ Successfully converted function objects to strings")
print(f"\n✓ Converted {len(hooks_string)} hooks:")
for hook_name in hooks_string.keys():
print(f" - {hook_name}")
print("\n✓ Preview of converted hook:")
print("-" * 60)
print(hooks_string["on_page_context_created"][:200] + "...")
print("-" * 60)
return hooks_string
async def test_docker_client_with_functions():
"""Test Docker client with function objects (automatic conversion)."""
print("\n" + "=" * 60)
print("Testing Docker Client with Function Objects")
print("=" * 60)
# Note: This requires a running Crawl4AI Docker server
# Uncomment the following to test with actual server:
async with Crawl4aiDockerClient(base_url="http://localhost:11234", verbose=True) as client:
# Pass function objects directly - they'll be converted automatically
result = await client.crawl(
["https://httpbin.org/html"],
hooks={
"on_page_context_created": auth_hook,
"before_retrieve_html": scroll_hook
},
hooks_timeout=30
)
print(f"\n✓ Crawl successful: {result.success}")
print(f"✓ URL: {result.url}")
print("\n✓ Docker client accepts function objects directly")
print("✓ Automatic conversion happens internally")
print("✓ No manual string formatting needed!")
async def test_docker_client_with_strings():
"""Test Docker client with pre-converted strings."""
print("\n" + "=" * 60)
print("Testing Docker Client with String Hooks")
print("=" * 60)
# Convert hooks to strings first
hooks_dict = {
"on_page_context_created": viewport_hook,
"before_retrieve_html": scroll_hook
}
hooks_string = hooks_to_string(hooks_dict)
# Note: This requires a running Crawl4AI Docker server
# Uncomment the following to test with actual server:
async with Crawl4aiDockerClient(base_url="http://localhost:11234", verbose=True) as client:
# Pass string hooks - they'll be used as-is
result = await client.crawl(
["https://httpbin.org/html"],
hooks=hooks_string,
hooks_timeout=30
)
print(f"\n✓ Crawl successful: {result.success}")
print("\n✓ Docker client also accepts pre-converted strings")
print("✓ Backward compatible with existing code")
async def show_usage_patterns():
"""Show different usage patterns."""
print("\n" + "=" * 60)
print("Usage Patterns")
print("=" * 60)
print("\n1. Direct function usage (simplest):")
print("-" * 60)
print("""
async def my_hook(page, context, **kwargs):
await page.set_viewport_size({"width": 1920, "height": 1080})
return page
result = await client.crawl(
["https://example.com"],
hooks={"on_page_context_created": my_hook}
)
""")
print("\n2. Convert then use:")
print("-" * 60)
print("""
hooks_dict = {"on_page_context_created": my_hook}
hooks_string = hooks_to_string(hooks_dict)
result = await client.crawl(
["https://example.com"],
hooks=hooks_string
)
""")
print("\n3. Manual string (backward compatible):")
print("-" * 60)
print("""
hooks_string = {
"on_page_context_created": '''
async def hook(page, context, **kwargs):
await page.set_viewport_size({"width": 1920, "height": 1080})
return page
'''
}
result = await client.crawl(
["https://example.com"],
hooks=hooks_string
)
""")
async def main():
"""Run all tests."""
print("\n🚀 Crawl4AI Hooks Utility Test Suite\n")
# Test the utility function
# await test_hooks_utility()
# Show usage with Docker client
# await test_docker_client_with_functions()
await test_docker_client_with_strings()
# Show different patterns
# await show_usage_patterns()
# print("\n" + "=" * 60)
# print("✓ All tests completed successfully!")
# print("=" * 60)
# print("\nKey Benefits:")
# print(" • Write hooks as regular Python functions")
# print(" • IDE support with autocomplete and type checking")
# print(" • Automatic conversion to API format")
# print(" • Backward compatible with string hooks")
# print(" • Same utility used everywhere")
# print("\n")
if __name__ == "__main__":
asyncio.run(main())