- 27 official team skills (Sentry, Trail of Bits, Expo, Hugging Face, etc.) - 34 community skills including context engineering suite - All skills validated and compliant with V4 quality bar - Complete source attribution maintained Skills added: - Official: commit, create-pr, find-bugs, iterate-pr, culture-index, fix-review, sharp-edges, expo-deployment, upgrading-expo, hugging-face-cli, hugging-face-jobs, vercel-deploy-claimable, design-md, using-neon, n8n-*, swiftui-expert-skill, fal-*, deep-research, imagen, readme, screenshots - Community: frontend-slides, linear-claude-skill, skill-rails-upgrade, context-*, multi-agent-patterns, tool-design, evaluation, memory-systems, terraform-skill, and more
1039 lines
30 KiB
Markdown
1039 lines
30 KiB
Markdown
---
|
||
name: hugging-face-jobs
|
||
description: "This skill should be used when users want to run any workload on Hugging Face Jobs infrastructure. Covers UV scripts, Docker-based jobs, hardware selection, cost estimation, authentication with tokens, secrets management, timeout configuration, and result persistence. Designed for general-purpose compute workloads including data processing, inference, experiments, batch jobs, and any Python-based tasks. Should be invoked for tasks involving cloud compute, GPU workloads, or when users mention running jobs on Hugging Face infrastructure without local setup."
|
||
license: "Complete terms in LICENSE.txt"
|
||
source: "https://github.com/huggingface/skills/tree/main/skills/hugging-face-jobs"
|
||
risk: safe
|
||
---
|
||
|
||
# Running Workloads on Hugging Face Jobs
|
||
|
||
## Overview
|
||
|
||
Run any workload on fully managed Hugging Face infrastructure. No local setup required—jobs run on cloud CPUs, GPUs, or TPUs and can persist results to the Hugging Face Hub.
|
||
|
||
**Common use cases:**
|
||
- **Data Processing** - Transform, filter, or analyze large datasets
|
||
- **Batch Inference** - Run inference on thousands of samples
|
||
- **Experiments & Benchmarks** - Reproducible ML experiments
|
||
- **Model Training** - Fine-tune models (see `model-trainer` skill for TRL-specific training)
|
||
- **Synthetic Data Generation** - Generate datasets using LLMs
|
||
- **Development & Testing** - Test code without local GPU setup
|
||
- **Scheduled Jobs** - Automate recurring tasks
|
||
|
||
**For model training specifically:** See the `model-trainer` skill for TRL-based training workflows.
|
||
|
||
## When to Use This Skill
|
||
|
||
Use this skill when users want to:
|
||
- Run Python workloads on cloud infrastructure
|
||
- Execute jobs without local GPU/TPU setup
|
||
- Process data at scale
|
||
- Run batch inference or experiments
|
||
- Schedule recurring tasks
|
||
- Use GPUs/TPUs for any workload
|
||
- Persist results to the Hugging Face Hub
|
||
|
||
## Key Directives
|
||
|
||
When assisting with jobs:
|
||
|
||
1. **ALWAYS use `hf_jobs()` MCP tool** - Submit jobs using `hf_jobs("uv", {...})` or `hf_jobs("run", {...})`. The `script` parameter accepts Python code directly. Do NOT save to local files unless the user explicitly requests it. Pass the script content as a string to `hf_jobs()`.
|
||
|
||
2. **Always handle authentication** - Jobs that interact with the Hub require `HF_TOKEN` via secrets. See Token Usage section below.
|
||
|
||
3. **Provide job details after submission** - After submitting, provide job ID, monitoring URL, estimated time, and note that the user can request status checks later.
|
||
|
||
4. **Set appropriate timeouts** - Default 30min may be insufficient for long-running tasks.
|
||
|
||
## Prerequisites Checklist
|
||
|
||
Before starting any job, verify:
|
||
|
||
### ✅ **Account & Authentication**
|
||
- Hugging Face Account with [Pro](https://hf.co/pro), [Team](https://hf.co/enterprise), or [Enterprise](https://hf.co/enterprise) plan (Jobs require paid plan)
|
||
- Authenticated login: Check with `hf_whoami()`
|
||
- **HF_TOKEN for Hub Access** ⚠️ CRITICAL - Required for any Hub operations (push models/datasets, download private repos, etc.)
|
||
- Token must have appropriate permissions (read for downloads, write for uploads)
|
||
|
||
### ✅ **Token Usage** (See Token Usage section for details)
|
||
|
||
**When tokens are required:**
|
||
- Pushing models/datasets to Hub
|
||
- Accessing private repositories
|
||
- Using Hub APIs in scripts
|
||
- Any authenticated Hub operations
|
||
|
||
**How to provide tokens:**
|
||
```python
|
||
{
|
||
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # Recommended: automatic token
|
||
}
|
||
```
|
||
|
||
**⚠️ CRITICAL:** The `$HF_TOKEN` placeholder is automatically replaced with your logged-in token. Never hardcode tokens in scripts.
|
||
|
||
## Token Usage Guide
|
||
|
||
### Understanding Tokens
|
||
|
||
**What are HF Tokens?**
|
||
- Authentication credentials for Hugging Face Hub
|
||
- Required for authenticated operations (push, private repos, API access)
|
||
- Stored securely on your machine after `hf auth login`
|
||
|
||
**Token Types:**
|
||
- **Read Token** - Can download models/datasets, read private repos
|
||
- **Write Token** - Can push models/datasets, create repos, modify content
|
||
- **Organization Token** - Can act on behalf of an organization
|
||
|
||
### When Tokens Are Required
|
||
|
||
**Always Required:**
|
||
- Pushing models/datasets to Hub
|
||
- Accessing private repositories
|
||
- Creating new repositories
|
||
- Modifying existing repositories
|
||
- Using Hub APIs programmatically
|
||
|
||
**Not Required:**
|
||
- Downloading public models/datasets
|
||
- Running jobs that don't interact with Hub
|
||
- Reading public repository information
|
||
|
||
### How to Provide Tokens to Jobs
|
||
|
||
#### Method 1: Automatic Token (Recommended)
|
||
|
||
```python
|
||
hf_jobs("uv", {
|
||
"script": "your_script.py",
|
||
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Automatic replacement
|
||
})
|
||
```
|
||
|
||
**How it works:**
|
||
- `$HF_TOKEN` is a placeholder that gets replaced with your actual token
|
||
- Uses the token from your logged-in session (`hf auth login`)
|
||
- Most secure and convenient method
|
||
- Token is encrypted server-side when passed as a secret
|
||
|
||
**Benefits:**
|
||
- No token exposure in code
|
||
- Uses your current login session
|
||
- Automatically updated if you re-login
|
||
- Works seamlessly with MCP tools
|
||
|
||
#### Method 2: Explicit Token (Not Recommended)
|
||
|
||
```python
|
||
hf_jobs("uv", {
|
||
"script": "your_script.py",
|
||
"secrets": {"HF_TOKEN": "hf_abc123..."} # ⚠️ Hardcoded token
|
||
})
|
||
```
|
||
|
||
**When to use:**
|
||
- Only if automatic token doesn't work
|
||
- Testing with a specific token
|
||
- Organization tokens (use with caution)
|
||
|
||
**Security concerns:**
|
||
- Token visible in code/logs
|
||
- Must manually update if token rotates
|
||
- Risk of token exposure
|
||
|
||
#### Method 3: Environment Variable (Less Secure)
|
||
|
||
```python
|
||
hf_jobs("uv", {
|
||
"script": "your_script.py",
|
||
"env": {"HF_TOKEN": "hf_abc123..."} # ⚠️ Less secure than secrets
|
||
})
|
||
```
|
||
|
||
**Difference from secrets:**
|
||
- `env` variables are visible in job logs
|
||
- `secrets` are encrypted server-side
|
||
- Always prefer `secrets` for tokens
|
||
|
||
### Using Tokens in Scripts
|
||
|
||
**In your Python script, tokens are available as environment variables:**
|
||
|
||
```python
|
||
# /// script
|
||
# dependencies = ["huggingface-hub"]
|
||
# ///
|
||
|
||
import os
|
||
from huggingface_hub import HfApi
|
||
|
||
# Token is automatically available if passed via secrets
|
||
token = os.environ.get("HF_TOKEN")
|
||
|
||
# Use with Hub API
|
||
api = HfApi(token=token)
|
||
|
||
# Or let huggingface_hub auto-detect
|
||
api = HfApi() # Automatically uses HF_TOKEN env var
|
||
```
|
||
|
||
**Best practices:**
|
||
- Don't hardcode tokens in scripts
|
||
- Use `os.environ.get("HF_TOKEN")` to access
|
||
- Let `huggingface_hub` auto-detect when possible
|
||
- Verify token exists before Hub operations
|
||
|
||
### Token Verification
|
||
|
||
**Check if you're logged in:**
|
||
```python
|
||
from huggingface_hub import whoami
|
||
user_info = whoami() # Returns your username if authenticated
|
||
```
|
||
|
||
**Verify token in job:**
|
||
```python
|
||
import os
|
||
assert "HF_TOKEN" in os.environ, "HF_TOKEN not found!"
|
||
token = os.environ["HF_TOKEN"]
|
||
print(f"Token starts with: {token[:7]}...") # Should start with "hf_"
|
||
```
|
||
|
||
### Common Token Issues
|
||
|
||
**Error: 401 Unauthorized**
|
||
- **Cause:** Token missing or invalid
|
||
- **Fix:** Add `secrets={"HF_TOKEN": "$HF_TOKEN"}` to job config
|
||
- **Verify:** Check `hf_whoami()` works locally
|
||
|
||
**Error: 403 Forbidden**
|
||
- **Cause:** Token lacks required permissions
|
||
- **Fix:** Ensure token has write permissions for push operations
|
||
- **Check:** Token type at https://huggingface.co/settings/tokens
|
||
|
||
**Error: Token not found in environment**
|
||
- **Cause:** `secrets` not passed or wrong key name
|
||
- **Fix:** Use `secrets={"HF_TOKEN": "$HF_TOKEN"}` (not `env`)
|
||
- **Verify:** Script checks `os.environ.get("HF_TOKEN")`
|
||
|
||
**Error: Repository access denied**
|
||
- **Cause:** Token doesn't have access to private repo
|
||
- **Fix:** Use token from account with access
|
||
- **Check:** Verify repo visibility and your permissions
|
||
|
||
### Token Security Best Practices
|
||
|
||
1. **Never commit tokens** - Use `$HF_TOKEN` placeholder or environment variables
|
||
2. **Use secrets, not env** - Secrets are encrypted server-side
|
||
3. **Rotate tokens regularly** - Generate new tokens periodically
|
||
4. **Use minimal permissions** - Create tokens with only needed permissions
|
||
5. **Don't share tokens** - Each user should use their own token
|
||
6. **Monitor token usage** - Check token activity in Hub settings
|
||
|
||
### Complete Token Example
|
||
|
||
```python
|
||
# Example: Push results to Hub
|
||
hf_jobs("uv", {
|
||
"script": """
|
||
# /// script
|
||
# dependencies = ["huggingface-hub", "datasets"]
|
||
# ///
|
||
|
||
import os
|
||
from huggingface_hub import HfApi
|
||
from datasets import Dataset
|
||
|
||
# Verify token is available
|
||
assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
|
||
|
||
# Use token for Hub operations
|
||
api = HfApi(token=os.environ["HF_TOKEN"])
|
||
|
||
# Create and push dataset
|
||
data = {"text": ["Hello", "World"]}
|
||
dataset = Dataset.from_dict(data)
|
||
dataset.push_to_hub("username/my-dataset", token=os.environ["HF_TOKEN"])
|
||
|
||
print("✅ Dataset pushed successfully!")
|
||
""",
|
||
"flavor": "cpu-basic",
|
||
"timeout": "30m",
|
||
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Token provided securely
|
||
})
|
||
```
|
||
|
||
## Quick Start: Two Approaches
|
||
|
||
### Approach 1: UV Scripts (Recommended)
|
||
|
||
UV scripts use PEP 723 inline dependencies for clean, self-contained workloads.
|
||
|
||
**MCP Tool:**
|
||
```python
|
||
hf_jobs("uv", {
|
||
"script": """
|
||
# /// script
|
||
# dependencies = ["transformers", "torch"]
|
||
# ///
|
||
|
||
from transformers import pipeline
|
||
import torch
|
||
|
||
# Your workload here
|
||
classifier = pipeline("sentiment-analysis")
|
||
result = classifier("I love Hugging Face!")
|
||
print(result)
|
||
""",
|
||
"flavor": "cpu-basic",
|
||
"timeout": "30m"
|
||
})
|
||
```
|
||
|
||
**CLI Equivalent:**
|
||
```bash
|
||
hf jobs uv run my_script.py --flavor cpu-basic --timeout 30m
|
||
```
|
||
|
||
**Python API:**
|
||
```python
|
||
from huggingface_hub import run_uv_job
|
||
run_uv_job("my_script.py", flavor="cpu-basic", timeout="30m")
|
||
```
|
||
|
||
**Benefits:** Direct MCP tool usage, clean code, dependencies declared inline, no file saving required
|
||
|
||
**When to use:** Default choice for all workloads, custom logic, any scenario requiring `hf_jobs()`
|
||
|
||
#### Custom Docker Images for UV Scripts
|
||
|
||
By default, UV scripts use `ghcr.io/astral-sh/uv:python3.12-bookworm-slim`. For ML workloads with complex dependencies, use pre-built images:
|
||
|
||
```python
|
||
hf_jobs("uv", {
|
||
"script": "inference.py",
|
||
"image": "vllm/vllm-openai:latest", # Pre-built image with vLLM
|
||
"flavor": "a10g-large"
|
||
})
|
||
```
|
||
|
||
**CLI:**
|
||
```bash
|
||
hf jobs uv run --image vllm/vllm-openai:latest --flavor a10g-large inference.py
|
||
```
|
||
|
||
**Benefits:** Faster startup, pre-installed dependencies, optimized for specific frameworks
|
||
|
||
#### Python Version
|
||
|
||
By default, UV scripts use Python 3.12. Specify a different version:
|
||
|
||
```python
|
||
hf_jobs("uv", {
|
||
"script": "my_script.py",
|
||
"python": "3.11", # Use Python 3.11
|
||
"flavor": "cpu-basic"
|
||
})
|
||
```
|
||
|
||
**Python API:**
|
||
```python
|
||
from huggingface_hub import run_uv_job
|
||
run_uv_job("my_script.py", python="3.11")
|
||
```
|
||
|
||
#### Working with Scripts
|
||
|
||
⚠️ **Important:** There are *two* "script path" stories depending on how you run Jobs:
|
||
|
||
- **Using the `hf_jobs()` MCP tool (recommended in this repo)**: the `script` value must be **inline code** (a string) or a **URL**. A local filesystem path (like `"./scripts/foo.py"`) won't exist inside the remote container.
|
||
- **Using the `hf jobs uv run` CLI**: local file paths **do work** (the CLI uploads your script).
|
||
|
||
**Common mistake with `hf_jobs()` MCP tool:**
|
||
|
||
```python
|
||
# ❌ Will fail (remote container can't see your local path)
|
||
hf_jobs("uv", {"script": "./scripts/foo.py"})
|
||
```
|
||
|
||
**Correct patterns with `hf_jobs()` MCP tool:**
|
||
|
||
```python
|
||
# ✅ Inline: read the local script file and pass its *contents*
|
||
from pathlib import Path
|
||
script = Path("hf-jobs/scripts/foo.py").read_text()
|
||
hf_jobs("uv", {"script": script})
|
||
|
||
# ✅ URL: host the script somewhere reachable
|
||
hf_jobs("uv", {"script": "https://huggingface.co/datasets/uv-scripts/.../raw/main/foo.py"})
|
||
|
||
# ✅ URL from GitHub
|
||
hf_jobs("uv", {"script": "https://raw.githubusercontent.com/huggingface/trl/main/trl/scripts/sft.py"})
|
||
```
|
||
|
||
**CLI equivalent (local paths supported):**
|
||
|
||
```bash
|
||
hf jobs uv run ./scripts/foo.py -- --your --args
|
||
```
|
||
|
||
#### Adding Dependencies at Runtime
|
||
|
||
Add extra dependencies beyond what's in the PEP 723 header:
|
||
|
||
```python
|
||
hf_jobs("uv", {
|
||
"script": "inference.py",
|
||
"dependencies": ["transformers", "torch>=2.0"], # Extra deps
|
||
"flavor": "a10g-small"
|
||
})
|
||
```
|
||
|
||
**Python API:**
|
||
```python
|
||
from huggingface_hub import run_uv_job
|
||
run_uv_job("inference.py", dependencies=["transformers", "torch>=2.0"])
|
||
```
|
||
|
||
### Approach 2: Docker-Based Jobs
|
||
|
||
Run jobs with custom Docker images and commands.
|
||
|
||
**MCP Tool:**
|
||
```python
|
||
hf_jobs("run", {
|
||
"image": "python:3.12",
|
||
"command": ["python", "-c", "print('Hello from HF Jobs!')"],
|
||
"flavor": "cpu-basic",
|
||
"timeout": "30m"
|
||
})
|
||
```
|
||
|
||
**CLI Equivalent:**
|
||
```bash
|
||
hf jobs run python:3.12 python -c "print('Hello from HF Jobs!')"
|
||
```
|
||
|
||
**Python API:**
|
||
```python
|
||
from huggingface_hub import run_job
|
||
run_job(image="python:3.12", command=["python", "-c", "print('Hello!')"], flavor="cpu-basic")
|
||
```
|
||
|
||
**Benefits:** Full Docker control, use pre-built images, run any command
|
||
**When to use:** Need specific Docker images, non-Python workloads, complex environments
|
||
|
||
**Example with GPU:**
|
||
```python
|
||
hf_jobs("run", {
|
||
"image": "pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel",
|
||
"command": ["python", "-c", "import torch; print(torch.cuda.get_device_name())"],
|
||
"flavor": "a10g-small",
|
||
"timeout": "1h"
|
||
})
|
||
```
|
||
|
||
**Using Hugging Face Spaces as Images:**
|
||
|
||
You can use Docker images from HF Spaces:
|
||
```python
|
||
hf_jobs("run", {
|
||
"image": "hf.co/spaces/lhoestq/duckdb", # Space as Docker image
|
||
"command": ["duckdb", "-c", "SELECT 'Hello from DuckDB!'"],
|
||
"flavor": "cpu-basic"
|
||
})
|
||
```
|
||
|
||
**CLI:**
|
||
```bash
|
||
hf jobs run hf.co/spaces/lhoestq/duckdb duckdb -c "SELECT 'Hello!'"
|
||
```
|
||
|
||
### Finding More UV Scripts on Hub
|
||
|
||
The `uv-scripts` organization provides ready-to-use UV scripts stored as datasets on Hugging Face Hub:
|
||
|
||
```python
|
||
# Discover available UV script collections
|
||
dataset_search({"author": "uv-scripts", "sort": "downloads", "limit": 20})
|
||
|
||
# Explore a specific collection
|
||
hub_repo_details(["uv-scripts/classification"], repo_type="dataset", include_readme=True)
|
||
```
|
||
|
||
**Popular collections:** OCR, classification, synthetic-data, vLLM, dataset-creation
|
||
|
||
## Hardware Selection
|
||
|
||
> **Reference:** [HF Jobs Hardware Docs](https://huggingface.co/docs/hub/en/spaces-config-reference) (updated 07/2025)
|
||
|
||
| Workload Type | Recommended Hardware | Use Case |
|
||
|---------------|---------------------|----------|
|
||
| Data processing, testing | `cpu-basic`, `cpu-upgrade` | Lightweight tasks |
|
||
| Small models, demos | `t4-small` | <1B models, quick tests |
|
||
| Medium models | `t4-medium`, `l4x1` | 1-7B models |
|
||
| Large models, production | `a10g-small`, `a10g-large` | 7-13B models |
|
||
| Very large models | `a100-large` | 13B+ models |
|
||
| Batch inference | `a10g-large`, `a100-large` | High-throughput |
|
||
| Multi-GPU workloads | `l4x4`, `a10g-largex2`, `a10g-largex4` | Parallel/large models |
|
||
| TPU workloads | `v5e-1x1`, `v5e-2x2`, `v5e-2x4` | JAX/Flax, TPU-optimized |
|
||
|
||
**All Available Flavors:**
|
||
- **CPU:** `cpu-basic`, `cpu-upgrade`
|
||
- **GPU:** `t4-small`, `t4-medium`, `l4x1`, `l4x4`, `a10g-small`, `a10g-large`, `a10g-largex2`, `a10g-largex4`, `a100-large`
|
||
- **TPU:** `v5e-1x1`, `v5e-2x2`, `v5e-2x4`
|
||
|
||
**Guidelines:**
|
||
- Start with smaller hardware for testing
|
||
- Scale up based on actual needs
|
||
- Use multi-GPU for parallel workloads or large models
|
||
- Use TPUs for JAX/Flax workloads
|
||
- See `references/hardware_guide.md` for detailed specifications
|
||
|
||
## Critical: Saving Results
|
||
|
||
**⚠️ EPHEMERAL ENVIRONMENT—MUST PERSIST RESULTS**
|
||
|
||
The Jobs environment is temporary. All files are deleted when the job ends. If results aren't persisted, **ALL WORK IS LOST**.
|
||
|
||
### Persistence Options
|
||
|
||
**1. Push to Hugging Face Hub (Recommended)**
|
||
|
||
```python
|
||
# Push models
|
||
model.push_to_hub("username/model-name", token=os.environ["HF_TOKEN"])
|
||
|
||
# Push datasets
|
||
dataset.push_to_hub("username/dataset-name", token=os.environ["HF_TOKEN"])
|
||
|
||
# Push artifacts
|
||
api.upload_file(
|
||
path_or_fileobj="results.json",
|
||
path_in_repo="results.json",
|
||
repo_id="username/results",
|
||
token=os.environ["HF_TOKEN"]
|
||
)
|
||
```
|
||
|
||
**2. Use External Storage**
|
||
|
||
```python
|
||
# Upload to S3, GCS, etc.
|
||
import boto3
|
||
s3 = boto3.client('s3')
|
||
s3.upload_file('results.json', 'my-bucket', 'results.json')
|
||
```
|
||
|
||
**3. Send Results via API**
|
||
|
||
```python
|
||
# POST results to your API
|
||
import requests
|
||
requests.post("https://your-api.com/results", json=results)
|
||
```
|
||
|
||
### Required Configuration for Hub Push
|
||
|
||
**In job submission:**
|
||
```python
|
||
{
|
||
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # Enables authentication
|
||
}
|
||
```
|
||
|
||
**In script:**
|
||
```python
|
||
import os
|
||
from huggingface_hub import HfApi
|
||
|
||
# Token automatically available from secrets
|
||
api = HfApi(token=os.environ.get("HF_TOKEN"))
|
||
|
||
# Push your results
|
||
api.upload_file(...)
|
||
```
|
||
|
||
### Verification Checklist
|
||
|
||
Before submitting:
|
||
- [ ] Results persistence method chosen
|
||
- [ ] `secrets={"HF_TOKEN": "$HF_TOKEN"}` if using Hub
|
||
- [ ] Script handles missing token gracefully
|
||
- [ ] Test persistence path works
|
||
|
||
**See:** `references/hub_saving.md` for detailed Hub persistence guide
|
||
|
||
## Timeout Management
|
||
|
||
**⚠️ DEFAULT: 30 MINUTES**
|
||
|
||
Jobs automatically stop after the timeout. For long-running tasks like training, always set a custom timeout.
|
||
|
||
### Setting Timeouts
|
||
|
||
**MCP Tool:**
|
||
```python
|
||
{
|
||
"timeout": "2h" # 2 hours
|
||
}
|
||
```
|
||
|
||
**Supported formats:**
|
||
- Integer/float: seconds (e.g., `300` = 5 minutes)
|
||
- String with suffix: `"5m"` (minutes), `"2h"` (hours), `"1d"` (days)
|
||
- Examples: `"90m"`, `"2h"`, `"1.5h"`, `300`, `"1d"`
|
||
|
||
**Python API:**
|
||
```python
|
||
from huggingface_hub import run_job, run_uv_job
|
||
|
||
run_job(image="python:3.12", command=[...], timeout="2h")
|
||
run_uv_job("script.py", timeout=7200) # 2 hours in seconds
|
||
```
|
||
|
||
### Timeout Guidelines
|
||
|
||
| Scenario | Recommended | Notes |
|
||
|----------|-------------|-------|
|
||
| Quick test | 10-30 min | Verify setup |
|
||
| Data processing | 1-2 hours | Depends on data size |
|
||
| Batch inference | 2-4 hours | Large batches |
|
||
| Experiments | 4-8 hours | Multiple runs |
|
||
| Long-running | 8-24 hours | Production workloads |
|
||
|
||
**Always add 20-30% buffer** for setup, network delays, and cleanup.
|
||
|
||
**On timeout:** Job killed immediately, all unsaved progress lost
|
||
|
||
## Cost Estimation
|
||
|
||
**General guidelines:**
|
||
|
||
```
|
||
Total Cost = (Hours of runtime) × (Cost per hour)
|
||
```
|
||
|
||
**Example calculations:**
|
||
|
||
**Quick test:**
|
||
- Hardware: cpu-basic ($0.10/hour)
|
||
- Time: 15 minutes (0.25 hours)
|
||
- Cost: $0.03
|
||
|
||
**Data processing:**
|
||
- Hardware: l4x1 ($2.50/hour)
|
||
- Time: 2 hours
|
||
- Cost: $5.00
|
||
|
||
**Batch inference:**
|
||
- Hardware: a10g-large ($5/hour)
|
||
- Time: 4 hours
|
||
- Cost: $20.00
|
||
|
||
**Cost optimization tips:**
|
||
1. Start small - Test on cpu-basic or t4-small
|
||
2. Monitor runtime - Set appropriate timeouts
|
||
3. Use checkpoints - Resume if job fails
|
||
4. Optimize code - Reduce unnecessary compute
|
||
5. Choose right hardware - Don't over-provision
|
||
|
||
## Monitoring and Tracking
|
||
|
||
### Check Job Status
|
||
|
||
**MCP Tool:**
|
||
```python
|
||
# List all jobs
|
||
hf_jobs("ps")
|
||
|
||
# Inspect specific job
|
||
hf_jobs("inspect", {"job_id": "your-job-id"})
|
||
|
||
# View logs
|
||
hf_jobs("logs", {"job_id": "your-job-id"})
|
||
|
||
# Cancel a job
|
||
hf_jobs("cancel", {"job_id": "your-job-id"})
|
||
```
|
||
|
||
**Python API:**
|
||
```python
|
||
from huggingface_hub import list_jobs, inspect_job, fetch_job_logs, cancel_job
|
||
|
||
# List your jobs
|
||
jobs = list_jobs()
|
||
|
||
# List running jobs only
|
||
running = [j for j in list_jobs() if j.status.stage == "RUNNING"]
|
||
|
||
# Inspect specific job
|
||
job_info = inspect_job(job_id="your-job-id")
|
||
|
||
# View logs
|
||
for log in fetch_job_logs(job_id="your-job-id"):
|
||
print(log)
|
||
|
||
# Cancel a job
|
||
cancel_job(job_id="your-job-id")
|
||
```
|
||
|
||
**CLI:**
|
||
```bash
|
||
hf jobs ps # List jobs
|
||
hf jobs logs <job-id> # View logs
|
||
hf jobs cancel <job-id> # Cancel job
|
||
```
|
||
|
||
**Remember:** Wait for user to request status checks. Avoid polling repeatedly.
|
||
|
||
### Job URLs
|
||
|
||
After submission, jobs have monitoring URLs:
|
||
```
|
||
https://huggingface.co/jobs/username/job-id
|
||
```
|
||
|
||
View logs, status, and details in the browser.
|
||
|
||
### Wait for Multiple Jobs
|
||
|
||
```python
|
||
import time
|
||
from huggingface_hub import inspect_job, run_job
|
||
|
||
# Run multiple jobs
|
||
jobs = [run_job(image=img, command=cmd) for img, cmd in workloads]
|
||
|
||
# Wait for all to complete
|
||
for job in jobs:
|
||
while inspect_job(job_id=job.id).status.stage not in ("COMPLETED", "ERROR"):
|
||
time.sleep(10)
|
||
```
|
||
|
||
## Scheduled Jobs
|
||
|
||
Run jobs on a schedule using CRON expressions or predefined schedules.
|
||
|
||
**MCP Tool:**
|
||
```python
|
||
# Schedule a UV script that runs every hour
|
||
hf_jobs("scheduled uv", {
|
||
"script": "your_script.py",
|
||
"schedule": "@hourly",
|
||
"flavor": "cpu-basic"
|
||
})
|
||
|
||
# Schedule with CRON syntax
|
||
hf_jobs("scheduled uv", {
|
||
"script": "your_script.py",
|
||
"schedule": "0 9 * * 1", # 9 AM every Monday
|
||
"flavor": "cpu-basic"
|
||
})
|
||
|
||
# Schedule a Docker-based job
|
||
hf_jobs("scheduled run", {
|
||
"image": "python:3.12",
|
||
"command": ["python", "-c", "print('Scheduled!')"],
|
||
"schedule": "@daily",
|
||
"flavor": "cpu-basic"
|
||
})
|
||
```
|
||
|
||
**Python API:**
|
||
```python
|
||
from huggingface_hub import create_scheduled_job, create_scheduled_uv_job
|
||
|
||
# Schedule a Docker job
|
||
create_scheduled_job(
|
||
image="python:3.12",
|
||
command=["python", "-c", "print('Running on schedule!')"],
|
||
schedule="@hourly"
|
||
)
|
||
|
||
# Schedule a UV script
|
||
create_scheduled_uv_job("my_script.py", schedule="@daily", flavor="cpu-basic")
|
||
|
||
# Schedule with GPU
|
||
create_scheduled_uv_job(
|
||
"ml_inference.py",
|
||
schedule="0 */6 * * *", # Every 6 hours
|
||
flavor="a10g-small"
|
||
)
|
||
```
|
||
|
||
**Available schedules:**
|
||
- `@annually`, `@yearly` - Once per year
|
||
- `@monthly` - Once per month
|
||
- `@weekly` - Once per week
|
||
- `@daily` - Once per day
|
||
- `@hourly` - Once per hour
|
||
- CRON expression - Custom schedule (e.g., `"*/5 * * * *"` for every 5 minutes)
|
||
|
||
**Manage scheduled jobs:**
|
||
```python
|
||
# MCP Tool
|
||
hf_jobs("scheduled ps") # List scheduled jobs
|
||
hf_jobs("scheduled inspect", {"job_id": "..."}) # Inspect details
|
||
hf_jobs("scheduled suspend", {"job_id": "..."}) # Pause
|
||
hf_jobs("scheduled resume", {"job_id": "..."}) # Resume
|
||
hf_jobs("scheduled delete", {"job_id": "..."}) # Delete
|
||
```
|
||
|
||
**Python API for management:**
|
||
```python
|
||
from huggingface_hub import (
|
||
list_scheduled_jobs,
|
||
inspect_scheduled_job,
|
||
suspend_scheduled_job,
|
||
resume_scheduled_job,
|
||
delete_scheduled_job
|
||
)
|
||
|
||
# List all scheduled jobs
|
||
scheduled = list_scheduled_jobs()
|
||
|
||
# Inspect a scheduled job
|
||
info = inspect_scheduled_job(scheduled_job_id)
|
||
|
||
# Suspend (pause) a scheduled job
|
||
suspend_scheduled_job(scheduled_job_id)
|
||
|
||
# Resume a scheduled job
|
||
resume_scheduled_job(scheduled_job_id)
|
||
|
||
# Delete a scheduled job
|
||
delete_scheduled_job(scheduled_job_id)
|
||
```
|
||
|
||
## Webhooks: Trigger Jobs on Events
|
||
|
||
Trigger jobs automatically when changes happen in Hugging Face repositories.
|
||
|
||
**Python API:**
|
||
```python
|
||
from huggingface_hub import create_webhook
|
||
|
||
# Create webhook that triggers a job when a repo changes
|
||
webhook = create_webhook(
|
||
job_id=job.id,
|
||
watched=[
|
||
{"type": "user", "name": "your-username"},
|
||
{"type": "org", "name": "your-org-name"}
|
||
],
|
||
domains=["repo", "discussion"],
|
||
secret="your-secret"
|
||
)
|
||
```
|
||
|
||
**How it works:**
|
||
1. Webhook listens for changes in watched repositories
|
||
2. When triggered, the job runs with `WEBHOOK_PAYLOAD` environment variable
|
||
3. Your script can parse the payload to understand what changed
|
||
|
||
**Use cases:**
|
||
- Auto-process new datasets when uploaded
|
||
- Trigger inference when models are updated
|
||
- Run tests when code changes
|
||
- Generate reports on repository activity
|
||
|
||
**Access webhook payload in script:**
|
||
```python
|
||
import os
|
||
import json
|
||
|
||
payload = json.loads(os.environ.get("WEBHOOK_PAYLOAD", "{}"))
|
||
print(f"Event type: {payload.get('event', {}).get('action')}")
|
||
```
|
||
|
||
See [Webhooks Documentation](https://huggingface.co/docs/huggingface_hub/guides/webhooks) for more details.
|
||
|
||
## Common Workload Patterns
|
||
|
||
This repository ships ready-to-run UV scripts in `hf-jobs/scripts/`. Prefer using them instead of inventing new templates.
|
||
|
||
### Pattern 1: Dataset → Model Responses (vLLM) — `scripts/generate-responses.py`
|
||
|
||
**What it does:** loads a Hub dataset (chat `messages` or a `prompt` column), applies a model chat template, generates responses with vLLM, and **pushes** the output dataset + dataset card back to the Hub.
|
||
|
||
**Requires:** GPU + **write** token (it pushes a dataset).
|
||
|
||
```python
|
||
from pathlib import Path
|
||
|
||
script = Path("hf-jobs/scripts/generate-responses.py").read_text()
|
||
hf_jobs("uv", {
|
||
"script": script,
|
||
"script_args": [
|
||
"username/input-dataset",
|
||
"username/output-dataset",
|
||
"--messages-column", "messages",
|
||
"--model-id", "Qwen/Qwen3-30B-A3B-Instruct-2507",
|
||
"--temperature", "0.7",
|
||
"--top-p", "0.8",
|
||
"--max-tokens", "2048",
|
||
],
|
||
"flavor": "a10g-large",
|
||
"timeout": "4h",
|
||
"secrets": {"HF_TOKEN": "$HF_TOKEN"},
|
||
})
|
||
```
|
||
|
||
### Pattern 2: CoT Self-Instruct Synthetic Data — `scripts/cot-self-instruct.py`
|
||
|
||
**What it does:** generates synthetic prompts/answers via CoT Self-Instruct, optionally filters outputs (answer-consistency / RIP), then **pushes** the generated dataset + dataset card to the Hub.
|
||
|
||
**Requires:** GPU + **write** token (it pushes a dataset).
|
||
|
||
```python
|
||
from pathlib import Path
|
||
|
||
script = Path("hf-jobs/scripts/cot-self-instruct.py").read_text()
|
||
hf_jobs("uv", {
|
||
"script": script,
|
||
"script_args": [
|
||
"--seed-dataset", "davanstrien/s1k-reasoning",
|
||
"--output-dataset", "username/synthetic-math",
|
||
"--task-type", "reasoning",
|
||
"--num-samples", "5000",
|
||
"--filter-method", "answer-consistency",
|
||
],
|
||
"flavor": "l4x4",
|
||
"timeout": "8h",
|
||
"secrets": {"HF_TOKEN": "$HF_TOKEN"},
|
||
})
|
||
```
|
||
|
||
### Pattern 3: Streaming Dataset Stats (Polars + HF Hub) — `scripts/finepdfs-stats.py`
|
||
|
||
**What it does:** scans parquet directly from Hub (no 300GB download), computes temporal stats, and (optionally) uploads results to a Hub dataset repo.
|
||
|
||
**Requires:** CPU is often enough; token needed **only** if you pass `--output-repo` (upload).
|
||
|
||
```python
|
||
from pathlib import Path
|
||
|
||
script = Path("hf-jobs/scripts/finepdfs-stats.py").read_text()
|
||
hf_jobs("uv", {
|
||
"script": script,
|
||
"script_args": [
|
||
"--limit", "10000",
|
||
"--show-plan",
|
||
"--output-repo", "username/finepdfs-temporal-stats",
|
||
],
|
||
"flavor": "cpu-upgrade",
|
||
"timeout": "2h",
|
||
"env": {"HF_XET_HIGH_PERFORMANCE": "1"},
|
||
"secrets": {"HF_TOKEN": "$HF_TOKEN"},
|
||
})
|
||
```
|
||
|
||
## Common Failure Modes
|
||
|
||
### Out of Memory (OOM)
|
||
|
||
**Fix:**
|
||
1. Reduce batch size or data chunk size
|
||
2. Process data in smaller batches
|
||
3. Upgrade hardware: cpu → t4 → a10g → a100
|
||
|
||
### Job Timeout
|
||
|
||
**Fix:**
|
||
1. Check logs for actual runtime
|
||
2. Increase timeout with buffer: `"timeout": "3h"`
|
||
3. Optimize code for faster execution
|
||
4. Process data in chunks
|
||
|
||
### Hub Push Failures
|
||
|
||
**Fix:**
|
||
1. Add to job: `secrets={"HF_TOKEN": "$HF_TOKEN"}`
|
||
2. Verify token in script: `assert "HF_TOKEN" in os.environ`
|
||
3. Check token permissions
|
||
4. Verify repo exists or can be created
|
||
|
||
### Missing Dependencies
|
||
|
||
**Fix:**
|
||
Add to PEP 723 header:
|
||
```python
|
||
# /// script
|
||
# dependencies = ["package1", "package2>=1.0.0"]
|
||
# ///
|
||
```
|
||
|
||
### Authentication Errors
|
||
|
||
**Fix:**
|
||
1. Check `hf_whoami()` works locally
|
||
2. Verify `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
|
||
3. Re-login: `hf auth login`
|
||
4. Check token has required permissions
|
||
|
||
## Troubleshooting
|
||
|
||
**Common issues:**
|
||
- Job times out → Increase timeout, optimize code
|
||
- Results not saved → Check persistence method, verify HF_TOKEN
|
||
- Out of Memory → Reduce batch size, upgrade hardware
|
||
- Import errors → Add dependencies to PEP 723 header
|
||
- Authentication errors → Check token, verify secrets parameter
|
||
|
||
**See:** `references/troubleshooting.md` for complete troubleshooting guide
|
||
|
||
## Resources
|
||
|
||
### References (In This Skill)
|
||
- `references/token_usage.md` - Complete token usage guide
|
||
- `references/hardware_guide.md` - Hardware specs and selection
|
||
- `references/hub_saving.md` - Hub persistence guide
|
||
- `references/troubleshooting.md` - Common issues and solutions
|
||
|
||
### Scripts (In This Skill)
|
||
- `scripts/generate-responses.py` - vLLM batch generation: dataset → responses → push to Hub
|
||
- `scripts/cot-self-instruct.py` - CoT Self-Instruct synthetic data generation + filtering → push to Hub
|
||
- `scripts/finepdfs-stats.py` - Polars streaming stats over `finepdfs-edu` parquet on Hub (optional push)
|
||
|
||
### External Links
|
||
|
||
**Official Documentation:**
|
||
- [HF Jobs Guide](https://huggingface.co/docs/huggingface_hub/guides/jobs) - Main documentation
|
||
- [HF Jobs CLI Reference](https://huggingface.co/docs/huggingface_hub/guides/cli#hf-jobs) - Command line interface
|
||
- [HF Jobs API Reference](https://huggingface.co/docs/huggingface_hub/package_reference/hf_api) - Python API details
|
||
- [Hardware Flavors Reference](https://huggingface.co/docs/hub/en/spaces-config-reference) - Available hardware
|
||
|
||
**Related Tools:**
|
||
- [UV Scripts Guide](https://docs.astral.sh/uv/guides/scripts/) - PEP 723 inline dependencies
|
||
- [UV Scripts Organization](https://huggingface.co/uv-scripts) - Community UV script collection
|
||
- [HF Hub Authentication](https://huggingface.co/docs/huggingface_hub/quick-start#authentication) - Token setup
|
||
- [Webhooks Documentation](https://huggingface.co/docs/huggingface_hub/guides/webhooks) - Event triggers
|
||
|
||
## Key Takeaways
|
||
|
||
1. **Submit scripts inline** - The `script` parameter accepts Python code directly; no file saving required unless user requests
|
||
2. **Jobs are asynchronous** - Don't wait/poll; let user check when ready
|
||
3. **Always set timeout** - Default 30 min may be insufficient; set appropriate timeout
|
||
4. **Always persist results** - Environment is ephemeral; without persistence, all work is lost
|
||
5. **Use tokens securely** - Always use `secrets={"HF_TOKEN": "$HF_TOKEN"}` for Hub operations
|
||
6. **Choose appropriate hardware** - Start small, scale up based on needs (see hardware guide)
|
||
7. **Use UV scripts** - Default to `hf_jobs("uv", {...})` with inline scripts for Python workloads
|
||
8. **Handle authentication** - Verify tokens are available before Hub operations
|
||
9. **Monitor jobs** - Provide job URLs and status check commands
|
||
10. **Optimize costs** - Choose right hardware, set appropriate timeouts
|
||
|
||
## Quick Reference: MCP Tool vs CLI vs Python API
|
||
|
||
| Operation | MCP Tool | CLI | Python API |
|
||
|-----------|----------|-----|------------|
|
||
| Run UV script | `hf_jobs("uv", {...})` | `hf jobs uv run script.py` | `run_uv_job("script.py")` |
|
||
| Run Docker job | `hf_jobs("run", {...})` | `hf jobs run image cmd` | `run_job(image, command)` |
|
||
| List jobs | `hf_jobs("ps")` | `hf jobs ps` | `list_jobs()` |
|
||
| View logs | `hf_jobs("logs", {...})` | `hf jobs logs <id>` | `fetch_job_logs(job_id)` |
|
||
| Cancel job | `hf_jobs("cancel", {...})` | `hf jobs cancel <id>` | `cancel_job(job_id)` |
|
||
| Schedule UV | `hf_jobs("scheduled uv", {...})` | - | `create_scheduled_uv_job()` |
|
||
| Schedule Docker | `hf_jobs("scheduled run", {...})` | - | `create_scheduled_job()` |
|
||
|