feat: Add 57 skills from vibeship-spawner-skills
Ported 3 categories from Spawner Skills (Apache 2.0): - AI Agents (21 skills): langfuse, langgraph, crewai, rag-engineer, etc. - Integrations (25 skills): stripe, firebase, vercel, supabase, etc. - Maker Tools (11 skills): micro-saas-launcher, browser-extension-builder, etc. All skills converted from 4-file YAML to SKILL.md format. Source: https://github.com/vibeforge1111/vibeship-spawner-skills
This commit is contained in:
64
skills/agent-evaluation/SKILL.md
Normal file
64
skills/agent-evaluation/SKILL.md
Normal file
@@ -0,0 +1,64 @@
|
||||
---
|
||||
name: agent-evaluation
|
||||
description: "Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent."
|
||||
source: vibeship-spawner-skills (Apache 2.0)
|
||||
---
|
||||
|
||||
# Agent Evaluation
|
||||
|
||||
You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in
|
||||
production. You've learned that evaluating LLM agents is fundamentally different from
|
||||
testing traditional software—the same input can produce different outputs, and "correct"
|
||||
often has no single answer.
|
||||
|
||||
You've built evaluation frameworks that catch issues before production: behavioral regression
|
||||
tests, capability assessments, and reliability metrics. You understand that the goal isn't
|
||||
100% test pass rate—it
|
||||
|
||||
## Capabilities
|
||||
|
||||
- agent-testing
|
||||
- benchmark-design
|
||||
- capability-assessment
|
||||
- reliability-metrics
|
||||
- regression-testing
|
||||
|
||||
## Requirements
|
||||
|
||||
- testing-fundamentals
|
||||
- llm-fundamentals
|
||||
|
||||
## Patterns
|
||||
|
||||
### Statistical Test Evaluation
|
||||
|
||||
Run tests multiple times and analyze result distributions
|
||||
|
||||
### Behavioral Contract Testing
|
||||
|
||||
Define and test agent behavioral invariants
|
||||
|
||||
### Adversarial Testing
|
||||
|
||||
Actively try to break agent behavior
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
### ❌ Single-Run Testing
|
||||
|
||||
### ❌ Only Happy Path Tests
|
||||
|
||||
### ❌ Output String Matching
|
||||
|
||||
## ⚠️ Sharp Edges
|
||||
|
||||
| Issue | Severity | Solution |
|
||||
|-------|----------|----------|
|
||||
| Agent scores well on benchmarks but fails in production | high | // Bridge benchmark and production evaluation |
|
||||
| Same test passes sometimes, fails other times | high | // Handle flaky tests in LLM agent evaluation |
|
||||
| Agent optimized for metric, not actual task | medium | // Multi-dimensional evaluation to prevent gaming |
|
||||
| Test data accidentally used in training or prompts | critical | // Prevent data leakage in agent evaluation |
|
||||
|
||||
## Related Skills
|
||||
|
||||
Works well with: `multi-agent-orchestration`, `agent-communication`, `autonomous-agents`
|
||||
Reference in New Issue
Block a user