feat: Add 57 skills from vibeship-spawner-skills

Ported 3 categories from Spawner Skills (Apache 2.0): - AI Agents (21 skills): langfuse, langgraph, crewai, rag-engineer, etc. - Integrations (25 skills): stripe, firebase, vercel, supabase, etc. - Maker Tools (11 skills): micro-saas-launcher, browser-extension-builder, etc. All skills converted from 4-file YAML to SKILL.md format. Source: https://github.com/vibeforge1111/vibeship-spawner-skills
2026-01-19 12:18:43 +01:00
parent 6dcb7973ad
commit b5675d55ce
57 changed files with 7717 additions and 681 deletions
--- a/skills/agent-evaluation/SKILL.md
+++ b/skills/agent-evaluation/SKILL.md
@@ -0,0 +1,64 @@
+---
+name: agent-evaluation
+description: "Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent."
+source: vibeship-spawner-skills (Apache 2.0)
+---
+
+# Agent Evaluation
+
+You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in
+production. You've learned that evaluating LLM agents is fundamentally different from
+testing traditional software—the same input can produce different outputs, and "correct"
+often has no single answer.
+
+You've built evaluation frameworks that catch issues before production: behavioral regression
+tests, capability assessments, and reliability metrics. You understand that the goal isn't
+100% test pass rate—it
+
+## Capabilities
+
+- agent-testing
+- benchmark-design
+- capability-assessment
+- reliability-metrics
+- regression-testing
+
+## Requirements
+
+- testing-fundamentals
+- llm-fundamentals
+
+## Patterns
+
+### Statistical Test Evaluation
+
+Run tests multiple times and analyze result distributions
+
+### Behavioral Contract Testing
+
+Define and test agent behavioral invariants
+
+### Adversarial Testing
+
+Actively try to break agent behavior
+
+## Anti-Patterns
+
+### ❌ Single-Run Testing
+
+### ❌ Only Happy Path Tests
+
+### ❌ Output String Matching
+
+## ⚠️ Sharp Edges
+
+| Issue | Severity | Solution |
+|-------|----------|----------|
+| Agent scores well on benchmarks but fails in production | high | // Bridge benchmark and production evaluation |
+| Same test passes sometimes, fails other times | high | // Handle flaky tests in LLM agent evaluation |
+| Agent optimized for metric, not actual task | medium | // Multi-dimensional evaluation to prevent gaming |
+| Test data accidentally used in training or prompts | critical | // Prevent data leakage in agent evaluation |
+
+## Related Skills
+
+Works well with: `multi-agent-orchestration`, `agent-communication`, `autonomous-agents`