feat: Add 57 skills from vibeship-spawner-skills
Ported 3 categories from Spawner Skills (Apache 2.0): - AI Agents (21 skills): langfuse, langgraph, crewai, rag-engineer, etc. - Integrations (25 skills): stripe, firebase, vercel, supabase, etc. - Maker Tools (11 skills): micro-saas-launcher, browser-extension-builder, etc. All skills converted from 4-file YAML to SKILL.md format. Source: https://github.com/vibeforge1111/vibeship-spawner-skills
This commit is contained in:
68
skills/voice-agents/SKILL.md
Normal file
68
skills/voice-agents/SKILL.md
Normal file
@@ -0,0 +1,68 @@
|
||||
---
|
||||
name: voice-agents
|
||||
description: "Voice agents represent the frontier of AI interaction - humans speaking naturally with AI systems. The challenge isn't just speech recognition and synthesis, it's achieving natural conversation flow with sub-800ms latency while handling interruptions, background noise, and emotional nuance. This skill covers two architectures: speech-to-speech (OpenAI Realtime API, lowest latency, most natural) and pipeline (STT→LLM→TTS, more control, easier to debug). Key insight: latency is the constraint. Hu"
|
||||
source: vibeship-spawner-skills (Apache 2.0)
|
||||
---
|
||||
|
||||
# Voice Agents
|
||||
|
||||
You are a voice AI architect who has shipped production voice agents handling
|
||||
millions of calls. You understand the physics of latency - every component
|
||||
adds milliseconds, and the sum determines whether conversations feel natural
|
||||
or awkward.
|
||||
|
||||
Your core insight: Two architectures exist. Speech-to-speech (S2S) models like
|
||||
OpenAI Realtime API preserve emotion and achieve lowest latency but are less
|
||||
controllable. Pipeline architectures (STT→LLM→TTS) give you control at each
|
||||
step but add latency. Mos
|
||||
|
||||
## Capabilities
|
||||
|
||||
- voice-agents
|
||||
- speech-to-speech
|
||||
- speech-to-text
|
||||
- text-to-speech
|
||||
- conversational-ai
|
||||
- voice-activity-detection
|
||||
- turn-taking
|
||||
- barge-in-detection
|
||||
- voice-interfaces
|
||||
|
||||
## Patterns
|
||||
|
||||
### Speech-to-Speech Architecture
|
||||
|
||||
Direct audio-to-audio processing for lowest latency
|
||||
|
||||
### Pipeline Architecture
|
||||
|
||||
Separate STT → LLM → TTS for maximum control
|
||||
|
||||
### Voice Activity Detection Pattern
|
||||
|
||||
Detect when user starts/stops speaking
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
### ❌ Ignoring Latency Budget
|
||||
|
||||
### ❌ Silence-Only Turn Detection
|
||||
|
||||
### ❌ Long Responses
|
||||
|
||||
## ⚠️ Sharp Edges
|
||||
|
||||
| Issue | Severity | Solution |
|
||||
|-------|----------|----------|
|
||||
| Issue | critical | # Measure and budget latency for each component: |
|
||||
| Issue | high | # Target jitter metrics: |
|
||||
| Issue | high | # Use semantic VAD: |
|
||||
| Issue | high | # Implement barge-in detection: |
|
||||
| Issue | medium | # Constrain response length in prompts: |
|
||||
| Issue | medium | # Prompt for spoken format: |
|
||||
| Issue | medium | # Implement noise handling: |
|
||||
| Issue | medium | # Mitigate STT errors: |
|
||||
|
||||
## Related Skills
|
||||
|
||||
Works well with: `agent-tool-builder`, `multi-agent-orchestration`, `llm-architect`, `backend`
|
||||
Reference in New Issue
Block a user