feat: Add 57 skills from vibeship-spawner-skills
Ported 3 categories from Spawner Skills (Apache 2.0): - AI Agents (21 skills): langfuse, langgraph, crewai, rag-engineer, etc. - Integrations (25 skills): stripe, firebase, vercel, supabase, etc. - Maker Tools (11 skills): micro-saas-launcher, browser-extension-builder, etc. All skills converted from 4-file YAML to SKILL.md format. Source: https://github.com/vibeforge1111/vibeship-spawner-skills
This commit is contained in:
302
skills/voice-ai-development/SKILL.md
Normal file
302
skills/voice-ai-development/SKILL.md
Normal file
@@ -0,0 +1,302 @@
|
||||
---
|
||||
name: voice-ai-development
|
||||
description: "Expert in building voice AI applications - from real-time voice agents to voice-enabled apps. Covers OpenAI Realtime API, Vapi for voice agents, Deepgram for transcription, ElevenLabs for synthesis, LiveKit for real-time infrastructure, and WebRTC fundamentals. Knows how to build low-latency, production-ready voice experiences. Use when: voice ai, voice agent, speech to text, text to speech, realtime voice."
|
||||
source: vibeship-spawner-skills (Apache 2.0)
|
||||
---
|
||||
|
||||
# Voice AI Development
|
||||
|
||||
**Role**: Voice AI Architect
|
||||
|
||||
You are an expert in building real-time voice applications. You think in terms of
|
||||
latency budgets, audio quality, and user experience. You know that voice apps feel
|
||||
magical when fast and broken when slow. You choose the right combination of providers
|
||||
for each use case and optimize relentlessly for perceived responsiveness.
|
||||
|
||||
## Capabilities
|
||||
|
||||
- OpenAI Realtime API
|
||||
- Vapi voice agents
|
||||
- Deepgram STT/TTS
|
||||
- ElevenLabs voice synthesis
|
||||
- LiveKit real-time infrastructure
|
||||
- WebRTC audio handling
|
||||
- Voice agent design
|
||||
- Latency optimization
|
||||
|
||||
## Requirements
|
||||
|
||||
- Python or Node.js
|
||||
- API keys for providers
|
||||
- Audio handling knowledge
|
||||
|
||||
## Patterns
|
||||
|
||||
### OpenAI Realtime API
|
||||
|
||||
Native voice-to-voice with GPT-4o
|
||||
|
||||
**When to use**: When you want integrated voice AI without separate STT/TTS
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import websockets
|
||||
import json
|
||||
import base64
|
||||
|
||||
OPENAI_API_KEY = "sk-..."
|
||||
|
||||
async def voice_session():
|
||||
url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
|
||||
headers = {
|
||||
"Authorization": f"Bearer {OPENAI_API_KEY}",
|
||||
"OpenAI-Beta": "realtime=v1"
|
||||
}
|
||||
|
||||
async with websockets.connect(url, extra_headers=headers) as ws:
|
||||
# Configure session
|
||||
await ws.send(json.dumps({
|
||||
"type": "session.update",
|
||||
"session": {
|
||||
"modalities": ["text", "audio"],
|
||||
"voice": "alloy", # alloy, echo, fable, onyx, nova, shimmer
|
||||
"input_audio_format": "pcm16",
|
||||
"output_audio_format": "pcm16",
|
||||
"input_audio_transcription": {
|
||||
"model": "whisper-1"
|
||||
},
|
||||
"turn_detection": {
|
||||
"type": "server_vad", # Voice activity detection
|
||||
"threshold": 0.5,
|
||||
"prefix_padding_ms": 300,
|
||||
"silence_duration_ms": 500
|
||||
},
|
||||
"tools": [
|
||||
{
|
||||
"type": "function",
|
||||
"name": "get_weather",
|
||||
"description": "Get weather for a location",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"location": {"type": "string"}
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}))
|
||||
|
||||
# Send audio (PCM16, 24kHz, mono)
|
||||
async def send_audio(audio_bytes):
|
||||
await ws.send(json.dumps({
|
||||
"type": "input_audio_buffer.append",
|
||||
"audio": base64.b64encode(audio_bytes).decode()
|
||||
}))
|
||||
|
||||
# Receive events
|
||||
async for message in ws:
|
||||
event = json.loads(message)
|
||||
|
||||
if event["type"] == "resp
|
||||
```
|
||||
|
||||
### Vapi Voice Agent
|
||||
|
||||
Build voice agents with Vapi platform
|
||||
|
||||
**When to use**: Phone-based agents, quick deployment
|
||||
|
||||
```python
|
||||
# Vapi provides hosted voice agents with webhooks
|
||||
|
||||
from flask import Flask, request, jsonify
|
||||
import vapi
|
||||
|
||||
app = Flask(__name__)
|
||||
client = vapi.Vapi(api_key="...")
|
||||
|
||||
# Create an assistant
|
||||
assistant = client.assistants.create(
|
||||
name="Support Agent",
|
||||
model={
|
||||
"provider": "openai",
|
||||
"model": "gpt-4o",
|
||||
"messages": [
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are a helpful support agent..."
|
||||
}
|
||||
]
|
||||
},
|
||||
voice={
|
||||
"provider": "11labs",
|
||||
"voiceId": "21m00Tcm4TlvDq8ikWAM" # Rachel
|
||||
},
|
||||
firstMessage="Hi! How can I help you today?",
|
||||
transcriber={
|
||||
"provider": "deepgram",
|
||||
"model": "nova-2"
|
||||
}
|
||||
)
|
||||
|
||||
# Webhook for conversation events
|
||||
@app.route("/vapi/webhook", methods=["POST"])
|
||||
def vapi_webhook():
|
||||
event = request.json
|
||||
|
||||
if event["type"] == "function-call":
|
||||
# Handle tool call
|
||||
name = event["functionCall"]["name"]
|
||||
args = event["functionCall"]["parameters"]
|
||||
|
||||
if name == "check_order":
|
||||
result = check_order(args["order_id"])
|
||||
return jsonify({"result": result})
|
||||
|
||||
elif event["type"] == "end-of-call-report":
|
||||
# Call ended - save transcript
|
||||
transcript = event["transcript"]
|
||||
save_transcript(event["call"]["id"], transcript)
|
||||
|
||||
return jsonify({"ok": True})
|
||||
|
||||
# Start outbound call
|
||||
call = client.calls.create(
|
||||
assistant_id=assistant.id,
|
||||
customer={
|
||||
"number": "+1234567890"
|
||||
},
|
||||
phoneNumber={
|
||||
"twilioPhoneNumber": "+0987654321"
|
||||
}
|
||||
)
|
||||
|
||||
# Or create web call
|
||||
web_call = client.calls.create(
|
||||
assistant_id=assistant.id,
|
||||
type="web"
|
||||
)
|
||||
# Returns URL for WebRTC connection
|
||||
```
|
||||
|
||||
### Deepgram STT + ElevenLabs TTS
|
||||
|
||||
Best-in-class transcription and synthesis
|
||||
|
||||
**When to use**: High quality voice, custom pipeline
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from deepgram import DeepgramClient, LiveTranscriptionEvents
|
||||
from elevenlabs import ElevenLabs
|
||||
|
||||
# Deepgram real-time transcription
|
||||
deepgram = DeepgramClient(api_key="...")
|
||||
|
||||
async def transcribe_stream(audio_stream):
|
||||
connection = deepgram.listen.live.v("1")
|
||||
|
||||
async def on_transcript(result):
|
||||
transcript = result.channel.alternatives[0].transcript
|
||||
if transcript:
|
||||
print(f"Heard: {transcript}")
|
||||
if result.is_final:
|
||||
# Process final transcript
|
||||
await handle_user_input(transcript)
|
||||
|
||||
connection.on(LiveTranscriptionEvents.Transcript, on_transcript)
|
||||
|
||||
await connection.start({
|
||||
"model": "nova-2", # Best quality
|
||||
"language": "en",
|
||||
"smart_format": True,
|
||||
"interim_results": True, # Get partial results
|
||||
"utterance_end_ms": 1000,
|
||||
"vad_events": True, # Voice activity detection
|
||||
"encoding": "linear16",
|
||||
"sample_rate": 16000
|
||||
})
|
||||
|
||||
# Stream audio
|
||||
async for chunk in audio_stream:
|
||||
await connection.send(chunk)
|
||||
|
||||
await connection.finish()
|
||||
|
||||
# ElevenLabs streaming synthesis
|
||||
eleven = ElevenLabs(api_key="...")
|
||||
|
||||
def text_to_speech_stream(text: str):
|
||||
"""Stream TTS audio chunks."""
|
||||
audio_stream = eleven.text_to_speech.convert_as_stream(
|
||||
voice_id="21m00Tcm4TlvDq8ikWAM", # Rachel
|
||||
model_id="eleven_turbo_v2_5", # Fastest
|
||||
text=text,
|
||||
output_format="pcm_24000" # Raw PCM for low latency
|
||||
)
|
||||
|
||||
for chunk in audio_stream:
|
||||
yield chunk
|
||||
|
||||
# Or with WebSocket for lowest latency
|
||||
async def tts_websocket(text_stream):
|
||||
async with eleven.text_to_speech.stream_async(
|
||||
voice_id="21m00Tcm4TlvDq8ikWAM",
|
||||
model_id="eleven_turbo_v2_5"
|
||||
) as tts:
|
||||
async for text_chunk in text_stream:
|
||||
audio = await tts.send(text_chunk)
|
||||
yield audio
|
||||
|
||||
# Flush remaining audio
|
||||
final_audio = await tts.flush()
|
||||
yield final_audio
|
||||
```
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
### ❌ Non-streaming Pipeline
|
||||
|
||||
**Why bad**: Adds seconds of latency.
|
||||
User perceives as slow.
|
||||
Loses conversation flow.
|
||||
|
||||
**Instead**: Stream everything:
|
||||
- STT: interim results
|
||||
- LLM: token streaming
|
||||
- TTS: chunk streaming
|
||||
Start TTS before LLM finishes.
|
||||
|
||||
### ❌ Ignoring Interruptions
|
||||
|
||||
**Why bad**: Frustrating user experience.
|
||||
Feels like talking to a machine.
|
||||
Wastes time.
|
||||
|
||||
**Instead**: Implement barge-in detection.
|
||||
Use VAD to detect user speech.
|
||||
Stop TTS immediately.
|
||||
Clear audio queue.
|
||||
|
||||
### ❌ Single Provider Lock-in
|
||||
|
||||
**Why bad**: May not be best quality.
|
||||
Single point of failure.
|
||||
Harder to optimize.
|
||||
|
||||
**Instead**: Mix best providers:
|
||||
- Deepgram for STT (speed + accuracy)
|
||||
- ElevenLabs for TTS (voice quality)
|
||||
- OpenAI/Anthropic for LLM
|
||||
|
||||
## Limitations
|
||||
|
||||
- Latency varies by provider
|
||||
- Cost per minute adds up
|
||||
- Quality depends on network
|
||||
- Complex debugging
|
||||
|
||||
## Related Skills
|
||||
|
||||
Works well with: `langgraph`, `structured-output`, `langfuse`
|
||||
Reference in New Issue
Block a user