feat: add voice-ai-engine-development skill for building real-time conversational AI

2026-01-27 07:24:06 +02:00
parent e9783892c1
commit d972c4fa3a
9 changed files with 3360 additions and 0 deletions
--- a/skills/voice-ai-engine-development/references/provider_comparison.md
+++ b/skills/voice-ai-engine-development/references/provider_comparison.md
@@ -0,0 +1,515 @@
+# Provider Comparison Guide
+
+This guide compares different providers for transcription, LLM, and TTS services to help you choose the best option for your voice AI engine.
+
+## Transcription Providers
+
+### Deepgram
+
+**Strengths:**
+- ✅ Fastest transcription speed (< 300ms latency)
+- ✅ Excellent streaming support
+- ✅ High accuracy (95%+ on clear audio)
+- ✅ Good pricing ($0.0043/minute)
+- ✅ Nova-2 model optimized for real-time
+- ✅ Excellent documentation
+
+**Weaknesses:**
+- ❌ Less accurate with heavy accents
+- ❌ Smaller company (potential reliability concerns)
+
+**Best For:**
+- Real-time voice conversations
+- Low-latency applications
+- English-language applications
+- Startups and small businesses
+
+**Configuration:**
+```python
+{
+    "transcriberProvider": "deepgram",
+    "deepgramApiKey": "your-api-key",
+    "deepgramModel": "nova-2",
+    "language": "en-US"
+}
+```
+
+---
+
+### AssemblyAI
+
+**Strengths:**
+- ✅ Very high accuracy (96%+ on clear audio)
+- ✅ Excellent with accents and dialects
+- ✅ Good speaker diarization
+- ✅ Competitive pricing ($0.00025/second)
+- ✅ Strong customer support
+
+**Weaknesses:**
+- ❌ Slightly higher latency than Deepgram
+- ❌ Streaming support is newer
+
+**Best For:**
+- Applications requiring highest accuracy
+- Multi-speaker scenarios
+- Diverse user base with accents
+- Enterprise applications
+
+**Configuration:**
+```python
+{
+    "transcriberProvider": "assemblyai",
+    "assemblyaiApiKey": "your-api-key",
+    "language": "en"
+}
+```
+
+---
+
+### Azure Speech
+
+**Strengths:**
+- ✅ Enterprise-grade reliability
+- ✅ Excellent multi-language support (100+ languages)
+- ✅ Strong security and compliance
+- ✅ Integration with Azure ecosystem
+- ✅ Custom model training available
+
+**Weaknesses:**
+- ❌ Higher cost ($1/hour)
+- ❌ More complex setup
+- ❌ Slower than specialized providers
+
+**Best For:**
+- Enterprise applications
+- Multi-language requirements
+- Azure-based infrastructure
+- Compliance-sensitive applications
+
+**Configuration:**
+```python
+{
+    "transcriberProvider": "azure",
+    "azureSpeechKey": "your-key",
+    "azureSpeechRegion": "eastus",
+    "language": "en-US"
+}
+```
+
+---
+
+### Google Cloud Speech
+
+**Strengths:**
+- ✅ Excellent multi-language support (125+ languages)
+- ✅ Good accuracy
+- ✅ Integration with Google Cloud
+- ✅ Automatic punctuation
+- ✅ Speaker diarization
+
+**Weaknesses:**
+- ❌ Higher latency for streaming
+- ❌ Complex pricing model
+- ❌ Requires Google Cloud account
+
+**Best For:**
+- Multi-language applications
+- Google Cloud infrastructure
+- Applications needing speaker diarization
+
+**Configuration:**
+```python
+{
+    "transcriberProvider": "google",
+    "googleCredentials": "path/to/credentials.json",
+    "language": "en-US"
+}
+```
+
+---
+
+## LLM Providers
+
+### OpenAI (GPT-4, GPT-3.5)
+
+**Strengths:**
+- ✅ Highest quality responses
+- ✅ Excellent instruction following
+- ✅ Fast streaming
+- ✅ Large context window (128k for GPT-4)
+- ✅ Best-in-class reasoning
+
+**Weaknesses:**
+- ❌ Higher cost ($0.01-0.03/1k tokens)
+- ❌ Rate limits can be restrictive
+- ❌ No free tier
+
+**Best For:**
+- High-quality conversational AI
+- Complex reasoning tasks
+- Production applications
+- Enterprise use cases
+
+**Configuration:**
+```python
+{
+    "llmProvider": "openai",
+    "openaiApiKey": "your-api-key",
+    "openaiModel": "gpt-4-turbo",
+    "prompt": "You are a helpful AI assistant."
+}
+```
+
+**Pricing:**
+- GPT-4 Turbo: $0.01/1k input tokens, $0.03/1k output tokens
+- GPT-3.5 Turbo: $0.0005/1k input tokens, $0.0015/1k output tokens
+
+---
+
+### Google Gemini
+
+**Strengths:**
+- ✅ Excellent cost-effectiveness (free tier available)
+- ✅ Multimodal capabilities
+- ✅ Good streaming support
+- ✅ Large context window (1M tokens for Pro)
+- ✅ Fast response times
+
+**Weaknesses:**
+- ❌ Slightly lower quality than GPT-4
+- ❌ Less predictable behavior
+- ❌ Newer, less battle-tested
+
+**Best For:**
+- Cost-sensitive applications
+- Multimodal applications
+- Startups and prototypes
+- High-volume applications
+
+**Configuration:**
+```python
+{
+    "llmProvider": "gemini",
+    "geminiApiKey": "your-api-key",
+    "geminiModel": "gemini-pro",
+    "prompt": "You are a helpful AI assistant."
+}
+```
+
+**Pricing:**
+- Gemini Pro: Free up to 60 requests/minute
+- Gemini Pro (paid): $0.00025/1k input tokens, $0.0005/1k output tokens
+
+---
+
+### Anthropic Claude
+
+**Strengths:**
+- ✅ Excellent safety and alignment
+- ✅ Very long context window (200k tokens)
+- ✅ High-quality responses
+- ✅ Good at following complex instructions
+- ✅ Strong reasoning capabilities
+
+**Weaknesses:**
+- ❌ Higher cost than Gemini
+- ❌ Slower streaming than OpenAI
+- ❌ More conservative responses
+
+**Best For:**
+- Safety-critical applications
+- Long-context applications
+- Nuanced conversations
+- Enterprise applications
+
+**Configuration:**
+```python
+{
+    "llmProvider": "claude",
+    "claudeApiKey": "your-api-key",
+    "claudeModel": "claude-3-opus",
+    "prompt": "You are a helpful AI assistant."
+}
+```
+
+**Pricing:**
+- Claude 3 Opus: $0.015/1k input tokens, $0.075/1k output tokens
+- Claude 3 Sonnet: $0.003/1k input tokens, $0.015/1k output tokens
+
+---
+
+## TTS Providers
+
+### ElevenLabs
+
+**Strengths:**
+- ✅ Most natural-sounding voices
+- ✅ Excellent emotional range
+- ✅ Voice cloning capabilities
+- ✅ Good streaming support
+- ✅ Multiple languages
+
+**Weaknesses:**
+- ❌ Higher cost ($0.30/1k characters)
+- ❌ Rate limits on lower tiers
+- ❌ Occasional pronunciation errors
+
+**Best For:**
+- Premium voice experiences
+- Customer-facing applications
+- Voice cloning needs
+- High-quality audio requirements
+
+**Configuration:**
+```python
+{
+    "voiceProvider": "elevenlabs",
+    "elevenlabsApiKey": "your-api-key",
+    "elevenlabsVoiceId": "voice-id",
+    "elevenlabsModel": "eleven_monolingual_v1"
+}
+```
+
+**Pricing:**
+- Free: 10k characters/month
+- Starter: $5/month, 30k characters
+- Creator: $22/month, 100k characters
+
+---
+
+### Azure TTS
+
+**Strengths:**
+- ✅ Enterprise-grade reliability
+- ✅ Many languages (100+)
+- ✅ Neural voices available
+- ✅ SSML support for fine control
+- ✅ Good pricing ($4/1M characters)
+
+**Weaknesses:**
+- ❌ Less natural than ElevenLabs
+- ❌ More complex setup
+- ❌ Requires Azure account
+
+**Best For:**
+- Enterprise applications
+- Multi-language requirements
+- Azure-based infrastructure
+- Cost-sensitive high-volume applications
+
+**Configuration:**
+```python
+{
+    "voiceProvider": "azure",
+    "azureSpeechKey": "your-key",
+    "azureSpeechRegion": "eastus",
+    "azureVoiceName": "en-US-JennyNeural"
+}
+```
+
+**Pricing:**
+- Neural voices: $16/1M characters
+- Standard voices: $4/1M characters
+
+---
+
+### Google Cloud TTS
+
+**Strengths:**
+- ✅ Good quality neural voices
+- ✅ Many languages (40+)
+- ✅ WaveNet voices available
+- ✅ Competitive pricing ($4/1M characters)
+- ✅ SSML support
+
+**Weaknesses:**
+- ❌ Less natural than ElevenLabs
+- ❌ Requires Google Cloud account
+- ❌ Complex setup
+
+**Best For:**
+- Multi-language applications
+- Google Cloud infrastructure
+- Cost-effective neural voices
+
+**Configuration:**
+```python
+{
+    "voiceProvider": "google",
+    "googleCredentials": "path/to/credentials.json",
+    "googleVoiceName": "en-US-Neural2-F"
+}
+```
+
+**Pricing:**
+- WaveNet voices: $16/1M characters
+- Neural2 voices: $16/1M characters
+- Standard voices: $4/1M characters
+
+---
+
+### Amazon Polly
+
+**Strengths:**
+- ✅ AWS integration
+- ✅ Good pricing ($4/1M characters)
+- ✅ Neural voices available
+- ✅ SSML support
+- ✅ Reliable service
+
+**Weaknesses:**
+- ❌ Less natural than ElevenLabs
+- ❌ Fewer voice options
+- ❌ Requires AWS account
+
+**Best For:**
+- AWS-based infrastructure
+- Cost-effective neural voices
+- Enterprise applications
+
+**Configuration:**
+```python
+{
+    "voiceProvider": "polly",
+    "awsAccessKey": "your-access-key",
+    "awsSecretKey": "your-secret-key",
+    "awsRegion": "us-east-1",
+    "pollyVoiceId": "Joanna"
+}
+```
+
+**Pricing:**
+- Neural voices: $16/1M characters
+- Standard voices: $4/1M characters
+
+---
+
+### Play.ht
+
+**Strengths:**
+- ✅ Voice cloning capabilities
+- ✅ Natural-sounding voices
+- ✅ Good streaming support
+- ✅ Easy to use API
+- ✅ Multiple languages
+
+**Weaknesses:**
+- ❌ Higher cost than cloud providers
+- ❌ Smaller company
+- ❌ Less documentation
+
+**Best For:**
+- Voice cloning applications
+- Premium voice experiences
+- Startups and small businesses
+
+**Configuration:**
+```python
+{
+    "voiceProvider": "playht",
+    "playhtApiKey": "your-api-key",
+    "playhtUserId": "your-user-id",
+    "playhtVoiceId": "voice-id"
+}
+```
+
+**Pricing:**
+- Free: 2.5k characters
+- Creator: $31/month, 50k characters
+- Pro: $79/month, 150k characters
+
+---
+
+## Recommended Combinations
+
+### Budget-Conscious Startup
+```python
+{
+    "transcriberProvider": "deepgram",  # Fast and affordable
+    "llmProvider": "gemini",            # Free tier available
+    "voiceProvider": "google"           # Cost-effective neural voices
+}
+```
+**Estimated cost:** ~$0.01 per minute of conversation
+
+---
+
+### Premium Experience
+```python
+{
+    "transcriberProvider": "assemblyai",  # Highest accuracy
+    "llmProvider": "openai",              # Best quality responses
+    "voiceProvider": "elevenlabs"         # Most natural voices
+}
+```
+**Estimated cost:** ~$0.05 per minute of conversation
+
+---
+
+### Enterprise Application
+```python
+{
+    "transcriberProvider": "azure",  # Enterprise reliability
+    "llmProvider": "openai",         # Best quality
+    "voiceProvider": "azure"         # Enterprise reliability
+}
+```
+**Estimated cost:** ~$0.03 per minute of conversation
+
+---
+
+### Multi-Language Application
+```python
+{
+    "transcriberProvider": "google",  # 125+ languages
+    "llmProvider": "gemini",          # Good multi-language support
+    "voiceProvider": "google"         # 40+ languages
+}
+```
+**Estimated cost:** ~$0.02 per minute of conversation
+
+---
+
+## Decision Matrix
+
+| Priority | Transcriber | LLM | TTS |
+|----------|-------------|-----|-----|
+| **Lowest Cost** | Deepgram | Gemini | Google |
+| **Highest Quality** | AssemblyAI | OpenAI | ElevenLabs |
+| **Fastest Speed** | Deepgram | OpenAI | ElevenLabs |
+| **Enterprise** | Azure | OpenAI | Azure |
+| **Multi-Language** | Google | Gemini | Google |
+| **Voice Cloning** | N/A | N/A | ElevenLabs/Play.ht |
+
+---
+
+## Testing Recommendations
+
+Before committing to providers, test with your specific use case:
+
+1. **Create test conversations** with representative audio
+2. **Measure latency** end-to-end
+3. **Evaluate quality** with real users
+4. **Calculate costs** based on expected volume
+5. **Test edge cases** (accents, background noise, interrupts)
+
+---
+
+## Switching Providers
+
+The multi-provider factory pattern makes switching easy:
+
+```python
+# Just change the configuration
+config = {
+    "transcriberProvider": "deepgram",  # Change to "assemblyai"
+    "llmProvider": "gemini",            # Change to "openai"
+    "voiceProvider": "google"           # Change to "elevenlabs"
+}
+
+# No code changes needed!
+factory = VoiceComponentFactory()
+transcriber = factory.create_transcriber(config)
+agent = factory.create_agent(config)
+synthesizer = factory.create_synthesizer(config)
+```