feat: add voice-ai-engine-development skill for building real-time conversational AI
This commit is contained in:
471
skills/voice-ai-engine-development/references/common_pitfalls.md
Normal file
471
skills/voice-ai-engine-development/references/common_pitfalls.md
Normal file
@@ -0,0 +1,471 @@
|
||||
# Common Pitfalls and Solutions
|
||||
|
||||
This document covers common issues encountered when building voice AI engines and their solutions.
|
||||
|
||||
## 1. Audio Jumping/Cutting Off
|
||||
|
||||
### Problem
|
||||
The bot's audio jumps or cuts off mid-response, creating a jarring user experience.
|
||||
|
||||
### Symptoms
|
||||
- Audio plays in fragments
|
||||
- Sentences are incomplete
|
||||
- Multiple audio streams overlap
|
||||
- Unnatural pauses or gaps
|
||||
|
||||
### Root Cause
|
||||
Sending text to the synthesizer in small chunks (sentence-by-sentence or word-by-word) causes multiple TTS API calls. Each call generates a separate audio stream, resulting in:
|
||||
- Multiple audio files being played sequentially
|
||||
- Timing issues between chunks
|
||||
- Potential overlapping audio
|
||||
- Inconsistent voice characteristics between chunks
|
||||
|
||||
### Solution
|
||||
Buffer the entire LLM response before sending it to the synthesizer:
|
||||
|
||||
**❌ Bad: Yields sentence-by-sentence**
|
||||
```python
|
||||
async def generate_response(self, prompt):
|
||||
async for sentence in llm_stream:
|
||||
# This creates multiple TTS calls!
|
||||
yield GeneratedResponse(message=BaseMessage(text=sentence))
|
||||
```
|
||||
|
||||
**✅ Good: Buffer entire response**
|
||||
```python
|
||||
async def generate_response(self, prompt):
|
||||
# Buffer the entire response
|
||||
full_response = ""
|
||||
async for chunk in llm_stream:
|
||||
full_response += chunk
|
||||
|
||||
# Yield once with complete response
|
||||
yield GeneratedResponse(message=BaseMessage(text=full_response))
|
||||
```
|
||||
|
||||
### Why This Works
|
||||
- Single TTS call for the entire response
|
||||
- Consistent voice characteristics
|
||||
- Proper timing and pacing
|
||||
- No gaps or overlaps
|
||||
|
||||
---
|
||||
|
||||
## 2. Echo/Feedback Loop
|
||||
|
||||
### Problem
|
||||
The bot hears itself speaking and responds to its own audio, creating an infinite loop.
|
||||
|
||||
### Symptoms
|
||||
- Bot responds to its own speech
|
||||
- Conversation becomes nonsensical
|
||||
- Transcriptions include bot's own words
|
||||
- System becomes unresponsive
|
||||
|
||||
### Root Cause
|
||||
The transcriber continues to process audio while the bot is speaking. If the bot's audio is being played through speakers and captured by the microphone, the transcriber will transcribe the bot's own speech.
|
||||
|
||||
### Solution
|
||||
Mute the transcriber when the bot starts speaking:
|
||||
|
||||
```python
|
||||
# Before sending audio to output
|
||||
self.transcriber.mute()
|
||||
|
||||
# Send audio...
|
||||
await self.send_speech_to_output(synthesis_result)
|
||||
|
||||
# After audio playback complete
|
||||
self.transcriber.unmute()
|
||||
```
|
||||
|
||||
### Implementation in Transcriber
|
||||
```python
|
||||
class BaseTranscriber:
|
||||
def __init__(self):
|
||||
self.is_muted = False
|
||||
|
||||
def send_audio(self, chunk: bytes):
|
||||
"""Client calls this to send audio"""
|
||||
if not self.is_muted:
|
||||
self.input_queue.put_nowait(chunk)
|
||||
else:
|
||||
# Send silence instead (prevents echo)
|
||||
self.input_queue.put_nowait(self.create_silent_chunk(len(chunk)))
|
||||
|
||||
def mute(self):
|
||||
"""Called when bot starts speaking"""
|
||||
self.is_muted = True
|
||||
|
||||
def unmute(self):
|
||||
"""Called when bot stops speaking"""
|
||||
self.is_muted = False
|
||||
|
||||
def create_silent_chunk(self, size: int) -> bytes:
|
||||
"""Create a silent audio chunk"""
|
||||
return b'\x00' * size
|
||||
```
|
||||
|
||||
### Why This Works
|
||||
- Transcriber receives silence while bot speaks
|
||||
- No transcription of bot's own speech
|
||||
- Prevents feedback loop
|
||||
- Maintains audio stream continuity
|
||||
|
||||
---
|
||||
|
||||
## 3. Interrupts Not Working
|
||||
|
||||
### Problem
|
||||
Users cannot interrupt the bot mid-sentence. The bot continues speaking even when the user starts talking.
|
||||
|
||||
### Symptoms
|
||||
- Bot speaks over user
|
||||
- User must wait for bot to finish
|
||||
- Unnatural conversation flow
|
||||
- Poor user experience
|
||||
|
||||
### Root Cause
|
||||
All audio chunks are sent to the client immediately, buffering the entire message on the client side. By the time an interrupt is detected, all audio has already been sent and is queued for playback.
|
||||
|
||||
### Solution
|
||||
Rate-limit audio chunks to match real-time playback:
|
||||
|
||||
**❌ Bad: Send all chunks immediately**
|
||||
```python
|
||||
async for chunk in synthesis_result.chunk_generator:
|
||||
# Sends all chunks as fast as possible
|
||||
output_device.consume_nonblocking(chunk)
|
||||
```
|
||||
|
||||
**✅ Good: Rate-limit chunks**
|
||||
```python
|
||||
async for chunk in synthesis_result.chunk_generator:
|
||||
# Check for interrupt
|
||||
if stop_event.is_set():
|
||||
# Calculate partial message
|
||||
partial_message = synthesis_result.get_message_up_to(
|
||||
chunk_idx * seconds_per_chunk
|
||||
)
|
||||
return partial_message, True # cut_off = True
|
||||
|
||||
start_time = time.time()
|
||||
|
||||
# Send chunk
|
||||
output_device.consume_nonblocking(chunk)
|
||||
|
||||
# CRITICAL: Wait for chunk duration before sending next
|
||||
processing_time = time.time() - start_time
|
||||
await asyncio.sleep(max(seconds_per_chunk - processing_time, 0))
|
||||
|
||||
chunk_idx += 1
|
||||
```
|
||||
|
||||
### Why This Works
|
||||
- Only one chunk is buffered on client at a time
|
||||
- Interrupts can stop mid-sentence
|
||||
- Natural conversation flow
|
||||
- Real-time playback maintained
|
||||
|
||||
### Calculating `seconds_per_chunk`
|
||||
```python
|
||||
# For LINEAR16 PCM audio at 16kHz
|
||||
sample_rate = 16000 # Hz
|
||||
chunk_size = 1024 # bytes
|
||||
bytes_per_sample = 2 # 16-bit = 2 bytes
|
||||
|
||||
samples_per_chunk = chunk_size / bytes_per_sample
|
||||
seconds_per_chunk = samples_per_chunk / sample_rate
|
||||
# = 1024 / 2 / 16000 = 0.032 seconds
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Memory Leaks from Unclosed Streams
|
||||
|
||||
### Problem
|
||||
Memory usage grows over time, eventually causing the application to crash.
|
||||
|
||||
### Symptoms
|
||||
- Increasing memory usage
|
||||
- Slow performance over time
|
||||
- WebSocket connections not closing
|
||||
- Resource exhaustion
|
||||
|
||||
### Root Cause
|
||||
WebSocket connections, API streams, or async tasks are not properly closed when conversations end or errors occur.
|
||||
|
||||
### Solution
|
||||
Always use context managers and cleanup:
|
||||
|
||||
**❌ Bad: No cleanup**
|
||||
```python
|
||||
async def handle_conversation(websocket):
|
||||
conversation = create_conversation()
|
||||
await conversation.start()
|
||||
|
||||
async for message in websocket.iter_bytes():
|
||||
conversation.receive_audio(message)
|
||||
# No cleanup! Resources leak
|
||||
```
|
||||
|
||||
**✅ Good: Proper cleanup**
|
||||
```python
|
||||
async def handle_conversation(websocket):
|
||||
conversation = None
|
||||
try:
|
||||
conversation = create_conversation()
|
||||
await conversation.start()
|
||||
|
||||
async for message in websocket.iter_bytes():
|
||||
conversation.receive_audio(message)
|
||||
|
||||
except WebSocketDisconnect:
|
||||
logger.info("Client disconnected")
|
||||
except Exception as e:
|
||||
logger.error(f"Error: {e}", exc_info=True)
|
||||
finally:
|
||||
# Always cleanup
|
||||
if conversation:
|
||||
await conversation.terminate()
|
||||
```
|
||||
|
||||
### Proper Termination
|
||||
```python
|
||||
async def terminate(self):
|
||||
"""Gracefully shut down all workers"""
|
||||
self.active = False
|
||||
|
||||
# Stop all workers
|
||||
self.transcriber.terminate()
|
||||
self.agent.terminate()
|
||||
self.synthesizer.terminate()
|
||||
|
||||
# Wait for queues to drain
|
||||
await asyncio.sleep(0.5)
|
||||
|
||||
# Close connections
|
||||
if self.websocket:
|
||||
await self.websocket.close()
|
||||
|
||||
# Cancel tasks
|
||||
for task in self.tasks:
|
||||
if not task.done():
|
||||
task.cancel()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Conversation History Not Updating
|
||||
|
||||
### Problem
|
||||
The agent doesn't remember previous messages or context is lost.
|
||||
|
||||
### Symptoms
|
||||
- Agent repeats itself
|
||||
- No context from previous messages
|
||||
- Each response is independent
|
||||
- Poor conversation quality
|
||||
|
||||
### Root Cause
|
||||
Conversation history is not being maintained or updated correctly.
|
||||
|
||||
### Solution
|
||||
Maintain conversation history in the agent:
|
||||
|
||||
```python
|
||||
class Agent:
|
||||
def __init__(self):
|
||||
self.conversation_history = []
|
||||
|
||||
async def generate_response(self, user_input):
|
||||
# Add user message to history
|
||||
self.conversation_history.append({
|
||||
"role": "user",
|
||||
"content": user_input
|
||||
})
|
||||
|
||||
# Generate response with full history
|
||||
response = await self.llm.generate(self.conversation_history)
|
||||
|
||||
# Add bot response to history
|
||||
self.conversation_history.append({
|
||||
"role": "assistant",
|
||||
"content": response
|
||||
})
|
||||
|
||||
return response
|
||||
```
|
||||
|
||||
### Handling Interrupts
|
||||
When the bot is interrupted, update history with partial message:
|
||||
|
||||
```python
|
||||
def update_last_bot_message_on_cut_off(self, partial_message):
|
||||
"""Update history when bot is interrupted"""
|
||||
if self.conversation_history and \
|
||||
self.conversation_history[-1]["role"] == "assistant":
|
||||
# Update with what was actually spoken
|
||||
self.conversation_history[-1]["content"] = partial_message
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. WebSocket Connection Drops
|
||||
|
||||
### Problem
|
||||
WebSocket connections drop unexpectedly, interrupting conversations.
|
||||
|
||||
### Symptoms
|
||||
- Frequent disconnections
|
||||
- Connection timeouts
|
||||
- "Connection closed" errors
|
||||
- Unstable conversations
|
||||
|
||||
### Root Cause
|
||||
- No heartbeat/ping mechanism
|
||||
- Idle timeout
|
||||
- Network issues
|
||||
- Server overload
|
||||
|
||||
### Solution
|
||||
Implement heartbeat and reconnection:
|
||||
|
||||
```python
|
||||
@app.websocket("/conversation")
|
||||
async def conversation_endpoint(websocket: WebSocket):
|
||||
await websocket.accept()
|
||||
|
||||
# Start heartbeat
|
||||
async def heartbeat():
|
||||
while True:
|
||||
try:
|
||||
await websocket.send_json({"type": "ping"})
|
||||
await asyncio.sleep(30) # Ping every 30 seconds
|
||||
except:
|
||||
break
|
||||
|
||||
heartbeat_task = asyncio.create_task(heartbeat())
|
||||
|
||||
try:
|
||||
async for message in websocket.iter_bytes():
|
||||
# Process message
|
||||
pass
|
||||
finally:
|
||||
heartbeat_task.cancel()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. High Latency / Slow Responses
|
||||
|
||||
### Problem
|
||||
Long delays between user speech and bot response.
|
||||
|
||||
### Symptoms
|
||||
- Noticeable lag
|
||||
- Poor user experience
|
||||
- Conversation feels unnatural
|
||||
- Users repeat themselves
|
||||
|
||||
### Root Causes & Solutions
|
||||
|
||||
**1. Not using streaming**
|
||||
```python
|
||||
# ❌ Bad: Wait for entire response
|
||||
response = await llm.complete(prompt)
|
||||
|
||||
# ✅ Good: Stream response
|
||||
async for chunk in llm.complete(prompt, stream=True):
|
||||
yield chunk
|
||||
```
|
||||
|
||||
**2. Sequential processing**
|
||||
```python
|
||||
# ❌ Bad: Sequential
|
||||
transcription = await transcriber.transcribe(audio)
|
||||
response = await agent.generate(transcription)
|
||||
audio = await synthesizer.synthesize(response)
|
||||
|
||||
# ✅ Good: Concurrent with queues
|
||||
# All workers run simultaneously
|
||||
```
|
||||
|
||||
**3. Large chunk sizes**
|
||||
```python
|
||||
# ❌ Bad: Large chunks (high latency)
|
||||
chunk_size = 8192 # 0.25 seconds
|
||||
|
||||
# ✅ Good: Small chunks (low latency)
|
||||
chunk_size = 1024 # 0.032 seconds
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Audio Quality Issues
|
||||
|
||||
### Problem
|
||||
Poor audio quality, distortion, or artifacts.
|
||||
|
||||
### Symptoms
|
||||
- Robotic voice
|
||||
- Crackling or popping
|
||||
- Distorted audio
|
||||
- Inconsistent volume
|
||||
|
||||
### Root Causes & Solutions
|
||||
|
||||
**1. Wrong audio format**
|
||||
```python
|
||||
# ✅ Use LINEAR16 PCM at 16kHz
|
||||
audio_encoding = AudioEncoding.LINEAR16
|
||||
sample_rate = 16000
|
||||
```
|
||||
|
||||
**2. Incorrect format conversion**
|
||||
```python
|
||||
# ✅ Proper MP3 to PCM conversion
|
||||
from pydub import AudioSegment
|
||||
import io
|
||||
|
||||
def mp3_to_pcm(mp3_bytes):
|
||||
audio = AudioSegment.from_mp3(io.BytesIO(mp3_bytes))
|
||||
audio = audio.set_frame_rate(16000)
|
||||
audio = audio.set_channels(1)
|
||||
audio = audio.set_sample_width(2) # 16-bit
|
||||
return audio.raw_data
|
||||
```
|
||||
|
||||
**3. Buffer underruns**
|
||||
```python
|
||||
# ✅ Ensure consistent chunk timing
|
||||
await asyncio.sleep(max(seconds_per_chunk - processing_time, 0))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
| Problem | Root Cause | Solution |
|
||||
|---------|-----------|----------|
|
||||
| Audio jumping | Multiple TTS calls | Buffer entire response |
|
||||
| Echo/feedback | Transcriber active during bot speech | Mute transcriber |
|
||||
| Interrupts not working | All chunks sent immediately | Rate-limit chunks |
|
||||
| Memory leaks | Unclosed streams | Proper cleanup |
|
||||
| Lost context | History not maintained | Update conversation history |
|
||||
| Connection drops | No heartbeat | Implement ping/pong |
|
||||
| High latency | Sequential processing | Use streaming + queues |
|
||||
| Poor audio quality | Wrong format/conversion | Use LINEAR16 PCM 16kHz |
|
||||
|
||||
---
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always buffer LLM responses** before sending to synthesizer
|
||||
2. **Always mute transcriber** when bot is speaking
|
||||
3. **Always rate-limit audio chunks** to enable interrupts
|
||||
4. **Always cleanup resources** in finally blocks
|
||||
5. **Always maintain conversation history** for context
|
||||
6. **Always use streaming** for low latency
|
||||
7. **Always use LINEAR16 PCM** at 16kHz for audio
|
||||
8. **Always implement error handling** in worker loops
|
||||
@@ -0,0 +1,515 @@
|
||||
# Provider Comparison Guide
|
||||
|
||||
This guide compares different providers for transcription, LLM, and TTS services to help you choose the best option for your voice AI engine.
|
||||
|
||||
## Transcription Providers
|
||||
|
||||
### Deepgram
|
||||
|
||||
**Strengths:**
|
||||
- ✅ Fastest transcription speed (< 300ms latency)
|
||||
- ✅ Excellent streaming support
|
||||
- ✅ High accuracy (95%+ on clear audio)
|
||||
- ✅ Good pricing ($0.0043/minute)
|
||||
- ✅ Nova-2 model optimized for real-time
|
||||
- ✅ Excellent documentation
|
||||
|
||||
**Weaknesses:**
|
||||
- ❌ Less accurate with heavy accents
|
||||
- ❌ Smaller company (potential reliability concerns)
|
||||
|
||||
**Best For:**
|
||||
- Real-time voice conversations
|
||||
- Low-latency applications
|
||||
- English-language applications
|
||||
- Startups and small businesses
|
||||
|
||||
**Configuration:**
|
||||
```python
|
||||
{
|
||||
"transcriberProvider": "deepgram",
|
||||
"deepgramApiKey": "your-api-key",
|
||||
"deepgramModel": "nova-2",
|
||||
"language": "en-US"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### AssemblyAI
|
||||
|
||||
**Strengths:**
|
||||
- ✅ Very high accuracy (96%+ on clear audio)
|
||||
- ✅ Excellent with accents and dialects
|
||||
- ✅ Good speaker diarization
|
||||
- ✅ Competitive pricing ($0.00025/second)
|
||||
- ✅ Strong customer support
|
||||
|
||||
**Weaknesses:**
|
||||
- ❌ Slightly higher latency than Deepgram
|
||||
- ❌ Streaming support is newer
|
||||
|
||||
**Best For:**
|
||||
- Applications requiring highest accuracy
|
||||
- Multi-speaker scenarios
|
||||
- Diverse user base with accents
|
||||
- Enterprise applications
|
||||
|
||||
**Configuration:**
|
||||
```python
|
||||
{
|
||||
"transcriberProvider": "assemblyai",
|
||||
"assemblyaiApiKey": "your-api-key",
|
||||
"language": "en"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Azure Speech
|
||||
|
||||
**Strengths:**
|
||||
- ✅ Enterprise-grade reliability
|
||||
- ✅ Excellent multi-language support (100+ languages)
|
||||
- ✅ Strong security and compliance
|
||||
- ✅ Integration with Azure ecosystem
|
||||
- ✅ Custom model training available
|
||||
|
||||
**Weaknesses:**
|
||||
- ❌ Higher cost ($1/hour)
|
||||
- ❌ More complex setup
|
||||
- ❌ Slower than specialized providers
|
||||
|
||||
**Best For:**
|
||||
- Enterprise applications
|
||||
- Multi-language requirements
|
||||
- Azure-based infrastructure
|
||||
- Compliance-sensitive applications
|
||||
|
||||
**Configuration:**
|
||||
```python
|
||||
{
|
||||
"transcriberProvider": "azure",
|
||||
"azureSpeechKey": "your-key",
|
||||
"azureSpeechRegion": "eastus",
|
||||
"language": "en-US"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Google Cloud Speech
|
||||
|
||||
**Strengths:**
|
||||
- ✅ Excellent multi-language support (125+ languages)
|
||||
- ✅ Good accuracy
|
||||
- ✅ Integration with Google Cloud
|
||||
- ✅ Automatic punctuation
|
||||
- ✅ Speaker diarization
|
||||
|
||||
**Weaknesses:**
|
||||
- ❌ Higher latency for streaming
|
||||
- ❌ Complex pricing model
|
||||
- ❌ Requires Google Cloud account
|
||||
|
||||
**Best For:**
|
||||
- Multi-language applications
|
||||
- Google Cloud infrastructure
|
||||
- Applications needing speaker diarization
|
||||
|
||||
**Configuration:**
|
||||
```python
|
||||
{
|
||||
"transcriberProvider": "google",
|
||||
"googleCredentials": "path/to/credentials.json",
|
||||
"language": "en-US"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## LLM Providers
|
||||
|
||||
### OpenAI (GPT-4, GPT-3.5)
|
||||
|
||||
**Strengths:**
|
||||
- ✅ Highest quality responses
|
||||
- ✅ Excellent instruction following
|
||||
- ✅ Fast streaming
|
||||
- ✅ Large context window (128k for GPT-4)
|
||||
- ✅ Best-in-class reasoning
|
||||
|
||||
**Weaknesses:**
|
||||
- ❌ Higher cost ($0.01-0.03/1k tokens)
|
||||
- ❌ Rate limits can be restrictive
|
||||
- ❌ No free tier
|
||||
|
||||
**Best For:**
|
||||
- High-quality conversational AI
|
||||
- Complex reasoning tasks
|
||||
- Production applications
|
||||
- Enterprise use cases
|
||||
|
||||
**Configuration:**
|
||||
```python
|
||||
{
|
||||
"llmProvider": "openai",
|
||||
"openaiApiKey": "your-api-key",
|
||||
"openaiModel": "gpt-4-turbo",
|
||||
"prompt": "You are a helpful AI assistant."
|
||||
}
|
||||
```
|
||||
|
||||
**Pricing:**
|
||||
- GPT-4 Turbo: $0.01/1k input tokens, $0.03/1k output tokens
|
||||
- GPT-3.5 Turbo: $0.0005/1k input tokens, $0.0015/1k output tokens
|
||||
|
||||
---
|
||||
|
||||
### Google Gemini
|
||||
|
||||
**Strengths:**
|
||||
- ✅ Excellent cost-effectiveness (free tier available)
|
||||
- ✅ Multimodal capabilities
|
||||
- ✅ Good streaming support
|
||||
- ✅ Large context window (1M tokens for Pro)
|
||||
- ✅ Fast response times
|
||||
|
||||
**Weaknesses:**
|
||||
- ❌ Slightly lower quality than GPT-4
|
||||
- ❌ Less predictable behavior
|
||||
- ❌ Newer, less battle-tested
|
||||
|
||||
**Best For:**
|
||||
- Cost-sensitive applications
|
||||
- Multimodal applications
|
||||
- Startups and prototypes
|
||||
- High-volume applications
|
||||
|
||||
**Configuration:**
|
||||
```python
|
||||
{
|
||||
"llmProvider": "gemini",
|
||||
"geminiApiKey": "your-api-key",
|
||||
"geminiModel": "gemini-pro",
|
||||
"prompt": "You are a helpful AI assistant."
|
||||
}
|
||||
```
|
||||
|
||||
**Pricing:**
|
||||
- Gemini Pro: Free up to 60 requests/minute
|
||||
- Gemini Pro (paid): $0.00025/1k input tokens, $0.0005/1k output tokens
|
||||
|
||||
---
|
||||
|
||||
### Anthropic Claude
|
||||
|
||||
**Strengths:**
|
||||
- ✅ Excellent safety and alignment
|
||||
- ✅ Very long context window (200k tokens)
|
||||
- ✅ High-quality responses
|
||||
- ✅ Good at following complex instructions
|
||||
- ✅ Strong reasoning capabilities
|
||||
|
||||
**Weaknesses:**
|
||||
- ❌ Higher cost than Gemini
|
||||
- ❌ Slower streaming than OpenAI
|
||||
- ❌ More conservative responses
|
||||
|
||||
**Best For:**
|
||||
- Safety-critical applications
|
||||
- Long-context applications
|
||||
- Nuanced conversations
|
||||
- Enterprise applications
|
||||
|
||||
**Configuration:**
|
||||
```python
|
||||
{
|
||||
"llmProvider": "claude",
|
||||
"claudeApiKey": "your-api-key",
|
||||
"claudeModel": "claude-3-opus",
|
||||
"prompt": "You are a helpful AI assistant."
|
||||
}
|
||||
```
|
||||
|
||||
**Pricing:**
|
||||
- Claude 3 Opus: $0.015/1k input tokens, $0.075/1k output tokens
|
||||
- Claude 3 Sonnet: $0.003/1k input tokens, $0.015/1k output tokens
|
||||
|
||||
---
|
||||
|
||||
## TTS Providers
|
||||
|
||||
### ElevenLabs
|
||||
|
||||
**Strengths:**
|
||||
- ✅ Most natural-sounding voices
|
||||
- ✅ Excellent emotional range
|
||||
- ✅ Voice cloning capabilities
|
||||
- ✅ Good streaming support
|
||||
- ✅ Multiple languages
|
||||
|
||||
**Weaknesses:**
|
||||
- ❌ Higher cost ($0.30/1k characters)
|
||||
- ❌ Rate limits on lower tiers
|
||||
- ❌ Occasional pronunciation errors
|
||||
|
||||
**Best For:**
|
||||
- Premium voice experiences
|
||||
- Customer-facing applications
|
||||
- Voice cloning needs
|
||||
- High-quality audio requirements
|
||||
|
||||
**Configuration:**
|
||||
```python
|
||||
{
|
||||
"voiceProvider": "elevenlabs",
|
||||
"elevenlabsApiKey": "your-api-key",
|
||||
"elevenlabsVoiceId": "voice-id",
|
||||
"elevenlabsModel": "eleven_monolingual_v1"
|
||||
}
|
||||
```
|
||||
|
||||
**Pricing:**
|
||||
- Free: 10k characters/month
|
||||
- Starter: $5/month, 30k characters
|
||||
- Creator: $22/month, 100k characters
|
||||
|
||||
---
|
||||
|
||||
### Azure TTS
|
||||
|
||||
**Strengths:**
|
||||
- ✅ Enterprise-grade reliability
|
||||
- ✅ Many languages (100+)
|
||||
- ✅ Neural voices available
|
||||
- ✅ SSML support for fine control
|
||||
- ✅ Good pricing ($4/1M characters)
|
||||
|
||||
**Weaknesses:**
|
||||
- ❌ Less natural than ElevenLabs
|
||||
- ❌ More complex setup
|
||||
- ❌ Requires Azure account
|
||||
|
||||
**Best For:**
|
||||
- Enterprise applications
|
||||
- Multi-language requirements
|
||||
- Azure-based infrastructure
|
||||
- Cost-sensitive high-volume applications
|
||||
|
||||
**Configuration:**
|
||||
```python
|
||||
{
|
||||
"voiceProvider": "azure",
|
||||
"azureSpeechKey": "your-key",
|
||||
"azureSpeechRegion": "eastus",
|
||||
"azureVoiceName": "en-US-JennyNeural"
|
||||
}
|
||||
```
|
||||
|
||||
**Pricing:**
|
||||
- Neural voices: $16/1M characters
|
||||
- Standard voices: $4/1M characters
|
||||
|
||||
---
|
||||
|
||||
### Google Cloud TTS
|
||||
|
||||
**Strengths:**
|
||||
- ✅ Good quality neural voices
|
||||
- ✅ Many languages (40+)
|
||||
- ✅ WaveNet voices available
|
||||
- ✅ Competitive pricing ($4/1M characters)
|
||||
- ✅ SSML support
|
||||
|
||||
**Weaknesses:**
|
||||
- ❌ Less natural than ElevenLabs
|
||||
- ❌ Requires Google Cloud account
|
||||
- ❌ Complex setup
|
||||
|
||||
**Best For:**
|
||||
- Multi-language applications
|
||||
- Google Cloud infrastructure
|
||||
- Cost-effective neural voices
|
||||
|
||||
**Configuration:**
|
||||
```python
|
||||
{
|
||||
"voiceProvider": "google",
|
||||
"googleCredentials": "path/to/credentials.json",
|
||||
"googleVoiceName": "en-US-Neural2-F"
|
||||
}
|
||||
```
|
||||
|
||||
**Pricing:**
|
||||
- WaveNet voices: $16/1M characters
|
||||
- Neural2 voices: $16/1M characters
|
||||
- Standard voices: $4/1M characters
|
||||
|
||||
---
|
||||
|
||||
### Amazon Polly
|
||||
|
||||
**Strengths:**
|
||||
- ✅ AWS integration
|
||||
- ✅ Good pricing ($4/1M characters)
|
||||
- ✅ Neural voices available
|
||||
- ✅ SSML support
|
||||
- ✅ Reliable service
|
||||
|
||||
**Weaknesses:**
|
||||
- ❌ Less natural than ElevenLabs
|
||||
- ❌ Fewer voice options
|
||||
- ❌ Requires AWS account
|
||||
|
||||
**Best For:**
|
||||
- AWS-based infrastructure
|
||||
- Cost-effective neural voices
|
||||
- Enterprise applications
|
||||
|
||||
**Configuration:**
|
||||
```python
|
||||
{
|
||||
"voiceProvider": "polly",
|
||||
"awsAccessKey": "your-access-key",
|
||||
"awsSecretKey": "your-secret-key",
|
||||
"awsRegion": "us-east-1",
|
||||
"pollyVoiceId": "Joanna"
|
||||
}
|
||||
```
|
||||
|
||||
**Pricing:**
|
||||
- Neural voices: $16/1M characters
|
||||
- Standard voices: $4/1M characters
|
||||
|
||||
---
|
||||
|
||||
### Play.ht
|
||||
|
||||
**Strengths:**
|
||||
- ✅ Voice cloning capabilities
|
||||
- ✅ Natural-sounding voices
|
||||
- ✅ Good streaming support
|
||||
- ✅ Easy to use API
|
||||
- ✅ Multiple languages
|
||||
|
||||
**Weaknesses:**
|
||||
- ❌ Higher cost than cloud providers
|
||||
- ❌ Smaller company
|
||||
- ❌ Less documentation
|
||||
|
||||
**Best For:**
|
||||
- Voice cloning applications
|
||||
- Premium voice experiences
|
||||
- Startups and small businesses
|
||||
|
||||
**Configuration:**
|
||||
```python
|
||||
{
|
||||
"voiceProvider": "playht",
|
||||
"playhtApiKey": "your-api-key",
|
||||
"playhtUserId": "your-user-id",
|
||||
"playhtVoiceId": "voice-id"
|
||||
}
|
||||
```
|
||||
|
||||
**Pricing:**
|
||||
- Free: 2.5k characters
|
||||
- Creator: $31/month, 50k characters
|
||||
- Pro: $79/month, 150k characters
|
||||
|
||||
---
|
||||
|
||||
## Recommended Combinations
|
||||
|
||||
### Budget-Conscious Startup
|
||||
```python
|
||||
{
|
||||
"transcriberProvider": "deepgram", # Fast and affordable
|
||||
"llmProvider": "gemini", # Free tier available
|
||||
"voiceProvider": "google" # Cost-effective neural voices
|
||||
}
|
||||
```
|
||||
**Estimated cost:** ~$0.01 per minute of conversation
|
||||
|
||||
---
|
||||
|
||||
### Premium Experience
|
||||
```python
|
||||
{
|
||||
"transcriberProvider": "assemblyai", # Highest accuracy
|
||||
"llmProvider": "openai", # Best quality responses
|
||||
"voiceProvider": "elevenlabs" # Most natural voices
|
||||
}
|
||||
```
|
||||
**Estimated cost:** ~$0.05 per minute of conversation
|
||||
|
||||
---
|
||||
|
||||
### Enterprise Application
|
||||
```python
|
||||
{
|
||||
"transcriberProvider": "azure", # Enterprise reliability
|
||||
"llmProvider": "openai", # Best quality
|
||||
"voiceProvider": "azure" # Enterprise reliability
|
||||
}
|
||||
```
|
||||
**Estimated cost:** ~$0.03 per minute of conversation
|
||||
|
||||
---
|
||||
|
||||
### Multi-Language Application
|
||||
```python
|
||||
{
|
||||
"transcriberProvider": "google", # 125+ languages
|
||||
"llmProvider": "gemini", # Good multi-language support
|
||||
"voiceProvider": "google" # 40+ languages
|
||||
}
|
||||
```
|
||||
**Estimated cost:** ~$0.02 per minute of conversation
|
||||
|
||||
---
|
||||
|
||||
## Decision Matrix
|
||||
|
||||
| Priority | Transcriber | LLM | TTS |
|
||||
|----------|-------------|-----|-----|
|
||||
| **Lowest Cost** | Deepgram | Gemini | Google |
|
||||
| **Highest Quality** | AssemblyAI | OpenAI | ElevenLabs |
|
||||
| **Fastest Speed** | Deepgram | OpenAI | ElevenLabs |
|
||||
| **Enterprise** | Azure | OpenAI | Azure |
|
||||
| **Multi-Language** | Google | Gemini | Google |
|
||||
| **Voice Cloning** | N/A | N/A | ElevenLabs/Play.ht |
|
||||
|
||||
---
|
||||
|
||||
## Testing Recommendations
|
||||
|
||||
Before committing to providers, test with your specific use case:
|
||||
|
||||
1. **Create test conversations** with representative audio
|
||||
2. **Measure latency** end-to-end
|
||||
3. **Evaluate quality** with real users
|
||||
4. **Calculate costs** based on expected volume
|
||||
5. **Test edge cases** (accents, background noise, interrupts)
|
||||
|
||||
---
|
||||
|
||||
## Switching Providers
|
||||
|
||||
The multi-provider factory pattern makes switching easy:
|
||||
|
||||
```python
|
||||
# Just change the configuration
|
||||
config = {
|
||||
"transcriberProvider": "deepgram", # Change to "assemblyai"
|
||||
"llmProvider": "gemini", # Change to "openai"
|
||||
"voiceProvider": "google" # Change to "elevenlabs"
|
||||
}
|
||||
|
||||
# No code changes needed!
|
||||
factory = VoiceComponentFactory()
|
||||
transcriber = factory.create_transcriber(config)
|
||||
agent = factory.create_agent(config)
|
||||
synthesizer = factory.create_synthesizer(config)
|
||||
```
|
||||
Reference in New Issue
Block a user