OpenAI Realtime Voice
The OpenAIRealtimeVoice class provides real-time voice interaction capabilities using OpenAI’s WebSocket-based API. It supports real time speech to speech, voice activity detection, and event-based audio streaming.
Usage Example
import { OpenAIRealtimeVoice } from "@mastra/voice-openai-realtime";
// Initialize with default configuration using environment variables
const voice = new OpenAIRealtimeVoice();
// Or initialize with specific configuration
const voiceWithConfig = new OpenAIRealtimeVoice({
chatModel: {
apiKey: 'your-openai-api-key',
model: 'gpt-4o-mini-realtime-preview-2024-12-17',
options: {
sessionConfig: {
turn_detection: {
type: 'server_vad',
threshold: 0.6,
silence_duration_ms: 1200
}
}
}
},
speaker: 'alloy' // Default voice
});
// Establish connection
await voice.connect();
// Set up event listeners
voice.on('speaking', ({ audio }) => {
// Handle audio data (Int16Array) pcm format by default
playAudio(audio);
});
voice.on('writing', ({ text, role }) => {
// Handle transcribed text
console.log(`${role}: ${text}`);
});
// Convert text to speech
await voice.speak('Hello, how can I help you today?', {
speaker: 'echo' // Override default voice
});
// Process audio input
const microphoneStream = getMicrophoneStream();
await voice.send(microphoneStream);
// When done, disconnect
voice.connect();
Configuration
Constructor Options
chatModel?:
speaker?:
chatModel
model?:
apiKey?:
tools?:
options?:
options
sessionConfig?:
url?:
dangerouslyAllowAPIKeyInBrowser?:
debug?:
Voice Activity Detection (VAD) Configuration
type?:
threshold?:
prefix_padding_ms?:
silence_duration_ms?:
Methods
connect()
Establishes a connection to the OpenAI realtime service. Must be called before using speak, listen, or send functions.
returns:
speak()
Emits a speaking event using the configured voice model. Can accept either a string or a readable stream as input.
input:
options.speaker?:
Returns: Promise<void>
listen()
Processes audio input for speech recognition. Takes a readable stream of audio data and emits a ‘listening’ event with the transcribed text.
audioData:
Returns: Promise<void>
send()
Streams audio data in real-time to the OpenAI service for continuous audio streaming scenarios like live microphone input.
audioData:
Returns: Promise<void>
updateConfig()
Updates the session configuration for the voice instance. This can be used to modify voice settings, turn detection, and other parameters.
sessionConfig:
Returns: void
addTools()
Adds a set of tools to the voice instance. Tools allow the model to perform additional actions during conversations. When OpenAIRealtimeVoice is added to an Agent, any tools configured for the Agent will automatically be available to the voice interface.
tools?:
Returns: void
close()
Disconnects from the OpenAI realtime session and cleans up resources. Should be called when you’re done with the voice instance.
Returns: void
getSpeakers()
Returns a list of available voice speakers.
Returns: Promise<Array<{ voiceId: string; [key: string]: any }>>
on()
Registers an event listener for voice events.
event:
callback:
Returns: void
off()
Removes a previously registered event listener.
event:
callback:
Returns: void
Events
The OpenAIRealtimeVoice class emits the following events:
speaking:
writing:
error:
OpenAI Realtime Events
You can also listen to OpenAI Realtime utility events by prefixing with ‘openAIRealtime:’:
openAIRealtime:conversation.created:
openAIRealtime:conversation.interrupted:
openAIRealtime:conversation.updated:
openAIRealtime:conversation.item.appended:
openAIRealtime:conversation.item.completed:
Available Voices
The following voice options are available:
alloy
: Neutral and balancedash
: Clear and preciseballad
: Melodic and smoothcoral
: Warm and friendlyecho
: Resonant and deepsage
: Calm and thoughtfulshimmer
: Bright and energeticverse
: Versatile and expressive
Notes
- API keys can be provided via constructor options or the
OPENAI_API_KEY
environment variable - The OpenAI Realtime Voice API uses WebSockets for real-time communication
- Server-side Voice Activity Detection (VAD) provides better accuracy for speech detection
- All audio data is processed as Int16Array format
- The voice instance must be connected with
connect()
before using other methods - Always call
close()
when done to properly clean up resources - Memory management is handled by OpenAI Realtime API