Voice-to-Voice Capabilities in Mastra
Introduction
Voice-to-Voice in Mastra provides a standardized interface for real-time speech-to-speech interactions across multiple service providers. This section covers configuration, event-driven architecture, and implementation methods for creating conversational voice experiences. For integrating Voice-to-Voice capabilities with agents, refer to the Adding Voice to Agents documentation.
Real-time Voice Interactions
Mastra’s real-time voice system enables continuous bidirectional audio communication through an event-driven architecture. Unlike separate TTS and STT operations, real-time voice maintains an open connection that processes speech continuously in both directions.
Example Implementation
import { Agent } from "@mastra/core/agent";
import { OpenAIRealtimeVoice } from "@mastra/voice-openai-realtime";
const agent = new Agent({
name: 'Agent',
instructions: `You are a helpful assistant with real-time voice capabilities.`,
model: openai('gpt-4o'),
voice: new OpenAIRealtimeVoice(),
});
// Connect to the voice service
await agent.voice.connect();
// Listen for agent audio responses
agent.voice.on('speaking', ({ audio }) => {
playAudio(audio);
});
// Initiate the conversation
await agent.voice.speak('How can I help you today?');
// Send continuous audio from the microphone
const micStream = getMicrophoneStream();
await agent.voice.send(micStream);
Event-Driven Architecture
Mastra’s voice-to-voice implementation is built on an event-driven architecture. Developers register event listeners to handle incoming audio progressively, allowing for more responsive interactions than waiting for complete audio responses.
Configuration
When initializing a voice-to-voice provider, you can provide configuration options to customize its behavior:
Constructor Options
-
chatModel
: Configuration for the OpenAI realtime model.apiKey
: Your OpenAI API key. Falls back to theOPENAI_API_KEY
environment variable.model
: The model ID to use for real-time voice interactions (e.g.,gpt-4o-mini-realtime
).options
: Additional options for the realtime client, such as session configuration.
-
speaker
: The default voice ID for speech synthesis. This allows you to specify which voice to use for the speech output.
Example Configuration
const voice = new OpenAIRealtimeVoice({
chatModel: {
apiKey: 'your-openai-api-key',
model: 'gpt-4o-mini-realtime',
options: {
sessionConfig: {
turn_detection: {
type: 'server_vad',
threshold: 0.6,
silence_duration_ms: 1200,
},
},
},
},
speaker: 'alloy', // Default voice
});
// If using default settings the configuration can be simplified to:
const voice = new OpenAIRealtimeVoice();
Core Methods
The OpenAIRealtimeVoice
class provides the following core methods for voice interactions:
connect()
Establishes a connection to the OpenAI realtime service.
Usage:
await voice.connect();
Notes:
- Must be called before using any other interaction methods
- Returns a Promise that resolves when the connection is established
speak(text, options?)
Emits a speaking event using the configured voice model.
Parameters:
text
: String content to be spokenoptions
: Optional configuration objectspeaker
: Voice ID to use (overrides default)properties
: Additional provider-specific properties
Usage:
voice.speak('Hello, how can I help you today?', {
speaker: 'alloy'
});
Notes:
- Emits ‘speaking’ events rather than returning an audio stream
listen(audioInput, options?)
Processes audio input for speech recognition.
Parameters:
audioInput
: Readable stream of audio dataoptions
: Optional configuration objectfiletype
: Audio format (default: ‘mp3’)- Additional provider-specific options
Usage:
const audioData = getMicrophoneStream();
voice.listen(audioData, {
filetype: 'wav'
});
Notes:
- Emits ‘writing’ events with transcribed text
send(audioStream)
Streams audio data in real-time for continuous processing.
Parameters:
audioStream
: Readable stream of audio data
Usage:
const micStream = getMicrophoneStream();
await voice.send(micStream);
Notes:
- Used for continuous audio streaming scenarios like live microphone input
- Returns a Promise that resolves when the stream is accepted
answer(params)
Sends a response to the OpenAI Realtime API.
Parameters:
params
: The parameters objectoptions
: Configuration options for the responsecontent
: Text content of the responsevoice
: Voice ID to use for the response
Usage:
await voice.answer({
options: {
content: "Hello, how can I help you today?",
voice: "alloy"
}
});
Notes:
- Triggers a response to the real-time session
- Returns a Promise that resolves when the response has been sent
Utility Methods
updateConfig(config)
Updates the session configuration for the voice instance.
Parameters:
config
: New session configuration object
Usage:
voice.updateConfig({
turn_detection: {
type: 'server_vad',
threshold: 0.6,
silence_duration_ms: 1200,
}
});
addTools(tools)
Adds a set of tools to the voice instance.
Parameters:
tools
: Array of tool objects that the model can call
Usage:
voice.addTools([
createTool({
id: "Get Weather Information",
inputSchema: z.object({
city: z.string(),
}),
description: `Fetches the current weather information for a given city`,
execute: async ({ city }) => {...},
})
]);
close()
Disconnects from the OpenAI realtime session and cleans up resources.
Usage:
voice.close();
Notes:
- Should be called when you’re done with the voice instance to free resources
on(event, callback)
Registers an event listener for voice events.
Parameters:
event
: Event name (‘speaking’, ‘writing’, or ‘error’)callback
: Function to call when the event occurs
Usage:
voice.on('speaking', ({ audio }) => {
playAudio(audio);
});
off(event, callback)
Removes a previously registered event listener.
Parameters:
event
: Event namecallback
: The callback function to remove
Usage:
voice.off('speaking', callbackFunction);
Events
The OpenAIRealtimeVoice
class emits the following events:
speaking
Emitted when audio data is received from the model.
Event payload:
audio
: A chunk of audio data as a readable stream
agent.voice.on('speaking', ({ audio }) => {
playAudio(audio); // Handle audio chunks as they're generated
});
writing
Emitted when transcribed text is available.
Event payload:
text
: The transcribed textrole
: The role of the speaker (user or assistant)done
: Boolean indicating if this is the final transcription
agent.voice.on('writing', ({ text, role }) => {
console.log(`${role}: ${text}`); // Log who said what
});
error
Emitted when an error occurs.
Event payload:
- Error object with details about what went wrong
agent.voice.on('error', (error) => {
console.error('Voice error:', error);
});