Today we're expanding Mastra Voice with real-time speech-to-speech capabilities for Agents. We also made a few other updates supporting this feature.
What's New
- Support for speech to speech providers, with OpenAI Realtime API being our first.
- We now support WebSocket connections to establish a persistent connection to voice providers (like OpenAI) instead of separate HTTP requests. This enables bidirectional audio streaming without the request-wait-response pattern of traditional HTTP.
- We've introduced an event-driven architecture for voice interactions. Instead of
const text = await agent.voice.listen(audio)
, you now useagent.voice.on('writing', ({ text }) => { ... })
, creating a more responsive experience without managing any WebSocket complexity.
Here's an example of setting up a real-time voice agent that can participate in continuous conversations:
import { Agent } from "@mastra/core/agent";
import { OpenAIRealtimeVoice } from "@mastra/voice-openai-realtime";
const agent = new Agent({
name: 'Agent',
instructions: `You are a helpful assistant with real-time voice capabilities.`,
model: openai('gpt-4o'),
voice: new OpenAIRealtimeVoice(),
});
// Connect to the voice service
await agent.voice.connect();
// Listen for agent audio responses
agent.voice.on('speaker', (stream) => {
// The 'speaker' can be any audio output implementation that accepts streams for instance, node-speaker
stream.pipe(speaker)
});
// Initiate the conversation by emitting a 'speaking' event
await agent.voice.speak('How can I help you today?');
// Send continuous audio from microphone
const micStream = mic.getAudioStream();
await agent.voice.send(micStream);
Why speech-to-speech matters
Traditional voice systems require multiple steps: converting speech to text, processing that text, and then converting text back to speech. Each transition introduces latency and can lose important aspects of communication like tone, emphasis, and natural pauses.
Speech-to-speech streamlines this process by maintaining audio as the primary medium throughout the interaction. This approach reduces latency and preserves more of the expressiveness in communication, resulting in more natural-feeling conversations.
For applications like language learning, customer service, or health coaching, this improved experience can make a significant difference in user engagement and effectiveness.
An Event-Based Approach
For our speech-to-speech implementation, we've introduced an event-driven approach alongside our existing voice API. While our standard voice methods continue to work as before for text-to-speech and speech-to-text, speech-to-speech providers use events for a more streaming-oriented experience:
// Standard voice API (still fully supported)
const text = await agent.listen(audioStream); // Returns transcribed text
const audioStream = await agent.speak("Hello"); // Returns audio stream
// New event-based approach for speech-to-speech
// Set up event listeners
agent.voice.on('speaker', (stream) => {
stream.pipe(speaker);
});
agent.voice.on('writing', ({ text, role }) => {
if (role === 'user') {
process.stdout.write(chalk.green(text));
} else {
process.stdout.write(chalk.blue(text));
}
});
agent.voice.on('error', (error) => {
console.error('Voice error:', error);
});
await agent.voice.send(microphoneStream); // Sends audio
// Trigger events
await agent.voice.speak("Hello"); // Triggers 'speaking' events
await agent.voice.listen("Hi"); // Triggers 'writing' events
This event-based architecture offers two key advantages:
- Continuous streaming - Audio can be processed in chunks as it becomes available
- Real-time feedback - Speech is recognized and processed as it happens
The event system is particularly powerful for web and mobile applications where you need to update the interface in real-time or handle audio playback as it's being generated.
Adding Tools to Speech-to-Speech Conversations
While each speech-to-speech provider implements tools in their own way, Mastra abstracts these differences away with a consistent interface. Any tools configured on your agent are automatically available to your voice provider:
import { Agent } from "@mastra/core/agent";
import { OpenAIRealtimeVoice } from "@mastra/voice-openai-realtime";
import { search, getWeather, fetchHoroscope } from "./tools";
const agent = new Agent({
name: 'Agent',
instructions: `You are a helpful assistant with speech-to-speech capabilities.`,
model: openai('gpt-4o'),
tools: {
search,
getWeather,
fetchHoroscope
},
voice: new OpenAIRealtimeVoice()
});
// Tools are automatically available to the voice system
await agent.voice.connect();
await agent.voice.speak("How can I help you today?");
Speech-to-Speech in Action
We're starting with OpenAI's Realtime API as our first speech-to-speech provider in Mastra, with plans to expand our provider options as they become available.
Here's an example of how you might build a voice agent that handles microphone input and audio playback:
import Speaker from "@mastra/node-speaker";
import NodeMic from "node-mic";
import { mastra } from "./mastra";
import chalk from "chalk";
const agent = mastra.getAgent("dane");
if (!agent.voice) {
throw new Error("Agent does not have voice capabilities");
}
let speaker: Speaker | undefined;
const makeSpeaker = () =>
new Speaker({
sampleRate: 24100, // Audio sample rate in Hz - standard for high-quality audio on MacBook Pro
channels: 1, // Mono audio output (as opposed to stereo which would be 2)
bitDepth: 16, // Bit depth for audio quality - CD quality standard (16-bit resolution)
});
const mic = new NodeMic({
rate: 24100, // Audio sample rate in Hz - matches the speaker configuration for consistent audio processing
});
agent.voice.on("writing", (ev) => {
if (ev.role === 'user') {
process.stdout.write(chalk.green(ev.text));
} else {
process.stdout.write(chalk.blue(ev.text));
}
})
agent.voice.on("speaker", (stream) => {
if (speaker) {
speaker.removeAllListeners();
speaker.close(true);
}
mic.pause();
speaker = makeSpeaker();
stream.pipe(speaker);
speaker.on('close', () => {
console.log("Speaker finished, resuming mic");
mic.resume();
})
})
// Error from voice provider
agent.voice.on("error", (error) => {
console.error("Voice error:", error);
});
await agent.voice.connect();
mic.start();
const microphoneStream = mic.getAudioStream();
agent.voice.send(microphoneStream);
agent.voice.speak('Hello how can I help you today?')
This example demonstrates a browser assistant that listens to microphone input, processes it through a speech-to-speech agent, and plays the agent's responses.
You can find a complete demo in our GitHub repository and try it out for yourself.
We're actively developing Mastra! If you encounter any issues or have suggestions for improvements, please open an issue on our GitHub repository or contribute directly with a pull request.
Get started with speech-to-speech for Mastra agents today by installing the latest version of our packages:
npm install @mastra/core @mastra/voice-openai-realtime
The full documentation is available here with additional examples and configuration options.