The Voice Connection: Mastra's Speech-to-Speech Capabilities

Today we're expanding Mastra Voice with real-time speech-to-speech capabilities for Agents. We also made a few other updates supporting this feature.

What's New

Support for speech to speech providers, with OpenAI Realtime API being our first.
We now support WebSocket connections to establish a persistent connection to voice providers (like OpenAI) instead of separate HTTP requests. This enables bidirectional audio streaming without the request-wait-response pattern of traditional HTTP.
We've introduced an event-driven architecture for voice interactions. Instead of const text = await agent.voice.listen(audio), you now use agent.voice.on('writing', ({ text }) => { ... }), creating a more responsive experience without managing any WebSocket complexity.

Here's an example of setting up a real-time voice agent that can participate in continuous conversations:

 1import { Agent } from "@mastra/core/agent";
 2import { OpenAIRealtimeVoice } from "@mastra/voice-openai-realtime";
 3
 4const agent = new Agent({
 5  name: 'Agent',
 6  instructions: `You are a helpful assistant with real-time voice capabilities.`,
 7  model: openai('gpt-4o'),
 8  voice: new OpenAIRealtimeVoice(),
 9});
10
11// Connect to the voice service
12await agent.voice.connect();
13
14// Listen for agent audio responses
15agent.voice.on('speaker', (stream) => {
16  // The 'speaker' can be any audio output implementation that accepts streams for instance, node-speaker
17  stream.pipe(speaker)
18});
19
20// Initiate the conversation by emitting a 'speaking' event
21await agent.voice.speak('How can I help you today?');
22
23// Send continuous audio from microphone
24const micStream = mic.getAudioStream();
25await agent.voice.send(micStream);

Traditional voice systems require multiple steps: converting speech to text, processing that text, and then converting text back to speech. Each transition introduces latency and can lose important aspects of communication like tone, emphasis, and natural pauses.

Speech-to-speech streamlines this process by maintaining audio as the primary medium throughout the interaction. This approach reduces latency and preserves more of the expressiveness in communication, resulting in more natural-feeling conversations.

For applications like language learning, customer service, or health coaching, this improved experience can make a significant difference in user engagement and effectiveness.

An Event-Based Approach

For our speech-to-speech implementation, we've introduced an event-driven approach alongside our existing voice API. While our standard voice methods continue to work as before for text-to-speech and speech-to-text, speech-to-speech providers use events for a more streaming-oriented experience:

 1// Standard voice API (still fully supported)
 2const text = await agent.listen(audioStream); // Returns transcribed text
 3const audioStream = await agent.speak("Hello"); // Returns audio stream
 4
 5// New event-based approach for speech-to-speech
 6// Set up event listeners
 7agent.voice.on('speaker', (stream) => {
 8	stream.pipe(speaker);
 9});
10
11agent.voice.on('writing', ({ text, role }) => {
12  if (role === 'user') {
13    process.stdout.write(chalk.green(text));
14  } else {
15    process.stdout.write(chalk.blue(text));
16  }
17});
18
19agent.voice.on('error', (error) => {
20  console.error('Voice error:', error);
21});
22
23await agent.voice.send(microphoneStream); // Sends audio
24
25// Trigger events
26await agent.voice.speak("Hello"); // Triggers 'speaking' events
27await agent.voice.listen("Hi"); // Triggers 'writing' events

This event-based architecture offers two key advantages:

Continuous streaming - Audio can be processed in chunks as it becomes available
Real-time feedback - Speech is recognized and processed as it happens

The event system is particularly powerful for web and mobile applications where you need to update the interface in real-time or handle audio playback as it's being generated.

Adding Tools to Speech-to-Speech Conversations

While each speech-to-speech provider implements tools in their own way, Mastra abstracts these differences away with a consistent interface. Any tools configured on your agent are automatically available to your voice provider:

 1import { Agent } from "@mastra/core/agent";
 2import { OpenAIRealtimeVoice } from "@mastra/voice-openai-realtime";
 3import { search, getWeather, fetchHoroscope } from "./tools";
 4
 5const agent = new Agent({
 6  name: 'Agent',
 7  instructions: `You are a helpful assistant with speech-to-speech capabilities.`,
 8  model: openai('gpt-4o'),
 9  tools: {
10    search,
11    getWeather,
12    fetchHoroscope
13  },
14  voice: new OpenAIRealtimeVoice()
15});
16
17// Tools are automatically available to the voice system
18await agent.voice.connect();
19await agent.voice.speak("How can I help you today?");

Speech-to-Speech in Action

We're starting with OpenAI's Realtime API as our first speech-to-speech provider in Mastra, with plans to expand our provider options as they become available.

Here's an example of how you might build a voice agent that handles microphone input and audio playback:

 1import Speaker from "@mastra/node-speaker";
 2import NodeMic from "node-mic";
 3import { mastra } from "./mastra";
 4import chalk from "chalk";
 5
 6const agent = mastra.getAgent("dane");
 7
 8if (!agent.voice) {
 9  throw new Error("Agent does not have voice capabilities");
10}
11
12let speaker: Speaker | undefined;
13
14const makeSpeaker = () =>
15  new Speaker({
16    sampleRate: 24100,  // Audio sample rate in Hz - standard for high-quality audio on MacBook Pro
17    channels: 1,        // Mono audio output (as opposed to stereo which would be 2)
18    bitDepth: 16,       // Bit depth for audio quality - CD quality standard (16-bit resolution)
19  });
20
21const mic = new NodeMic({
22  rate: 24100,  // Audio sample rate in Hz - matches the speaker configuration for consistent audio processing
23});
24
25agent.voice.on("writing", (ev) => {
26  if (ev.role === 'user') {
27    process.stdout.write(chalk.green(ev.text));
28  } else {
29    process.stdout.write(chalk.blue(ev.text));
30  }
31})
32
33agent.voice.on("speaker", (stream) => {
34  if (speaker) {
35    speaker.removeAllListeners();
36    speaker.close(true);
37  }
38
39  mic.pause();
40  speaker = makeSpeaker();
41  
42  stream.pipe(speaker);
43
44  speaker.on('close', () => {
45    console.log("Speaker finished, resuming mic");
46    mic.resume();
47  })
48})
49
50// Error from voice provider
51agent.voice.on("error", (error) => {
52  console.error("Voice error:", error);
53});
54
55await agent.voice.connect();
56
57mic.start();
58
59const microphoneStream = mic.getAudioStream();
60agent.voice.send(microphoneStream);
61
62agent.voice.speak('Hello how can I help you today?')

This example demonstrates a browser assistant that listens to microphone input, processes it through a speech-to-speech agent, and plays the agent's responses.

You can find a complete demo in our GitHub repository and try it out for yourself.

We're actively developing Mastra! If you encounter any issues or have suggestions for improvements, please open an issue on our GitHub repository or contribute directly with a pull request.

Get started with speech-to-speech for Mastra agents today by installing the latest version of our packages:

npm install @mastra/core @mastra/voice-openai-realtime

The full documentation is available here with additional examples and configuration options.

The Voice Connection: Mastra's Speech-to-Speech Capabilities

What's New

Why speech-to-speech matters

An Event-Based Approach

Adding Tools to Speech-to-Speech Conversations

Speech-to-Speech in Action

Stay up to date