Skip to Content
DocsVoiceOverview

Voice in Mastra

Mastra’s Voice system provides a unified interface for voice interactions, enabling text-to-speech (TTS), speech-to-text (STT), and real-time speech-to-speech (STS) capabilities in your applications.

Adding Voice to Agents

To learn how to integrate voice capabilities into your agents, check out the Adding Voice to Agents documentation. This section covers how to use both single and multiple voice providers, as well as real-time interactions.

import { Agent } from '@mastra/core/agent'; import { openai } from '@ai-sdk/openai'; import { OpenAIVoice } from "@mastra/voice-openai"; // Initialize OpenAI voice for TTS const voiceAgent = new Agent({ name: "Voice Agent", instructions: "You are a voice assistant that can help users with their tasks.", model: openai("gpt-4o"), voice: new OpenAIVoice(), });

You can then use the following voice capabilities:

Text to Speech (TTS)

Turn your agent’s responses into natural-sounding speech using Mastra’s TTS capabilities. Choose from multiple providers like OpenAI, ElevenLabs, and more.

For detailed configuration options and advanced features, check out our Text-to-Speech guide.

import { Agent } from '@mastra/core/agent'; import { openai } from '@ai-sdk/openai'; import { OpenAIVoice } from "@mastra/voice-openai"; import { playAudio } from "@mastra/node-audio"; const voiceAgent = new Agent({ name: "Voice Agent", instructions: "You are a voice assistant that can help users with their tasks.", model: openai("gpt-4o"), voice: new OpenAIVoice(), }); const { text } = await voiceAgent.generate('What color is the sky?'); // Convert text to speech to an Audio Stream const audioStream = await voiceAgent.voice.speak(text, { speaker: "default", // Optional: specify a speaker responseFormat: "wav", // Optional: specify a response format }); playAudio(audioStream);

Visit the OpenAI Voice Reference for more information on the OpenAI voice provider.

Speech to Text (STT)

Transcribe spoken content using various providers like OpenAI, ElevenLabs, and more. For detailed configuration options and more, check out Speech to Text.

You can download a sample audio file from here.


import { Agent } from '@mastra/core/agent'; import { openai } from '@ai-sdk/openai'; import { OpenAIVoice } from "@mastra/voice-openai"; import { createReadStream } from 'fs'; const voiceAgent = new Agent({ name: "Voice Agent", instructions: "You are a voice assistant that can help users with their tasks.", model: openai("gpt-4o"), voice: new OpenAIVoice(), }); // Use an audio file from a URL const audioStream = await createReadStream("./how_can_i_help_you.mp3"); // Convert audio to text const transcript = await voiceAgent.voice.listen(audioStream); console.log(`User said: ${transcript}`); // Generate a response based on the transcript const { text } = await voiceAgent.generate(transcript);

Visit the OpenAI Voice Reference for more information on the OpenAI voice provider.

Speech to Speech (STS)

Create conversational experiences with speech-to-speech capabilities. The unified API enables real-time voice interactions between users and AI agents. For detailed configuration options and advanced features, check out Speech to Speech.

import { Agent } from '@mastra/core/agent'; import { openai } from '@ai-sdk/openai'; import { playAudio, getMicrophoneStream } from '@mastra/node-audio'; import { OpenAIRealtimeVoice } from "@mastra/voice-openai-realtime"; const voiceAgent = new Agent({ name: "Voice Agent", instructions: "You are a voice assistant that can help users with their tasks.", model: openai("gpt-4o"), voice: new OpenAIRealtimeVoice(), }); // Listen for agent audio responses voiceAgent.voice.on('speaker', ({ audio }) => { playAudio(audio); }); // Initiate the conversation await voiceAgent.voice.speak('How can I help you today?'); // Send continuous audio from the microphone const micStream = getMicrophoneStream(); await voiceAgent.voice.send(micStream);

Visit the OpenAI Voice Reference for more information on the OpenAI voice provider.

Voice Configuration

Each voice provider can be configured with different models and options. Below are the detailed configuration options for all supported providers:

// OpenAI Voice Configuration const voice = new OpenAIVoice({ speechModel: { name: "gpt-3.5-turbo", // Example model name apiKey: process.env.OPENAI_API_KEY, language: "en-US", // Language code voiceType: "neural", // Type of voice model }, listeningModel: { name: "whisper-1", // Example model name apiKey: process.env.OPENAI_API_KEY, language: "en-US", // Language code format: "wav", // Audio format }, speaker: "alloy", // Example speaker name });

Visit the OpenAI Voice Reference for more information on the OpenAI voice provider.

Using Multiple Voice Providers

This example demonstrates how to create and use two different voice providers in Mastra: OpenAI for speech-to-text (STT) and PlayAI for text-to-speech (TTS).

Start by creating instances of the voice providers with any necessary configuration.

import { OpenAIVoice } from "@mastra/voice-openai"; import { PlayAIVoice } from "@mastra/voice-playai"; import { CompositeVoice } from "@mastra/core/voice"; import { playAudio, getMicrophoneStream } from "@mastra/node-audio"; // Initialize OpenAI voice for STT const input = new OpenAIVoice({ listeningModel: { name: "whisper-1", apiKey: process.env.OPENAI_API_KEY, }, }); // Initialize PlayAI voice for TTS const output = new PlayAIVoice({ speechModel: { name: "playai-voice", apiKey: process.env.PLAYAI_API_KEY, }, }); // Combine the providers using CompositeVoice const voice = new CompositeVoice({ input, output, }); // Implement voice interactions using the combined voice provider const audioStream = getMicrophoneStream(); // Assume this function gets audio input const transcript = await voice.listen(audioStream); // Log the transcribed text console.log("Transcribed text:", transcript); // Convert text to speech const responseAudio = await voice.speak(`You said: ${transcript}`, { speaker: "default", // Optional: specify a speaker, responseFormat: "wav", // Optional: specify a response format }); // Play the audio response playAudio(responseAudio);

For more information on the CompositeVoice, refer to the CompositeVoice Reference.

More Resources