Adding Voice to Agents

Mastra agents can be enhanced with voice capabilities, allowing them to speak responses and listen to user input. You can configure an agent to use either a single voice provider or combine multiple providers for different operations.

Basic usage

The simplest way to add voice to an agent is to use a single provider for both speaking and listening:


import { createReadStream } from "fs";
import path from "path";
import { Agent } from "@mastra/core/agent";
import { OpenAIVoice } from "@mastra/voice-openai";
import { openai } from "@ai-sdk/openai";
 
// Initialize the voice provider with default settings
const voice = new OpenAIVoice();
 
// Create an agent with voice capabilities
export const agent = new Agent({
  name: "Agent",
  instructions: `You are a helpful assistant with both STT and TTS capabilities.`,
  model: openai("gpt-4o"),
  voice,
});
 
// The agent can now use voice for interaction
const audioStream = await agent.voice.speak("Hello, I'm your AI assistant!", {
  filetype: "m4a",
});
 
playAudio(audioStream!);
 
try {
  const transcription = await agent.voice.listen(audioStream);
  console.log(transcription);
} catch (error) {
  console.error("Error transcribing audio:", error);
}

Working with Audio Streams

The speak() and listen() methods work with Node.js streams. Here’s how to save and load audio files:

Saving Speech Output

The speak method returns a stream that you can pipe to a file or speaker.


import { createWriteStream } from "fs";
import path from "path";
 
// Generate speech and save to file
const audio = await agent.voice.speak("Hello, World!");
const filePath = path.join(process.cwd(), "agent.mp3");
const writer = createWriteStream(filePath);
 
audio.pipe(writer);
 
await new Promise<void>((resolve, reject) => {
  writer.on("finish", () => resolve());
  writer.on("error", reject);
});

Transcribing Audio Input

The listen method expects a stream of audio data from a microphone or file.


import { createReadStream } from "fs";
import path from "path";
 
// Read audio file and transcribe
const audioFilePath = path.join(process.cwd(), "/agent.m4a");
const audioStream = createReadStream(audioFilePath);
 
try {
  console.log("Transcribing audio file...");
  const transcription = await agent.voice.listen(audioStream, {
    filetype: "m4a",
  });
  console.log("Transcription:", transcription);
} catch (error) {
  console.error("Error transcribing audio:", error);
}

Speech-to-Speech Voice Interactions

For more dynamic and interactive voice experiences, you can use real-time voice providers that support speech-to-speech capabilities:


import { Agent } from "@mastra/core/agent";
import { getMicrophoneStream } from "@mastra/node-audio";
import { OpenAIRealtimeVoice } from "@mastra/voice-openai-realtime";
import { search, calculate } from "../tools";
 
// Initialize the realtime voice provider
const voice = new OpenAIRealtimeVoice({
  apiKey: process.env.OPENAI_API_KEY,
  model: "gpt-4o-mini-realtime",
  speaker: "alloy",
});
 
// Create an agent with speech-to-speech voice capabilities
export const agent = new Agent({
  name: "Agent",
  instructions: `You are a helpful assistant with speech-to-speech capabilities.`,
  model: openai("gpt-4o"),
  tools: {
    // Tools configured on Agent are passed to voice provider
    search,
    calculate,
  },
  voice,
});
 
// Establish a WebSocket connection
await agent.voice.connect();
 
// Start a conversation
agent.voice.speak("Hello, I'm your AI assistant!");
 
// Stream audio from a microphone
const microphoneStream = getMicrophoneStream();
agent.voice.send(microphoneStream);
 
// When done with the conversation
agent.voice.close();

Event System

The realtime voice provider emits several events you can listen for:


// Listen for speech audio data sent from voice provider
agent.voice.on("speaking", ({ audio }) => {
  // audio contains ReadableStream or Int16Array audio data
});
 
// Listen for transcribed text sent from both voice provider and user
agent.voice.on("writing", ({ text, role }) => {
  console.log(`${role} said: ${text}`);
});
 
// Listen for errors
agent.voice.on("error", (error) => {
  console.error("Voice error:", error);
});

Examples

End-to-end voice interaction

This example demonstrates a voice interaction between two agents. The hybrid voice agent, which uses multiple providers, speaks a question, which is saved as an audio file. The unified voice agent listens to that file, processes the question, generates a response, and speaks it back. Both audio outputs are saved to the audio directory.

The following files are created:

hybrid-question.mp3 – Hybrid agent’s spoken question.
unified-response.mp3 – Unified agent’s spoken response.

src/test-voice-agents.ts


import "dotenv/config";
 
import path from "path";
import { createReadStream } from "fs";
import { Agent } from "@mastra/core/agent";
import { CompositeVoice } from "@mastra/core/voice";
import { OpenAIVoice } from "@mastra/voice-openai";
import { Mastra } from "@mastra/core/mastra";
import { openai } from "@ai-sdk/openai";
 
// Saves an audio stream to a file in the audio directory, creating the directory if it doesn't exist.
export const saveAudioToFile = async (audio: NodeJS.ReadableStream, filename: string): Promise<void> => {
  const audioDir = path.join(process.cwd(), "audio");
  const filePath = path.join(audioDir, filename);
 
  await fs.promises.mkdir(audioDir, { recursive: true });
 
  const writer = createWriteStream(filePath);
  audio.pipe(writer);
  return new Promise((resolve, reject) => {
    writer.on("finish", resolve);
    writer.on("error", reject);
  });
};
 
// Saves an audio stream to a file in the audio directory, creating the directory if it doesn't exist.
export const convertToText = async (input: string | NodeJS.ReadableStream): Promise<string> => {
  if (typeof input === "string") {
    return input;
  }
 
  const chunks: Buffer[] = [];
  return new Promise((resolve, reject) => {
    input.on("data", (chunk) => chunks.push(Buffer.from(chunk)));
    input.on("error", reject);
    input.on("end", () => resolve(Buffer.concat(chunks).toString("utf-8")));
  });
};
 
export const hybridVoiceAgent = new Agent({
  name: "hybrid-voice-agent",
  model: openai("gpt-4o"),
  instructions: "You can speak and listen using different providers.",
  voice: new CompositeVoice({
    input: new OpenAIVoice(),
    output: new OpenAIVoice()
  })
});
 
export const unifiedVoiceAgent = new Agent({
  name: "unified-voice-agent",
  instructions: "You are an agent with both STT and TTS capabilities.",
  model: openai("gpt-4o"),
  voice: new OpenAIVoice()
});
 
export const mastra = new Mastra({
  // ...
  agents: { hybridVoiceAgent, unifiedVoiceAgent }
});
 
const hybridVoiceAgent = mastra.getAgent("hybridVoiceAgent");
const unifiedVoiceAgent = mastra.getAgent("unifiedVoiceAgent");
 
const question = "What is the meaning of life in one sentence?";
 
const hybridSpoken = await hybridVoiceAgent.voice.speak(question);
 
await saveAudioToFile(hybridSpoken!, "hybrid-question.mp3");
 
const audioStream = createReadStream(path.join(process.cwd(), "audio", "hybrid-question.mp3"));
const unifiedHeard = await unifiedVoiceAgent.voice.listen(audioStream);
 
const inputText = await convertToText(unifiedHeard!);
 
const unifiedResponse = await unifiedVoiceAgent.generate(inputText);
const unifiedSpoken = await unifiedVoiceAgent.voice.speak(unifiedResponse.text);
 
await saveAudioToFile(unifiedSpoken!, "unified-response.mp3");

Using Multiple Providers

For more flexibility, you can use different providers for speaking and listening using the CompositeVoice class:


import { Agent } from "@mastra/core/agent";
import { CompositeVoice } from "@mastra/core/voice";
import { OpenAIVoice } from "@mastra/voice-openai";
import { PlayAIVoice } from "@mastra/voice-playai";
import { openai } from "@ai-sdk/openai";
 
export const agent = new Agent({
  name: "Agent",
  instructions: `You are a helpful assistant with both STT and TTS capabilities.`,
  model: openai("gpt-4o"),
 
  // Create a composite voice using OpenAI for listening and PlayAI for speaking
  voice: new CompositeVoice({
    input: new OpenAIVoice(),
    output: new PlayAIVoice(),
  }),
});

Supported Voice Providers

Mastra supports multiple voice providers for text-to-speech (TTS) and speech-to-text (STT) capabilities:

Provider	Package	Features	Reference
OpenAI	`@mastra/voice-openai`	TTS, STT	Documentation
OpenAI Realtime	`@mastra/voice-openai-realtime`	Realtime speech-to-speech	Documentation
ElevenLabs	`@mastra/voice-elevenlabs`	High-quality TTS	Documentation
PlayAI	`@mastra/voice-playai`	TTS	Documentation
Google	`@mastra/voice-google`	TTS, STT	Documentation
Deepgram	`@mastra/voice-deepgram`	STT	Documentation
Murf	`@mastra/voice-murf`	TTS	Documentation
Speechify	`@mastra/voice-speechify`	TTS	Documentation
Sarvam	`@mastra/voice-sarvam`	TTS, STT	Documentation
Azure	`@mastra/voice-azure`	TTS, STT	Documentation
Cloudflare	`@mastra/voice-cloudflare`	TTS	Documentation

For more details on voice capabilities, see the Voice API Reference.