Giving your Agent a Voice
Mastra agents can be enhanced with voice capabilities, enabling them to speak and listen. This example demonstrates two ways to configure voice functionality:
- Using a composite voice setup that separates input and output streams,
- Using a unified voice provider that handles both.
Both examples use the OpenAIVoice
provider for demonstration purposes.
Prerequisites
This example uses the openai
model. Make sure to add OPENAI_API_KEY
to your .env
file.
OPENAI_API_KEY=<your-api-key>
Installation
npm install @mastra/voice-openai
Hybrid voice agent
This agent uses a composite voice setup that separates speech-to-text and text-to-speech functionality. The CompositeVoice
allows you to configure different providers for listening (input) and speaking (output). However, in this example, both are handled by the same provider: OpenAIVoice
.
import { Agent } from "@mastra/core/agent";
import { CompositeVoice } from "@mastra/core/voice";
import { OpenAIVoice } from "@mastra/voice-openai";
import { openai } from "@ai-sdk/openai";
export const hybridVoiceAgent = new Agent({
name: "hybrid-voice-agent",
model: openai("gpt-4o"),
instructions: "You can speak and listen using different providers.",
voice: new CompositeVoice({
input: new OpenAIVoice(),
output: new OpenAIVoice()
})
});
See Agent for a full list of configuration options.
Unified voice agent
This agent uses a single voice provider for both speech-to-text and text-to-speech. If you plan to use the same provider for both listening and speaking, this is a simpler setup. In this example, the OpenAIVoice
provider handles both functions.
import { openai } from "@ai-sdk/openai";
import { Agent } from "@mastra/core/agent";
import { OpenAIVoice } from "@mastra/voice-openai";
export const unifiedVoiceAgent = new Agent({
name: "unified-voice-agent",
instructions: "You are an agent with both STT and TTS capabilities.",
model: openai("gpt-4o"),
voice: new OpenAIVoice()
});
See Agent for a full list of configuration options.
Registering agents
To use these agents, register them in your main Mastra instance.
import { Mastra } from "@mastra/core/mastra";
import { hybridVoiceAgent } from "./agents/example-hybrid-voice-agent";
import { unifiedVoiceAgent } from "./agents/example-unified-voice-agent";
export const mastra = new Mastra({
// ...
agents: { hybridVoiceAgent, unifiedVoiceAgent }
});
Functions
These helper functions handle audio file operations and text conversion for the voice interaction example.
saveAudioToFile
This function saves an audio stream to a file in the audio directory, creating the directory if it doesn’t exist.
import fs, { createWriteStream } from "fs";
import path from "path";
export const saveAudioToFile = async (audio: NodeJS.ReadableStream, filename: string): Promise<void> => {
const audioDir = path.join(process.cwd(), "audio");
const filePath = path.join(audioDir, filename);
await fs.promises.mkdir(audioDir, { recursive: true });
const writer = createWriteStream(filePath);
audio.pipe(writer);
return new Promise((resolve, reject) => {
writer.on("finish", resolve);
writer.on("error", reject);
});
};
convertToText
This function converts either a string or a readable stream to text, handling both input types for voice processing.
export const convertToText = async (input: string | NodeJS.ReadableStream): Promise<string> => {
if (typeof input === "string") {
return input;
}
const chunks: Buffer[] = [];
return new Promise((resolve, reject) => {
input.on("data", (chunk) => chunks.push(Buffer.from(chunk)));
input.on("error", reject);
input.on("end", () => resolve(Buffer.concat(chunks).toString("utf-8")));
});
};
Example usage
This example demonstrates a voice interaction between two agents. The hybrid voice agent speaks a question, which is saved as an audio file. The unified voice agent listens to that file, processes the question, generates a response, and speaks it back. Both audio outputs are saved to the audio
directory.
The following files are created:
- hybrid-question.mp3 – Hybrid agent’s spoken question.
- unified-response.mp3 – Unified agent’s spoken response.
import "dotenv/config";
import path from "path";
import { createReadStream } from "fs";
import { mastra } from "./mastra";
import { saveAudioToFile } from "./mastra/utils/save-audio-to-file";
import { convertToText } from "./mastra/utils/convert-to-text";
const hybridVoiceAgent = mastra.getAgent("hybridVoiceAgent");
const unifiedVoiceAgent = mastra.getAgent("unifiedVoiceAgent");
const question = "What is the meaning of life in one sentence?";
const hybridSpoken = await hybridVoiceAgent.voice.speak(question);
await saveAudioToFile(hybridSpoken!, "hybrid-question.mp3");
const audioStream = createReadStream(path.join(process.cwd(), "audio", "hybrid-question.mp3"));
const unifiedHeard = await unifiedVoiceAgent.voice.listen(audioStream);
const inputText = await convertToText(unifiedHeard!);
const unifiedResponse = await unifiedVoiceAgent.generate(inputText);
const unifiedSpoken = await unifiedVoiceAgent.voice.speak(unifiedResponse.text);
await saveAudioToFile(unifiedSpoken!, "unified-response.mp3");