Skip to main content

Giving your Agent a Voice

Mastra agents can be enhanced with voice capabilities, enabling them to speak and listen. This example demonstrates two ways to configure voice functionality:

  1. Using a composite voice setup that separates input and output streams,
  2. Using a unified voice provider that handles both.

Both examples use the OpenAIVoice provider for demonstration purposes.

Prerequisites

This example uses the openai model. Make sure to add OPENAI_API_KEY to your .env file.

OPENAI_API_KEY=<your-api-key>

Installation

npm install @mastra/voice-openai

Hybrid voice agent

This agent uses a composite voice setup that separates speech-to-text and text-to-speech functionality. The CompositeVoice allows you to configure different providers for listening (input) and speaking (output). However, in this example, both are handled by the same provider: OpenAIVoice.

import { Agent } from "@mastra/core/agent";
import { CompositeVoice } from "@mastra/core/voice";
import { OpenAIVoice } from "@mastra/voice-openai";
import { openai } from "@ai-sdk/openai";

export const hybridVoiceAgent = new Agent({
name: "hybrid-voice-agent",
model: openai("gpt-4o"),
instructions: "You can speak and listen using different providers.",
voice: new CompositeVoice({
input: new OpenAIVoice(),
output: new OpenAIVoice(),
}),
});

See Agent for a full list of configuration options.

Unified voice agent

This agent uses a single voice provider for both speech-to-text and text-to-speech. If you plan to use the same provider for both listening and speaking, this is a simpler setup. In this example, the OpenAIVoice provider handles both functions.

import { openai } from "@ai-sdk/openai";
import { Agent } from "@mastra/core/agent";
import { OpenAIVoice } from "@mastra/voice-openai";

export const unifiedVoiceAgent = new Agent({
name: "unified-voice-agent",
instructions: "You are an agent with both STT and TTS capabilities.",
model: openai("gpt-4o"),
voice: new OpenAIVoice(),
});

See Agent for a full list of configuration options.

Registering agents

To use these agents, register them in your main Mastra instance.

import { Mastra } from "@mastra/core/mastra";

import { hybridVoiceAgent } from "./agents/example-hybrid-voice-agent";
import { unifiedVoiceAgent } from "./agents/example-unified-voice-agent";

export const mastra = new Mastra({
// ...
agents: { hybridVoiceAgent, unifiedVoiceAgent },
});

Functions

These helper functions handle audio file operations and text conversion for the voice interaction example.

saveAudioToFile

This function saves an audio stream to a file in the audio directory, creating the directory if it doesn't exist.

import fs, { createWriteStream } from "fs";
import path from "path";

export const saveAudioToFile = async (
audio: NodeJS.ReadableStream,
filename: string,
): Promise<void> => {
const audioDir = path.join(process.cwd(), "audio");
const filePath = path.join(audioDir, filename);

await fs.promises.mkdir(audioDir, { recursive: true });

const writer = createWriteStream(filePath);
audio.pipe(writer);
return new Promise((resolve, reject) => {
writer.on("finish", resolve);
writer.on("error", reject);
});
};

convertToText

This function converts either a string or a readable stream to text, handling both input types for voice processing.

export const convertToText = async (
input: string | NodeJS.ReadableStream,
): Promise<string> => {
if (typeof input === "string") {
return input;
}

const chunks: Buffer[] = [];
return new Promise((resolve, reject) => {
input.on("data", (chunk) => chunks.push(Buffer.from(chunk)));
input.on("error", reject);
input.on("end", () => resolve(Buffer.concat(chunks).toString("utf-8")));
});
};

Example usage

This example demonstrates a voice interaction between two agents. The hybrid voice agent speaks a question, which is saved as an audio file. The unified voice agent listens to that file, processes the question, generates a response, and speaks it back. Both audio outputs are saved to the audio directory.

The following files are created:

  • hybrid-question.mp3 – Hybrid agent's spoken question.
  • unified-response.mp3 – Unified agent's spoken response.
import "dotenv/config";

import path from "path";
import { createReadStream } from "fs";
import { mastra } from "./mastra";

import { saveAudioToFile } from "./mastra/utils/save-audio-to-file";
import { convertToText } from "./mastra/utils/convert-to-text";

const hybridVoiceAgent = mastra.getAgent("hybridVoiceAgent");
const unifiedVoiceAgent = mastra.getAgent("unifiedVoiceAgent");

const question = "What is the meaning of life in one sentence?";

const hybridSpoken = await hybridVoiceAgent.voice.speak(question);

await saveAudioToFile(hybridSpoken!, "hybrid-question.mp3");

const audioStream = createReadStream(
path.join(process.cwd(), "audio", "hybrid-question.mp3"),
);
const unifiedHeard = await unifiedVoiceAgent.voice.listen(audioStream);

const inputText = await convertToText(unifiedHeard!);

const unifiedResponse = await unifiedVoiceAgent.generate(inputText);
const unifiedSpoken = await unifiedVoiceAgent.voice.speak(unifiedResponse.text);

await saveAudioToFile(unifiedSpoken!, "unified-response.mp3");
View source on GitHub