Adding Voice to Agents
Mastra agents can be enhanced with voice capabilities, allowing them to speak responses and listen to user input. You can configure an agent to use either a single voice provider or combine multiple providers for different operations.
Using a Single Provider
The simplest way to add voice to an agent is to use a single provider for both speaking and listening:
import { createReadStream } from "fs";
import path from "path";
import { Agent } from "@mastra/core/agent";
import { OpenAIVoice } from "@mastra/voice-openai";
import { openai } from "@ai-sdk/openai";
// Initialize the voice provider with default settings
const voice = new OpenAIVoice();
// Create an agent with voice capabilities
export const agent = new Agent({
name: "Agent",
instructions: `You are a helpful assistant with both STT and TTS capabilities.`,
model: openai("gpt-4o"),
voice,
});
// The agent can now use voice for interaction
await agent.voice.speak("Hello, I'm your AI assistant!");
// Read audio file and transcribe
const audioFilePath = path.join(process.cwd(), "/audio.m4a");
const audioStream = createReadStream(audioFilePath);
try {
const transcription = await agent.listen(audioStream, { filetype: "m4a" });
} catch (error) {
console.error("Error transcribing audio:", error);
}
Using Multiple Providers
For more flexibility, you can use different providers for speaking and listening using the CompositeVoice class:
import { Agent } from "@mastra/core/agent";
import { CompositeVoice } from "@mastra/core/voice";
import { OpenAIVoice } from "@mastra/voice-openai";
import { PlayAIVoice } from "@mastra/voice-playai";
import { openai } from "@ai-sdk/openai";
export const agent = new Agent({
name: "Agent",
instructions: `You are a helpful assistant with both STT and TTS capabilities.`,
model: openai("gpt-4o"),
// Create a composite voice using OpenAI for listening and PlayAI for speaking
voice: new CompositeVoice({
listenProvider: new OpenAIVoice(),
speakProvider: new PlayAIVoice(),
}),
});
Working with Audio Streams
The speak()
and listen()
methods work with Node.js streams. Here’s how to save and load audio files:
Saving Speech Output
import { createWriteStream } from "fs";
import path from "path";
// Generate speech and save to file
const audio = await agent.speak("Hello, World!");
const filePath = path.join(process.cwd(), "agent.mp3");
const writer = createWriteStream(filePath);
audio.pipe(writer);
await new Promise<void>((resolve, reject) => {
writer.on("finish", () => resolve());
writer.on("error", reject);
});
Transcribing Audio Input
import { createReadStream } from "fs";
import path from "path";
// Read audio file and transcribe
const audioFilePath = path.join(process.cwd(), "/agent.m4a");
const audioStream = createReadStream(audioFilePath);
try {
console.log("Transcribing audio file...");
const transcription = await agent.listen(audioStream, { filetype: "m4a" });
console.log("Transcription:", transcription);
} catch (error) {
console.error("Error transcribing audio:", error);
}
Real-time Voice Interactions
For more dynamic and interactive voice experiences, you can use real-time voice providers that support speech-to-speech capabilities:
import { Agent } from "@mastra/core/agent";
import { OpenAIRealtimeVoice } from "@mastra/voice-openai-realtime";
import { search, calculate } from "../tools";
// Initialize the realtime voice provider
const voice = new OpenAIRealtimeVoice({
chatModel: {
apiKey: process.env.OPENAI_API_KEY,
model: 'gpt-4o-mini-realtime',
},
speaker: 'alloy'
});
// Create an agent with speech-to-speech voice capabilities
export const agent = new Agent({
name: 'Agent',
instructions: `You are a helpful assistant with speech-to-speech capabilities.`,
model: openai('gpt-4o'),
tools: { // Tools configured on Agent are passed to voice provider
search,
calculate
},
voice
});
// Establish a WebSocket connection
await agent.voice.connect();
// Start a conversation
agent.voice.speak("Hello, I'm your AI assistant!");
// Stream audio from a microphone
const microphoneStream = getMicrophoneStream();
agent.voice.send(microphoneStream);
// When done with the conversation
agent.voice.close();
Event System
The realtime voice provider emits several events you can listen for:
// Listen for speech audio data sent from voice provider
agent.voice.on('speaking', ({ audio }) => {
// audio contains ReadableStream or Int16Array audio data
});
// Listen for transcribed text sent from both voice provider and user
agent.voice.on('writing', ({ text, role }) => {
console.log(`${role} said: ${text}`);
});
// Listen for errors
agent.voice.on('error', (error) => {
console.error('Voice error:', error);
});