ExamplesAgentsGive your Agent a voice

Giving your Agent a Voice

This example demonstrates how to add voice capabilities to Mastra agents, enabling them to speak and listen using different voice providers. We’ll create two agents with different voice configurations and show how they can interact using speech.

The example showcases:

  1. Using CompositeVoice to combine different providers for speaking and listening
  2. Using a single provider for both capabilities
  3. Basic voice interactions between agents

First, let’s import the required dependencies and set up our agents:

// Import required dependencies
import { openai } from '@ai-sdk/openai';
import { Agent } from '@mastra/core/agent';
import { CompositeVoice } from '@mastra/core/voice';
import { OpenAIVoice } from '@mastra/voice-openai';
import { createReadStream, createWriteStream } from 'fs';
import { PlayAIVoice } from '@mastra/voice-playai';
import path from 'path';
 
// Initialize Agent 1 with both listening and speaking capabilities
const agent1 = new Agent({
  name: 'Agent1',
  instructions: `You are an agent with both STT and TTS capabilities.`,
  model: openai('gpt-4o'),
  voice: new CompositeVoice({
    listenProvider: new OpenAIVoice(), // For converting speech to text
    speakProvider: new PlayAIVoice(), // For converting text to speech
  }),
});
 
// Initialize Agent 2 with just OpenAI for both listening and speaking capabilities
const agent2 = new Agent({
  name: 'Agent2',
  instructions: `You are an agent with both STT and TTS capabilities.`,
  model: openai('gpt-4o'),
  voice: new OpenAIVoice(),
});

In this setup:

  • Agent1 uses a CompositeVoice that combines OpenAI for speech-to-text and PlayAI for text-to-speech
  • Agent2 uses OpenAI’s voice capabilities for both functions

Now let’s demonstrate a basic interaction between the agents:

// Step 1: Agent 1 speaks a question and saves it to a file
const audio1 = await agent1.speak('What is the meaning of life in one sentence?');
await saveAudioToFile(audio1, 'agent1-question.mp3');
 
// Step 2: Agent 2 listens to Agent 1's question
const audioFilePath = path.join(process.cwd(), 'agent1-question.mp3');
const audioStream = createReadStream(audioFilePath);
const audio2 = await agent2.listen(audioStream);
const text = await convertToText(audio2);
 
// Step 3: Agent 2 generates and speaks a response
const agent2Response = await agent2.generate(text);
const agent2ResponseAudio = await agent2.speak(agent2Response.text);
await saveAudioToFile(agent2ResponseAudio, 'agent2-response.mp3');

Here’s what’s happening in the interaction:

  1. Agent1 converts text to speech using PlayAI and saves it to a file (we save the audio so you can hear the interaction)
  2. Agent2 listens to the audio file using OpenAI’s speech-to-text
  3. Agent2 generates a response and converts it to speech

The example includes helper functions for handling audio files:

/**
 * Saves an audio stream to a file
 */
async function saveAudioToFile(audio: NodeJS.ReadableStream, filename: string): Promise<void> {
  const filePath = path.join(process.cwd(), filename);
  const writer = createWriteStream(filePath);
  audio.pipe(writer);
  return new Promise<void>((resolve, reject) => {
    writer.on('finish', resolve);
    writer.on('error', reject);
  });
}
 
/**
 * Converts either a string or a readable stream to text
 */
async function convertToText(input: string | NodeJS.ReadableStream): Promise<string> {
  if (typeof input === 'string') {
    return input;
  }
 
  const chunks: Buffer[] = [];
  return new Promise<string>((resolve, reject) => {
    input.on('data', chunk => chunks.push(Buffer.from(chunk)));
    input.on('error', err => reject(err));
    input.on('end', () => resolve(Buffer.concat(chunks).toString('utf-8')));
  });
}

Key Points

  1. The voice property in the Agent configuration accepts any implementation of MastraVoice
  2. CompositeVoice allows using different providers for speaking and listening
  3. Audio can be handled as streams, making it efficient for real-time processing
  4. Voice capabilities can be combined with the agent’s natural language processing





View Example on GitHub