DocsVoiceVoice to Voice

Voice-to-Voice Capabilities in Mastra

Introduction

Voice-to-Voice in Mastra provides a standardized interface for real-time speech-to-speech interactions across multiple service providers. This section covers configuration, event-driven architecture, and implementation methods for creating conversational voice experiences. For integrating Voice-to-Voice capabilities with agents, refer to the Adding Voice to Agents documentation.

Real-time Voice Interactions

Mastra’s real-time voice system enables continuous bidirectional audio communication through an event-driven architecture. Unlike separate TTS and STT operations, real-time voice maintains an open connection that processes speech continuously in both directions.

Example Implementation

import { Agent } from "@mastra/core/agent";
import { OpenAIRealtimeVoice } from "@mastra/voice-openai-realtime";
 
const agent = new Agent({
  name: 'Agent',
  instructions: `You are a helpful assistant with real-time voice capabilities.`,
  model: openai('gpt-4o'),
  voice: new OpenAIRealtimeVoice(),
});
 
// Connect to the voice service
await agent.voice.connect();
 
// Listen for agent audio responses
agent.voice.on('speaking', ({ audio }) => {
  playAudio(audio);
});
 
// Initiate the conversation
await agent.voice.speak('How can I help you today?');
 
// Send continuous audio from the microphone
const micStream = getMicrophoneStream();
await agent.voice.send(micStream);

Event-Driven Architecture

Mastra’s voice-to-voice implementation is built on an event-driven architecture. Developers register event listeners to handle incoming audio progressively, allowing for more responsive interactions than waiting for complete audio responses.

Configuration

When initializing a voice-to-voice provider, you can provide configuration options to customize its behavior:

Constructor Options

  • chatModel: Configuration for the OpenAI realtime model.

    • apiKey: Your OpenAI API key. Falls back to the OPENAI_API_KEY environment variable.
    • model: The model ID to use for real-time voice interactions (e.g., gpt-4o-mini-realtime).
    • options: Additional options for the realtime client, such as session configuration.
  • speaker: The default voice ID for speech synthesis. This allows you to specify which voice to use for the speech output.

Example Configuration

const voice = new OpenAIRealtimeVoice({
  chatModel: {
    apiKey: 'your-openai-api-key',
    model: 'gpt-4o-mini-realtime',
    options: {
      sessionConfig: {
        turn_detection: {
          type: 'server_vad',
          threshold: 0.6,
          silence_duration_ms: 1200,
        },
      },
    },
  },
  speaker: 'alloy', // Default voice
});
 
// If using default settings the configuration can be simplified to:
const voice = new OpenAIRealtimeVoice();

Core Methods

The OpenAIRealtimeVoice class provides the following core methods for voice interactions:

connect()

Establishes a connection to the OpenAI realtime service.

Usage:

await voice.connect();

Notes:

  • Must be called before using any other interaction methods
  • Returns a Promise that resolves when the connection is established

speak(text, options?)

Emits a speaking event using the configured voice model.

Parameters:

  • text: String content to be spoken
  • options: Optional configuration object
    • speaker: Voice ID to use (overrides default)
    • properties: Additional provider-specific properties

Usage:

voice.speak('Hello, how can I help you today?', {
  speaker: 'alloy'
});

Notes:

  • Emits ‘speaking’ events rather than returning an audio stream

listen(audioInput, options?)

Processes audio input for speech recognition.

Parameters:

  • audioInput: Readable stream of audio data
  • options: Optional configuration object
    • filetype: Audio format (default: ‘mp3’)
    • Additional provider-specific options

Usage:

const audioData = getMicrophoneStream();
voice.listen(audioData, {
  filetype: 'wav'
});

Notes:

  • Emits ‘writing’ events with transcribed text

send(audioStream)

Streams audio data in real-time for continuous processing.

Parameters:

  • audioStream: Readable stream of audio data

Usage:

const micStream = getMicrophoneStream();
await voice.send(micStream);

Notes:

  • Used for continuous audio streaming scenarios like live microphone input
  • Returns a Promise that resolves when the stream is accepted

answer(params)

Sends a response to the OpenAI Realtime API.

Parameters:

  • params: The parameters object
    • options: Configuration options for the response
      • content: Text content of the response
      • voice: Voice ID to use for the response

Usage:

await voice.answer({
  options: {
    content: "Hello, how can I help you today?",
    voice: "alloy"
  }
});

Notes:

  • Triggers a response to the real-time session
  • Returns a Promise that resolves when the response has been sent

Utility Methods

updateConfig(config)

Updates the session configuration for the voice instance.

Parameters:

  • config: New session configuration object

Usage:

voice.updateConfig({
  turn_detection: {
    type: 'server_vad',
    threshold: 0.6,
    silence_duration_ms: 1200,
  }
});

addTools(tools)

Adds a set of tools to the voice instance.

Parameters:

  • tools: Array of tool objects that the model can call

Usage:

voice.addTools([
  createTool({
    id: "Get Weather Information",
    inputSchema: z.object({
        city: z.string(),
    }),
    description: `Fetches the current weather information for a given city`,
    execute: async ({ city }) => {...},
  })
]);

close()

Disconnects from the OpenAI realtime session and cleans up resources.

Usage:

voice.close();

Notes:

  • Should be called when you’re done with the voice instance to free resources

on(event, callback)

Registers an event listener for voice events.

Parameters:

  • event: Event name (‘speaking’, ‘writing’, or ‘error’)
  • callback: Function to call when the event occurs

Usage:

voice.on('speaking', ({ audio }) => {
  playAudio(audio);
});

off(event, callback)

Removes a previously registered event listener.

Parameters:

  • event: Event name
  • callback: The callback function to remove

Usage:

voice.off('speaking', callbackFunction);

Events

The OpenAIRealtimeVoice class emits the following events:

speaking

Emitted when audio data is received from the model.

Event payload:

  • audio: A chunk of audio data as a readable stream
agent.voice.on('speaking', ({ audio }) => {
  playAudio(audio); // Handle audio chunks as they're generated
});

writing

Emitted when transcribed text is available.

Event payload:

  • text: The transcribed text
  • role: The role of the speaker (user or assistant)
  • done: Boolean indicating if this is the final transcription
agent.voice.on('writing', ({ text, role }) => {
  console.log(`${role}: ${text}`); // Log who said what
});

error

Emitted when an error occurs.

Event payload:

  • Error object with details about what went wrong
agent.voice.on('error', (error) => {
  console.error('Voice error:', error);
});