Skip to Content
DocsVoiceVoice to Voicenew

Voice-to-Voice Capabilities in Mastra

Introduction

Voice-to-Voice in Mastra provides a standardized interface for real-time speech-to-speech interactions across multiple service providers. This section covers configuration, event-driven architecture, and implementation methods for creating conversational voice experiences. For integrating Voice-to-Voice capabilities with agents, refer to the Adding Voice to Agents documentation.

Real-time Voice Interactions

Mastra’s real-time voice system enables continuous bidirectional audio communication through an event-driven architecture. Unlike separate TTS and STT operations, real-time voice maintains an open connection that processes speech continuously in both directions.

Example Implementation

import { Agent } from "@mastra/core/agent"; import { OpenAIRealtimeVoice } from "@mastra/voice-openai-realtime"; const agent = new Agent({ name: 'Agent', instructions: `You are a helpful assistant with real-time voice capabilities.`, model: openai('gpt-4o'), voice: new OpenAIRealtimeVoice(), }); // Connect to the voice service await agent.voice.connect(); // Listen for agent audio responses agent.voice.on('speaking', ({ audio }) => { playAudio(audio); }); // Initiate the conversation await agent.voice.speak('How can I help you today?'); // Send continuous audio from the microphone const micStream = getMicrophoneStream(); await agent.voice.send(micStream);

Event-Driven Architecture

Mastra’s voice-to-voice implementation is built on an event-driven architecture. Developers register event listeners to handle incoming audio progressively, allowing for more responsive interactions than waiting for complete audio responses.

Configuration

When initializing a voice-to-voice provider, you can provide configuration options to customize its behavior:

Constructor Options

  • chatModel: Configuration for the OpenAI realtime model.

    • apiKey: Your OpenAI API key. Falls back to the OPENAI_API_KEY environment variable.
    • model: The model ID to use for real-time voice interactions (e.g., gpt-4o-mini-realtime).
    • options: Additional options for the realtime client, such as session configuration.
  • speaker: The default voice ID for speech synthesis. This allows you to specify which voice to use for the speech output.

Example Configuration

const voice = new OpenAIRealtimeVoice({ chatModel: { apiKey: 'your-openai-api-key', model: 'gpt-4o-mini-realtime', options: { sessionConfig: { turn_detection: { type: 'server_vad', threshold: 0.6, silence_duration_ms: 1200, }, }, }, }, speaker: 'alloy', // Default voice }); // If using default settings the configuration can be simplified to: const voice = new OpenAIRealtimeVoice();

Core Methods

The OpenAIRealtimeVoice class provides the following core methods for voice interactions:

connect()

Establishes a connection to the OpenAI realtime service.

Usage:

await voice.connect();

Notes:

  • Must be called before using any other interaction methods
  • Returns a Promise that resolves when the connection is established

speak(text, options?)

Emits a speaking event using the configured voice model.

Parameters:

  • text: String content to be spoken
  • options: Optional configuration object
    • speaker: Voice ID to use (overrides default)
    • properties: Additional provider-specific properties

Usage:

voice.speak('Hello, how can I help you today?', { speaker: 'alloy' });

Notes:

  • Emits ‘speaker’ event rather than returning an audio stream

listen(audioInput, options?)

Processes audio input for speech recognition.

Parameters:

  • audioInput: Readable stream of audio data
  • options: Optional configuration object
    • filetype: Audio format (default: ‘mp3’)
    • Additional provider-specific options

Usage:

const audioData = getMicrophoneStream(); voice.listen(audioData, { filetype: 'wav' });

Notes:

  • Emits ‘writing’ events with transcribed text

send(audioStream)

Streams audio data in real-time for continuous processing.

Parameters:

  • audioStream: Readable stream of audio data

Usage:

const micStream = getMicrophoneStream(); await voice.send(micStream);

Notes:

  • Used for continuous audio streaming scenarios like live microphone input
  • Returns a Promise that resolves when the stream is accepted

answer(params)

Sends a response to the OpenAI Realtime API.

Parameters:

  • params: The parameters object
    • options: Configuration options for the response
      • content: Text content of the response
      • voice: Voice ID to use for the response

Usage:

await voice.answer({ options: { content: "Hello, how can I help you today?", voice: "alloy" } });

Notes:

  • Triggers a response to the real-time session
  • Returns a Promise that resolves when the response has been sent

Utility Methods

updateConfig(config)

Updates the session configuration for the voice instance.

Parameters:

  • config: New session configuration object

Usage:

voice.updateConfig({ turn_detection: { type: 'server_vad', threshold: 0.6, silence_duration_ms: 1200, } });

addTools(tools)

Adds a set of tools to the voice instance.

Parameters:

  • tools: Array of tool objects that the model can call

Usage:

voice.addTools([ createTool({ id: "Get Weather Information", inputSchema: z.object({ city: z.string(), }), description: `Fetches the current weather information for a given city`, execute: async ({ city }) => {...}, }) ]);

close()

Disconnects from the OpenAI realtime session and cleans up resources.

Usage:

voice.close();

Notes:

  • Should be called when you’re done with the voice instance to free resources

on(event, callback)

Registers an event listener for voice events.

Parameters:

  • event: Event name (‘speaker’, ‘writing’, or ‘error’)
  • callback: Function to call when the event occurs

Usage:

voice.on('speaker', (stream) => { stream.pipe(speaker) });

off(event, callback)

Removes a previously registered event listener.

Parameters:

  • event: Event name
  • callback: The callback function to remove

Usage:

voice.off('speaking', callbackFunction);

Events

The OpenAIRealtimeVoice class emits the following events:

speaker

Emitted when audio data is received from the model.

Event payload:

  • stream: A stream of audio data as a readable stream
agent.voice.on('speaker', (stream) => { stream.pipe(speaker) });

writing

Emitted when transcribed text is available.

Event payload:

  • text: The transcribed text
  • role: The role of the speaker (user or assistant)
  • done: Boolean indicating if this is the final transcription
agent.voice.on('writing', ({ text, role }) => { console.log(`${role}: ${text}`); // Log who said what });

error

Emitted when an error occurs.

Event payload:

  • Error object with details about what went wrong
agent.voice.on('error', (error) => { console.error('Voice error:', error); });