voice.listen()

The listen() method is a core function available in all Mastra voice providers that converts speech to text. It takes an audio stream as input and returns the transcribed text.

Usage Example


import { OpenAIVoice } from "@mastra/voice-openai";
import { getMicrophoneStream } from "@mastra/node-audio";
import { createReadStream } from "fs";
import path from "path";
 
// Initialize a voice provider
const voice = new OpenAIVoice({
  listeningModel: {
    name: "whisper-1",
    apiKey: process.env.OPENAI_API_KEY,
  },
});
 
// Basic usage with a file stream
const audioFilePath = path.join(process.cwd(), "audio.mp3");
const audioStream = createReadStream(audioFilePath);
const transcript = await voice.listen(audioStream, {
  filetype: "mp3",
});
console.log("Transcribed text:", transcript);
 
// Using a microphone stream
const microphoneStream = getMicrophoneStream(); // Assume this function gets audio input
const transcription = await voice.listen(microphoneStream);
 
// With provider-specific options
const transcriptWithOptions = await voice.listen(audioStream, {
  language: "en",
  prompt: "This is a conversation about artificial intelligence.",
});

Parameters

audioStream:

NodeJS.ReadableStream

Audio stream to transcribe. This can be a file stream or a microphone stream.

options?:

object

Provider-specific options for speech recognition

Return Value

Returns one of the following:

Promise<string>: A promise that resolves to the transcribed text
Promise<NodeJS.ReadableStream>: A promise that resolves to a stream of transcribed text (for streaming transcription)
Promise<void>: For real-time providers that emit ‘writing’ events instead of returning text directly

Provider-Specific Options

Each voice provider may support additional options specific to their implementation. Here are some examples:

OpenAI

options.filetype?:

string

= 'mp3'

Audio file format (e.g., 'mp3', 'wav', 'm4a')

options.prompt?:

string

Text to guide the model's transcription

options.language?:

string

Language code (e.g., 'en', 'fr', 'de')

Google

options.stream?:

boolean

= false

Whether to use streaming recognition

options.config?:

object

= { encoding: 'LINEAR16', languageCode: 'en-US' }

Recognition configuration from Google Cloud Speech-to-Text API

Deepgram

options.model?:

string

= 'nova-2'

Deepgram model to use for transcription

options.language?:

string

= 'en'

Language code for transcription

Realtime Voice Providers

When using realtime voice providers like OpenAIRealtimeVoice, the listen() method behaves differently:

Instead of returning transcribed text, it emits ‘writing’ events with the transcribed text
You need to register an event listener to receive the transcription


import { OpenAIRealtimeVoice } from "@mastra/voice-openai-realtime";
import { getMicrophoneStream } from "@mastra/node-audio";
 
const voice = new OpenAIRealtimeVoice();
await voice.connect();
 
// Register event listener for transcription
voice.on("writing", ({ text, role }) => {
  console.log(`${role}: ${text}`);
});
 
// This will emit 'writing' events instead of returning text
const microphoneStream = getMicrophoneStream();
await voice.listen(microphoneStream);

Using with CompositeVoice

When using CompositeVoice, the listen() method delegates to the configured listening provider:


import { CompositeVoice } from "@mastra/core/voice";
import { OpenAIVoice } from "@mastra/voice-openai";
import { PlayAIVoice } from "@mastra/voice-playai";
 
const voice = new CompositeVoice({
  listenProvider: new OpenAIVoice(),
  speakProvider: new PlayAIVoice(),
});
 
// This will use the OpenAIVoice provider
const transcript = await voice.listen(audioStream);

Notes

Not all voice providers support speech-to-text functionality (e.g., PlayAI, Speechify)
The behavior of listen() may vary slightly between providers, but all implementations follow the same basic interface
When using a realtime voice provider, the method might not return text directly but instead emit a ‘writing’ event
The audio format supported depends on the provider. Common formats include MP3, WAV, and M4A
Some providers support streaming transcription, where text is returned as it’s transcribed
For best performance, consider closing or ending the audio stream when you’re done with it

voice.speak() - Converts text to speech
voice.send() - Sends audio data to the voice provider in real-time
voice.on() - Registers an event listener for voice events

voice.listen()

Usage Example

Parameters

audioStream:

options?:

Return Value

Provider-Specific Options

OpenAI

options.filetype?:

options.prompt?:

options.language?:

Google

options.stream?:

options.config?:

Deepgram

options.model?:

options.language?:

Realtime Voice Providers

Using with CompositeVoice

Notes

Related Methods