Skip to Content
ReferenceVoicevoice.listen()

voice.listen()

The listen() method is a core function available in all Mastra voice providers that converts speech to text. It takes an audio stream as input and returns the transcribed text.

Usage Example

import { OpenAIVoice } from "@mastra/voice-openai"; import { createReadStream } from "fs"; import path from "path"; // Initialize a voice provider const voice = new OpenAIVoice({ listeningModel: { name: "whisper-1", apiKey: process.env.OPENAI_API_KEY, }, }); // Basic usage with a file stream const audioFilePath = path.join(process.cwd(), "audio.mp3"); const audioStream = createReadStream(audioFilePath); const transcript = await voice.listen(audioStream, { filetype: "mp3", }); console.log("Transcribed text:", transcript); // Using a microphone stream const microphoneStream = getMicrophoneStream(); // Assume this function gets audio input const transcription = await voice.listen(microphoneStream); // With provider-specific options const transcriptWithOptions = await voice.listen(audioStream, { language: "en", prompt: "This is a conversation about artificial intelligence.", });

Parameters

audioStream:

NodeJS.ReadableStream
Audio stream to transcribe. This can be a file stream or a microphone stream.

options?:

object
Provider-specific options for speech recognition

Return Value

Returns one of the following:

  • Promise<string>: A promise that resolves to the transcribed text
  • Promise<NodeJS.ReadableStream>: A promise that resolves to a stream of transcribed text (for streaming transcription)
  • Promise<void>: For real-time providers that emit ‘writing’ events instead of returning text directly

Provider-Specific Options

Each voice provider may support additional options specific to their implementation. Here are some examples:

OpenAI

options.filetype?:

string
= 'mp3'
Audio file format (e.g., 'mp3', 'wav', 'm4a')

options.prompt?:

string
Text to guide the model's transcription

options.language?:

string
Language code (e.g., 'en', 'fr', 'de')

Google

options.stream?:

boolean
= false
Whether to use streaming recognition

options.config?:

object
= { encoding: 'LINEAR16', languageCode: 'en-US' }
Recognition configuration from Google Cloud Speech-to-Text API

Deepgram

options.model?:

string
= 'nova-2'
Deepgram model to use for transcription

options.language?:

string
= 'en'
Language code for transcription

Realtime Voice Providers

When using realtime voice providers like OpenAIRealtimeVoice, the listen() method behaves differently:

  • Instead of returning transcribed text, it emits ‘writing’ events with the transcribed text
  • You need to register an event listener to receive the transcription
import { OpenAIRealtimeVoice } from "@mastra/voice-openai-realtime"; const voice = new OpenAIRealtimeVoice(); await voice.connect(); // Register event listener for transcription voice.on("writing", ({ text, role }) => { console.log(`${role}: ${text}`); }); // This will emit 'writing' events instead of returning text const microphoneStream = getMicrophoneStream(); await voice.listen(microphoneStream);

Using with CompositeVoice

When using CompositeVoice, the listen() method delegates to the configured listening provider:

import { CompositeVoice } from "@mastra/core/voice"; import { OpenAIVoice } from "@mastra/voice-openai"; import { PlayAIVoice } from "@mastra/voice-playai"; const voice = new CompositeVoice({ listenProvider: new OpenAIVoice(), speakProvider: new PlayAIVoice(), }); // This will use the OpenAIVoice provider const transcript = await voice.listen(audioStream);

Notes

  • Not all voice providers support speech-to-text functionality (e.g., PlayAI, Speechify)
  • The behavior of listen() may vary slightly between providers, but all implementations follow the same basic interface
  • When using a realtime voice provider, the method might not return text directly but instead emit a ‘writing’ event
  • The audio format supported depends on the provider. Common formats include MP3, WAV, and M4A
  • Some providers support streaming transcription, where text is returned as it’s transcribed
  • For best performance, consider closing or ending the audio stream when you’re done with it