Skip to main content

voice.listen()

The listen() method is a core function available in all Mastra voice providers that converts speech to text. It takes an audio stream as input and returns the transcribed text.

Usage Example

import { OpenAIVoice } from "@mastra/voice-openai";
import { getMicrophoneStream } from "@mastra/node-audio";
import { createReadStream } from "fs";
import path from "path";

// Initialize a voice provider
const voice = new OpenAIVoice({
listeningModel: {
name: "whisper-1",
apiKey: process.env.OPENAI_API_KEY,
},
});

// Basic usage with a file stream
const audioFilePath = path.join(process.cwd(), "audio.mp3");
const audioStream = createReadStream(audioFilePath);
const transcript = await voice.listen(audioStream, {
filetype: "mp3",
});
console.log("Transcribed text:", transcript);

// Using a microphone stream
const microphoneStream = getMicrophoneStream(); // Assume this function gets audio input
const transcription = await voice.listen(microphoneStream);

// With provider-specific options
const transcriptWithOptions = await voice.listen(audioStream, {
language: "en",
prompt: "This is a conversation about artificial intelligence.",
});

Parameters

audioStream:

NodeJS.ReadableStream
Audio stream to transcribe. This can be a file stream or a microphone stream.

options?:

object
Provider-specific options for speech recognition

Return Value

Returns one of the following:

  • Promise<string>: A promise that resolves to the transcribed text
  • Promise<NodeJS.ReadableStream>: A promise that resolves to a stream of transcribed text (for streaming transcription)
  • Promise<void>: For real-time providers that emit 'writing' events instead of returning text directly

Provider-Specific Options

Each voice provider may support additional options specific to their implementation. Here are some examples:

OpenAI

options.filetype?:

string
= 'mp3'
Audio file format (e.g., 'mp3', 'wav', 'm4a')

options.prompt?:

string
Text to guide the model's transcription

options.language?:

string
Language code (e.g., 'en', 'fr', 'de')

Google

options.stream?:

boolean
= false
Whether to use streaming recognition

options.config?:

object
= { encoding: 'LINEAR16', languageCode: 'en-US' }
Recognition configuration from Google Cloud Speech-to-Text API

Deepgram

options.model?:

string
= 'nova-2'
Deepgram model to use for transcription

options.language?:

string
= 'en'
Language code for transcription

Realtime Voice Providers

When using realtime voice providers like OpenAIRealtimeVoice, the listen() method behaves differently:

  • Instead of returning transcribed text, it emits 'writing' events with the transcribed text
  • You need to register an event listener to receive the transcription
import { OpenAIRealtimeVoice } from "@mastra/voice-openai-realtime";
import { getMicrophoneStream } from "@mastra/node-audio";

const voice = new OpenAIRealtimeVoice();
await voice.connect();

// Register event listener for transcription
voice.on("writing", ({ text, role }) => {
console.log(`${role}: ${text}`);
});

// This will emit 'writing' events instead of returning text
const microphoneStream = getMicrophoneStream();
await voice.listen(microphoneStream);

Using with CompositeVoice

When using CompositeVoice, the listen() method delegates to the configured listening provider:

import { CompositeVoice } from "@mastra/core/voice";
import { OpenAIVoice } from "@mastra/voice-openai";
import { PlayAIVoice } from "@mastra/voice-playai";

const voice = new CompositeVoice({
listenProvider: new OpenAIVoice(),
speakProvider: new PlayAIVoice(),
});

// This will use the OpenAIVoice provider
const transcript = await voice.listen(audioStream);

Notes

  • Not all voice providers support speech-to-text functionality (e.g., PlayAI, Speechify)
  • The behavior of listen() may vary slightly between providers, but all implementations follow the same basic interface
  • When using a realtime voice provider, the method might not return text directly but instead emit a 'writing' event
  • The audio format supported depends on the provider. Common formats include MP3, WAV, and M4A
  • Some providers support streaming transcription, where text is returned as it's transcribed
  • For best performance, consider closing or ending the audio stream when you're done with it