DocsVoiceSpeech to Text

Speech-to-Text (STT)

Speech-to-Text (STT) in Mastra provides a standardized interface for converting audio input into text across multiple service providers. This section covers STT configuration and usage. Check out the Adding Voice to Agents documentation to learn how to use STT in an agent.

Speech Configuration

To use STT in Mastra, you need to provide a listeningModel configuration when initializing the voice provider. This configuration includes parameters such as:

  • name: The specific STT model to use.
  • apiKey: Your API key for authentication.
  • Provider-specific options: Additional options that may be required or supported by the specific voice provider.

Note: All of these parameters are optional. You can use the default settings provided by the voice provider, which will depend on the specific provider you are using.

Example Configuration

const voice = new OpenAIVoice({
  listeningModel: {
    name: "whisper-1",
    apiKey: process.env.OPENAI_API_KEY,
  },
});
 
// If using default settings the configuration can be simplified to:
const voice = new OpenAIVoice();

Using the Listen Method

The primary method for STT is the listen() method, which converts spoken audio into text. Here’s how to use it:

const audioStream = getMicrophoneStream(); // Assume this function gets audio input
const transcript = await voice.listen(audioStream, {
  filetype: "m4a", // Optional: specify the audio file type
});

Note: If you are using a voice-to-voice provider, such as OpenAIRealtimeVoice, the listen() method will emit a “writing” event instead of returning a transcript directly.