Speech-to-Text (STT)

Speech-to-Text (STT) in Mastra provides a standardized interface for converting audio input into text across multiple service providers. STT helps create voice-enabled applications that can respond to human speech, enabling hands-free interaction, accessibility for users with disabilities, and more natural human-computer interfaces.

ConfigurationDirect link to Configuration

To use STT in Mastra, you need to provide a listeningModel when initializing the voice provider. This includes parameters such as:

name: The specific STT model to use.
apiKey: Your API key for authentication.
Provider-specific options: Additional options that may be required or supported by the specific voice provider.

Note: All of these parameters are optional. You can use the default settings provided by the voice provider, which will depend on the specific provider you are using.

const voice = new OpenAIVoice({
  listeningModel: {
    name: "whisper-1",
    apiKey: process.env.OPENAI_API_KEY,
  },
});

// If using default settings the configuration can be simplified to:
const voice = new OpenAIVoice();

Available ProvidersDirect link to Available Providers

Mastra supports several Speech-to-Text providers, each with their own capabilities and strengths:

OpenAI - High-accuracy transcription with Whisper models
Azure - Microsoft's speech recognition with enterprise-grade reliability
ElevenLabs - Advanced speech recognition with support for multiple languages
Google - Google's speech recognition with extensive language support
Cloudflare - Edge-optimized speech recognition for low-latency applications
Deepgram - AI-powered speech recognition with high accuracy for various accents
Sarvam - Specialized in Indic languages and accents

Each provider is implemented as a separate package that you can install as needed:

pnpm add @mastra/voice-openai  # Example for OpenAI

Using the Listen MethodDirect link to Using the Listen Method

The primary method for STT is the listen() method, which converts spoken audio into text. Here's how to use it:

import { Agent } from "@mastra/core/agent";
import { openai } from "@ai-sdk/openai";
import { OpenAIVoice } from "@mastra/voice-openai";
import { getMicrophoneStream } from "@mastra/node-audio";

const voice = new OpenAIVoice();

const agent = new Agent({
  name: "Voice Agent",
  instructions:
    "You are a voice assistant that provides recommendations based on user input.",
  model: openai("gpt-4o"),
  voice,
});

const audioStream = getMicrophoneStream(); // Assume this function gets audio input

const transcript = await agent.voice.listen(audioStream, {
  filetype: "m4a", // Optional: specify the audio file type
});

console.log(`User said: ${transcript}`);

const { text } = await agent.generate(
  `Based on what the user said, provide them a recommendation: ${transcript}`,
);

console.log(`Recommendation: ${text}`);

Check out the Adding Voice to Agents documentation to learn how to use STT in an agent.