Building Text-to-Speech Applications with Mastra
We recently shipped a TTS module that integrates with OpenAI and ElevenLabs speech models.
Let's explore how to use it.
Basic Setup
First, install the required package:
npm install @mastra/ttsConfigure your environment:
OPENAI_API_KEY=your_api_key_hereBasic TTS Usage
Initialize the TTS client:
 1import { OpenAITTS } from "@mastra/tts";
 2
 3const tts = new OpenAITTS({
 4  model: {
 5    name: "tts-1",
 6  },
 7});Voice Selection
OpenAI provides several voices to choose from:
 1const voices = await tts.voices();
 2// Available voices: alloy, echo, fable, onyx, nova, shimmerGenerate Speech
Generate audio with your chosen voice:
 1const { audioResult } = await tts.generate({
 2  text: "Hello, world!",
 3  voice: "nova",
 4  speed: 1.0,
 5});Streaming Audio
For real-time audio streaming:
 1const { audioResult } = await tts.stream({
 2  text: "This is a streaming response",
 3  voice: "alloy",
 4  speed: 1.2,
 5});
 6
 7// audioResult is a PassThrough streamError Handling and Telemetry
The TTS system includes built-in telemetry and error tracing, so you can use your favorite tracing tools to get visibility into your TTS usage.
Usage with Mastra
Integrate TTS with your Mastra application:
 1import { Mastra } from "@mastra/core";
 2import { OpenAITTS } from "@mastra/tts";
 3
 4const tts = new OpenAITTS({
 5  model: {
 6    name: "tts-1",
 7  },
 8});
 9
10const mastra = new Mastra({
11  tts,
12});
13
14// Generate speech
15const audio = await mastra.tts.generate({
16  text: "Welcome to Mastra",
17  voice: "nova",
18});The Mastra TTS system provides type-safe speech generation with telemetry and error handling. Start with basic generation and add streaming as needed.
Next Steps: Exposing TTS to Agents
One thing we're thinking about is how to expose TTS to agents.
Currently, our thought is to optionally let agents be configured with a TTS model, and then agent.tts.generate() and agent.tts.stream() would be available, as well as /agents/$AGENT_ID/tts/generate and /agents/$AGENT_ID/tts/stream endpoints.
Some other questions:
- How should we expose this functionality in the mastra devUI?
We figured we would embed a sound clip in the chat UI for agents that have a TTS model configured.
- How should we expose this functionality in agent memory?
We figured we would probably add a new tts field to items in agent memory, and then we could store the TTS model name there.
