Building Text-to-Speech Applications with Mastra
We recently shipped a TTS module that integrates with OpenAI and ElevenLabs speech models.
Let's explore how to use it.
Basic Setup
First, install the required package:
npm install @mastra/tts
Configure your environment:
OPENAI_API_KEY=your_api_key_here
Basic TTS Usage
Initialize the TTS client:
1import { OpenAITTS } from "@mastra/tts";
2
3const tts = new OpenAITTS({
4 model: {
5 name: "tts-1",
6 },
7});
Voice Selection
OpenAI provides several voices to choose from:
1const voices = await tts.voices();
2// Available voices: alloy, echo, fable, onyx, nova, shimmer
Generate Speech
Generate audio with your chosen voice:
1const { audioResult } = await tts.generate({
2 text: "Hello, world!",
3 voice: "nova",
4 speed: 1.0,
5});
Streaming Audio
For real-time audio streaming:
1const { audioResult } = await tts.stream({
2 text: "This is a streaming response",
3 voice: "alloy",
4 speed: 1.2,
5});
6
7// audioResult is a PassThrough stream
Error Handling and Telemetry
The TTS system includes built-in telemetry and error tracing, so you can use your favorite tracing tools to get visibility into your TTS usage.
Usage with Mastra
Integrate TTS with your Mastra application:
1import { Mastra } from "@mastra/core";
2import { OpenAITTS } from "@mastra/tts";
3
4const tts = new OpenAITTS({
5 model: {
6 name: "tts-1",
7 },
8});
9
10const mastra = new Mastra({
11 tts,
12});
13
14// Generate speech
15const audio = await mastra.tts.generate({
16 text: "Welcome to Mastra",
17 voice: "nova",
18});
The Mastra TTS system provides type-safe speech generation with telemetry and error handling. Start with basic generation and add streaming as needed.
Next Steps: Exposing TTS to Agents
One thing we're thinking about is how to expose TTS to agents.
Currently, our thought is to optionally let agents be configured with a TTS model, and then agent.tts.generate()
and agent.tts.stream()
would be available, as well as /agents/$AGENT_ID/tts/generate
and /agents/$AGENT_ID/tts/stream
endpoints.
Some other questions:
- How should we expose this functionality in the
mastra dev
UI?
We figured we would embed a sound clip in the chat UI for agents that have a TTS model configured.
- How should we expose this functionality in agent memory?
We figured we would probably add a new tts
field to items in agent memory, and then we could store the TTS model name there.