Building Text-to-Speech Applications with Mastra
We recently shipped a TTS module that integrates with OpenAI and ElevenLabs speech models.
Let's explore how to use it.
Basic Setup
First, install the required package:
npm install @mastra/tts
Configure your environment:
OPENAI_API_KEY=your_api_key_here
Basic TTS Usage
Initialize the TTS client:
import { OpenAITTS } from "@mastra/tts";
const tts = new OpenAITTS({
model: {
name: "tts-1",
},
});
Voice Selection
OpenAI provides several voices to choose from:
const voices = await tts.voices();
// Available voices: alloy, echo, fable, onyx, nova, shimmer
Generate Speech
Generate audio with your chosen voice:
const { audioResult } = await tts.generate({
text: "Hello, world!",
voice: "nova",
speed: 1.0,
});
Streaming Audio
For real-time audio streaming:
const { audioResult } = await tts.stream({
text: "This is a streaming response",
voice: "alloy",
speed: 1.2,
});
// audioResult is a PassThrough stream
Error Handling and Telemetry
The TTS system includes built-in telemetry and error tracing, so you can use your favorite tracing tools to get visibility into your TTS usage.
Usage with Mastra
Integrate TTS with your Mastra application:
import { Mastra } from "@mastra/core";
import { OpenAITTS } from "@mastra/tts";
const tts = new OpenAITTS({
model: {
name: "tts-1",
},
});
const mastra = new Mastra({
tts,
});
// Generate speech
const audio = await mastra.tts.generate({
text: "Welcome to Mastra",
voice: "nova",
});
The Mastra TTS system provides type-safe speech generation with telemetry and error handling. Start with basic generation and add streaming as needed.
Next Steps: Exposing TTS to Agents
One thing we're thinking about is how to expose TTS to agents.
Currently, our thought is to optionally let agents be configured with a TTS model, and then agent.tts.generate()
and agent.tts.stream()
would be available, as well as /agents/$AGENT_ID/tts/generate
and /agents/$AGENT_ID/tts/stream
endpoints.
Some other questions:
- How should we expose this functionality in the
mastra dev
UI?
We figured we would embed a sound clip in the chat UI for agents that have a TTS model configured.
- How should we expose this functionality in agent memory?
We figured we would probably add a new tts
field to items in agent memory, and then we could store the TTS model name there.