Azure
The AzureVoice class in Mastra provides text-to-speech and speech-to-text capabilities using Microsoft Azure Cognitive Services.
Usage Example
import { AzureVoice } from "@mastra/voice-azure";
// Initialize with configuration
const voice = new AzureVoice({
speechModel: {
name: "neural",
apiKey: "your-azure-speech-api-key",
region: "eastus",
},
listeningModel: {
name: "whisper",
apiKey: "your-azure-speech-api-key",
region: "eastus",
},
speaker: "en-US-JennyNeural", // Default voice
});
// Convert text to speech
const audioStream = await voice.speak("Hello, how can I help you?", {
speaker: "en-US-GuyNeural", // Override default voice
style: "cheerful", // Voice style
});
// Convert speech to text
const text = await voice.listen(audioStream, {
filetype: "wav",
language: "en-US",
});
Configuration
Constructor Options
speechModel?:
AzureSpeechConfig
Configuration for text-to-speech synthesis.
listeningModel?:
AzureSpeechConfig
Configuration for speech-to-text recognition.
speaker?:
string
Default voice ID for speech synthesis.
AzureSpeechConfig
name?:
'neural' | 'standard' | 'whisper'
Model type to use. 'neural' for TTS, 'whisper' for STT.
apiKey?:
string
Azure Speech Services API key. Falls back to AZURE_SPEECH_KEY environment variable.
region?:
string
Azure region (e.g., 'eastus', 'westeurope'). Falls back to AZURE_SPEECH_REGION environment variable.
Methods
speak()
Converts text to speech using Azure’s neural text-to-speech service.
input:
string | NodeJS.ReadableStream
Text or text stream to convert to speech.
options.speaker?:
string
= Constructor's speaker value
Voice ID to use for speech synthesis.
options.style?:
string
Speaking style (e.g., 'cheerful', 'sad', 'angry').
options.rate?:
string
Speaking rate (e.g., 'slow', 'medium', 'fast').
options.pitch?:
string
Voice pitch (e.g., 'low', 'medium', 'high').
Returns: Promise<NodeJS.ReadableStream>
listen()
Transcribes audio using Azure’s speech-to-text service.
audioStream:
NodeJS.ReadableStream
Audio stream to transcribe.
options.filetype?:
string
= 'wav'
Audio format of the input stream.
options.language?:
string
= 'en-US'
Language code for transcription.
Returns: Promise<string>
getSpeakers()
Returns an array of available voice options, where each node contains:
voiceId:
string
Unique identifier for the voice (e.g., 'en-US-JennyNeural')
name:
string
Human-readable name of the voice
locale:
string
Language locale of the voice (e.g., 'en-US')
gender:
string
Gender of the voice ('Male' or 'Female')
styles?:
string[]
Available speaking styles for the voice
Notes
- API keys can be provided via constructor options or environment variables (AZURE_SPEECH_KEY and AZURE_SPEECH_REGION)
- Azure offers a wide range of neural voices across many languages
- Some voices support speaking styles like cheerful, sad, angry, etc.
- Speech recognition supports multiple audio formats and languages
- Azure’s speech services provide high-quality neural voices with natural-sounding speech