Azure
The AzureVoice class in Mastra provides text-to-speech and speech-to-text capabilities using Microsoft Azure Cognitive Services.
Usage Example
import { AzureVoice } from '@mastra/voice-azure';
// Initialize with configuration
const voice = new AzureVoice({
speechModel: {
name: 'neural',
apiKey: 'your-azure-speech-api-key',
region: 'eastus'
},
listeningModel: {
name: 'whisper',
apiKey: 'your-azure-speech-api-key',
region: 'eastus'
},
speaker: 'en-US-JennyNeural' // Default voice
});
// Convert text to speech
const audioStream = await voice.speak('Hello, how can I help you?', {
speaker: 'en-US-GuyNeural', // Override default voice
style: 'cheerful' // Voice style
});
// Convert speech to text
const text = await voice.listen(audioStream, {
filetype: 'wav',
language: 'en-US'
});
Configuration
Constructor Options
speechModel?:
AzureSpeechConfig
Configuration for text-to-speech synthesis.
listeningModel?:
AzureSpeechConfig
Configuration for speech-to-text recognition.
speaker?:
string
Default voice ID for speech synthesis.
AzureSpeechConfig
name?:
'neural' | 'standard' | 'whisper'
Model type to use. 'neural' for TTS, 'whisper' for STT.
apiKey?:
string
Azure Speech Services API key. Falls back to AZURE_SPEECH_KEY environment variable.
region?:
string
Azure region (e.g., 'eastus', 'westeurope'). Falls back to AZURE_SPEECH_REGION environment variable.
Methods
speak()
Converts text to speech using Azure’s neural text-to-speech service.
input:
string | NodeJS.ReadableStream
Text or text stream to convert to speech.
options.speaker?:
string
= Constructor's speaker value
Voice ID to use for speech synthesis.
options.style?:
string
Speaking style (e.g., 'cheerful', 'sad', 'angry').
options.rate?:
string
Speaking rate (e.g., 'slow', 'medium', 'fast').
options.pitch?:
string
Voice pitch (e.g., 'low', 'medium', 'high').
Returns: Promise<NodeJS.ReadableStream>
listen()
Transcribes audio using Azure’s speech-to-text service.
audioStream:
NodeJS.ReadableStream
Audio stream to transcribe.
options.filetype?:
string
= 'wav'
Audio format of the input stream.
options.language?:
string
= 'en-US'
Language code for transcription.
Returns: Promise<string>
getSpeakers()
Returns an array of available voice options, where each node contains:
voiceId:
string
Unique identifier for the voice (e.g., 'en-US-JennyNeural')
name:
string
Human-readable name of the voice
locale:
string
Language locale of the voice (e.g., 'en-US')
gender:
string
Gender of the voice ('Male' or 'Female')
styles?:
string[]
Available speaking styles for the voice
Notes
- API keys can be provided via constructor options or environment variables (AZURE_SPEECH_KEY and AZURE_SPEECH_REGION)
- Azure offers a wide range of neural voices across many languages
- Some voices support speaking styles like cheerful, sad, angry, etc.
- Speech recognition supports multiple audio formats and languages
- Azure’s speech services provide high-quality neural voices with natural-sounding speech