Skip to Content

Azure

The AzureVoice class in Mastra provides text-to-speech and speech-to-text capabilities using Microsoft Azure Cognitive Services.

Usage Example

import { AzureVoice } from '@mastra/voice-azure'; // Initialize with configuration const voice = new AzureVoice({ speechModel: { name: 'neural', apiKey: 'your-azure-speech-api-key', region: 'eastus' }, listeningModel: { name: 'whisper', apiKey: 'your-azure-speech-api-key', region: 'eastus' }, speaker: 'en-US-JennyNeural' // Default voice }); // Convert text to speech const audioStream = await voice.speak('Hello, how can I help you?', { speaker: 'en-US-GuyNeural', // Override default voice style: 'cheerful' // Voice style }); // Convert speech to text const text = await voice.listen(audioStream, { filetype: 'wav', language: 'en-US' });

Configuration

Constructor Options

speechModel?:

AzureSpeechConfig
Configuration for text-to-speech synthesis.

listeningModel?:

AzureSpeechConfig
Configuration for speech-to-text recognition.

speaker?:

string
Default voice ID for speech synthesis.

AzureSpeechConfig

name?:

'neural' | 'standard' | 'whisper'
Model type to use. 'neural' for TTS, 'whisper' for STT.

apiKey?:

string
Azure Speech Services API key. Falls back to AZURE_SPEECH_KEY environment variable.

region?:

string
Azure region (e.g., 'eastus', 'westeurope'). Falls back to AZURE_SPEECH_REGION environment variable.

Methods

speak()

Converts text to speech using Azure’s neural text-to-speech service.

input:

string | NodeJS.ReadableStream
Text or text stream to convert to speech.

options.speaker?:

string
= Constructor's speaker value
Voice ID to use for speech synthesis.

options.style?:

string
Speaking style (e.g., 'cheerful', 'sad', 'angry').

options.rate?:

string
Speaking rate (e.g., 'slow', 'medium', 'fast').

options.pitch?:

string
Voice pitch (e.g., 'low', 'medium', 'high').

Returns: Promise<NodeJS.ReadableStream>

listen()

Transcribes audio using Azure’s speech-to-text service.

audioStream:

NodeJS.ReadableStream
Audio stream to transcribe.

options.filetype?:

string
= 'wav'
Audio format of the input stream.

options.language?:

string
= 'en-US'
Language code for transcription.

Returns: Promise<string>

getSpeakers()

Returns an array of available voice options, where each node contains:

voiceId:

string
Unique identifier for the voice (e.g., 'en-US-JennyNeural')

name:

string
Human-readable name of the voice

locale:

string
Language locale of the voice (e.g., 'en-US')

gender:

string
Gender of the voice ('Male' or 'Female')

styles?:

string[]
Available speaking styles for the voice

Notes

  • API keys can be provided via constructor options or environment variables (AZURE_SPEECH_KEY and AZURE_SPEECH_REGION)
  • Azure offers a wide range of neural voices across many languages
  • Some voices support speaking styles like cheerful, sad, angry, etc.
  • Speech recognition supports multiple audio formats and languages
  • Azure’s speech services provide high-quality neural voices with natural-sounding speech