Skip to Content

Azure

The AzureVoice class in Mastra provides text-to-speech and speech-to-text capabilities using Microsoft Azure Cognitive Services.

Usage Example

import { AzureVoice } from "@mastra/voice-azure"; // Initialize with configuration const voice = new AzureVoice({ speechModel: { name: "neural", apiKey: "your-azure-speech-api-key", region: "eastus", }, listeningModel: { name: "whisper", apiKey: "your-azure-speech-api-key", region: "eastus", }, speaker: "en-US-JennyNeural", // Default voice }); // Convert text to speech const audioStream = await voice.speak("Hello, how can I help you?", { speaker: "en-US-GuyNeural", // Override default voice style: "cheerful", // Voice style }); // Convert speech to text const text = await voice.listen(audioStream, { filetype: "wav", language: "en-US", });

Configuration

Constructor Options

speechModel?:

AzureSpeechConfig
Configuration for text-to-speech synthesis.

listeningModel?:

AzureSpeechConfig
Configuration for speech-to-text recognition.

speaker?:

string
Default voice ID for speech synthesis.

AzureSpeechConfig

name?:

'neural' | 'standard' | 'whisper'
Model type to use. 'neural' for TTS, 'whisper' for STT.

apiKey?:

string
Azure Speech Services API key. Falls back to AZURE_SPEECH_KEY environment variable.

region?:

string
Azure region (e.g., 'eastus', 'westeurope'). Falls back to AZURE_SPEECH_REGION environment variable.

Methods

speak()

Converts text to speech using Azure’s neural text-to-speech service.

input:

string | NodeJS.ReadableStream
Text or text stream to convert to speech.

options.speaker?:

string
= Constructor's speaker value
Voice ID to use for speech synthesis.

options.style?:

string
Speaking style (e.g., 'cheerful', 'sad', 'angry').

options.rate?:

string
Speaking rate (e.g., 'slow', 'medium', 'fast').

options.pitch?:

string
Voice pitch (e.g., 'low', 'medium', 'high').

Returns: Promise<NodeJS.ReadableStream>

listen()

Transcribes audio using Azure’s speech-to-text service.

audioStream:

NodeJS.ReadableStream
Audio stream to transcribe.

options.filetype?:

string
= 'wav'
Audio format of the input stream.

options.language?:

string
= 'en-US'
Language code for transcription.

Returns: Promise<string>

getSpeakers()

Returns an array of available voice options, where each node contains:

voiceId:

string
Unique identifier for the voice (e.g., 'en-US-JennyNeural')

name:

string
Human-readable name of the voice

locale:

string
Language locale of the voice (e.g., 'en-US')

gender:

string
Gender of the voice ('Male' or 'Female')

styles?:

string[]
Available speaking styles for the voice

Notes

  • API keys can be provided via constructor options or environment variables (AZURE_SPEECH_KEY and AZURE_SPEECH_REGION)
  • Azure offers a wide range of neural voices across many languages
  • Some voices support speaking styles like cheerful, sad, angry, etc.
  • Speech recognition supports multiple audio formats and languages
  • Azure’s speech services provide high-quality neural voices with natural-sounding speech