Azure

The AzureVoice class in Mastra provides text-to-speech and speech-to-text capabilities using Microsoft Azure Cognitive Services.

Usage Example

This requires Azure Speech Services credentials that can be provided through environment variables or directly in the configuration:

import { AzureVoice } from "@mastra/voice-azure";

// Initialize with configuration
const voice = new AzureVoice({
  speechModel: {
    apiKey: "your-azure-speech-api-key", // Or use AZURE_API_KEY env var
    region: "eastus", // Or use AZURE_REGION env var
    voiceName: "en-US-AriaNeural", // Optional: specific voice for TTS
  },
  listeningModel: {
    apiKey: "your-azure-speech-api-key", // Or use AZURE_API_KEY env var
    region: "eastus", // Or use AZURE_REGION env var
    language: "en-US", // Optional: recognition language for STT
  },
  speaker: "en-US-JennyNeural", // Optional: default voice
});

// Convert text to speech
const audioStream = await voice.speak("Hello, how can I help you?", {
  speaker: "en-US-GuyNeural", // Optional: override default voice
});

// Convert speech to text
const text = await voice.listen(audioStream);

Configuration

Constructor Options

speechModel?:

AzureSpeechConfig

Configuration for text-to-speech synthesis.

listeningModel?:

AzureSpeechConfig

Configuration for speech-to-text recognition.

speaker?:

string

Default voice ID for speech synthesis.

AzureSpeechConfig

Configuration object for speech synthesis (speechModel) and recognition (listeningModel).

apiKey?:

string

Azure Speech Services API key (NOT Azure OpenAI key). Falls back to AZURE_API_KEY environment variable.

region?:

string

Azure region (e.g., 'eastus', 'westeurope'). Falls back to AZURE_REGION environment variable.

voiceName?:

string

Voice ID for speech synthesis (e.g., 'en-US-AriaNeural', 'en-US-JennyNeural'). Only used in speechModel. See voice list below.

language?:

string

Recognition language code (e.g., 'en-US', 'fr-FR'). Only used in listeningModel.

Methods

speak()

Converts text to speech using Azure's neural text-to-speech service.

input:

string | NodeJS.ReadableStream

Text or text stream to convert to speech.

options.speaker?:

string

= Constructor's speaker value

Voice ID to use for speech synthesis (e.g., 'en-US-JennyNeural'). Overrides the default voice.

Returns: Promise<NodeJS.ReadableStream> - Audio stream in WAV format

listen()

Transcribes audio using Azure's speech-to-text service.

audioStream:

NodeJS.ReadableStream

Audio stream to transcribe. Must be in WAV format.

Returns: Promise<string> - The recognized text from the audio

Note: Language and recognition settings are configured in the listeningModel configuration during initialization, not passed as options to this method.

getSpeakers()

Returns an array of available voice options (200+ voices), where each node contains:

voiceId:

string

Unique identifier for the voice (e.g., 'en-US-JennyNeural', 'fr-FR-DeniseNeural')

language:

string

Language code extracted from voice ID (e.g., 'en', 'fr')

region:

string

Region code extracted from voice ID (e.g., 'US', 'GB', 'FR')

Returns: Promise<Array<{ voiceId: string; language: string; region: string; }>>

Important Notes

Azure Speech Services vs Azure OpenAI

⚠️ Critical: This package uses Azure Speech Services, which is different from Azure OpenAI Services.

DO NOT use your AZURE_OPENAI_API_KEY for this package
DO use an Azure Speech Services subscription key (obtain from Azure Portal under "Speech Services")
These are separate Azure resources with different API keys and endpoints

Environment Variables

API keys and regions can be provided via constructor options or environment variables:

AZURE_API_KEY - Your Azure Speech Services subscription key
AZURE_REGION - Your Azure region (e.g., 'eastus', 'westeurope')

Voice Capabilities

Azure offers 200+ neural voices across 50+ languages
Each voice ID follows the format: {language}-{region}-{name}Neural (e.g., 'en-US-JennyNeural')
Some voices include multilingual support or HD quality variants
Audio output is in WAV format
Audio input for recognition must be in WAV format

Available Voices

Azure provides 200+ neural voices across many languages. Some popular English voices include:

US English:
- en-US-AriaNeural (Female, default)
- en-US-JennyNeural (Female)
- en-US-GuyNeural (Male)
- en-US-DavisNeural (Male)
- en-US-AvaNeural (Female)
- en-US-AndrewNeural (Male)
British English:
- en-GB-SoniaNeural (Female)
- en-GB-RyanNeural (Male)
- en-GB-LibbyNeural (Female)
Australian English:
- en-AU-NatashaNeural (Female)
- en-AU-WilliamNeural (Male)

To get a complete list of all 200+ voices:

const voices = await voice.getSpeakers();
console.log(voices); // Array of { voiceId, language, region }

For more information, see the Azure Neural TTS documentation.

Usage Example​

Configuration​

Constructor Options​

speechModel?:

listeningModel?:

speaker?:

AzureSpeechConfig​

apiKey?:

region?:

voiceName?:

language?:

Methods​

speak()​

input:

options.speaker?:

listen()​

audioStream:

getSpeakers()​

voiceId:

language:

region:

Important Notes​

Azure Speech Services vs Azure OpenAI​

Environment Variables​

Voice Capabilities​

Available Voices​

Usage Example

Configuration

Constructor Options

AzureSpeechConfig

Methods

speak()

listen()

getSpeakers()

Important Notes

Azure Speech Services vs Azure OpenAI

Environment Variables

Voice Capabilities

Available Voices