Skip to main content

Azure

The AzureVoice class in Mastra provides text-to-speech and speech-to-text capabilities using Microsoft Azure Cognitive Services.

Usage Example

This requires Azure Speech Services credentials that can be provided through environment variables or directly in the configuration:

import { AzureVoice } from "@mastra/voice-azure";

// Initialize with configuration
const voice = new AzureVoice({
speechModel: {
apiKey: "your-azure-speech-api-key", // Or use AZURE_API_KEY env var
region: "eastus", // Or use AZURE_REGION env var
voiceName: "en-US-AriaNeural", // Optional: specific voice for TTS
},
listeningModel: {
apiKey: "your-azure-speech-api-key", // Or use AZURE_API_KEY env var
region: "eastus", // Or use AZURE_REGION env var
language: "en-US", // Optional: recognition language for STT
},
speaker: "en-US-JennyNeural", // Optional: default voice
});

// Convert text to speech
const audioStream = await voice.speak("Hello, how can I help you?", {
speaker: "en-US-GuyNeural", // Optional: override default voice
});

// Convert speech to text
const text = await voice.listen(audioStream);

Configuration

Constructor Options

speechModel?:

AzureSpeechConfig
Configuration for text-to-speech synthesis.

listeningModel?:

AzureSpeechConfig
Configuration for speech-to-text recognition.

speaker?:

string
Default voice ID for speech synthesis.

AzureSpeechConfig

Configuration object for speech synthesis (speechModel) and recognition (listeningModel).

apiKey?:

string
Azure Speech Services API key (NOT Azure OpenAI key). Falls back to AZURE_API_KEY environment variable.

region?:

string
Azure region (e.g., 'eastus', 'westeurope'). Falls back to AZURE_REGION environment variable.

voiceName?:

string
Voice ID for speech synthesis (e.g., 'en-US-AriaNeural', 'en-US-JennyNeural'). Only used in speechModel. See voice list below.

language?:

string
Recognition language code (e.g., 'en-US', 'fr-FR'). Only used in listeningModel.

Methods

speak()

Converts text to speech using Azure's neural text-to-speech service.

input:

string | NodeJS.ReadableStream
Text or text stream to convert to speech.

options.speaker?:

string
= Constructor's speaker value
Voice ID to use for speech synthesis (e.g., 'en-US-JennyNeural'). Overrides the default voice.

Returns: Promise<NodeJS.ReadableStream> - Audio stream in WAV format

listen()

Transcribes audio using Azure's speech-to-text service.

audioStream:

NodeJS.ReadableStream
Audio stream to transcribe. Must be in WAV format.

Returns: Promise<string> - The recognized text from the audio

Note: Language and recognition settings are configured in the listeningModel configuration during initialization, not passed as options to this method.

getSpeakers()

Returns an array of available voice options (200+ voices), where each node contains:

voiceId:

string
Unique identifier for the voice (e.g., 'en-US-JennyNeural', 'fr-FR-DeniseNeural')

language:

string
Language code extracted from voice ID (e.g., 'en', 'fr')

region:

string
Region code extracted from voice ID (e.g., 'US', 'GB', 'FR')

Returns: Promise<Array<{ voiceId: string; language: string; region: string; }>>

Important Notes

Azure Speech Services vs Azure OpenAI

⚠️ Critical: This package uses Azure Speech Services, which is different from Azure OpenAI Services.

  • DO NOT use your AZURE_OPENAI_API_KEY for this package
  • DO use an Azure Speech Services subscription key (obtain from Azure Portal under "Speech Services")
  • These are separate Azure resources with different API keys and endpoints

Environment Variables

API keys and regions can be provided via constructor options or environment variables:

  • AZURE_API_KEY - Your Azure Speech Services subscription key
  • AZURE_REGION - Your Azure region (e.g., 'eastus', 'westeurope')

Voice Capabilities

  • Azure offers 200+ neural voices across 50+ languages
  • Each voice ID follows the format: {language}-{region}-{name}Neural (e.g., 'en-US-JennyNeural')
  • Some voices include multilingual support or HD quality variants
  • Audio output is in WAV format
  • Audio input for recognition must be in WAV format

Available Voices

Azure provides 200+ neural voices across many languages. Some popular English voices include:

  • US English:

    • en-US-AriaNeural (Female, default)
    • en-US-JennyNeural (Female)
    • en-US-GuyNeural (Male)
    • en-US-DavisNeural (Male)
    • en-US-AvaNeural (Female)
    • en-US-AndrewNeural (Male)
  • British English:

    • en-GB-SoniaNeural (Female)
    • en-GB-RyanNeural (Male)
    • en-GB-LibbyNeural (Female)
  • Australian English:

    • en-AU-NatashaNeural (Female)
    • en-AU-WilliamNeural (Male)

To get a complete list of all 200+ voices:

const voices = await voice.getSpeakers();
console.log(voices); // Array of { voiceId, language, region }

For more information, see the Azure Neural TTS documentation.