Azure

The AzureVoice class in Mastra provides text-to-speech and speech-to-text capabilities using Microsoft Azure Cognitive Services.

Usage Example


import { AzureVoice } from "@mastra/voice-azure";
 
// Initialize with configuration
const voice = new AzureVoice({
  speechModel: {
    name: "neural",
    apiKey: "your-azure-speech-api-key",
    region: "eastus",
  },
  listeningModel: {
    name: "whisper",
    apiKey: "your-azure-speech-api-key",
    region: "eastus",
  },
  speaker: "en-US-JennyNeural", // Default voice
});
 
// Convert text to speech
const audioStream = await voice.speak("Hello, how can I help you?", {
  speaker: "en-US-GuyNeural", // Override default voice
  style: "cheerful", // Voice style
});
 
// Convert speech to text
const text = await voice.listen(audioStream, {
  filetype: "wav",
  language: "en-US",
});

Configuration

Constructor Options

speechModel?:

AzureSpeechConfig

Configuration for text-to-speech synthesis.

listeningModel?:

AzureSpeechConfig

Configuration for speech-to-text recognition.

speaker?:

string

Default voice ID for speech synthesis.

AzureSpeechConfig

name?:

'neural' | 'standard' | 'whisper'

Model type to use. 'neural' for TTS, 'whisper' for STT.

apiKey?:

string

Azure Speech Services API key. Falls back to AZURE_SPEECH_KEY environment variable.

region?:

string

Azure region (e.g., 'eastus', 'westeurope'). Falls back to AZURE_SPEECH_REGION environment variable.

Methods

speak()

Converts text to speech using Azure’s neural text-to-speech service.

input:

string | NodeJS.ReadableStream

Text or text stream to convert to speech.

options.speaker?:

string

= Constructor's speaker value

Voice ID to use for speech synthesis.

options.style?:

string

Speaking style (e.g., 'cheerful', 'sad', 'angry').

options.rate?:

string

Speaking rate (e.g., 'slow', 'medium', 'fast').

options.pitch?:

string

Voice pitch (e.g., 'low', 'medium', 'high').

Returns: Promise<NodeJS.ReadableStream>

listen()

Transcribes audio using Azure’s speech-to-text service.

audioStream:

NodeJS.ReadableStream

Audio stream to transcribe.

options.filetype?:

string

= 'wav'

Audio format of the input stream.

options.language?:

string

= 'en-US'

Language code for transcription.

Returns: Promise<string>

getSpeakers()

Returns an array of available voice options, where each node contains:

voiceId:

string

Unique identifier for the voice (e.g., 'en-US-JennyNeural')

name:

string

Human-readable name of the voice

locale:

string

Language locale of the voice (e.g., 'en-US')

gender:

string

Gender of the voice ('Male' or 'Female')

styles?:

string[]

Available speaking styles for the voice

Notes

API keys can be provided via constructor options or environment variables (AZURE_SPEECH_KEY and AZURE_SPEECH_REGION)
Azure offers a wide range of neural voices across many languages
Some voices support speaking styles like cheerful, sad, angry, etc.
Speech recognition supports multiple audio formats and languages
Azure’s speech services provide high-quality neural voices with natural-sounding speech