Skip to main content

Inworld

The Inworld voice implementation in Mastra provides streaming text-to-speech (TTS) and batch speech-to-text (STT) capabilities using Inworld AI's API. It supports multiple TTS and STT models, configurable audio encodings, and progressive audio streaming.

Usage example
Direct link to Usage example

import { InworldVoice } from '@mastra/voice-inworld'

// Initialize with default configuration (uses INWORLD_API_KEY environment variable)
const voice = new InworldVoice()

// Initialize with custom configuration
const voice = new InworldVoice({
speechModel: {
name: 'inworld-tts-2',
apiKey: 'your-api-key',
},
listeningModel: {
name: 'groq/whisper-large-v3',
apiKey: 'your-api-key',
},
speaker: 'Dennis',
})

// Text-to-Speech (streaming)
const audioStream = await voice.speak('Hello, world!')

// Speech-to-Text
const transcript = await voice.listen(audioStream)

Constructor parameters
Direct link to Constructor parameters

speechModel?:

InworldVoiceConfig
= { name: 'inworld-tts-2' }
Configuration for text-to-speech functionality.
InworldVoiceConfig

name?:

'inworld-tts-2' | 'inworld-tts-1.5-max' | 'inworld-tts-1.5-mini'
The Inworld TTS model to use.

apiKey?:

string
Inworld API key. Falls back to INWORLD_API_KEY environment variable.

listeningModel?:

InworldListeningConfig
= { name: 'groq/whisper-large-v3' }
Configuration for speech-to-text functionality.
InworldListeningConfig

name?:

'groq/whisper-large-v3'
The Inworld STT model to use.

apiKey?:

string
Inworld API key. Falls back to INWORLD_API_KEY environment variable.

speaker?:

string
= 'Dennis'
Default voice ID to use for text-to-speech.

audioEncoding?:

'LINEAR16' | 'MP3' | 'OGG_OPUS' | 'ALAW' | 'MULAW' | 'FLAC' | 'PCM' | 'WAV'
= 'MP3'
Default audio encoding for TTS output.

sampleRateHertz?:

number
= 48000
Default sample rate for TTS output.

language?:

string
= 'en-US'
Default BCP-47 language code for STT.

Methods
Direct link to Methods

speak(input, options?)
Direct link to speakinput-options

Converts text to speech using Inworld's streaming TTS endpoint. Returns a readable stream that emits audio chunks progressively as they arrive.

const audioStream = await voice.speak('Hello, world!', {
speaker: 'Olivia',
audioEncoding: 'WAV',
sampleRateHertz: 24000,
speakingRate: 1.2,
temperature: 0.8,
})

input:

string | NodeJS.ReadableStream
Text to convert to speech. If a stream is provided, it will be converted to text first.

options?:

InworldSpeakOptions
Additional options for speech synthesis.
InworldSpeakOptions

speaker?:

string
Override the default speaker for this request.

audioEncoding?:

AudioEncoding
Override the default audio encoding.

sampleRateHertz?:

number
Override the default sample rate.

speakingRate?:

number
Adjust the speaking rate.

temperature?:

number
Controls voice variability. Honored on `inworld-tts-1.5-*` models; ignored by `inworld-tts-2`.

deliveryMode?:

'STABLE' | 'BALANCED' | 'CREATIVE'
Steering control for delivery style. Only honored by `inworld-tts-2`.

language?:

string
BCP-47 language code for this request. Auto-detected when omitted.

Returns: Promise<NodeJS.ReadableStream>

listen(input, options?)
Direct link to listeninput-options

Converts speech to text using Inworld's batch STT endpoint.

const transcript = await voice.listen(audioStream, {
audioEncoding: 'MP3',
sampleRateHertz: 44100,
language: 'ja-JP',
})

input:

NodeJS.ReadableStream
Audio stream to transcribe.

options?:

InworldListenOptions
Additional options for transcription.
InworldListenOptions

audioEncoding?:

'LINEAR16' | 'MP3' | 'OGG_OPUS' | 'FLAC' | 'AUTO_DETECT'
Audio encoding of the input stream.

sampleRateHertz?:

number
Sample rate of the input audio.

language?:

string
BCP-47 language code for transcription.

numberOfChannels?:

number
Number of audio channels in the input.

Returns: Promise<string>

getSpeakers()
Direct link to getspeakers

Returns a list of available voices from the Inworld API.

const speakers = await voice.getSpeakers()
// [{ voiceId: 'Dennis', name: 'Dennis', language: 'en', description: '...', tags: ['friendly'], source: 'SYSTEM' }, ...]

Returns: Promise<Array<{ voiceId: string; name: string; language: string; description: string; tags: string[]; source: string }>>

Notes
Direct link to Notes

  • The TTS endpoint uses progressive NDJSON streaming, so audio playback can begin before the full response is received.
  • An API key can be provided via the speechModel or listeningModel config, or the INWORLD_API_KEY environment variable. TTS and STT keys are resolved independently: passing distinct speechModel.apiKey and listeningModel.apiKey values lets each service use its own credential. If only one is provided, it is reused for both services as a fallback before the env var.
  • inworld-tts-2 is the default flagship model. Use deliveryMode (STABLE | BALANCED | CREATIVE) to steer delivery style on this model. The temperature option is ignored on inworld-tts-2.
  • The inworld-tts-1.5-mini model offers lower latency at the cost of reduced voice quality compared to inworld-tts-1.5-max.