Inworld
The Inworld voice implementation in Mastra provides streaming text-to-speech (TTS) and batch speech-to-text (STT) capabilities using Inworld AI's API. It supports multiple TTS and STT models, configurable audio encodings, and progressive audio streaming.
Usage exampleDirect link to Usage example
import { InworldVoice } from '@mastra/voice-inworld'
// Initialize with default configuration (uses INWORLD_API_KEY environment variable)
const voice = new InworldVoice()
// Initialize with custom configuration
const voice = new InworldVoice({
speechModel: {
name: 'inworld-tts-2',
apiKey: 'your-api-key',
},
listeningModel: {
name: 'groq/whisper-large-v3',
apiKey: 'your-api-key',
},
speaker: 'Dennis',
})
// Text-to-Speech (streaming)
const audioStream = await voice.speak('Hello, world!')
// Speech-to-Text
const transcript = await voice.listen(audioStream)
Constructor parametersDirect link to Constructor parameters
speechModel?:
InworldVoiceConfig
= { name: 'inworld-tts-2' }
Configuration for text-to-speech functionality.
InworldVoiceConfig
name?:
'inworld-tts-2' | 'inworld-tts-1.5-max' | 'inworld-tts-1.5-mini'
The Inworld TTS model to use.
apiKey?:
string
Inworld API key. Falls back to INWORLD_API_KEY environment variable.
listeningModel?:
InworldListeningConfig
= { name: 'groq/whisper-large-v3' }
Configuration for speech-to-text functionality.
InworldListeningConfig
name?:
'groq/whisper-large-v3'
The Inworld STT model to use.
apiKey?:
string
Inworld API key. Falls back to INWORLD_API_KEY environment variable.
speaker?:
string
= 'Dennis'
Default voice ID to use for text-to-speech.
audioEncoding?:
'LINEAR16' | 'MP3' | 'OGG_OPUS' | 'ALAW' | 'MULAW' | 'FLAC' | 'PCM' | 'WAV'
= 'MP3'
Default audio encoding for TTS output.
sampleRateHertz?:
number
= 48000
Default sample rate for TTS output.
language?:
string
= 'en-US'
Default BCP-47 language code for STT.
MethodsDirect link to Methods
speak(input, options?)Direct link to speakinput-options
Converts text to speech using Inworld's streaming TTS endpoint. Returns a readable stream that emits audio chunks progressively as they arrive.
const audioStream = await voice.speak('Hello, world!', {
speaker: 'Olivia',
audioEncoding: 'WAV',
sampleRateHertz: 24000,
speakingRate: 1.2,
temperature: 0.8,
})
input:
string | NodeJS.ReadableStream
Text to convert to speech. If a stream is provided, it will be converted to text first.
options?:
InworldSpeakOptions
Additional options for speech synthesis.
InworldSpeakOptions
speaker?:
string
Override the default speaker for this request.
audioEncoding?:
AudioEncoding
Override the default audio encoding.
sampleRateHertz?:
number
Override the default sample rate.
speakingRate?:
number
Adjust the speaking rate.
temperature?:
number
Controls voice variability. Honored on `inworld-tts-1.5-*` models; ignored by `inworld-tts-2`.
deliveryMode?:
'STABLE' | 'BALANCED' | 'CREATIVE'
Steering control for delivery style. Only honored by `inworld-tts-2`.
language?:
string
BCP-47 language code for this request. Auto-detected when omitted.
Returns: Promise<NodeJS.ReadableStream>
listen(input, options?)Direct link to listeninput-options
Converts speech to text using Inworld's batch STT endpoint.
const transcript = await voice.listen(audioStream, {
audioEncoding: 'MP3',
sampleRateHertz: 44100,
language: 'ja-JP',
})
input:
NodeJS.ReadableStream
Audio stream to transcribe.
options?:
InworldListenOptions
Additional options for transcription.
InworldListenOptions
audioEncoding?:
'LINEAR16' | 'MP3' | 'OGG_OPUS' | 'FLAC' | 'AUTO_DETECT'
Audio encoding of the input stream.
sampleRateHertz?:
number
Sample rate of the input audio.
language?:
string
BCP-47 language code for transcription.
numberOfChannels?:
number
Number of audio channels in the input.
Returns: Promise<string>
getSpeakers()Direct link to getspeakers
Returns a list of available voices from the Inworld API.
const speakers = await voice.getSpeakers()
// [{ voiceId: 'Dennis', name: 'Dennis', language: 'en', description: '...', tags: ['friendly'], source: 'SYSTEM' }, ...]
Returns: Promise<Array<{ voiceId: string; name: string; language: string; description: string; tags: string[]; source: string }>>
NotesDirect link to Notes
- The TTS endpoint uses progressive NDJSON streaming, so audio playback can begin before the full response is received.
- An API key can be provided via the
speechModelorlisteningModelconfig, or theINWORLD_API_KEYenvironment variable. TTS and STT keys are resolved independently: passing distinctspeechModel.apiKeyandlisteningModel.apiKeyvalues lets each service use its own credential. If only one is provided, it is reused for both services as a fallback before the env var. inworld-tts-2is the default flagship model. UsedeliveryMode(STABLE|BALANCED|CREATIVE) to steer delivery style on this model. Thetemperatureoption is ignored oninworld-tts-2.- The
inworld-tts-1.5-minimodel offers lower latency at the cost of reduced voice quality compared toinworld-tts-1.5-max.