Inworld Realtime voice
The InworldRealtimeVoice class provides real-time, full-duplex voice interaction using Inworld AI's Realtime API over WebSockets. It supports speech-to-speech, tool calling, and Inworld-specific session knobs such as semantic voice activity detection, MCP tool routing, and playback speed.
Inworld's wire protocol is the OpenAI Realtime GA spec, so client and server event names match @mastra/voice-openai-realtime. The provider-level differences are the endpoint (which uses a client-generated session key in the URL), the Authorization: Basic <key> header, the typed session constructor field for Inworld-specific knobs, and a typed providerData object for Inworld extensions (STT, TTS, memory, back-channel, responsiveness) sent under session.providerData.
For batch text-to-speech and speech-to-text, see @mastra/voice-inworld.
Usage exampleDirect link to Usage example
import { InworldRealtimeVoice } from '@mastra/voice-inworld'
import { playAudio, getMicrophoneStream } from '@mastra/node-audio'
// Initialize with INWORLD_API_KEY from the environment
const voice = new InworldRealtimeVoice()
// Or initialize with explicit configuration
const voiceWithConfig = new InworldRealtimeVoice({
apiKey: 'your-inworld-api-key',
model: 'inworld/models/gemma-4-26b-a4b-it',
speaker: 'Sarah',
instructions: 'You are a helpful voice assistant.',
session: {
audio: {
output: { speed: 1.1 },
input: { turn_detection: { type: 'semantic_vad', eagerness: 'high' } },
},
},
})
// Establish connection
await voice.connect()
// Listen for audio output (PCM16 @ 24 kHz by default)
voice.on('speaker', stream => {
playAudio(stream)
})
voice.on('writing', ({ text, role }) => {
console.log(`${role}: ${text}`)
})
// Convert text to speech
await voice.speak('Hello, how can I help you today?', {
speaker: 'Hades',
})
// Stream microphone audio to the model
const microphoneStream = getMicrophoneStream()
await voice.send(microphoneStream)
// Clean up
voice.close()
Inworld API keys ship pre-Basic-encoded. Paste them verbatim into
INWORLD_API_KEY; the package does not re-encode them.
Constructor parametersDirect link to Constructor parameters
apiKey?:
url?:
model?:
speaker?:
sessionId?:
instructions?:
session?:
debug?:
providerData?:
connectTimeoutMs?:
session (typed knobs)Direct link to session-typed-knobs
Use the typed session field for documented Inworld realtime options. Fields compose with the connect-time defaults (e.g. audio.output.voice set from speaker):
output_modalities?:
audio.output.voice?:
audio.output.speed?:
audio.output.model?:
audio.output.format?:
audio.input.format?:
audio.input.noise_reduction?:
audio.input.transcription?:
audio.input.turn_detection?:
tool_choice?:
temperature?:
max_output_tokens?:
truncation?:
tracing?:
include?:
prompt?:
providerData (Inworld extensions)Direct link to providerdata-inworld-extensions
providerData is a typed object for Inworld-specific realtime extensions. It is sent under session.providerData on every session.update, and composes with any session.providerData you set via the session field — the constructor providerData wins on key collisions.
It has five branches plus two session-level fields:
stt: STT tuning, such asprompt,voice_profile,language_hints, and VAD or end-of-turn thresholds.tts: TTS segmentation and delivery, such assegmenter_strategy,steering_handling,delivery_mode,conversational, anduser_turn_mode.memory: automatic rolling memory, such asenabled,turn_interval, andmax_facts. Inworld echoes its state back through thememoryevent.backchannel: short acknowledgements ("uh-huh") while the user speaks. Audio arrives on thebackchannelevent.responsiveness: early filler audio while the main response generates. Filler audio reuses the normalspeakerandspeakingevents, so there are no distinct events.user_idandmetadata: session-level identifiers passed through to Inworld.
const voice = new InworldRealtimeVoice({
providerData: {
stt: { voice_profile: true, language_hints: ['en-US'] },
tts: { delivery_mode: 'CREATIVE', segmenter_strategy: 'balanced' },
memory: { enabled: true, turn_interval: 4 },
backchannel: { enabled: true, max_per_turn: 1 },
user_id: 'user-123',
},
})
MethodsDirect link to Methods
connect()Direct link to connect
Opens the WebSocket connection, sends the initial session.update, and resolves once the server acknowledges with session.updated. Must be called before speak(), listen(), or send().
A pre-open error or close on the WebSocket — or a handshake that exceeds connectTimeoutMs (15s default) — surfaces as a rejected promise instead of an uncaught socket error. On reject, the half-open socket is closed.
await voice.connect()
Returns: Promise<void>
speak()Direct link to speak
Sends a text message to the model and triggers an audio response. The returned promise resolves only after the full response lifecycle completes (response.done for the response this call triggered), and rejects if the response is interrupted by user speech or if a transport error occurs.
Serial speak() calls are the supported pattern. Concurrent calls share the same listener pool and have undefined response-pinning order.
input:
options?:
speaker?:
Returns: Promise<void>
listen()Direct link to listen
Sends a single audio buffer as a user turn and asks the model to respond with text only.
audioData:
Returns: Promise<void>
send()Direct link to send
Streams audio data to the server in real time. Useful for continuous microphone input.
audioData:
eventId?:
Returns: Promise<void>
updateConfig()Direct link to updateconfig
Sends a session.update to the server. The typed session field is deep-merged into the payload, and any constructor providerData is nested under session.providerData.
sessionConfig:
Returns: void
addInstructions()Direct link to addinstructions
Sets the system instructions used on the next connect() or updateConfig() call.
instructions?:
Returns: void
addTools()Direct link to addtools
Registers tools that the model can call during the session. When InworldRealtimeVoice is attached to an Agent, tools configured for the Agent are made available automatically.
tools?:
Returns: void
answer()Direct link to answer
Sends a response.create event to trigger a model response, optionally with per-response options.
options?:
Returns: Promise<void>
Turn-takingDirect link to Turn-taking
commitInput()Direct link to commitinput
Manually commits buffered input audio as a user turn. Use this for push-to-talk or manual turn-taking when turn_detection is set to null.
voice.commitInput()
Returns: void
clearInput()Direct link to clearinput
Discards buffered input audio without committing it as a user turn.
voice.clearInput()
Returns: void
clearOutput()Direct link to clearoutput
Clears the server's entire output audio buffer, stopping playback. This also stops any in-flight back-channel audio. The default barge-in path (response.cancel on interrupted) is back-channel-safe; prefer it. Use clearOutput() only when you want to flush everything.
voice.clearOutput()
Returns: void
close() and disconnect()Direct link to close-and-disconnect
Both methods close the WebSocket and mark the instance as disconnected.
Returns: void
getSpeakers()Direct link to getspeakers
Returns the curated voice list bundled with the package. Inworld's catalog is larger than this list; any voice ID can be passed to speaker at runtime.
Returns: Promise<Array<{ voiceId: string }>>
on() and off()Direct link to on-and-off
Register and remove event listeners. See Events below.
EventsDirect link to Events
The InworldRealtimeVoice class emits the following events:
speaker:
speaking:
speaking.done:
writing:
speech-started:
speech-stopped:
interrupted:
turn-suggestion:
turn-suggestion-revoked:
input-committed:
input-cleared:
input-timeout:
output-audio-started:
output-audio-stopped:
output-audio-cleared:
memory:
backchannel:
backchannel.done:
backchannel.skipped:
response.created:
response.done:
conversation.item.added:
conversation.item.done:
function_call.arguments:
tool-call-start:
tool-call-result:
error:
VoicesDirect link to Voices
The package ships with a curated set of voice IDs returned from getSpeakers():
DennisHadesWendyEdwardOliviaSarahTimothyPriyaRonaldDeborah
Any voice ID from Inworld's voice catalog can be passed to speaker at runtime.
NotesDirect link to Notes
- API keys can be provided via constructor options or the
INWORLD_API_KEYenvironment variable. Keys are pre-Basic-encoded; do not re-encode them. - The WebSocket URL appends
?key=<sessionId>&protocol=realtime. The model is configured via the initialsession.update, not the URL. - Per-call
speak(input, { speaker })scopes the voice override to a single response (via the flatresponse.voicefield) and does NOT mutate the session. - Audio output defaults to PCM16 at 24 kHz. Telephony
audio/pcmuandaudio/pcmaat 8 kHz, andaudio/float32, are also supported viasession.audio.output.format. - Use
connect()before any send, speak, or listen call. Events sent before the WebSocket is open are queued and flushed once the server acknowledgessession.updated. - The voice instance must be closed with
close()ordisconnect()to release the WebSocket. audio.input.turn_detectiondefaults to semantic VAD whensessiondoes not supply it. Override with your own object, or passnullto disable turn detection entirely.audio.input.transcriptiondefaults to{ model: 'inworld/inworld-stt-1' }, so user-sidewritingevents fire out of the box. Override with your own object, or passnullto disable user-side transcription.on()andoff()are typed againstInworldVoiceEventMap— known event names yield a typed callback payload, unknown names fall back tounknown.