Inworld's Realtime API now plugs into Mastra. The Inworld team built @mastra/voice-inworld-realtime, and one WebSocket session gets your agent speech in, speech out, semantic turn detection, barge-in, and tool calling.
Inworld wrote a guide from their side covering the Realtime API session model, the event surface, and the conversational details that make a voice agent feel natural. They also shipped a reference CLI that runs the full loop in about 70 lines of TypeScript, plus a demo video. On the Mastra end, the whole point is what doesn't change: voice attaches to the agent you already have, tools and all.
Voice is a property of the agent
In Mastra, voice is a field on the Agent, the same way tools and instructions are. Every provider implements the MastraVoice interface, so attaching one is a constructor argument and not an architectural decision. Inworld plugs into that interface for both realtime speech-to-speech and TTS, so a single provider covers the whole range of voice an agent might need.
The LLM stays swappable. Inworld's Realtime API routes to OpenAI, Anthropic, Google, and others through one model string, so you can change the model behind the conversation without touching the integration.
Attach a voice, keep your agent
The minimum config is two fields:
import { InworldRealtimeVoice } from '@mastra/voice-inworld-realtime';
const voice = new InworldRealtimeVoice({
model: 'openai/gpt-5.4-nano',
speaker: 'Jason',
});model is the routing string, and speaker is any voice in Inworld's library, including cloned custom voices. Attaching it looks like attaching anything else to an agent:
const agent = new Agent({
id: 'voice-demo',
name: 'Voice Demo',
instructions: 'You are a concise voice assistant. Reply in one or two short sentences.',
tools: { getCurrentTime },
voice,
});That's the whole integration surface. The CLI demo's src/main.ts adds mic and speaker plumbing around it, and the result is a full-duplex voice agent running in your terminal.
Your tools already work in voice
The tool surface doesn't change because the voice is realtime. The demo registers one tool, defined exactly as a text agent would define it:
const getCurrentTime = createTool({
id: 'get-current-time',
description: 'Returns the current local time.',
inputSchema: z.object({}),
outputSchema: z.object({ time: z.string() }),
execute: async () => ({ time: new Date().toLocaleTimeString() }),
});When the user asks for the time, the LLM inside the realtime session emits a tool call, the voice package routes it to execute(), and the result streams back into the session as a tool result. The audio channel stays open the whole way through. In a cascaded voice pipeline, a tool call usually means a stall while the pipeline tears down and rebuilds; here the conversation just continues.
The contract is the same one every Mastra agent uses, so anything you'd register on a non-voice agent works: an HTTP call, a database query, an MCP tool. If your agent already has a tool belt, it keeps it when it gains a voice. And the extension points are the usual Mastra ones too. Memory, scorers, and workflows aren't in this demo, but they attach to a voice agent just as they would to any other.
What Inworld's realtime layer adds
Everything above is the Mastra half. Inworld's layer is what makes the conversation feel live: semantic VAD that ends the user's turn when they finish a thought rather than when they pause, barge-in that cuts the assistant off within ~100ms when the user starts talking, and inline steering tags that the TTS renders as prosody without a second model call. Their post goes deep on each of these, and the CLI demo wires them up end to end — try talking over the assistant and watch playback stop.
For a more complete application example, Inworld also built a voice design agent that lives on a website and edits CSS live while you talk to it.
Start building
Clone the CLI demo, add your Inworld API key, and you're talking to a Mastra agent in a couple of minutes. The voice docs list the full provider set, and Inworld's guide covers the realtime session in depth.
Thanks to Cale and the Inworld team for the integration and the demos.
