Skip to main content

xAI Realtime voice

The XAIRealtimeVoice class provides realtime voice interaction capabilities using the xAI Grok Voice Agent API. It implements Mastra's MastraVoice realtime contract and supports bidirectional audio streaming, text turns, server VAD, xAI voices, function tools, and xAI server-side tools.

Usage example
Direct link to Usage example

import { Agent } from '@mastra/core/agent'
import { getMicrophoneStream, playAudio } from '@mastra/node-audio'
import { XAIRealtimeVoice } from '@mastra/voice-xai-realtime'

const voice = new XAIRealtimeVoice({
apiKey: process.env.XAI_API_KEY,
model: 'grok-voice-think-fast-1.0',
speaker: 'eve',
instructions: 'You are a concise voice assistant.',
turnDetection: { type: 'server_vad' },
})

const agent = new Agent({
id: 'voice-agent',
name: 'Voice Agent',
instructions: 'You are a helpful voice assistant.',
model: 'xai/grok-4.3',
voice,
})

await agent.voice.connect()

agent.voice.on('speaker', audioStream => {
playAudio(audioStream)
})

agent.voice.on('writing', ({ text, role }) => {
console.log(`${role}: ${text}`)
})

await agent.voice.speak('How can I help you today?')

const microphoneStream = getMicrophoneStream()
await agent.voice.send(microphoneStream)

agent.voice.close()

Configuration
Direct link to Configuration

Constructor options
Direct link to Constructor options

apiKey?:

string
xAI API key. Falls back to the XAI_API_KEY environment variable.

ephemeralToken?:

string
Short-lived xAI token sent with the WebSocket protocol instead of an authorization header.

model?:

XAIRealtimeModel
= 'grok-voice-think-fast-1.0'
The Grok voice model to use.

speaker?:

XAIVoice
= 'eve'
Voice ID to use for speech output. Built-in values are eve, ara, rex, sal, and leo. Custom xAI voice IDs are also supported.

instructions?:

string
System instructions sent in session.update.

turnDetection?:

XAITurnDetection
= { type: 'server_vad' }
Voice activity detection configuration.

audio?:

XAIAudioConfig
= 24 kHz audio/pcm input and output
Input and output audio format configuration.

serverTools?:

XAIServerTool[]
xAI server-side tools to send in session.update. Supports file_search, web_search, x_search, and mcp. These are merged with session.tools.

session?:

Partial<XAISessionConfig>
Additional xAI session fields to merge into the initial session.update event.

url?:

string
= 'wss://api.x.ai/v1/realtime'
Override the xAI realtime WebSocket URL.

debug?:

boolean
= false
Enable debug logging for received xAI events. Debug logs can include transcripts and tool-call arguments.

VoiceConfig pattern
Direct link to VoiceConfig pattern

You can also use Mastra's shared voice configuration shape:

const voice = new XAIRealtimeVoice({
speaker: 'ara',
realtimeConfig: {
model: 'grok-voice-think-fast-1.0',
apiKey: process.env.XAI_API_KEY,
options: {
instructions: 'Answer briefly.',
turnDetection: { type: 'server_vad', threshold: 0.85 },
},
},
})

Authentication
Direct link to Authentication

Use apiKey or XAI_API_KEY for server-side applications. This provider is built for Node.js server-side runtimes. If you already mint xAI ephemeral tokens on your server, you can pass one as ephemeralToken; the provider uses the xai-client-secret.<token> WebSocket protocol instead of an authorization header. If both apiKey and ephemeralToken are configured, the provider uses the ephemeral token.

Methods
Direct link to Methods

connect()
Direct link to connect

Establishes the WebSocket connection and sends the initial session.update.

requestContext?:

RequestContext
Optional Mastra request context passed to function tool executions.

Returns: Promise<void>

close()
Direct link to close

Closes the WebSocket connection, ends active speaker streams, and clears queued events, pending function-call state, and request context. disconnect() is an alias for close().

Returns: void

addInstructions()
Direct link to addinstructions

Sets session instructions. If the WebSocket is open, the provider sends a session.update; passing undefined stores an empty string and clears the active instructions on the current session or the next connection.

instructions?:

string
System instructions to send to xAI.

Returns: void

addTools()
Direct link to addtools

Registers Mastra function tools and, when connected, refreshes the session tools with session.update.

tools?:

ToolsInput
Mastra tools to expose as xAI function tools.

Returns: void

updateConfig()
Direct link to updateconfig

Sends a session.update event with additional xAI session fields.

sessionConfig:

Partial<XAISessionConfig>
Session fields to update.

Returns: void

speak()
Direct link to speak

Sends a text turn using conversation.item.create and then requests a response.

input:

string | NodeJS.ReadableStream
Text or a readable text stream to send as user input.

options.speaker?:

XAIVoice
Voice override. This updates the active xAI session voice and is used for subsequent turns.

options.response?:

Record<string, unknown>
Additional xAI response.create fields.

Returns: Promise<void>

send()
Direct link to send

Streams realtime audio chunks with input_audio_buffer.append.

send() requires an open connection. Use it for live microphone audio after connect() resolves. Readable stream chunks must be binary audio chunks (Buffer, ArrayBuffer, or a typed array).

audioData:

NodeJS.ReadableStream | Int16Array
PCM audio stream or Int16Array audio data.

eventId?:

string
Optional xAI event ID.

Returns: Promise<void>

listen()
Direct link to listen

Sends a finite audio stream with input_audio_buffer.append. By default it commits the input buffer and requests a response.

audioData:

NodeJS.ReadableStream
Audio stream to send.

options.commit?:

boolean
= true
Whether to send input_audio_buffer.commit after the audio item.

options.createResponse?:

boolean
= true
Whether to send response.create after the audio item.

Returns: Promise<void>

answer()
Direct link to answer

Sends response.create to ask xAI to continue the conversation.

Returns: Promise<void>

commitAudioBuffer() and clearAudioBuffer()
Direct link to commitaudiobuffer-and-clearaudiobuffer

Send the matching xAI realtime client events for manual turn control.

Returns: Promise<void>

cancelResponse()
Direct link to cancelresponse

Sends response.cancel to interrupt an in-flight response.

responseId?:

string
Optional xAI response ID to cancel.

eventId?:

string
Optional xAI event ID.

Returns: Promise<void>

Events
Direct link to Events

XAIRealtimeVoice maps xAI realtime server events onto Mastra voice events:

  • speaker: emits a readable stream for assistant audio.
  • speaking: emits assistant audio deltas.
  • speaking.done: emits when an assistant audio response completes.
  • writing: emits assistant text deltas and user input transcriptions.
  • error: emits xAI errors, provider execution errors, tool execution errors, and malformed function-call arguments. Tool errors include details.call_id and details.name.
  • close: emits when the WebSocket closes.
  • tool-call-start: emits before a Mastra function tool is executed.
  • tool-call-result: emits after a Mastra function tool returns.

Raw xAI event names are also emitted, so you can subscribe to events such as response.output_audio.delta, response.text.delta, response.function_call_arguments.done, and response.done.

Tools
Direct link to Tools

Mastra function tools
Direct link to Mastra function tools

Tools added with addTools() are converted into xAI function tools and included in session.update.

import { createTool } from '@mastra/core/tools'
import { z } from 'zod'

const weatherTool = createTool({
id: 'getWeather',
description: 'Get current weather for a location.',
inputSchema: z.object({
location: z.string(),
}),
execute: async ({ location }) => {
return { location, temperature: 22 }
},
})

voice.addTools({ getWeather: weatherTool })

When xAI emits response.function_call_arguments.done, the provider executes the matching Mastra tool and sends a function_call_output item. If xAI emits multiple function calls for one response, the provider waits for every tool result and the response's response.done event before sending one continuation response.create.

xAI server-side tools
Direct link to xAI server-side tools

xAI server-side tools are passed through in the session configuration and executed by xAI. Tools passed in session.tools and serverTools are merged:

const voice = new XAIRealtimeVoice({
apiKey: process.env.XAI_API_KEY,
serverTools: [
{ type: 'web_search' },
{ type: 'x_search', allowed_x_handles: ['xai'] },
{ type: 'file_search', vector_store_ids: ['collection_123'], max_num_results: 10 },
{
type: 'mcp',
server_url: 'https://mcp.example.com/mcp',
server_label: 'business-tools',
allowed_tools: ['lookup_order'],
},
],
})

Audio formats
Direct link to Audio formats

The default input and output format is 24 kHz PCM16. You can also configure supported PCM sample rates or telephony codecs:

const voice = new XAIRealtimeVoice({
audio: {
input: { format: { type: 'audio/pcm', rate: 16000 } },
output: { format: { type: 'audio/pcm', rate: 16000 } },
},
})

Supported format types are audio/pcm, audio/pcmu, and audio/pcma. PCM supports the documented sample rates from 8 kHz through 48 kHz. audio/pcmu and audio/pcma are G.711 telephony codecs and use 8 kHz.