Google Gemini Live Voice

The GeminiLiveVoice class provides real-time voice interaction capabilities using Google’s Gemini Live API. It supports bidirectional audio streaming, tool calling, session management, and both standard Google API and Vertex AI authentication methods.

Usage Example


import { GeminiLiveVoice } from "@mastra/voice-google-gemini-live";
import { playAudio, getMicrophoneStream } from "@mastra/node-audio";
 
// Initialize with Gemini API (using API key)
const voice = new GeminiLiveVoice({
  apiKey: process.env.GOOGLE_API_KEY, // Required for Gemini API
  model: "gemini-2.0-flash-exp",
  speaker: "Puck", // Default voice
  debug: true,
});
 
// Or initialize with Vertex AI (using OAuth)
const voiceWithVertexAI = new GeminiLiveVoice({
  vertexAI: true,
  project: "your-gcp-project",
  location: "us-central1",
  serviceAccountKeyFile: "/path/to/service-account.json",
  model: "gemini-2.0-flash-exp",
  speaker: "Puck",
});
 
// Or use the VoiceConfig pattern (recommended for consistency with other providers)
const voiceWithConfig = new GeminiLiveVoice({
  speechModel: {
    name: "gemini-2.0-flash-exp",
    apiKey: process.env.GOOGLE_API_KEY,
  },
  speaker: "Puck",
  realtimeConfig: {
    model: "gemini-2.0-flash-exp",
    apiKey: process.env.GOOGLE_API_KEY,
    options: {
      debug: true,
      sessionConfig: {
        interrupts: { enabled: true },
      },
    },
  },
});
 
// Establish connection (required before using other methods)
await voice.connect();
 
// Set up event listeners
voice.on("speaker", (audioStream) => {
  // Handle audio stream (NodeJS.ReadableStream)
  playAudio(audioStream);
});
 
voice.on("writing", ({ text, role }) => {
  // Handle transcribed text
  console.log(`${role}: ${text}`);
});
 
voice.on("turnComplete", ({ timestamp }) => {
  // Handle turn completion
  console.log("Turn completed at:", timestamp);
});
 
// Convert text to speech
await voice.speak("Hello, how can I help you today?", {
  speaker: "Charon", // Override default voice
  responseModalities: ["AUDIO", "TEXT"],
});
 
// Process audio input
const microphoneStream = getMicrophoneStream();
await voice.send(microphoneStream);
 
// Update session configuration
await voice.updateSessionConfig({
  speaker: "Kore",
  instructions: "Be more concise in your responses",
});
 
// When done, disconnect
await voice.disconnect();
// Or use the synchronous wrapper
voice.close();

Configuration

Constructor Options

apiKey?:

string

Google API key for Gemini API authentication. Required unless using Vertex AI.

model?:

GeminiVoiceModel

= 'gemini-2.0-flash-exp'

The model ID to use for real-time voice interactions.

speaker?:

GeminiVoiceName

= 'Puck'

Default voice ID for speech synthesis.

vertexAI?:

boolean

= false

Use Vertex AI instead of Gemini API for authentication.

project?:

string

Google Cloud project ID (required for Vertex AI).

location?:

string

= 'us-central1'

Google Cloud region for Vertex AI.

serviceAccountKeyFile?:

string

Path to service account JSON key file for Vertex AI authentication.

serviceAccountEmail?:

string

Service account email for impersonation (alternative to key file).

instructions?:

string

System instructions for the model.

sessionConfig?:

GeminiSessionConfig

Session configuration including interrupt and context settings.

debug?:

boolean

= false

Enable debug logging for troubleshooting.

Session Configuration

interrupts?:

object

Interrupt handling configuration.

interrupts.enabled?:

boolean

= true

Enable interrupt handling.

interrupts.allowUserInterruption?:

boolean

= true

Allow user to interrupt model responses.

contextCompression?:

boolean

= false

Enable automatic context compression.

Methods

connect()

Establishes a connection to the Gemini Live API. Must be called before using speak, listen, or send methods.

runtimeContext?:

object

Optional runtime context for the connection.

returns:

Promise<void>

Promise that resolves when the connection is established.

speak()

Converts text to speech and sends it to the model. Can accept either a string or a readable stream as input.

input:

string | NodeJS.ReadableStream

Text or text stream to convert to speech.

options?:

GeminiLiveVoiceOptions

Optional speech configuration.

options.speaker?:

GeminiVoiceName

= Constructor's speaker value

Voice ID to use for this specific speech request.

options.languageCode?:

string

Language code for the response.

options.responseModalities?:

('AUDIO' | 'TEXT')[]

= ['AUDIO', 'TEXT']

Response modalities to receive from the model.

Returns: Promise<void> (responses are emitted via speaker and writing events)

listen()

Processes audio input for speech recognition. Takes a readable stream of audio data and returns the transcribed text.

audioStream:

NodeJS.ReadableStream

Audio stream to transcribe.

options?:

GeminiLiveVoiceOptions

Optional listening configuration.

Returns: Promise<string> - The transcribed text

send()

Streams audio data in real-time to the Gemini service for continuous audio streaming scenarios like live microphone input.

audioData:

NodeJS.ReadableStream | Int16Array

Audio stream or buffer to send to the service.

Returns: Promise<void>

updateSessionConfig()

Updates the session configuration dynamically. This can be used to modify voice settings, speaker selection, and other runtime configurations.

config:

Partial<GeminiLiveVoiceConfig>

Configuration updates to apply.

Returns: Promise<void>

addTools()

Adds a set of tools to the voice instance. Tools allow the model to perform additional actions during conversations. When GeminiLiveVoice is added to an Agent, any tools configured for the Agent will automatically be available to the voice interface.

tools:

ToolsInput

Tools configuration to equip.

Returns: void

addInstructions()

Adds or updates system instructions for the model.

instructions?:

string

System instructions to set.

Returns: void

answer()

Triggers a response from the model. This method is primarily used internally when integrated with an Agent.

options?:

Record<string, unknown>

Optional parameters for the answer request.

Returns: Promise<void>

getSpeakers()

Returns a list of available voice speakers for the Gemini Live API.

Returns: Promise<Array<{ voiceId: string; description?: string }>>

disconnect()

Disconnects from the Gemini Live session and cleans up resources. This is the async method that properly handles cleanup.

Returns: Promise<void>

close()

Synchronous wrapper for disconnect(). Calls disconnect() internally without awaiting.

Returns: void

on()

Registers an event listener for voice events.

event:

string

Name of the event to listen for.

callback:

Function

Function to call when the event occurs.

Returns: void

off()

Removes a previously registered event listener.

event:

string

Name of the event to stop listening to.

callback:

Function

The specific callback function to remove.

Returns: void

Events

The GeminiLiveVoice class emits the following events:

speaker:

event

Emitted when audio data is received from the model. Callback receives a NodeJS.ReadableStream.

speaking:

event

Emitted with audio metadata. Callback receives { audioData?: Int16Array, sampleRate?: number }.

writing:

event

Emitted when transcribed text is available. Callback receives { text: string, role: 'assistant' | 'user' }.

session:

event

Emitted on session state changes. Callback receives { state: 'connecting' | 'connected' | 'disconnected' | 'disconnecting' | 'updated', config?: object }.

turnComplete:

event

Emitted when a conversation turn is completed. Callback receives { timestamp: number }.

toolCall:

event

Emitted when the model requests a tool call. Callback receives { name: string, args: object, id: string }.

usage:

event

Emitted with token usage information. Callback receives { inputTokens: number, outputTokens: number, totalTokens: number, modality: string }.

error:

event

Emitted when an error occurs. Callback receives { message: string, code?: string, details?: unknown }.

interrupt:

event

Interrupt events. Callback receives { type: 'user' | 'model', timestamp: number }.

Available Models

The following Gemini Live models are available:

gemini-2.0-flash-exp (default)
gemini-2.0-flash-exp-image-generation
gemini-2.0-flash-live-001
gemini-live-2.5-flash-preview-native-audio
gemini-2.5-flash-exp-native-audio-thinking-dialog
gemini-live-2.5-flash-preview
gemini-2.6.flash-preview-tts

Available Voices

The following voice options are available:

Puck (default): Conversational, friendly
Charon: Deep, authoritative
Kore: Neutral, professional
Fenrir: Warm, approachable

Authentication Methods

Gemini API (Development)

The simplest method using an API key from Google AI Studio :


const voice = new GeminiLiveVoice({
  apiKey: "your-api-key", // Required for Gemini API
  model: "gemini-2.0-flash-exp",
});

Vertex AI (Production)

For production use with OAuth authentication and Google Cloud Platform:


// Using service account key file
const voice = new GeminiLiveVoice({
  vertexAI: true,
  project: "your-gcp-project",
  location: "us-central1",
  serviceAccountKeyFile: "/path/to/service-account.json",
});
 
// Using Application Default Credentials
const voice = new GeminiLiveVoice({
  vertexAI: true,
  project: "your-gcp-project",
  location: "us-central1",
});
 
// Using service account impersonation
const voice = new GeminiLiveVoice({
  vertexAI: true,
  project: "your-gcp-project",
  location: "us-central1",
  serviceAccountEmail: "service-account@project.iam.gserviceaccount.com",
});

Advanced Features

Session Management

The Gemini Live API supports session resumption for handling network interruptions:


voice.on("sessionHandle", ({ handle, expiresAt }) => {
  // Store session handle for resumption
  saveSessionHandle(handle, expiresAt);
});
 
// Resume a previous session
const voice = new GeminiLiveVoice({
  sessionConfig: {
    enableResumption: true,
    maxDuration: "2h",
  },
});

Tool Calling

Enable the model to call functions during conversations:


import { z } from 'zod';
 
voice.addTools({
  weather: {
    description: "Get weather information",
    parameters: z.object({
      location: z.string(),
    }),
    execute: async ({ location }) => {
      const weather = await getWeather(location);
      return weather;
    },
  },
});
 
voice.on("toolCall", ({ name, args, id }) => {
  console.log(`Tool called: ${name} with args:`, args);
});

Notes

The Gemini Live API uses WebSockets for real-time communication
Audio is processed as 16kHz PCM16 for input and 24kHz PCM16 for output
The voice instance must be connected with connect() before using other methods
Always call close() when done to properly clean up resources
Vertex AI authentication requires appropriate IAM permissions (aiplatform.user role)
Session resumption allows recovery from network interruptions
The API supports real-time interactions with text and audio

Google Gemini Live Voice

Usage Example


import { GeminiLiveVoice } from "@mastra/voice-google-gemini-live";
import { playAudio, getMicrophoneStream } from "@mastra/node-audio";
 
// Initialize with Gemini API (using API key)
const voice = new GeminiLiveVoice({
  apiKey: process.env.GOOGLE_API_KEY, // Required for Gemini API
  model: "gemini-2.0-flash-exp",
  speaker: "Puck", // Default voice
  debug: true,
});
 
// Or initialize with Vertex AI (using OAuth)
const voiceWithVertexAI = new GeminiLiveVoice({
  vertexAI: true,
  project: "your-gcp-project",
  location: "us-central1",
  serviceAccountKeyFile: "/path/to/service-account.json",
  model: "gemini-2.0-flash-exp",
  speaker: "Puck",
});
 
// Or use the VoiceConfig pattern (recommended for consistency with other providers)
const voiceWithConfig = new GeminiLiveVoice({
  speechModel: {
    name: "gemini-2.0-flash-exp",
    apiKey: process.env.GOOGLE_API_KEY,
  },
  speaker: "Puck",
  realtimeConfig: {
    model: "gemini-2.0-flash-exp",
    apiKey: process.env.GOOGLE_API_KEY,
    options: {
      debug: true,
      sessionConfig: {
        interrupts: { enabled: true },
      },
    },
  },
});
 
// Establish connection (required before using other methods)
await voice.connect();
 
// Set up event listeners
voice.on("speaker", (audioStream) => {
  // Handle audio stream (NodeJS.ReadableStream)
  playAudio(audioStream);
});
 
voice.on("writing", ({ text, role }) => {
  // Handle transcribed text
  console.log(`${role}: ${text}`);
});
 
voice.on("turnComplete", ({ timestamp }) => {
  // Handle turn completion
  console.log("Turn completed at:", timestamp);
});
 
// Convert text to speech
await voice.speak("Hello, how can I help you today?", {
  speaker: "Charon", // Override default voice
  responseModalities: ["AUDIO", "TEXT"],
});
 
// Process audio input
const microphoneStream = getMicrophoneStream();
await voice.send(microphoneStream);
 
// Update session configuration
await voice.updateSessionConfig({
  speaker: "Kore",
  instructions: "Be more concise in your responses",
});
 
// When done, disconnect
await voice.disconnect();
// Or use the synchronous wrapper
voice.close();

Configuration

Constructor Options

apiKey?:

string

Google API key for Gemini API authentication. Required unless using Vertex AI.

model?:

GeminiVoiceModel

= 'gemini-2.0-flash-exp'

The model ID to use for real-time voice interactions.

speaker?:

GeminiVoiceName

= 'Puck'

Default voice ID for speech synthesis.

vertexAI?:

boolean

= false

Use Vertex AI instead of Gemini API for authentication.

project?:

string

Google Cloud project ID (required for Vertex AI).

location?:

string

= 'us-central1'

Google Cloud region for Vertex AI.

serviceAccountKeyFile?:

string

Path to service account JSON key file for Vertex AI authentication.

serviceAccountEmail?:

string

Service account email for impersonation (alternative to key file).

instructions?:

string

System instructions for the model.

sessionConfig?:

GeminiSessionConfig

Session configuration including interrupt and context settings.

debug?:

boolean

= false

Enable debug logging for troubleshooting.

Session Configuration

interrupts?:

object

Interrupt handling configuration.

interrupts.enabled?:

boolean

= true

Enable interrupt handling.

interrupts.allowUserInterruption?:

boolean

= true

Allow user to interrupt model responses.

contextCompression?:

boolean

= false

Enable automatic context compression.

Methods

connect()

Establishes a connection to the Gemini Live API. Must be called before using speak, listen, or send methods.

runtimeContext?:

object

Optional runtime context for the connection.

returns:

Promise<void>

Promise that resolves when the connection is established.

speak()

Converts text to speech and sends it to the model. Can accept either a string or a readable stream as input.

input:

string | NodeJS.ReadableStream

Text or text stream to convert to speech.

options?:

GeminiLiveVoiceOptions

Optional speech configuration.

options.speaker?:

GeminiVoiceName

= Constructor's speaker value

Voice ID to use for this specific speech request.

options.languageCode?:

string

Language code for the response.

options.responseModalities?:

('AUDIO' | 'TEXT')[]

= ['AUDIO', 'TEXT']

Response modalities to receive from the model.

Returns: Promise<void> (responses are emitted via speaker and writing events)

listen()

Processes audio input for speech recognition. Takes a readable stream of audio data and returns the transcribed text.

audioStream:

NodeJS.ReadableStream

Audio stream to transcribe.

options?:

GeminiLiveVoiceOptions

Optional listening configuration.

Returns: Promise<string> - The transcribed text

send()

Streams audio data in real-time to the Gemini service for continuous audio streaming scenarios like live microphone input.

audioData:

NodeJS.ReadableStream | Int16Array

Audio stream or buffer to send to the service.

Returns: Promise<void>

updateSessionConfig()

Updates the session configuration dynamically. This can be used to modify voice settings, speaker selection, and other runtime configurations.

config:

Partial<GeminiLiveVoiceConfig>

Configuration updates to apply.

Returns: Promise<void>

addTools()

tools:

ToolsInput

Tools configuration to equip.

Returns: void

addInstructions()

Adds or updates system instructions for the model.

instructions?:

string

System instructions to set.

Returns: void

answer()

Triggers a response from the model. This method is primarily used internally when integrated with an Agent.

options?:

Record<string, unknown>

Optional parameters for the answer request.

Returns: Promise<void>

getSpeakers()

Returns a list of available voice speakers for the Gemini Live API.

Returns: Promise<Array<{ voiceId: string; description?: string }>>

disconnect()

Disconnects from the Gemini Live session and cleans up resources. This is the async method that properly handles cleanup.

Returns: Promise<void>

close()

Synchronous wrapper for disconnect(). Calls disconnect() internally without awaiting.

Returns: void

on()

Registers an event listener for voice events.

event:

string

Name of the event to listen for.

callback:

Function

Function to call when the event occurs.

Returns: void

off()

Removes a previously registered event listener.

event:

string

Name of the event to stop listening to.

callback:

Function

The specific callback function to remove.

Returns: void

Events

The GeminiLiveVoice class emits the following events:

speaker:

event

Emitted when audio data is received from the model. Callback receives a NodeJS.ReadableStream.

speaking:

event

Emitted with audio metadata. Callback receives { audioData?: Int16Array, sampleRate?: number }.

writing:

event

Emitted when transcribed text is available. Callback receives { text: string, role: 'assistant' | 'user' }.

session:

event

Emitted on session state changes. Callback receives { state: 'connecting' | 'connected' | 'disconnected' | 'disconnecting' | 'updated', config?: object }.

turnComplete:

event

Emitted when a conversation turn is completed. Callback receives { timestamp: number }.

toolCall:

event

Emitted when the model requests a tool call. Callback receives { name: string, args: object, id: string }.

usage:

event

Emitted with token usage information. Callback receives { inputTokens: number, outputTokens: number, totalTokens: number, modality: string }.

error:

event

Emitted when an error occurs. Callback receives { message: string, code?: string, details?: unknown }.

interrupt:

event

Interrupt events. Callback receives { type: 'user' | 'model', timestamp: number }.

Available Models

The following Gemini Live models are available:

gemini-2.0-flash-exp (default)
gemini-2.0-flash-exp-image-generation
gemini-2.0-flash-live-001
gemini-live-2.5-flash-preview-native-audio
gemini-2.5-flash-exp-native-audio-thinking-dialog
gemini-live-2.5-flash-preview
gemini-2.6.flash-preview-tts

Available Voices

The following voice options are available:

Puck (default): Conversational, friendly
Charon: Deep, authoritative
Kore: Neutral, professional
Fenrir: Warm, approachable

Authentication Methods

Gemini API (Development)

The simplest method using an API key from Google AI Studio :


const voice = new GeminiLiveVoice({
  apiKey: "your-api-key", // Required for Gemini API
  model: "gemini-2.0-flash-exp",
});

Vertex AI (Production)

For production use with OAuth authentication and Google Cloud Platform:


// Using service account key file
const voice = new GeminiLiveVoice({
  vertexAI: true,
  project: "your-gcp-project",
  location: "us-central1",
  serviceAccountKeyFile: "/path/to/service-account.json",
});
 
// Using Application Default Credentials
const voice = new GeminiLiveVoice({
  vertexAI: true,
  project: "your-gcp-project",
  location: "us-central1",
});
 
// Using service account impersonation
const voice = new GeminiLiveVoice({
  vertexAI: true,
  project: "your-gcp-project",
  location: "us-central1",
  serviceAccountEmail: "service-account@project.iam.gserviceaccount.com",
});

Advanced Features

Session Management

The Gemini Live API supports session resumption for handling network interruptions:


voice.on("sessionHandle", ({ handle, expiresAt }) => {
  // Store session handle for resumption
  saveSessionHandle(handle, expiresAt);
});
 
// Resume a previous session
const voice = new GeminiLiveVoice({
  sessionConfig: {
    enableResumption: true,
    maxDuration: "2h",
  },
});

Tool Calling

Enable the model to call functions during conversations:


import { z } from 'zod';
 
voice.addTools({
  weather: {
    description: "Get weather information",
    parameters: z.object({
      location: z.string(),
    }),
    execute: async ({ location }) => {
      const weather = await getWeather(location);
      return weather;
    },
  },
});
 
voice.on("toolCall", ({ name, args, id }) => {
  console.log(`Tool called: ${name} with args:`, args);
});

Notes

The Gemini Live API uses WebSockets for real-time communication
Audio is processed as 16kHz PCM16 for input and 24kHz PCM16 for output
The voice instance must be connected with connect() before using other methods
Always call close() when done to properly clean up resources
Vertex AI authentication requires appropriate IAM permissions (aiplatform.user role)
Session resumption allows recovery from network interruptions
The API supports real-time interactions with text and audio