Building Low-Latency Agents That Help Humans Respond on Live Calls

The engineering and context principles Micro uses to deliver sub-3s answers on live calls with Mastra.

Sam BhagwatSam Bhagwat·

Jun 2, 2026

·

3 min read

Voice/video is the final boss of agent latency. One of our users shared eng + context principles they used to deliver accurate answers in <3s RTT.

@microHQ runs Mastra agents on top of Recall.ai's real-time transcripts to help salespeople and founders respond to customer questions. The agents advising a human on a call need to help them respond to their users in real time.

So the first thing they do is optimize context: treat the live transcript as an event stream, keep a rolling local buffer, and send only the recent relevant window into the agent.

The flow is roughly:

  1. Recall.ai sends transcript.partial_data and transcript.data webhooks in low-latency mode.
  2. They fan those out to the client over WebSocket as partial/final utterances.
  3. The UI renders the transcript immediately.
  4. Separately, the “meeting intelligence” loop wakes up after the first few utterances, then every ~5 new utterances or ~30s of new content.
  5. That loop sends a rolling window, currently the last ~60 utterances client-side and capped again server-side to the last ~4k chars, into the agent.

For long calls, the important thing is: don’t keep appending the whole transcript to the prompt. Use a rolling window plus durable state. The live path should care about “what was just said.”

If you need continuity across 60+ minutes, maintain a compact running summary / open-issues / commitments object and include that alongside the most recent transcript slice.

For live output, they use two channels:

  • WebSocket for transcript updates.
  • SSE for agent output.

The backend calls streamText, writes each delta as an SSE text event, and then sends a final done event with parsed sections like insights, talking points, and questions.

The UI shows the raw streamed text immediately, then swaps into structured cards/chips when the final parse arrives.

The Micro team built quickly and kept iterating. Latency-wise, the biggest wins came from:

  • using Recall’s low-latency streaming transcript mode
  • not invoking the agent on every partial
  • keeping prompts small with a rolling transcript window
  • using a fast sub-agent model with tight maxOutputTokens
  • streaming the model response instead of waiting for completion
  • doing enrichment in parallel with a hard timeout, currently ~3s
  • aborting stale in-flight intelligence requests when fresher transcript context arrives

For slower side tasks: enrichment is parallel and runs in the background.

They kick off attendee/contact lookup and vault search together, wait up to a short cap, and proceed with whatever came back.

If it misses the fast path, they don’t block the live assistant.

Share:
Sam Bhagwat

Sam Bhagwat is the founder and CEO of Mastra. He co-founded Gatsby, which was used by hundreds of thousands of developers. A Stanford graduate and veteran of web development, he authored 'Principles of Building AI Agents' (2025).

All articles by Sam Bhagwat