Voice/video is the final boss of agent latency. One of our users shared eng + context principles they used to deliver accurate answers in <3s RTT.
@microHQ runs Mastra agents on top of Recall.ai's real-time transcripts to help salespeople and founders respond to customer questions. The agents advising a human on a call need to help them respond to their users in real time.
So the first thing they do is optimize context: treat the live transcript as an event stream, keep a rolling local buffer, and send only the recent relevant window into the agent.
The flow is roughly:
- Recall.ai sends
transcript.partial_dataandtranscript.datawebhooks in low-latency mode. - They fan those out to the client over WebSocket as partial/final utterances.
- The UI renders the transcript immediately.
- Separately, the “meeting intelligence” loop wakes up after the first few utterances, then every ~5 new utterances or ~30s of new content.
- That loop sends a rolling window, currently the last ~60 utterances client-side and capped again server-side to the last ~4k chars, into the agent.
For long calls, the important thing is: don’t keep appending the whole transcript to the prompt. Use a rolling window plus durable state. The live path should care about “what was just said.”
If you need continuity across 60+ minutes, maintain a compact running summary / open-issues / commitments object and include that alongside the most recent transcript slice.
For live output, they use two channels:
- WebSocket for transcript updates.
- SSE for agent output.
The backend calls streamText, writes each delta as an SSE text event, and then sends a final done event with parsed sections like insights, talking points, and questions.
The UI shows the raw streamed text immediately, then swaps into structured cards/chips when the final parse arrives.
The Micro team built quickly and kept iterating. Latency-wise, the biggest wins came from:
- using Recall’s low-latency streaming transcript mode
- not invoking the agent on every partial
- keeping prompts small with a rolling transcript window
- using a fast sub-agent model with tight
maxOutputTokens - streaming the model response instead of waiting for completion
- doing enrichment in parallel with a hard timeout, currently ~3s
- aborting stale in-flight intelligence requests when fresher transcript context arrives
For slower side tasks: enrichment is parallel and runs in the background.
They kick off attendee/contact lookup and vault search together, wait up to a short cap, and proceed with whatever came back.
If it misses the fast path, they don’t block the live assistant.
