Building a voice research agent with Mastra, Render, and AssemblyAI

This week, the teams at Render and AssemblyAI shipped ravendr. It's a production voice research agent. The whole flow—classify, plan, search, synthesize, verify—happens in under 60 seconds with zero silence on the line.

Ryan Seams (AssemblyAI) and Ojus Save (Render) detailed the orchestration behind the build.

Here's the agent-layer view.

Both posts are digging into the same code.

Voice agents are multi-agent systems

Sure, you could throw every tool at one massive LLM and just let it run. Some teams do exactly that, and it can actually work for text chat. The model just figures everything out on its own.

But voice? That's a different story. It breaks down for a few key reasons.

First up: latency. Everything—tool selection, execution, synthesis—happens one after another. The user is just stuck waiting, listening to silence on the other end of the line.

Then there's the verification problem. There's no good spot to check the work. You only know the answer is wrong when your user tells you.

And failures get messy. One bad classification can poison the entire loop. You can't just retry one part; you're stuck starting over.

That's why ravendr splits the work into distinct stages: classify, plan, search, synthesize, and verify. Each stage gets its own dedicated agent. The actual voice agent—the one your user talks to—sits upstream in AssemblyAI's Voice Agent API, handling the conversation.

The architecture

The architecture is simple: three layers, each handling what it does best.

Voice (AssemblyAI). User audio streams via WebSocket to AssemblyAI's Voice Agent API. They handle the speech-to-text, LLM conversation, and text-to-speech. If the model triggers a tool—say, "research this question for me"—AssemblyAI sends an event. ravendr picks that up and routes it into the research pipeline.

Orchestration (Render Workflows). The research pipeline is a Render Workflow task called research. We break that down into subtasks: classify_ask, plan_queries, search_branch, synthesize, and verify. These get their own compute plans, timeouts, retry policies, and replay logs. Render's post explains the details.

Agents (Mastra). ravendr calls on Mastra agents for the classify, plan, synthesize, and verify steps. Each agent has one job and a very specific set of rules. The best part? They're loosely coupled. The agents don't know about Render Workflows, and the Workflows don't care about Mastra's internals.

One agent per stage

Here's how the Mastra agent factory pattern looks in src/mastra/agents.ts:

import { Agent } from "@mastra/core/agent";
 
function normalize(model: string): string {
  return model.includes("/") ? model : `anthropic/${model}`;
}
 
export function classifierAgent(anthropicModel: string): Agent {
  return new Agent({
    id: "ravendr-classifier",
    name: "ravendr-classifier",
    instructions: CLASSIFY_INSTRUCTIONS,
    model: normalize(anthropicModel),
  });
}

The model's just a string. Mastra's router parses that provider/model-name format and figures out where to send it. Instead of making these agents singletons, we build them fresh each call—that keeps model selection configurable. The instructions stay focused: the classifier's job is simply dropping the user's request into one of our predefined output shapes.

The planner, synthesizer, and verifier all use this same factory approach. Their prompts adapt based on whatever shape the classifier spit out.

JSON parsing happens down in the subtask layer. A tiny helper pulls JSON from the agent's text response, and if the model's output is broken, it falls back to something sensible. The verify step uses this exact same pattern.

To extend this with memory, tools, or scorers, see the Mastra docs.

Durability from Render and intelligence from Mastra

At its core, a Mastra agent is just a function. To make it durable—restartable, retryable, observable—ravendr runs each one inside a Render Workflow subtask. Here's classify_ask:

import { task } from "@renderinc/sdk/workflows";
import { classifierAgent } from "../../../mastra/agents.js";
 
export const classify_ask = task(
  {
    name: "classify_ask",
    plan: "starter",
    timeoutSeconds: 30,
    retry: { maxRetries: 2, waitDurationMs: 500, backoffScaling: 1.5 },
  },
  async function classify_ask(sessionId: string, topic: string) {
    const config = loadWorkflowConfig();
    const agent = classifierAgent(config.ANTHROPIC_MODEL);
    const result = await agent.generate(`User ask: "${topic}"\n\nClassify. JSON only.`);
    const shape = parseShape(result.text ?? "");
    await events.publish({ sessionId, kind: "ask.classified", shape });
    return { shape };
  },
);

Render handles the durability; Mastra handles the brains. The best part? These layers stay isolated. We pass data around, not control flow. The classifier spits out a shape. That shape feeds into the planner, which generates queries. Then the synthesizer grabs the search results and builds a briefing.

For task definitions, plan selection, and the dashboard view, see Ryan and Ojus's write-up.

The voice channel: AssemblyAI Voice Agent API

Mastra and Render never actually touch the user's voice. AssemblyAI's Voice Agent API manages the WebSocket, STT, conversational model, and TTS, while ravendr exists purely as a tool for the agent to call.

Whenever the agent invokes the research tool, the voice_session Render task receives the event. It triggers the research chain and streams progress back. AssemblyAI's agent stays live in the background, saying "let me look that up for you" while the heavy processing runs in parallel.

The browser-to-task connection relies on a reverse WebSocket pattern. The browser connects to a broker; the Render task does the same. The broker pairs them, leaving the browser and task connected without either side knowing the other's address.

Every stage gets a budget

Think of the deadline not as one hard 60-second timeout, but as a budget that trickles down through the pipeline.

From src/render/tasks/research.ts:

const OVERALL_BUDGET_MS = 55_000; // 55s leaves ~5s for broker + browser render
const SYNTH_RESERVE_MS  = 12_000; // minimum runway to write the briefing
const VERIFY_RESERVE_MS = 6_000;  // minimum runway for verify
const RETRY_RESERVE_MS  = 35_000; // don't attempt a retry unless this much is left

Simple math. Synthesis gets 12 seconds, verify gets 6, and we'll only retry if 35 seconds are still on the clock. Each stage checks the time and decides whether to run or bow out.

This is where the search design really pays off. The planner spits out N queries—just 3 for a simple request, but up to 40 when we need to be thorough. Then ravendr races them in parallel using whatever budget remains:

const searchBudget = Math.max(5_000, remaining() - SYNTH_RESERVE_MS);
const branches = await racePartial(
  plan.queries.map((q) => search_branch(sessionId, q.angle, q.query, q.tier)),
  searchBudget,
);

racePartial returns whatever branches finish before the budget expires. So if we launch 30 searches and only 18 complete in time, the synthesizer works with those 18 results while the others keep chugging away as orphaned subtasks.

Here's the clever part: if nothing comes back in time, the pipeline doesn't throw an error. Instead, it seeds the synthesizer with an "overview" branch that says "no research results came back in time." The synthesizer can still write a "couldn't find anything" briefing. Users notice silence first, and a partial answer is always better than none.

Verify, retry once, default to ship

After synthesis finishes, the verifier runs on the same factory pattern. It checks the briefing to decide if the question was actually answered.

We parse the verdict exactly like the classifier's output. If that fails, the verifier defaults to pass:

} catch (err) {
  logger.warn({ err, raw: text.slice(0, 200) },
    "verify: unparseable verdict, defaulting to pass");
  return { passes: true, reason: "verifier output unreadable — defaulting to pass", feedback: "" };
}

A misbehaving guard fails open. The pipeline already holds a synthesized response, so the verifier acts as a guardrail rather than a gate.

When the verifier returns a clean fail, the feedback routes to the planner instead of the synthesizer. The next loop iteration calls plan_queries with that feedback baked in. New queries get drafted, fresh searches run, and another synthesis kicks off. One retry only; then ravendr ships whatever it has.

Streaming progress to the UI

Fifty-five seconds of dead air? The audio channel can't handle that alone. When your voice agent says "let me look that up for you," the UI needs to show what's happening behind the scenes. That's where phase-level progress comes in: ask.classified, plan.ready, youcom.call.started, verify.started, briefing.ready.

Here's how it works. Every Render subtask publishes events to a Postgres-backed event bus. The browser then subscribes through SSE—running through our reverse-tunnel broker.

Yeah, SSE's been around forever. But voice UX demands visual feedback when the audio channel's tied up. Users need to know something's actually happening.

Concurrency, deployment, env

ravendr handles up to 100 concurrent sessions, each with a 15-minute TTL. There's also a cleanup daemon that takes care of expired sessions. For deployment, you'll use a render.yaml that sets up two services—web and workflow—plus a Postgres database for storing session state and powering the event bus. You'll need four environment variables: ANTHROPIC_API_KEY, ASSEMBLYAI_API_KEY, YOUCOM_API_KEY, and a Render API key. Full setup is in the ravendr README.

What this gives you

Your voice channel and agent work don't need to wait on each other. AssemblyAI tackles the audio while Render Workflows crunches the heavy lifting in parallel—no dead air for your users.

Each research stage runs as its own small agent, not just another slice of some massive prompt. That's where Mastra's per-stage agent factories really shine.

Every stage gets its own budget. The pipeline knows how to fail gracefully: if the search fan-out blows past its limit, it synthesizes whatever came back; if the verifier can't parse, the briefing still ships; if there's no time for a retry, it takes the first answer. It always delivers something.

Render and AssemblyAI built ravendr as a reference implementation. Clone it, deploy it, start talking to it. Throw in memory, tools, or scorers when your project needs them.

git clone https://github.com/render-examples/ravendr

For the orchestration deep-dive, see Ryan and Ojus's post. For the agent layer, the Mastra docs.