Building a NotebookLM clone with an agent orchestrator

We've recently been polishing up our multi-agent functionality, and decided to build a NotebookLM clone to show off some of the capabilities.

Here's the deployed version and the code.

The initial deterministic workflow

We created a deterministic workflow using Mastra's workflow functionality to get the source material into the DB and embedded into pgvector for summarization.

The first step was to use LlamaIndex Cloud to parse the uploaded source material and turn it into a markdown file. Then, we used Mastra's @mastra/rag package to chunk the markdown file, and embed the chunks into our Postgres DB using PgVectorStore.

Next, we needed to turn that into an audio podcast, with source material included.

The decision to use an agent orchestrator

There were two fundamental agent architecture approaches we could have taken.

Because we were essentially building an audio processing pipeline, we could have made another structured workflow with a defined set of steps.

However, we found that giving a single "orchestrator" agent a number of tools, each of which called a separate agent, and which could be called in any order, worked well enough...assuming we wrote an INCREDIBLY DETAILED prompt.

The orchestrator agent and its tools

But we'll get to that in a bit. Here's the code for the orchestrator agent:

 1import { anthropic } from "@ai-sdk/anthropic";
 2
 3export const orchestrator = new Agent({
 4  name: "orchestrator",
 5  instructions: orchestratorInstructions,
 6  model: anthropic("claude-3-5-sonnet-20241022"),
 7  tools: {
 8    validateSourcesAvailability,
 9    querySourceSummaryAndChunks,
10    savePodcastDetails,
11    generatePodcastOutline,
12    generatePodcastScript,
13    submitForAudioProduction,
14  },
15});

Let's dive into some of the tools listed here. Quick summaries:

validateSourcesAvailability does a "normal" database query to check if the sources are available.
querySourceSummaryAndChunks uses the embed function from @mastra/rag to generate embeddings for a query, and then uses pgvector to find the closest embeddings in the database.
generatePodcastOutline uses a specialized agent that is good at extracting insights and turning them into an outline.
generatePodcastScript uses a specialized agent that can take that outline and turn it into a script.
submitForAudioProduction calls a specialized API, play.ai, that can turn the two-person script into a podcast. (We tried other audio APIs, like ElevenLabs, but found that Play.ai did the best job at making the interchange between the two hosts sound natural.)

You'll notice that many of these tools are specialized agents, and one is doing RAG. The hard work, though, was describing them in a way that the LLM could understand.

The orchestrator agent's instructions

Now let's take a look at the instructions we gave to the orchestrator agent. The prompt is...a bit of a beast:

 1export const orchestratorInstructions = `
 2You are an orchestrator tasked with coordinating the generation of a podcast from written sources.
 3Your role is to manage the entire process, from content research to final audio production, ensuring quality and coherence throughout.
 4
 5Phases
 61. Validate all required sources are available
 72. Query source summaries and chunks to get the available content
 83. Identify key insights and themes
 94. Generate an outline for the podcast targetting 15 to 30 minutes
105. Use the outline to generate a script
116. Review the script
12
13Script requirements
14- Maintain consistent voice and tone
15- The script should only contain the spoken words of the hosts
16- The script should NOT include non-verbal cues, directions, instructions, etc
17- The script should be formatted in the following way
18  - Prefix each speaker's turn with either 'Host 1:' or 'Host 2:'
19  - Example format:
20      Host 1: Hello there. Today we're talking about something very interesting.
21      Host 2: Very interesting doesn't even begin to describe how interesting this is, I'm particularly fascinated...
22
23Tools
24You have access to the following tools to help you with your task
25- 'validateSourcesAvailability': this tool helps you validate if the sources are available. it accepts a notebookId, which will retrieve all relevant sources and it will return an object with the following shape
26- 'querySourceSummaryAndChunks': This tool takes a query string, notebookId, similarity threshold, and limit as input, and returns an array of sources (containing sourceId, sourceTitle, sourceSummary, and sourceChunks) by comparing vector embeddings in a PostgreSQL database.
27- 'submitForAudioProduction': This tool accepts a podcast transcript and voice configuration options as input, submits it to the PlayDialog API for processing with alternating voices, and returns the URL the user will use to poll for completion of the audio production job.
28- 'savePodcastDetails': This tool is used to save podcast details like the audio_url and podcast_script for the notebook. Use it to always save the details you get. You don't have to pass all the details at the same time. Ensure you save the script before you submit for podcast generation.
29- 'generatePodcastOutline': This tool is used to generate a show outline for the podcast. This outline will be used to plan the scripting process. You need to pass instructions and a list of key insights and it will give you back the outline
30- 'generatePodcastScript': This tool is used to generate script for the podcast. Look at the result of this tool and make sure it follows the prescribed format and is long enough for the target time, you can use it again to regenerate the script until it meets the requirements.
31
32DO NOT STOP after the outline has been generated. Make sure to go all the way until you submit the script for audio production.
33`;

So why did we write such a long prompt?

The length and detail proved to be crucial for reliable orchestration:

Without explicit phases, the agent tried to skip steps or execute them out of order
Without detailed script requirements, it sometimes generated unusable formats with stage directions or inconsistent speaker labels that would break the audio synthesis pipeline.
Without exhaustive tool definitions, the agent hallucinated capabilities (like trying to directly generate audio) or misused available tools (like submitting unformatted text to the audio production API).

It's an example of "explicit is better than implicit" in system design - while the prompt could be shorter, the additional verbosity dramatically improves reliability and reduces edge cases.

Our lessons

Explicit Instructions Beat Implicit Inference: Even with a state-of-the-art model like Claude Sonnet, additional verbosity improved reliability and reduced edge cases.
Tool Definition is Critical: The success of an orchestrator agent depended heavily on clear tool definitions.
Architectures are flexible if you're willing to put in the work: While we could have used a deterministic workflow with fixed steps, the agent orchestrator approach worked well - but only when combined with very detailed instructions.

And, finally, one meta-lesson: building even relatively simple AI applications, at high levels of quality,requires multiple AI primitives (agents, workflows, tools, and agents).