Blog

Building a NotebookLM clone with an agent orchestrator

Jan 16, 2025

We've recently been polishing up our multi-agent functionality, and decided to build a NotebookLM clone to show off some of the capabilities.

Here's the deployed version and the code.

The initial deterministic workflow

We created a deterministic workflow using Mastra's workflow functionality to get the source material into the DB and embedded into pgvector for summarization.

The first step was to use LlamaIndex Cloud to parse the uploaded source material and turn it into a markdown file. Then, we used Mastra's @mastra/rag package to chunk the markdown file, and embed the chunks into our Postgres DB using PgVectorStore.

Next, we needed to turn that into an audio podcast, with source material included.

The decision to use an agent orchestrator

There were two fundamental agent architecture approaches we could have taken.

Because we were essentially building an audio processing pipeline, we could have made another structured workflow with a defined set of steps.

However, we found that giving a single "orchestrator" agent a number of tools, each of which called a separate agent, and which could be called in any order, worked well enough...assuming we wrote an INCREDIBLY DETAILED prompt.

The orchestrator agent and its tools

But we'll get to that in a bit. Here's the code for the orchestrator agent:

export const orchestrator = new Agent({
  name: "orchestrator",
  instructions: orchestratorInstructions,
  model: {
    provider: "ANTHROPIC",
    name: "claude-3-5-sonnet-20241022",
    toolChoice: "required",
  },
  tools: {
    validateSourcesAvailability,
    querySourceSummaryAndChunks,
    savePodcastDetails,
    generatePodcastOutline,
    generatePodcastScript,
    submitForAudioProduction,
  },
});

Let's dive into some of the tools listed here. Quick summaries:

  • validateSourcesAvailability does a "normal" database query to check if the sources are available.
  • querySourceSummaryAndChunks uses the embed function from @mastra/rag to generate embeddings for a query, and then uses pgvector to find the closest embeddings in the database.
  • generatePodcastOutline uses a specialized agent that is good at extracting insights and turning them into an outline.
  • generatePodcastScript uses a specialized agent that can take that outline and turn it into a script.
  • submitForAudioProduction calls a specialized API, play.ai, that can turn the two-person script into a podcast. (We tried other audio APIs, like ElevenLabs, but found that Play.ai did the best job at making the interchange between the two hosts sound natural.)

You'll notice that many of these tools are specialized agents, and one is doing RAG. The hard work, though, was describing them in a way that the LLM could understand.

The orchestrator agent's instructions

Now let's take a look at the instructions we gave to the orchestrator agent. The prompt is...a bit of a beast:


export const orchestratorInstructions = `
You are an orchestrator tasked with coordinating the generation of a podcast from written sources.
Your role is to manage the entire process, from content research to final audio production, ensuring quality and coherence throughout.

Phases
1. Validate all required sources are available
2. Query source summaries and chunks to get the available content
3. Identify key insights and themes
4. Generate an outline for the podcast targetting 15 to 30 minutes
5. Use the outline to generate a script
6. Review the script

Script requirements
- Maintain consistent voice and tone
- The script should only contain the spoken words of the hosts
- The script should NOT include non-verbal cues, directions, instructions, etc
- The script should be formatted in the following way
  - Prefix each speaker's turn with either 'Host 1:' or 'Host 2:'
  - Example format:
      Host 1: Hello there. Today we're talking about something very interesting.
      Host 2: Very interesting doesn't even begin to describe how interesting this is, I'm particularly fascinated...

Tools
You have access to the following tools to help you with your task
- 'validateSourcesAvailability': this tool helps you validate if the sources are available. it accepts a notebookId, which will retrieve all relevant sources and it will return an object with the following shape
- 'querySourceSummaryAndChunks': This tool takes a query string, notebookId, similarity threshold, and limit as input, and returns an array of sources (containing sourceId, sourceTitle, sourceSummary, and sourceChunks) by comparing vector embeddings in a PostgreSQL database.
- 'submitForAudioProduction': This tool accepts a podcast transcript and voice configuration options as input, submits it to the PlayDialog API for processing with alternating voices, and returns the URL the user will use to poll for completion of the audio production job.
- 'savePodcastDetails': This tool is used to save podcast details like the audio_url and podcast_script for the notebook. Use it to always save the details you get. You don't have to pass all the details at the same time. Ensure you save the script before you submit for podcast generation.
- 'generatePodcastOutline': This tool is used to generate a show outline for the podcast. This outline will be used to plan the scripting process. You need to pass instructions and a list of key insights and it will give you back the outline
- 'generatePodcastScript': This tool is used to generate script for the podcast. Look at the result of this tool and make sure it follows the prescribed format and is long enough for the target time, you can use it again to regenerate the script until it meets the requirements.

DO NOT STOP after the outline has been generated. Make sure to go all the way until you submit the script for audio production.
`;

So why did we write such a long prompt?

The length and detail proved to be crucial for reliable orchestration:

  • Without explicit phases, the agent tried to skip steps or execute them out of order
  • Without detailed script requirements, it sometimes generated unusable formats with stage directions or inconsistent speaker labels that would break the audio synthesis pipeline.
  • Without exhaustive tool definitions, the agent hallucinated capabilities (like trying to directly generate audio) or misused available tools (like submitting unformatted text to the audio production API).

It's an example of "explicit is better than implicit" in system design - while the prompt could be shorter, the additional verbosity dramatically improves reliability and reduces edge cases.

Our lessons

  1. Explicit Instructions Beat Implicit Inference: Even with a state-of-the-art model like Claude Sonnet, additional verbosity improved reliability and reduced edge cases.

  2. Tool Definition is Critical: The success of an orchestrator agent depended heavily on clear tool definitions.

  3. Architectures are flexible if you're willing to put in the work: While we could have used a deterministic workflow with fixed steps, the agent orchestrator approach worked well - but only when combined with very detailed instructions.

And, finally, one meta-lesson: building even relatively simple AI applications, at high levels of quality,requires multiple AI primitives (agents, workflows, tools, and agents).

Share

Stay up to date