RAG tutorial: Setting up a RAG pipeline

Understanding RAG: The Theory

Large Language Models (LLMs) face a fundamental challenge: they can only work with information they were trained on. This creates limitations when dealing with new, private, or domain-specific information. Additionally, LLMs sometimes generate plausible but incorrect information – a problem known as hallucination.

Retrieval-Augmented Generation (RAG) addresses these limitations by connecting LLMs to external knowledge sources. Think of RAG as giving an LLM the ability to "look up" information before responding, similar to how a human might consult reference materials to answer a question accurately.

How RAG Works

The RAG process mirrors how humans research and answer questions. When a query comes in, the system first searches its knowledge base for relevant information using vector embeddings – numerical representations that capture the meaning of text. This retrieved information is then formatted into a prompt for the LLM, which combines this specific knowledge with its general capabilities to generate an accurate, contextual response.

Here's a visual representation of how RAG works:

RAG workflow diagram

Mastra has built-in support for RAG, and allows developers to easily integrate a RAG workflow into their applications.

Now, let's build a RAG workflow using Mastra, including reranking capabilities for improved accuracy.

Part 1: Document Ingestion and Chunking

Let’s start by ingesting a document and chunking it. We want to split the document into bite-sized pieces for our search; this is called “chunking.”

The goal is to split at natural boundaries like topic transitions and new sections. This is called “semantic coherence.”

 1import { Mastra } from "@mastra/core";
 2import { MDocument, PgVector, Reranker } from "@mastra/rag";
 3import { embedMany } from "ai";
 4import { openai } from "@ai/openai";
 5
 6const doc = MDocument.fromText(`Your text content here...`);
 7
 8const chunks = await doc.chunk({
 9  strategy: "recursive",
10  size: 512,
11  overlap: 50,
12  separator: "\n",
13});

Part 2: Embedding Generation

After chunking, we'll need to embed our data – transform it into a vector, or an array of 1536 values between 0 and 1, representing the meaning of the text.

We do this with LLMs, because they make the embeddings much more accurate. OpenAI has an API for this.

 1const { embeddings } = await embedMany({
 2  values: chunks,
 3  model: openai.embedding("text-embedding-3-small"),
 4});

Part 3: Vector Storage and Management

We need to use a vector DB which can store these vectors and do the math to search on them. We'll use pgvector, which comes out of the box with Postgres.

Once we pick a vector DB, we need to set up an index to store our document chunks, represented as vector embeddings.

 1const pgVector = new PgVector(process.env.POSTGRES_CONNECTION_STRING!);
 2
 3// add to the Mastra object to get logging
 4export const mastra = new Mastra({
 5  vectors: { pgVector },
 6});
 7
 8await pgVector.createIndex("embeddings", 1536);
 9await pgVector.upsert(
10  "embeddings",
11  embeddings,
12  chunks?.map((chunk: any) => ({ text: chunk.text })),
13);

Part 4: Query Processing and Response Generation

Let's set up the LLM we'll use for generating responses and define our query. We'll then generate an embedding for the query using the same embedding API we used for the document chunks.

 1const query = "insert query here";
 2const { embedding } = await embed({
 3  value: query,
 4  model: openai.embedding("text-embedding-3-small"),
 5  maxRetries: 3,
 6});

Okay, after that setup, we can now query the database!

Under the hood, Mastra is running an algorithm that compares our query string to all the chunks in the database and returning the most similar ones.

The actual algorithm is called “cosine similarity”. The implementation is similar to geo queries searching latitude/longitude, except the search goes over 1536 dimensions instead of two. We can use other algorithms as well.

Now, we can construct a prompt using the results as context.

 1// Perform vector similarity search
 2const results = await pgVector.query("embeddings", embedding);
 3
 4// Extract and combine relevant chunks
 5const relevantChunks = results.map((result) => result?.metadata?.text);
 6const relevantContext = relevantChunks.join("\n\n");
 7
 8const prompt = `
 9    Please answer the following question:
10    ${query}
11
12    Please base your answer only on this context ${relevantContext}.
13    If the context doesn't contain enough information to fully answer the question, please state that explicitly.
14`;

We can now pass this prompt into an LLM to generate a response.

 1const { completion } = await generateText({
 2  model: openai("gpt-4o-mini"),
 3  prompt: prompt,
 4});
 5console.log(completion);

Part 5: Enhanced Retrieval with Reranking

Optionally, after querying and getting our results, we can use a reranker. Reranking is basically a more computationally expensive way of searching the dataset.

It would take too long to run it over the entire database, but we can run it over our results to improve the ordering.

 1const reranker = new Reranker({
 2  semanticProvider: openai("gpt-4o-mini"),
 3});
 4
 5const rerankedResults = await reranker.rerank({
 6  query: query,
 7  vectorStoreResults: results,
 8  topK: 3,
 9});
10// Process reranked results
11const rerankedChunks = rerankedResults.map(
12  ({ result }) => result?.metadata?.text,
13);
14
15// Combine reranked chunks into a context
16const rerankedContext = rerankedChunks.join("\n\n");
17
18const rerankedPrompt = `
19    Please answer the following question:
20    ${query}
21
22    Please base your answer only on this context ${rerankedContext}.
23    If the context doesn't contain enough information to fully answer the question, please state that explicitly.
24`;

Finally, we can generate a response using the reranked prompt:

 1const rerankedCompletion = await llm.generate(rerankedPrompt);
 2console.log(rerankedCompletion.text);

Common Patterns and Best Practices

When working with this RAG implementation, keep these principles in mind:

The chunk size and overlap settings in the document processing significantly impact retrieval quality. Using a recursive strategy with moderate overlap (like 50 tokens) often provides good results while maintaining context.

The choice of embedding model affects both the quality of retrieval and the cost of operation. While OpenAI's embeddings provide excellent results, there are open-source alternatives available for different needs.

The number of chunks retrieved (topK in the vector query tool) is a balance between providing enough context and staying within the LLM's context window. Start with 3-4 chunks and adjust based on need.

Consider the tradeoff between response time and quality when deciding whether to use reranking. While it can improve accuracy, it adds an additional processing step to the workflow.

Next steps

Check out our guide where we create an AI research assistant that can analyze academic papers and answer specific questions about their content using RAG.

RAG tutorial: Setting up a RAG pipeline

Stay up to date