DocsRAGChunking and Embedding

Chunking and Embedding Documents

Before processing, create a MDocument instance from your content. You can initialize it from various formats:

const docFromText = MDocument.fromText("Your plain text content...");
const docFromHTML = MDocument.fromHTML("<html>Your HTML content...</html>");
const docFromMarkdown = MDocument.fromMarkdown("# Your Markdown content...");
const docFromJSON = MDocument.fromJSON(`{ "key": "value" }`);

Step 1: Document Processing

Use chunk to split documents into manageable pieces. Mastra supports multiple chunking strategies optimized for different document types:

  • recursive: Smart splitting based on content structure
  • character: Simple character-based splits
  • token: Token-aware splitting
  • markdown: Markdown-aware splitting
  • html: HTML structure-aware splitting
  • json: JSON structure-aware splitting
  • latex: LaTeX structure-aware splitting

Here’s an example of how to use the recursive strategy:

const chunks = await doc.chunk({
  strategy: "recursive",
  size: 512,
  overlap: 50,
  separator: "\n",
  extract: {
    metadata: true, // Optionally extract metadata
  }
});

Note: Metadata extraction may use LLM calls, so ensure your API key is set.

We go deeper into chunking strategies in our chunk documentation.

Step 2: Embedding Generation

Transform chunks into embeddings using your preferred provider. Mastra supports both OpenAI and Cohere embeddings:

Using OpenAI

import { embed } from "@mastra/rag";
 
const embeddings = await embed(chunks, {
  provider: "OPEN_AI",
  model: "text-embedding-ada-002",
  maxRetries: 3
});

Using Cohere

const embeddings = await embed(chunks, {
  provider: "COHERE",
  model: "embed-english-v3.0",
  maxRetries: 3
});

The embedding functions return vectors, arrays of numbers representing the semantic meaning of your text, ready for similarity searches in your vector database.

Example: Complete Pipeline

Here’s an example showing document processing and embedding generation with both providers:

import { MDocument, embed } from "@mastra/rag";
 
// Initialize document
const doc = MDocument.fromText(`
  Climate change poses significant challenges to global agriculture.
  Rising temperatures and changing precipitation patterns affect crop yields.
`);
 
// Create chunks
const chunks = await doc.chunk({
  strategy: "recursive",
  size: 256,
  overlap: 50
});
 
// Generate embeddings with OpenAI
const openAIEmbeddings = await embed(chunks, {
  provider: "OPEN_AI",
  model: "text-embedding-ada-002"
});
 
// Generate embeddings with Cohere
const cohereEmbeddings = await embed(chunks, {
  provider: "COHERE",
  model: "embed-english-v3.0"
});
 
// Store embeddings in your vector database
await vectorStore.upsert("embeddings", embeddings);

This example demonstrates how to process a document, split it into chunks, generate embeddings with both OpenAI and Cohere, and store the results in a vector database.

For more examples of different chunking strategies and embedding configurations, see:


MIT 2025 © Nextra.