Skip to Content
DocsRAGChunking and Embedding

Chunking and Embedding Documents

Before processing, create a MDocument instance from your content. You can initialize it from various formats:

const docFromText = MDocument.fromText("Your plain text content..."); const docFromHTML = MDocument.fromHTML("<html>Your HTML content...</html>"); const docFromMarkdown = MDocument.fromMarkdown("# Your Markdown content..."); const docFromJSON = MDocument.fromJSON(`{ "key": "value" }`);

Step 1: Document Processing

Use chunk to split documents into manageable pieces. Mastra supports multiple chunking strategies optimized for different document types:

  • recursive: Smart splitting based on content structure
  • character: Simple character-based splits
  • token: Token-aware splitting
  • markdown: Markdown-aware splitting
  • html: HTML structure-aware splitting
  • json: JSON structure-aware splitting
  • latex: LaTeX structure-aware splitting

Here’s an example of how to use the recursive strategy:

const chunks = await doc.chunk({ strategy: "recursive", size: 512, overlap: 50, separator: "\n", extract: { metadata: true, // Optionally extract metadata }, });

Note: Metadata extraction may use LLM calls, so ensure your API key is set.

We go deeper into chunking strategies in our chunk documentation.

Step 2: Embedding Generation

Transform chunks into embeddings using your preferred provider. Mastra supports many embedding providers, including OpenAI and Cohere:

Using OpenAI

import { openai } from "@ai-sdk/openai"; import { embedMany } from "ai"; const { embeddings } = await embedMany({ model: openai.embedding('text-embedding-3-small'), values: chunks.map(chunk => chunk.text), });

Using Cohere

import { cohere } from '@ai-sdk/cohere'; import { embedMany } from 'ai'; const { embeddings } = await embedMany({ model: cohere.embedding('embed-english-v3.0'), values: chunks.map(chunk => chunk.text), });

The embedding functions return vectors, arrays of numbers representing the semantic meaning of your text, ready for similarity searches in your vector database.

Configuring Embedding Dimensions

Embedding models typically output vectors with a fixed number of dimensions (e.g., 1536 for OpenAI’s text-embedding-3-small). Some models support reducing this dimensionality, which can help:

  • Decrease storage requirements in vector databases
  • Reduce computational costs for similarity searches

Here are some supported models:

OpenAI (text-embedding-3 models):

const { embeddings } = await embedMany({ model: openai.embedding('text-embedding-3-small', { dimensions: 256 // Only supported in text-embedding-3 and later }), values: chunks.map(chunk => chunk.text), });

Google (text-embedding-004):

const { embeddings } = await embedMany({ model: google.textEmbeddingModel('text-embedding-004', { outputDimensionality: 256 // Truncates excessive values from the end }), values: chunks.map(chunk => chunk.text), });

Example: Complete Pipeline

Here’s an example showing document processing and embedding generation with both providers:

import { embedMany } from "ai"; import { openai } from "@ai-sdk/openai"; import { cohere } from "@ai-sdk/cohere"; import { MDocument } from "@mastra/rag"; // Initialize document const doc = MDocument.fromText(` Climate change poses significant challenges to global agriculture. Rising temperatures and changing precipitation patterns affect crop yields. `); // Create chunks const chunks = await doc.chunk({ strategy: "recursive", size: 256, overlap: 50, }); // Generate embeddings with OpenAI const { embeddings: openAIEmbeddings } = await embedMany({ model: openai.embedding('text-embedding-3-small'), values: chunks.map(chunk => chunk.text), }); // OR // Generate embeddings with Cohere const { embeddings: cohereEmbeddings } = await embedMany({ model: cohere.embedding('embed-english-v3.0'), values: chunks.map(chunk => chunk.text), }); // Store embeddings in your vector database await vectorStore.upsert({ indexName: "embeddings", vectors: embeddings, });

For more examples of different chunking strategies and embedding configurations, see: