Chunking and Embedding Documents
Before processing, create a MDocument instance from your content. You can initialize it from various formats:
const docFromText = MDocument.fromText("Your plain text content...");
const docFromHTML = MDocument.fromHTML("<html>Your HTML content...</html>");
const docFromMarkdown = MDocument.fromMarkdown("# Your Markdown content...");
const docFromJSON = MDocument.fromJSON(`{ "key": "value" }`);
Step 1: Document Processing
Use chunk
to split documents into manageable pieces. Mastra supports multiple chunking strategies optimized for different document types:
recursive
: Smart splitting based on content structurecharacter
: Simple character-based splitstoken
: Token-aware splittingmarkdown
: Markdown-aware splittinghtml
: HTML structure-aware splittingjson
: JSON structure-aware splittinglatex
: LaTeX structure-aware splitting
Here’s an example of how to use the recursive
strategy:
const chunks = await doc.chunk({
strategy: "recursive",
size: 512,
overlap: 50,
separator: "\n",
extract: {
metadata: true, // Optionally extract metadata
},
});
Note: Metadata extraction may use LLM calls, so ensure your API key is set.
We go deeper into chunking strategies in our chunk documentation.
Step 2: Embedding Generation
Transform chunks into embeddings using your preferred provider. Mastra supports both OpenAI and Cohere embeddings:
Using OpenAI
import { openai } from "@ai-sdk/openai";
import { embedMany } from "ai";
const { embeddings } = await embedMany({
model: openai.embedding('text-embedding-3-small'),
values: chunks.map(chunk => chunk.text),
});
Using Cohere
import { embedMany } from 'ai';
import { cohere } from '@ai-sdk/cohere';
const { embeddings } = await embedMany({
model: cohere.embedding('embed-english-v3.0'),
values: chunks.map(chunk => chunk.text),
});
The embedding functions return vectors, arrays of numbers representing the semantic meaning of your text, ready for similarity searches in your vector database.
Example: Complete Pipeline
Here’s an example showing document processing and embedding generation with both providers:
import { embedMany } from "ai";
import { openai } from "@ai-sdk/openai";
import { cohere } from "@ai-sdk/cohere";
import { MDocument } from "@mastra/rag";
// Initialize document
const doc = MDocument.fromText(`
Climate change poses significant challenges to global agriculture.
Rising temperatures and changing precipitation patterns affect crop yields.
`);
// Create chunks
const chunks = await doc.chunk({
strategy: "recursive",
size: 256,
overlap: 50,
});
// Generate embeddings with OpenAI
const { embeddings: openAIEmbeddings } = await embedMany({
model: openai.embedding('text-embedding-3-small'),
values: chunks.map(chunk => chunk.text),
});
// OR
// Generate embeddings with Cohere
const { embeddings: cohereEmbeddings } = await embedMany({
model: cohere.embedding('embed-english-v3.0'),
values: chunks.map(chunk => chunk.text),
});
// Store embeddings in your vector database
await vectorStore.upsert("embeddings", embeddings);
This example demonstrates how to process a document, split it into chunks, generate embeddings with both OpenAI and Cohere, and store the results in a vector database.
For more examples of different chunking strategies and embedding configurations, see: