Chunking and Embedding Documents
Before processing, create a MDocument instance from your content. You can initialize it from various formats:
const docFromText = MDocument.fromText("Your plain text content...");
const docFromHTML = MDocument.fromHTML("<html>Your HTML content...</html>");
const docFromMarkdown = MDocument.fromMarkdown("# Your Markdown content...");
const docFromJSON = MDocument.fromJSON(`{ "key": "value" }`);
Step 1: Document Processing
Use chunk
to split documents into manageable pieces. Mastra supports multiple chunking strategies optimized for different document types:
recursive
: Smart splitting based on content structurecharacter
: Simple character-based splitstoken
: Token-aware splittingmarkdown
: Markdown-aware splittinghtml
: HTML structure-aware splittingjson
: JSON structure-aware splittinglatex
: LaTeX structure-aware splitting
Here’s an example of how to use the recursive
strategy:
const chunks = await doc.chunk({
strategy: "recursive",
size: 512,
overlap: 50,
separator: "\n",
extract: {
metadata: true, // Optionally extract metadata
}
});
Note: Metadata extraction may use LLM calls, so ensure your API key is set.
We go deeper into chunking strategies in our chunk documentation.
Step 2: Embedding Generation
Transform chunks into embeddings using your preferred provider. Mastra supports both OpenAI and Cohere embeddings:
Using OpenAI
import { embed } from "@mastra/rag";
const embeddings = await embed(chunks, {
provider: "OPEN_AI",
model: "text-embedding-ada-002",
maxRetries: 3
});
Using Cohere
const embeddings = await embed(chunks, {
provider: "COHERE",
model: "embed-english-v3.0",
maxRetries: 3
});
The embedding functions return vectors, arrays of numbers representing the semantic meaning of your text, ready for similarity searches in your vector database.
Example: Complete Pipeline
Here’s an example showing document processing and embedding generation with both providers:
import { MDocument, embed } from "@mastra/rag";
// Initialize document
const doc = MDocument.fromText(`
Climate change poses significant challenges to global agriculture.
Rising temperatures and changing precipitation patterns affect crop yields.
`);
// Create chunks
const chunks = await doc.chunk({
strategy: "recursive",
size: 256,
overlap: 50
});
// Generate embeddings with OpenAI
const openAIEmbeddings = await embed(chunks, {
provider: "OPEN_AI",
model: "text-embedding-ada-002"
});
// Generate embeddings with Cohere
const cohereEmbeddings = await embed(chunks, {
provider: "COHERE",
model: "embed-english-v3.0"
});
// Store embeddings in your vector database
await vectorStore.upsert("embeddings", embeddings);
This example demonstrates how to process a document, split it into chunks, generate embeddings with both OpenAI and Cohere, and store the results in a vector database.
For more examples of different chunking strategies and embedding configurations, see: