Reference: .chunk()
The .chunk()
function splits documents into smaller segments using various strategies and options.
Example
import { MDocument } from "@mastra/rag";
const doc = MDocument.fromMarkdown(`
# Introduction
This is a sample document that we want to split into chunks.
## Section 1
Here is the first section with some content.
## Section 2
Here is another section with different content.
`);
// Basic chunking with defaults
const chunks = await doc.chunk();
// Markdown-specific chunking with header extraction
const chunksWithMetadata = await doc.chunk({
strategy: "markdown",
headers: [
["#", "title"],
["##", "section"],
],
extract: {
summary: true, // Extract summaries with default settings
keywords: true, // Extract keywords with default settings
},
});
Parameters
The following parameters are available for all chunking strategies. Important: Each strategy will only utilize a subset of these parameters relevant to its specific use case.
strategy?:
maxSize?:
size?:
overlap?:
lengthFunction?:
keepSeparator?:
addStartIndex?:
stripWhitespace?:
extract?:
Strategy-Specific Options
Strategy-specific options are passed as top-level parameters alongside the strategy parameter. For example:
// Character strategy example
const chunks = await doc.chunk({
strategy: "character",
separator: ".", // Character-specific option
isSeparatorRegex: false, // Character-specific option
maxSize: 300, // general option
});
// Recursive strategy example
const chunks = await doc.chunk({
strategy: "recursive",
separators: ["\n\n", "\n", " "], // Recursive-specific option
language: "markdown", // Recursive-specific option
maxSize: 500, // general option
});
// Sentence strategy example
const chunks = await doc.chunk({
strategy: "sentence",
maxSize: 450, // Required for sentence strategy
minSize: 50, // Sentence-specific option
sentenceEnders: ["."], // Sentence-specific option
fallbackToCharacters: false, // Sentence-specific option
keepSeparator: true, // general option
});
// HTML strategy example
const chunks = await doc.chunk({
strategy: "html",
headers: [
["h1", "title"],
["h2", "subtitle"],
], // HTML-specific option
});
// Markdown strategy example
const chunks = await doc.chunk({
strategy: "markdown",
headers: [
["#", "title"],
["##", "section"],
], // Markdown-specific option
stripHeaders: true, // Markdown-specific option
});
// Token strategy example
const chunks = await doc.chunk({
strategy: "token",
encodingName: "gpt2", // Token-specific option
modelName: "gpt-3.5-turbo", // Token-specific option
maxSize: 1000, // general option
});
The options documented below are passed directly at the top level of the configuration object, not nested within a separate options object.
Character
separator?:
isSeparatorRegex?:
Recursive
separators?:
isSeparatorRegex?:
language?:
Sentence
maxSize:
minSize?:
targetSize?:
sentenceEnders?:
fallbackToWords?:
fallbackToCharacters?:
HTML
headers:
sections:
returnEachLine?:
Important: When using the HTML strategy, all general options are ignored. Use headers
for header-based splitting or sections
for section-based splitting. If used together, sections
will be ignored.
Markdown
headers?:
stripHeaders?:
returnEachLine?:
Important: When using the headers
option, the markdown strategy ignores all general options and content is split based on the markdown header structure. To use size-based chunking with markdown, omit the headers
parameter.
Token
encodingName?:
modelName?:
allowedSpecial?:
disallowedSpecial?:
JSON
maxSize:
minSize?:
ensureAscii?:
convertLists?:
Latex
The Latex strategy uses only the general chunking options listed above. It provides LaTeX-aware splitting optimized for mathematical and academic documents.
Return Value
Returns a MDocument
instance containing the chunked documents. Each chunk includes:
interface DocumentNode {
text: string;
metadata: Record<string, any>;
embedding?: number[];
}