Reference: .chunk()
The .chunk()
function splits documents into smaller segments using various strategies and options.
Example
import { MDocument } from "@mastra/rag";
const doc = MDocument.fromMarkdown(`
# Introduction
This is a sample document that we want to split into chunks.
## Section 1
Here is the first section with some content.
## Section 2
Here is another section with different content.
`);
// Basic chunking with defaults
const chunks = await doc.chunk();
// Markdown-specific chunking with header extraction
const chunksWithMetadata = await doc.chunk({
strategy: "markdown",
headers: [
["#", "title"],
["##", "section"],
],
extract: {
summary: true, // Extract summaries with default settings
keywords: true, // Extract keywords with default settings
},
});
Parameters
strategy?:
'recursive' | 'character' | 'token' | 'markdown' | 'html' | 'json' | 'latex'
The chunking strategy to use. If not specified, defaults based on document type. Depending on the chunking strategy, there are additional optionals. Defaults: .md files → 'markdown', .html/.htm → 'html', .json → 'json', .tex → 'latex', others → 'recursive'
size?:
number
= 512
Maximum size of each chunk
overlap?:
number
= 50
Number of characters/tokens that overlap between chunks.
separator?:
string
= \n\n
Character(s) to split on. Defaults to double newline for text content.
isSeparatorRegex?:
boolean
= false
Whether the separator is a regex pattern
keepSeparator?:
'start' | 'end'
Whether to keep the separator at the start or end of chunks
extract?:
ExtractParams
Metadata extraction configuration. See [ExtractParams reference](./extract-params) for details.
Strategy-Specific Options
Strategy-specific options are passed as top-level parameters alongside the strategy parameter. For example:
// HTML strategy example
const chunks = await doc.chunk({
strategy: "html",
headers: [
["h1", "title"],
["h2", "subtitle"],
], // HTML-specific option
sections: [["div.content", "main"]], // HTML-specific option
size: 500, // general option
});
// Markdown strategy example
const chunks = await doc.chunk({
strategy: "markdown",
headers: [
["#", "title"],
["##", "section"],
], // Markdown-specific option
stripHeaders: true, // Markdown-specific option
overlap: 50, // general option
});
// Token strategy example
const chunks = await doc.chunk({
strategy: "token",
encodingName: "gpt2", // Token-specific option
modelName: "gpt-3.5-turbo", // Token-specific option
size: 1000, // general option
});
The options documented below are passed directly at the top level of the configuration object, not nested within a separate options object.
HTML
headers:
Array<[string, string]>
Array of [selector, metadata key] pairs for header-based splitting
sections:
Array<[string, string]>
Array of [selector, metadata key] pairs for section-based splitting
returnEachLine?:
boolean
Whether to return each line as a separate chunk
Markdown
headers:
Array<[string, string]>
Array of [header level, metadata key] pairs
stripHeaders?:
boolean
Whether to remove headers from the output
returnEachLine?:
boolean
Whether to return each line as a separate chunk
Token
encodingName?:
string
Name of the token encoding to use
modelName?:
string
Name of the model for tokenization
JSON
maxSize:
number
Maximum size of each chunk
minSize?:
number
Minimum size of each chunk
ensureAscii?:
boolean
Whether to ensure ASCII encoding
convertLists?:
boolean
Whether to convert lists in the JSON
Return Value
Returns a MDocument
instance containing the chunked documents. Each chunk includes:
interface DocumentNode {
text: string;
metadata: Record<string, any>;
embedding?: number[];
}