Reference: .chunk()
The .chunk()
function splits documents into smaller segments using various strategies and options.
Example
import { Document } from '@mastra/core';
const doc = new Document(`
# Introduction
This is a sample document that we want to split into chunks.
## Section 1
Here is the first section with some content.
## Section 2
Here is another section with different content.
`);
// Basic chunking with defaults
const chunks = await doc.chunk();
// Markdown-specific chunking with header extraction
const chunksWithMetadata = await doc.chunk({
strategy: 'markdown',
headers: [['#', 'title'], ['##', 'section']],
extract: {
fields: [
{ name: 'summary', description: 'A brief summary of the chunk content' },
{ name: 'keywords', description: 'Key terms found in the chunk' }
]
}
});
Parameters
strategy?:
'recursive' | 'character' | 'token' | 'markdown' | 'html' | 'json' | 'latex'
The chunking strategy to use. If not specified, defaults based on document type. Depending on the chunking strategy, there are additional optionals. Defaults: .md files → 'markdown', .html/.htm → 'html', .json → 'json', .tex → 'latex', others → 'recursive'
size?:
number
= 512
Maximum size of each chunk
overlap?:
number
= 50
Number of characters/tokens that overlap between chunks.
separator?:
string
= \n\n
Character(s) to split on. Defaults to double newline for text content.
isSeparatorRegex?:
boolean
= false
Whether the separator is a regex pattern
keepSeparator?:
'start' | 'end'
Whether to keep the separator at the start or end of chunks
extract?:
ExtractParams
Metadata extraction configuration. See [ExtractParams reference](./extract-params) for details.
Strategy-Specific Options
Strategy-specific options are passed as top-level parameters alongside the strategy parameter. For example:
// HTML strategy example
const chunks = await doc.chunk({
strategy: 'html',
headers: [['h1', 'title'], ['h2', 'subtitle']], // HTML-specific option
sections: [['div.content', 'main']], // HTML-specific option
size: 500 // general option
});
// Markdown strategy example
const chunks = await doc.chunk({
strategy: 'markdown',
headers: [['#', 'title'], ['##', 'section']], // Markdown-specific option
stripHeaders: true, // Markdown-specific option
overlap: 50 // general option
});
// Token strategy example
const chunks = await doc.chunk({
strategy: 'token',
encodingName: 'gpt2', // Token-specific option
modelName: 'gpt-3.5-turbo', // Token-specific option
size: 1000 // general option
});
The options documented below are passed directly at the top level of the configuration object, not nested within a separate options object.
HTML
headers:
Array<[string, string]>
Array of [selector, metadata key] pairs for header-based splitting
sections:
Array<[string, string]>
Array of [selector, metadata key] pairs for section-based splitting
returnEachLine?:
boolean
Whether to return each line as a separate chunk
Markdown
headers:
Array<[string, string]>
Array of [header level, metadata key] pairs
stripHeaders?:
boolean
Whether to remove headers from the output
returnEachLine?:
boolean
Whether to return each line as a separate chunk
Token
encodingName?:
string
Name of the token encoding to use
modelName?:
string
Name of the model for tokenization
JSON
maxSize:
number
Maximum size of each chunk
minSize?:
number
Minimum size of each chunk
ensureAscii?:
boolean
Whether to ensure ASCII encoding
convertLists?:
boolean
Whether to convert lists in the JSON
Return Value
Returns a MDocument
instance containing the chunked documents. Each chunk includes:
interface DocumentNode {
text: string;
metadata: Record<string, any>;
embedding?: number[];
}