Skip to Content
ReferenceRAG.chunk()

Reference: .chunk()

The .chunk() function splits documents into smaller segments using various strategies and options.

Example

import { MDocument } from "@mastra/rag"; const doc = MDocument.fromMarkdown(` # Introduction This is a sample document that we want to split into chunks. ## Section 1 Here is the first section with some content. ## Section 2 Here is another section with different content. `); // Basic chunking with defaults const chunks = await doc.chunk(); // Markdown-specific chunking with header extraction const chunksWithMetadata = await doc.chunk({ strategy: "markdown", headers: [ ["#", "title"], ["##", "section"], ], extract: { summary: true, // Extract summaries with default settings keywords: true, // Extract keywords with default settings }, });

Parameters

The following parameters are available for all chunking strategies. Important: Each strategy will only utilize a subset of these parameters relevant to its specific use case.

strategy?:

'recursive' | 'character' | 'token' | 'markdown' | 'semantic-markdown' | 'html' | 'json' | 'latex' | 'sentence'
The chunking strategy to use. If not specified, defaults based on document type. Depending on the chunking strategy, there are additional optionals. Defaults: .md files → 'markdown', .html/.htm → 'html', .json → 'json', .tex → 'latex', others → 'recursive'

maxSize?:

number
= 4000
Maximum size of each chunk. **Note:** Some strategy configurations (markdown with headers, HTML with headers) ignore this parameter.

size?:

number
**Deprecated:** Use `maxSize` instead. This parameter will be removed in the next major version.

overlap?:

number
= 50
Number of characters/tokens that overlap between chunks.

lengthFunction?:

(text: string) => number
Function to calculate text length. Defaults to character count.

keepSeparator?:

boolean | 'start' | 'end'
Whether to keep the separator at the start or end of chunks

addStartIndex?:

boolean
= false
Whether to add start index metadata to chunks.

stripWhitespace?:

boolean
= true
Whether to strip whitespace from chunks.

extract?:

ExtractParams
Metadata extraction configuration.

See ExtractParams reference for details on the extract parameter.

Strategy-Specific Options

Strategy-specific options are passed as top-level parameters alongside the strategy parameter. For example:

// Character strategy example const chunks = await doc.chunk({ strategy: "character", separator: ".", // Character-specific option isSeparatorRegex: false, // Character-specific option maxSize: 300, // general option }); // Recursive strategy example const chunks = await doc.chunk({ strategy: "recursive", separators: ["\n\n", "\n", " "], // Recursive-specific option language: "markdown", // Recursive-specific option maxSize: 500, // general option }); // Sentence strategy example const chunks = await doc.chunk({ strategy: "sentence", maxSize: 450, // Required for sentence strategy minSize: 50, // Sentence-specific option sentenceEnders: ["."], // Sentence-specific option fallbackToCharacters: false, // Sentence-specific option keepSeparator: true, // general option }); // HTML strategy example const chunks = await doc.chunk({ strategy: "html", headers: [ ["h1", "title"], ["h2", "subtitle"], ], // HTML-specific option }); // Markdown strategy example const chunks = await doc.chunk({ strategy: "markdown", headers: [ ["#", "title"], ["##", "section"], ], // Markdown-specific option stripHeaders: true, // Markdown-specific option }); // Semantic Markdown strategy example const chunks = await doc.chunk({ strategy: "semantic-markdown", joinThreshold: 500, // Semantic Markdown-specific option modelName: "gpt-3.5-turbo", // Semantic Markdown-specific option }); // Token strategy example const chunks = await doc.chunk({ strategy: "token", encodingName: "gpt2", // Token-specific option modelName: "gpt-3.5-turbo", // Token-specific option maxSize: 1000, // general option });

The options documented below are passed directly at the top level of the configuration object, not nested within a separate options object.

Character

separators?:

string[]
Array of separators to try in order of preference. The strategy will attempt to split on the first separator, then fall back to subsequent ones.

isSeparatorRegex?:

boolean
= false
Whether the separator is a regex pattern

Recursive

separators?:

string[]
Array of separators to try in order of preference. The strategy will attempt to split on the first separator, then fall back to subsequent ones.

isSeparatorRegex?:

boolean
= false
Whether the separators are regex patterns

language?:

Language
Programming or markup language for language-specific splitting behavior. See Language enum for supported values.

Sentence

maxSize:

number
Maximum size of each chunk (required for sentence strategy)

minSize?:

number
= 50
Minimum size of each chunk. Chunks smaller than this will be merged with adjacent chunks when possible.

targetSize?:

number
Preferred target size for chunks. Defaults to 80% of maxSize. The strategy will try to create chunks close to this size.

sentenceEnders?:

string[]
= ['.', '!', '?']
Array of characters that mark sentence endings for splitting boundaries.

fallbackToWords?:

boolean
= true
Whether to fall back to word-level splitting for sentences that exceed maxSize.

fallbackToCharacters?:

boolean
= true
Whether to fall back to character-level splitting for words that exceed maxSize. Only applies if fallbackToWords is enabled.

HTML

headers:

Array<[string, string]>
Array of [selector, metadata key] pairs for header-based splitting

sections:

Array<[string, string]>
Array of [selector, metadata key] pairs for section-based splitting

returnEachLine?:

boolean
Whether to return each line as a separate chunk

Important: When using the HTML strategy, all general options are ignored. Use headers for header-based splitting or sections for section-based splitting. If used together, sections will be ignored.

Markdown

headers?:

Array<[string, string]>
Array of [header level, metadata key] pairs

stripHeaders?:

boolean
Whether to remove headers from the output

returnEachLine?:

boolean
Whether to return each line as a separate chunk

Important: When using the headers option, the markdown strategy ignores all general options and content is split based on the markdown header structure. To use size-based chunking with markdown, omit the headers parameter.

Semantic Markdown

joinThreshold?:

number
= 500
Maximum token count for merging related sections. Sections exceeding this limit individually are left intact, but smaller sections are merged with siblings or parents if the combined size stays under this threshold.

modelName?:

string
Name of the model for tokenization. If provided, the model's underlying tokenization `encodingName` will be used.

encodingName?:

string
= cl100k_base
Name of the token encoding to use. Derived from `modelName` if available.

allowedSpecial?:

Set<string> | 'all'
Set of special tokens allowed during tokenization, or 'all' to allow all special tokens

disallowedSpecial?:

Set<string> | 'all'
= all
Set of special tokens to disallow during tokenization, or 'all' to disallow all special tokens

Token

encodingName?:

string
Name of the token encoding to use

modelName?:

string
Name of the model for tokenization

allowedSpecial?:

Set<string> | 'all'
Set of special tokens allowed during tokenization, or 'all' to allow all special tokens

disallowedSpecial?:

Set<string> | 'all'
Set of special tokens to disallow during tokenization, or 'all' to disallow all special tokens

JSON

maxSize:

number
Maximum size of each chunk

minSize?:

number
Minimum size of each chunk

ensureAscii?:

boolean
Whether to ensure ASCII encoding

convertLists?:

boolean
Whether to convert lists in the JSON

Latex

The Latex strategy uses only the general chunking options listed above. It provides LaTeX-aware splitting optimized for mathematical and academic documents.

Return Value

Returns a MDocument instance containing the chunked documents. Each chunk includes:

interface DocumentNode { text: string; metadata: Record<string, any>; embedding?: number[]; }