Skip to Content
ReferenceRAG.chunk()

Reference: .chunk()

The .chunk() function splits documents into smaller segments using various strategies and options.

Example

import { MDocument } from "@mastra/rag"; const doc = MDocument.fromMarkdown(` # Introduction This is a sample document that we want to split into chunks. ## Section 1 Here is the first section with some content. ## Section 2 Here is another section with different content. `); // Basic chunking with defaults const chunks = await doc.chunk(); // Markdown-specific chunking with header extraction const chunksWithMetadata = await doc.chunk({ strategy: "markdown", headers: [ ["#", "title"], ["##", "section"], ], extract: { summary: true, // Extract summaries with default settings keywords: true, // Extract keywords with default settings }, });

Parameters

The following parameters are available for all chunking strategies. Important: Each strategy will only utilize a subset of these parameters relevant to its specific use case.

strategy?:

'recursive' | 'character' | 'token' | 'markdown' | 'html' | 'json' | 'latex' | 'sentence'
The chunking strategy to use. If not specified, defaults based on document type. Depending on the chunking strategy, there are additional optionals. Defaults: .md files → 'markdown', .html/.htm → 'html', .json → 'json', .tex → 'latex', others → 'recursive'

maxSize?:

number
= 4000
Maximum size of each chunk. **Note:** Some strategy configurations (markdown with headers, HTML with headers) ignore this parameter.

size?:

number
**Deprecated:** Use `maxSize` instead. This parameter will be removed in the next major version.

overlap?:

number
= 50
Number of characters/tokens that overlap between chunks.

lengthFunction?:

(text: string) => number
Function to calculate text length. Defaults to character count.

keepSeparator?:

boolean | 'start' | 'end'
Whether to keep the separator at the start or end of chunks

addStartIndex?:

boolean
= false
Whether to add start index metadata to chunks.

stripWhitespace?:

boolean
= true
Whether to strip whitespace from chunks.

extract?:

ExtractParams
Metadata extraction configuration. See [ExtractParams reference](/reference/rag/extract-params) for details.

Strategy-Specific Options

Strategy-specific options are passed as top-level parameters alongside the strategy parameter. For example:

// Character strategy example const chunks = await doc.chunk({ strategy: "character", separator: ".", // Character-specific option isSeparatorRegex: false, // Character-specific option maxSize: 300, // general option }); // Recursive strategy example const chunks = await doc.chunk({ strategy: "recursive", separators: ["\n\n", "\n", " "], // Recursive-specific option language: "markdown", // Recursive-specific option maxSize: 500, // general option }); // Sentence strategy example const chunks = await doc.chunk({ strategy: "sentence", maxSize: 450, // Required for sentence strategy minSize: 50, // Sentence-specific option sentenceEnders: ["."], // Sentence-specific option fallbackToCharacters: false, // Sentence-specific option keepSeparator: true, // general option }); // HTML strategy example const chunks = await doc.chunk({ strategy: "html", headers: [ ["h1", "title"], ["h2", "subtitle"], ], // HTML-specific option }); // Markdown strategy example const chunks = await doc.chunk({ strategy: "markdown", headers: [ ["#", "title"], ["##", "section"], ], // Markdown-specific option stripHeaders: true, // Markdown-specific option }); // Token strategy example const chunks = await doc.chunk({ strategy: "token", encodingName: "gpt2", // Token-specific option modelName: "gpt-3.5-turbo", // Token-specific option maxSize: 1000, // general option });

The options documented below are passed directly at the top level of the configuration object, not nested within a separate options object.

Character

separator?:

string
= \n\n
Character(s) to split on. Defaults to double newline for text content.

isSeparatorRegex?:

boolean
= false
Whether the separator is a regex pattern

Recursive

separators?:

string[]
Array of separators to try in order of preference. The strategy will attempt to split on the first separator, then fall back to subsequent ones.

isSeparatorRegex?:

boolean
= false
Whether the separators are regex patterns

language?:

Language
Programming or markup language for language-specific splitting behavior. See Language enum for supported values.

Sentence

maxSize:

number
Maximum size of each chunk (required for sentence strategy)

minSize?:

number
= 50
Minimum size of each chunk. Chunks smaller than this will be merged with adjacent chunks when possible.

targetSize?:

number
Preferred target size for chunks. Defaults to 80% of maxSize. The strategy will try to create chunks close to this size.

sentenceEnders?:

string[]
= ['.', '!', '?']
Array of characters that mark sentence endings for splitting boundaries.

fallbackToWords?:

boolean
= true
Whether to fall back to word-level splitting for sentences that exceed maxSize.

fallbackToCharacters?:

boolean
= true
Whether to fall back to character-level splitting for words that exceed maxSize. Only applies if fallbackToWords is enabled.

HTML

headers:

Array<[string, string]>
Array of [selector, metadata key] pairs for header-based splitting

sections:

Array<[string, string]>
Array of [selector, metadata key] pairs for section-based splitting

returnEachLine?:

boolean
Whether to return each line as a separate chunk

Important: When using the HTML strategy, all general options are ignored. Use headers for header-based splitting or sections for section-based splitting. If used together, sections will be ignored.

Markdown

headers?:

Array<[string, string]>
Array of [header level, metadata key] pairs

stripHeaders?:

boolean
Whether to remove headers from the output

returnEachLine?:

boolean
Whether to return each line as a separate chunk

Important: When using the headers option, the markdown strategy ignores all general options and content is split based on the markdown header structure. To use size-based chunking with markdown, omit the headers parameter.

Token

encodingName?:

string
Name of the token encoding to use

modelName?:

string
Name of the model for tokenization

allowedSpecial?:

Set<string> | 'all'
Set of special tokens allowed during tokenization, or 'all' to allow all special tokens

disallowedSpecial?:

Set<string> | 'all'
Set of special tokens to disallow during tokenization, or 'all' to disallow all special tokens

JSON

maxSize:

number
Maximum size of each chunk

minSize?:

number
Minimum size of each chunk

ensureAscii?:

boolean
Whether to ensure ASCII encoding

convertLists?:

boolean
Whether to convert lists in the JSON

Latex

The Latex strategy uses only the general chunking options listed above. It provides LaTeX-aware splitting optimized for mathematical and academic documents.

Return Value

Returns a MDocument instance containing the chunked documents. Each chunk includes:

interface DocumentNode { text: string; metadata: Record<string, any>; embedding?: number[]; }