Reference: .chunk()

The .chunk() function splits documents into smaller segments using various strategies and options.

ExampleDirect link to Example

import { MDocument } from "@mastra/rag";

const doc = MDocument.fromMarkdown(`
# Introduction
This is a sample document that we want to split into chunks.

## Section 1
Here is the first section with some content.

## Section 2 
Here is another section with different content.
`);

// Basic chunking with defaults
const chunks = await doc.chunk();

// Markdown-specific chunking with header extraction
const chunksWithMetadata = await doc.chunk({
  strategy: "markdown",
  headers: [
    ["#", "title"],
    ["##", "section"],
  ],
  extract: {
    summary: true, // Extract summaries with default settings
    keywords: true, // Extract keywords with default settings
  },
});

ParametersDirect link to Parameters

The following parameters are available for all chunking strategies. Important: Each strategy will only utilize a subset of these parameters relevant to its specific use case.

strategy?:

The chunking strategy to use. If not specified, defaults based on document type. Depending on the chunking strategy, there are additional optionals. Defaults: .md files → 'markdown', .html/.htm → 'html', .json → 'json', .tex → 'latex', others → 'recursive'

maxSize?:

number

= 4000

Maximum size of each chunk. **Note:** Some strategy configurations (markdown with headers, HTML with headers) ignore this parameter.

size?:

number

**Deprecated:** Use `maxSize` instead. This parameter will be removed in the next major version.

overlap?:

number

= 50

Number of characters/tokens that overlap between chunks.

lengthFunction?:

(text: string) => number

Function to calculate text length. Defaults to character count.

keepSeparator?:

boolean | 'start' | 'end'

Whether to keep the separator at the start or end of chunks

addStartIndex?:

boolean

= false

Whether to add start index metadata to chunks.

stripWhitespace?:

boolean

= true

Whether to strip whitespace from chunks.

extract?:

ExtractParams

Metadata extraction configuration.

See ExtractParams reference for details on the extract parameter.

Strategy-Specific OptionsDirect link to Strategy-Specific Options

Strategy-specific options are passed as top-level parameters alongside the strategy parameter. For example:

// Character strategy example
const chunks = await doc.chunk({
  strategy: "character",
  separator: ".", // Character-specific option
  isSeparatorRegex: false, // Character-specific option
  maxSize: 300, // general option
});

// Recursive strategy example
const chunks = await doc.chunk({
  strategy: "recursive",
  separators: ["\n\n", "\n", " "], // Recursive-specific option
  language: "markdown", // Recursive-specific option
  maxSize: 500, // general option
});

// Sentence strategy example
const chunks = await doc.chunk({
  strategy: "sentence",
  maxSize: 450, // Required for sentence strategy
  minSize: 50, // Sentence-specific option
  sentenceEnders: ["."], // Sentence-specific option
  fallbackToCharacters: false, // Sentence-specific option
  keepSeparator: true, // general option
});

// HTML strategy example
const chunks = await doc.chunk({
  strategy: "html",
  headers: [
    ["h1", "title"],
    ["h2", "subtitle"],
  ], // HTML-specific option
});

// Markdown strategy example
const chunks = await doc.chunk({
  strategy: "markdown",
  headers: [
    ["#", "title"],
    ["##", "section"],
  ], // Markdown-specific option
  stripHeaders: true, // Markdown-specific option
});

// Semantic Markdown strategy example
const chunks = await doc.chunk({
  strategy: "semantic-markdown",
  joinThreshold: 500, // Semantic Markdown-specific option
  modelName: "gpt-3.5-turbo", // Semantic Markdown-specific option
});

// Token strategy example
const chunks = await doc.chunk({
  strategy: "token",
  encodingName: "gpt2", // Token-specific option
  modelName: "gpt-3.5-turbo", // Token-specific option
  maxSize: 1000, // general option
});

The options documented below are passed directly at the top level of the configuration object, not nested within a separate options object.

CharacterDirect link to Character

separators?:

string[]

Array of separators to try in order of preference. The strategy will attempt to split on the first separator, then fall back to subsequent ones.

isSeparatorRegex?:

boolean

= false

Whether the separator is a regex pattern

RecursiveDirect link to Recursive

separators?:

string[]

Array of separators to try in order of preference. The strategy will attempt to split on the first separator, then fall back to subsequent ones.

isSeparatorRegex?:

boolean

= false

Whether the separators are regex patterns

language?:

Language

Programming or markup language for language-specific splitting behavior. See Language enum for supported values.

SentenceDirect link to Sentence

maxSize:

number

Maximum size of each chunk (required for sentence strategy)

minSize?:

number

= 50

Minimum size of each chunk. Chunks smaller than this will be merged with adjacent chunks when possible.

targetSize?:

number

Preferred target size for chunks. Defaults to 80% of maxSize. The strategy will try to create chunks close to this size.

sentenceEnders?:

string[]

= ['.', '!', '?']

Array of characters that mark sentence endings for splitting boundaries.

fallbackToWords?:

boolean

= true

Whether to fall back to word-level splitting for sentences that exceed maxSize.

fallbackToCharacters?:

boolean

= true

Whether to fall back to character-level splitting for words that exceed maxSize. Only applies if fallbackToWords is enabled.

HTMLDirect link to HTML

headers:

Array<[string, string]>

Array of [selector, metadata key] pairs for header-based splitting

sections:

Array<[string, string]>

Array of [selector, metadata key] pairs for section-based splitting

returnEachLine?:

boolean

Whether to return each line as a separate chunk

Important: When using the HTML strategy, all general options are ignored. Use headers for header-based splitting or sections for section-based splitting. If used together, sections will be ignored.

MarkdownDirect link to Markdown

headers?:

Array<[string, string]>

Array of [header level, metadata key] pairs

stripHeaders?:

boolean

Whether to remove headers from the output

returnEachLine?:

boolean

Whether to return each line as a separate chunk

Important: When using the headers option, the markdown strategy ignores all general options and content is split based on the markdown header structure. To use size-based chunking with markdown, omit the headers parameter.

Semantic MarkdownDirect link to Semantic Markdown

joinThreshold?:

number

= 500

Maximum token count for merging related sections. Sections exceeding this limit individually are left intact, but smaller sections are merged with siblings or parents if the combined size stays under this threshold.

modelName?:

string

Name of the model for tokenization. If provided, the model's underlying tokenization `encodingName` will be used.

encodingName?:

string

= cl100k_base

Name of the token encoding to use. Derived from `modelName` if available.

allowedSpecial?:

Set<string> | 'all'

Set of special tokens allowed during tokenization, or 'all' to allow all special tokens

disallowedSpecial?:

Set<string> | 'all'

= all

Set of special tokens to disallow during tokenization, or 'all' to disallow all special tokens

TokenDirect link to Token

encodingName?:

string

Name of the token encoding to use

modelName?:

string

Name of the model for tokenization

allowedSpecial?:

Set<string> | 'all'

Set of special tokens allowed during tokenization, or 'all' to allow all special tokens

disallowedSpecial?:

Set<string> | 'all'

Set of special tokens to disallow during tokenization, or 'all' to disallow all special tokens

JSONDirect link to JSON

maxSize:

number

Maximum size of each chunk

minSize?:

number

Minimum size of each chunk

ensureAscii?:

boolean

Whether to ensure ASCII encoding

convertLists?:

boolean

Whether to convert lists in the JSON

LatexDirect link to Latex

The Latex strategy uses only the general chunking options listed above. It provides LaTeX-aware splitting optimized for mathematical and academic documents.

Return ValueDirect link to Return Value

Returns a MDocument instance containing the chunked documents. Each chunk includes:

interface DocumentNode {
  text: string;
  metadata: Record<string, any>;
  embedding?: number[];
}