Reference: .chunk()

The .chunk() function splits documents into smaller segments using various strategies and options.

Example


import { MDocument } from "@mastra/rag";
 
const doc = MDocument.fromMarkdown(`
# Introduction
This is a sample document that we want to split into chunks.
 
## Section 1
Here is the first section with some content.
 
## Section 2 
Here is another section with different content.
`);
 
// Basic chunking with defaults
const chunks = await doc.chunk();
 
// Markdown-specific chunking with header extraction
const chunksWithMetadata = await doc.chunk({
  strategy: "markdown",
  headers: [
    ["#", "title"],
    ["##", "section"],
  ],
  extract: {
    summary: true, // Extract summaries with default settings
    keywords: true, // Extract keywords with default settings
  },
});

Parameters

strategy?:

The chunking strategy to use. If not specified, defaults based on document type. Depending on the chunking strategy, there are additional optionals. Defaults: .md files → 'markdown', .html/.htm → 'html', .json → 'json', .tex → 'latex', others → 'recursive'

size?:

number

= 512

Maximum size of each chunk

overlap?:

number

= 50

Number of characters/tokens that overlap between chunks.

separator?:

string

= \n\n

Character(s) to split on. Defaults to double newline for text content.

isSeparatorRegex?:

boolean

= false

Whether the separator is a regex pattern

keepSeparator?:

'start' | 'end'

Whether to keep the separator at the start or end of chunks

extract?:

ExtractParams

Metadata extraction configuration. See [ExtractParams reference](./extract-params) for details.

Strategy-Specific Options

Strategy-specific options are passed as top-level parameters alongside the strategy parameter. For example:


// HTML strategy example
const chunks = await doc.chunk({
  strategy: "html",
  headers: [
    ["h1", "title"],
    ["h2", "subtitle"],
  ], // HTML-specific option
  sections: [["div.content", "main"]], // HTML-specific option
  size: 500, // general option
});
 
// Markdown strategy example
const chunks = await doc.chunk({
  strategy: "markdown",
  headers: [
    ["#", "title"],
    ["##", "section"],
  ], // Markdown-specific option
  stripHeaders: true, // Markdown-specific option
  overlap: 50, // general option
});
 
// Token strategy example
const chunks = await doc.chunk({
  strategy: "token",
  encodingName: "gpt2", // Token-specific option
  modelName: "gpt-3.5-turbo", // Token-specific option
  size: 1000, // general option
});

The options documented below are passed directly at the top level of the configuration object, not nested within a separate options object.

HTML

headers:

Array<[string, string]>

Array of [selector, metadata key] pairs for header-based splitting

sections:

Array<[string, string]>

Array of [selector, metadata key] pairs for section-based splitting

returnEachLine?:

boolean

Whether to return each line as a separate chunk

Markdown

headers:

Array<[string, string]>

Array of [header level, metadata key] pairs

stripHeaders?:

boolean

Whether to remove headers from the output

returnEachLine?:

boolean

Whether to return each line as a separate chunk

Token

encodingName?:

string

Name of the token encoding to use

modelName?:

string

Name of the model for tokenization

JSON

maxSize:

number

Maximum size of each chunk

minSize?:

number

Minimum size of each chunk

ensureAscii?:

boolean

Whether to ensure ASCII encoding

convertLists?:

boolean

Whether to convert lists in the JSON

Return Value

Returns a MDocument instance containing the chunked documents. Each chunk includes:


interface DocumentNode {
  text: string;
  metadata: Record<string, any>;
  embedding?: number[];
}