DocsReferenceRAG.chunk()

Reference: .chunk()

The .chunk() function splits documents into smaller segments using various strategies and options.

Example

import { Document } from '@mastra/core';
 
const doc = new Document(`
# Introduction
This is a sample document that we want to split into chunks.
 
## Section 1
Here is the first section with some content.
 
## Section 2 
Here is another section with different content.
`);
 
// Basic chunking with defaults
const chunks = await doc.chunk();
 
// Markdown-specific chunking with header extraction
const chunksWithMetadata = await doc.chunk({
  strategy: 'markdown',
  headers: [['#', 'title'], ['##', 'section']],
  extract: {
    fields: [
      { name: 'summary', description: 'A brief summary of the chunk content' },
      { name: 'keywords', description: 'Key terms found in the chunk' }
    ]
  }
});

Parameters

strategy?:

'recursive' | 'character' | 'token' | 'markdown' | 'html' | 'json' | 'latex'
The chunking strategy to use. If not specified, defaults based on document type. Depending on the chunking strategy, there are additional optionals. Defaults: .md files → 'markdown', .html/.htm → 'html', .json → 'json', .tex → 'latex', others → 'recursive'

size?:

number
= 512
Maximum size of each chunk

overlap?:

number
= 50
Number of characters/tokens that overlap between chunks.

separator?:

string
= \n\n
Character(s) to split on. Defaults to double newline for text content.

isSeparatorRegex?:

boolean
= false
Whether the separator is a regex pattern

keepSeparator?:

'start' | 'end'
Whether to keep the separator at the start or end of chunks

extract?:

ExtractParams
Metadata extraction configuration. See [ExtractParams reference](./extract-params) for details.

Strategy-Specific Options

Strategy-specific options are passed as top-level parameters alongside the strategy parameter. For example:

// HTML strategy example
const chunks = await doc.chunk({
  strategy: 'html',
  headers: [['h1', 'title'], ['h2', 'subtitle']], // HTML-specific option
  sections: [['div.content', 'main']], // HTML-specific option
  size: 500 // general option
});
 
// Markdown strategy example
const chunks = await doc.chunk({
  strategy: 'markdown',
  headers: [['#', 'title'], ['##', 'section']], // Markdown-specific option
  stripHeaders: true, // Markdown-specific option
  overlap: 50 // general option
});
 
// Token strategy example
const chunks = await doc.chunk({
  strategy: 'token',
  encodingName: 'gpt2', // Token-specific option
  modelName: 'gpt-3.5-turbo', // Token-specific option
  size: 1000 // general option
});

The options documented below are passed directly at the top level of the configuration object, not nested within a separate options object.

HTML

headers:

Array<[string, string]>
Array of [selector, metadata key] pairs for header-based splitting

sections:

Array<[string, string]>
Array of [selector, metadata key] pairs for section-based splitting

returnEachLine?:

boolean
Whether to return each line as a separate chunk

Markdown

headers:

Array<[string, string]>
Array of [header level, metadata key] pairs

stripHeaders?:

boolean
Whether to remove headers from the output

returnEachLine?:

boolean
Whether to return each line as a separate chunk

Token

encodingName?:

string
Name of the token encoding to use

modelName?:

string
Name of the model for tokenization

JSON

maxSize:

number
Maximum size of each chunk

minSize?:

number
Minimum size of each chunk

ensureAscii?:

boolean
Whether to ensure ASCII encoding

convertLists?:

boolean
Whether to convert lists in the JSON

Return Value

Returns a MDocument instance containing the chunked documents. Each chunk includes:

interface DocumentNode {
  text: string;
  metadata: Record<string, any>;
  embedding?: number[];
}