# Reference: .chunk()

The `.chunk()` function splits documents into smaller segments using various strategies and options.

## Example

```typescript
import { MDocument } from "@mastra/rag";

const doc = MDocument.fromMarkdown(`
# Introduction
This is a sample document that we want to split into chunks.

## Section 1
Here is the first section with some content.

## Section 2
Here is another section with different content.
`);

// Basic chunking with defaults
const chunks = await doc.chunk();

// Markdown-specific chunking with header extraction
const chunksWithMetadata = await doc.chunk({
  strategy: "markdown",
  headers: [
    ["#", "title"],
    ["##", "section"],
  ],
  extract: {
    summary: true, // Extract summaries with default settings
    keywords: true, // Extract keywords with default settings
  },
});
```

## Parameters

The following parameters are available for all chunking strategies. **Important:** Each strategy will only utilize a subset of these parameters relevant to its specific use case.

**strategy?:** (`'recursive' | 'character' | 'token' | 'markdown' | 'semantic-markdown' | 'html' | 'json' | 'latex' | 'sentence'`): The chunking strategy to use. If not specified, defaults based on document type. Depending on the chunking strategy, there are additional optionals. Defaults: .md files → 'markdown', .html/.htm → 'html', .json → 'json', .tex → 'latex', others → 'recursive'

**maxSize?:** (`number`): Maximum size of each chunk. \*\*Note:\*\* Some strategy configurations (markdown with headers, HTML with headers) ignore this parameter. (Default: `4000`)

**overlap?:** (`number`): Number of characters/tokens that overlap between chunks. (Default: `50`)

**lengthFunction?:** (`(text: string) => number`): Function to calculate text length. Defaults to character count.

**separatorPosition?:** (`'start' | 'end'`): Where to position the separator in chunks. 'start' attaches to beginning of next chunk, 'end' attaches to end of current chunk. If not specified, separators are discarded.

**addStartIndex?:** (`boolean`): Whether to add start index metadata to chunks. (Default: `false`)

**stripWhitespace?:** (`boolean`): Whether to strip whitespace from chunks. (Default: `true`)

**extract?:** (`ExtractParams`): Metadata extraction configuration.

See [ExtractParams reference](https://mastra.ai/reference/rag/extract-params) for details on the `extract` parameter.

## Strategy-Specific Options

Strategy-specific options are passed as top-level parameters alongside the strategy parameter. For example:

```typescript
// Character strategy example
const chunks = await doc.chunk({
  strategy: "character",
  separator: ".", // Character-specific option
  isSeparatorRegex: false, // Character-specific option
  maxSize: 300, // general option
});

// Recursive strategy example
const chunks = await doc.chunk({
  strategy: "recursive",
  separators: ["\n\n", "\n", " "], // Recursive-specific option
  language: "markdown", // Recursive-specific option
  maxSize: 500, // general option
});

// Sentence strategy example
const chunks = await doc.chunk({
  strategy: "sentence",
  maxSize: 450, // Required for sentence strategy
  minSize: 50, // Sentence-specific option
  sentenceEnders: ["."], // Sentence-specific option
  fallbackToCharacters: false, // Sentence-specific option
});

// HTML strategy example
const chunks = await doc.chunk({
  strategy: "html",
  headers: [
    ["h1", "title"],
    ["h2", "subtitle"],
  ], // HTML-specific option
});

// Markdown strategy example
const chunks = await doc.chunk({
  strategy: "markdown",
  headers: [
    ["#", "title"],
    ["##", "section"],
  ], // Markdown-specific option
  stripHeaders: true, // Markdown-specific option
});

// Semantic Markdown strategy example
const chunks = await doc.chunk({
  strategy: "semantic-markdown",
  joinThreshold: 500, // Semantic Markdown-specific option
  modelName: "gpt-3.5-turbo", // Semantic Markdown-specific option
});

// Token strategy example
const chunks = await doc.chunk({
  strategy: "token",
  encodingName: "gpt2", // Token-specific option
  modelName: "gpt-3.5-turbo", // Token-specific option
  maxSize: 1000, // general option
});
```

The options documented below are passed directly at the top level of the configuration object, not nested within a separate options object.

### Character

**separators?:** (`string[]`): Array of separators to try in order of preference. The strategy will attempt to split on the first separator, then fall back to subsequent ones.

**isSeparatorRegex?:** (`boolean`): Whether the separator is a regex pattern (Default: `false`)

### Recursive

**separators?:** (`string[]`): Array of separators to try in order of preference. The strategy will attempt to split on the first separator, then fall back to subsequent ones.

**isSeparatorRegex?:** (`boolean`): Whether the separators are regex patterns (Default: `false`)

**language?:** (`Language`): Programming or markup language for language-specific splitting behavior. See Language enum for supported values.

### Sentence

**maxSize:** (`number`): Maximum size of each chunk (required for sentence strategy)

**minSize?:** (`number`): Minimum size of each chunk. Chunks smaller than this will be merged with adjacent chunks when possible. (Default: `50`)

**targetSize?:** (`number`): Preferred target size for chunks. Defaults to 80% of maxSize. The strategy will try to create chunks close to this size.

**sentenceEnders?:** (`string[]`): Array of characters that mark sentence endings for splitting boundaries. (Default: `['.', '!', '?']`)

**fallbackToWords?:** (`boolean`): Whether to fall back to word-level splitting for sentences that exceed maxSize. (Default: `true`)

**fallbackToCharacters?:** (`boolean`): Whether to fall back to character-level splitting for words that exceed maxSize. Only applies if fallbackToWords is enabled. (Default: `true`)

### HTML

**headers:** (`Array<[string, string]>`): Array of \[selector, metadata key] pairs for header-based splitting

**sections:** (`Array<[string, string]>`): Array of \[selector, metadata key] pairs for section-based splitting

**returnEachLine?:** (`boolean`): Whether to return each line as a separate chunk

**Important:** When using the HTML strategy, all general options are ignored. Use `headers` for header-based splitting or `sections` for section-based splitting. If used together, `sections` will be ignored.

### Markdown

**headers?:** (`Array<[string, string]>`): Array of \[header level, metadata key] pairs

**stripHeaders?:** (`boolean`): Whether to remove headers from the output

**returnEachLine?:** (`boolean`): Whether to return each line as a separate chunk

**Important:** When using the `headers` option, the markdown strategy ignores all general options and content is split based on the markdown header structure. To use size-based chunking with markdown, omit the `headers` parameter.

### Semantic Markdown

**joinThreshold?:** (`number`): Maximum token count for merging related sections. Sections exceeding this limit individually are left intact, but smaller sections are merged with siblings or parents if the combined size stays under this threshold. (Default: `500`)

**modelName?:** (`string`): Name of the model for tokenization. If provided, the model's underlying tokenization \`encodingName\` will be used.

**encodingName?:** (`string`): Name of the token encoding to use. Derived from \`modelName\` if available. (Default: `cl100k_base`)

**allowedSpecial?:** (`Set<string> | 'all'`): Set of special tokens allowed during tokenization, or 'all' to allow all special tokens

**disallowedSpecial?:** (`Set<string> | 'all'`): Set of special tokens to disallow during tokenization, or 'all' to disallow all special tokens (Default: `all`)

### Token

**encodingName?:** (`string`): Name of the token encoding to use

**modelName?:** (`string`): Name of the model for tokenization

**allowedSpecial?:** (`Set<string> | 'all'`): Set of special tokens allowed during tokenization, or 'all' to allow all special tokens

**disallowedSpecial?:** (`Set<string> | 'all'`): Set of special tokens to disallow during tokenization, or 'all' to disallow all special tokens

### JSON

**maxSize:** (`number`): Maximum size of each chunk

**minSize?:** (`number`): Minimum size of each chunk

**ensureAscii?:** (`boolean`): Whether to ensure ASCII encoding

**convertLists?:** (`boolean`): Whether to convert lists in the JSON

### Latex

The Latex strategy uses only the general chunking options listed above. It provides LaTeX-aware splitting optimized for mathematical and academic documents.

## Return Value

Returns a `MDocument` instance containing the chunked documents. Each chunk includes:

```typescript
interface DocumentNode {
  text: string;
  metadata: Record<string, any>;
  embedding?: number[];
}
```