リファレンス: .chunk()

.chunk() 関数は、さまざまな戦略やオプションを用いてドキュメントをより小さなセグメントに分割します。

例


import { MDocument } from "@mastra/rag";
 
const doc = MDocument.fromMarkdown(`
# Introduction
This is a sample document that we want to split into chunks.
 
## Section 1
Here is the first section with some content.
 
## Section 2 
Here is another section with different content.
`);
 
// Basic chunking with defaults
const chunks = await doc.chunk();
 
// Markdown-specific chunking with header extraction
const chunksWithMetadata = await doc.chunk({
  strategy: "markdown",
  headers: [
    ["#", "title"],
    ["##", "section"],
  ],
  extract: {
    summary: true, // Extract summaries with default settings
    keywords: true, // Extract keywords with default settings
  },
});

パラメーター

strategy?:

使用するチャンク化戦略。指定しない場合はドキュメントタイプに基づいてデフォルトが選択されます。チャンク化戦略によっては追加のオプションがあります。デフォルト: .md ファイル → 'markdown'、.html/.htm → 'html'、.json → 'json'、.tex → 'latex'、その他 → 'recursive'

size?:

number

= 512

各チャンクの最大サイズ

overlap?:

number

= 50

チャンク間で重複する文字数またはトークン数。

separator?:

string

= \n\n

分割に使用する文字。テキストコンテンツの場合、デフォルトはダブル改行です。

isSeparatorRegex?:

boolean

= false

セパレーターが正規表現パターンかどうか

keepSeparator?:

'start' | 'end'

セパレーターをチャンクの先頭または末尾に保持するかどうか

extract?:

ExtractParams

メタデータ抽出の設定。詳細は[ExtractParams のリファレンス](./extract-params)を参照してください。

戦略固有のオプション

戦略固有のオプションは、strategy パラメータと並んでトップレベルのパラメータとして渡されます。例えば：


// HTML戦略の例
const chunks = await doc.chunk({
  strategy: "html",
  headers: [
    ["h1", "title"],
    ["h2", "subtitle"],
  ], // HTML固有のオプション
  sections: [["div.content", "main"]], // HTML固有のオプション
  size: 500, // 一般的なオプション
});
 
// Markdown戦略の例
const chunks = await doc.chunk({
  strategy: "markdown",
  headers: [
    ["#", "title"],
    ["##", "section"],
  ], // Markdown固有のオプション
  stripHeaders: true, // Markdown固有のオプション
  overlap: 50, // 一般的なオプション
});
 
// Token戦略の例
const chunks = await doc.chunk({
  strategy: "token",
  encodingName: "gpt2", // Token固有のオプション
  modelName: "gpt-3.5-turbo", // Token固有のオプション
  size: 1000, // 一般的なオプション
});

以下に記載されているオプションは、設定オブジェクトのトップレベルで直接渡され、別の options オブジェクト内にネストされません。

HTML

headers:

Array<[string, string]>

ヘッダー単位で分割するための [セレクタ, メタデータキー] のペアの配列

sections:

Array<[string, string]>

セクション単位で分割するための [セレクタ, メタデータキー] のペアの配列

returnEachLine?:

boolean

各行を個別のチャンクとして返すかどうか

Markdown

headers:

Array<[string, string]>

[ヘッダーレベル, メタデータキー] のペアの配列

stripHeaders?:

boolean

出力からヘッダーを削除するかどうか

returnEachLine?:

boolean

各行を個別のチャンクとして返すかどうか

Token

encodingName?:

string

使用するトークンエンコーディングの名前

modelName?:

string

トークナイズに使用するモデル名

JSON

maxSize:

number

各チャンクの最大サイズ

minSize?:

number

各チャンクの最小サイズ

ensureAscii?:

boolean

ASCIIエンコーディングを保証するかどうか

convertLists?:

boolean

JSON内のリストを変換するかどうか

戻り値

チャンク化されたドキュメントを含む MDocument インスタンスを返します。各チャンクには以下が含まれます：


interface DocumentNode {
  text: string;
  metadata: Record<string, any>;
  embedding?: number[];
}