Skip to Content
評価要約

要約評価

この例では、Mastraの要約メトリクスを使用して、LLMが生成した要約が内容をどれだけ正確に捉え、事実性を維持しているかを評価する方法を示します。

概要

この例では、以下の方法を示します:

  1. LLMを使ってSummarizationメトリクスを設定する
  2. 要約の品質と事実の正確性を評価する
  3. アラインメントとカバレッジのスコアを分析する
  4. さまざまな要約シナリオに対応する

セットアップ

環境のセットアップ

環境変数を設定してください:

.env
OPENAI_API_KEY=your_api_key_here

依存関係

必要な依存関係をインポートします:

src/index.ts
import { openai } from "@ai-sdk/openai"; import { SummarizationMetric } from "@mastra/evals/llm";

メトリックの設定

OpenAIモデルを使ってSummarizationメトリックを設定します:

src/index.ts
const metric = new SummarizationMetric(openai("gpt-4o-mini"));

使用例

高品質な要約の例

事実の正確性と完全な網羅性の両方を維持している要約を評価します:

src/index.ts
const input1 = `The electric car company Tesla was founded in 2003 by Martin Eberhard and Marc Tarpenning. Elon Musk joined in 2004 as the largest investor and became CEO in 2008. The company's first car, the Roadster, was launched in 2008.`; const output1 = `Tesla, founded by Martin Eberhard and Marc Tarpenning in 2003, launched its first car, the Roadster, in 2008. Elon Musk joined as the largest investor in 2004 and became CEO in 2008.`; console.log("Example 1 - High-quality Summary:"); console.log("Input:", input1); console.log("Output:", output1); const result1 = await metric.measure(input1, output1); console.log("Metric Result:", { score: result1.score, info: { reason: result1.info.reason, alignmentScore: result1.info.alignmentScore, coverageScore: result1.info.coverageScore, }, }); // Example Output: // Metric Result: { // score: 1, // info: { // reason: "The score is 1 because the summary maintains perfect factual accuracy and includes all key information from the source text.", // alignmentScore: 1, // coverageScore: 1 // } // }

部分的な網羅性の例

事実は正確だが、重要な情報が省略されている要約を評価します:

src/index.ts
const input2 = `The Python programming language was created by Guido van Rossum and was first released in 1991. It emphasizes code readability with its notable use of significant whitespace. Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured, object-oriented, and functional programming.`; const output2 = `Python, created by Guido van Rossum, is a programming language known for its readable code and use of whitespace. It was released in 1991.`; console.log("Example 2 - Partial Coverage:"); console.log("Input:", input2); console.log("Output:", output2); const result2 = await metric.measure(input2, output2); console.log("Metric Result:", { score: result2.score, info: { reason: result2.info.reason, alignmentScore: result2.info.alignmentScore, coverageScore: result2.info.coverageScore, }, }); // Example Output: // Metric Result: { // score: 0.4, // info: { // reason: "The score is 0.4 because while the summary is factually accurate (alignment score: 1), it only covers a portion of the key information from the source text (coverage score: 0.4), omitting several important technical details.", // alignmentScore: 1, // coverageScore: 0.4 // } // }

不正確な要約の例

事実誤認や誤った表現が含まれている要約を評価します:

src/index.ts
const input3 = `The World Wide Web was invented by Tim Berners-Lee in 1989 while working at CERN. He published the first website in 1991. Berners-Lee made the Web freely available, with no patent and no royalties due.`; const output3 = `The Internet was created by Tim Berners-Lee at MIT in the early 1990s, and he went on to commercialize the technology through patents.`; console.log("Example 3 - Inaccurate Summary:"); console.log("Input:", input3); console.log("Output:", output3); const result3 = await metric.measure(input3, output3); console.log("Metric Result:", { score: result3.score, info: { reason: result3.info.reason, alignmentScore: result3.info.alignmentScore, coverageScore: result3.info.coverageScore, }, }); // Example Output: // Metric Result: { // score: 0, // info: { // reason: "The score is 0 because the summary contains multiple factual errors and misrepresentations of key details from the source text, despite covering some of the basic information.", // alignmentScore: 0, // coverageScore: 0.6 // } // }

結果の理解

この指標は、要約を次の2つの要素で評価します。

  1. アライメントスコア(0-1):

    • 1.0: 完全な事実の正確性
    • 0.7-0.9: 軽微な事実の不一致
    • 0.4-0.6: いくつかの事実誤認
    • 0.1-0.3: 重大な不正確さ
    • 0.0: 完全な事実の誤表現
  2. カバレッジスコア(0-1):

    • 1.0: 情報が完全に網羅されている
    • 0.7-0.9: 主要な情報のほとんどが含まれている
    • 0.4-0.6: 主要なポイントの一部のみカバー
    • 0.1-0.3: 最も重要な詳細がほとんど抜けている
    • 0.0: 関連する情報が全く含まれていない

最終スコアはこれら2つのスコアのうち低い方で決定されます。これにより、高品質な要約には事実の正確性と情報の網羅性の両方が必要となります。






GitHubで例を見る