Summarization Evaluation
Use SummarizationMetric
to evaluate how well a response captures key information from the source while maintaining factual accuracy. The metric accepts a query
and a response
, and returns a score and an info
object containing a reason, alignment score, and coverage score.
Installation
npm install @mastra/evals
Accurate summary example
In this example, the summary accurately preserves all important facts from the source while maintaining faithful phrasing. The score reflects both complete coverage and perfect factual alignment.
import { openai } from "@ai-sdk/openai";
import { SummarizationMetric } from "@mastra/evals/llm";
const metric = new SummarizationMetric(openai("gpt-4o-mini"));
const query =
"The electric car company Tesla was founded in 2003 by Martin Eberhard and Marc Tarpenning. Elon Musk joined in 2004 as the largest investor and became CEO in 2008. The company's first car, the Roadster, was launched in 2008.";
const response =
"Tesla, founded by Martin Eberhard and Marc Tarpenning in 2003, launched its first car, the Roadster, in 2008. Elon Musk joined as the largest investor in 2004 and became CEO in 2008.";
const result = await metric.measure(query, response);
Accurate summary output
A high score indicates that the summary captures all key details from the input without introducing errors. The info
field confirms full alignment and complete coverage.
{
score: 1,
info: {
reason: 'The score is 1 because the summary is completely factual and covers all key information from the original text.',
alignmentScore: 1,
coverageScore: 1
}
}
Partial summary example
In this example, the summary is factually accurate but leaves out several key points from the source. The score reflects incomplete coverage despite strong alignment.
import { openai } from "@ai-sdk/openai";
import { SummarizationMetric } from "@mastra/evals/llm";
const metric = new SummarizationMetric(openai("gpt-4o-mini"));
const query =
"The Python programming language was created by Guido van Rossum and was first released in 1991. It emphasizes code readability with its notable use of significant whitespace. Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured, object-oriented, and functional programming.";
const response = "Python, created by Guido van Rossum, is a programming language known for its readable code and use of whitespace. It was released in 1991.";
const result = await metric.measure(query, response);
console.log(result);
Partial summary output
The summary receives a moderate score due to missing key information from the input. The info
field confirms factual alignment but highlights gaps in content coverage.
{
score: 0.7,
info: {
reason: "The score is 0.7 because the summary accurately captures key facts about Python's creation, release date, and emphasis on readability, achieving a perfect alignment score. However, it fails to mention that Python is dynamically typed, garbage-collected, and supports multiple programming paradigms, which affects the coverage score.",
alignmentScore: 1,
coverageScore: 0.7
}
}
Inaccurate summary example
In this example, the summary includes factual errors and misrepresents key details from the source. The score reflects poor alignment, even if some information is partially covered.
import { openai } from "@ai-sdk/openai";
import { SummarizationMetric } from "@mastra/evals/llm";
const metric = new SummarizationMetric(openai("gpt-4o-mini"));
const query =
"The World Wide Web was invented by Tim Berners-Lee in 1989 while working at CERN. He published the first website in 1991. Berners-Lee made the Web freely available, with no patent and no royalties due.";
const response = "The Internet was created by Tim Berners-Lee at MIT in the early 1990s, and he went on to commercialize the technology through patents.";
const result = await metric.measure(query, response);
console.log(result);
Inaccurate summary output
The summary receives a low score due to factual inaccuracies and misalignment with the input. The info
field explains which details were incorrect and how the summary deviated from the source.
{
score: 0,
info: {
reason: 'The score is 0 because the summary contains factual inaccuracies and fails to cover essential details from the original text. The claim that the Internet was created at MIT in the early 1990s contradicts the original text, which states that the World Wide Web was invented at CERN in 1989. Additionally, the summary incorrectly states that Berners-Lee commercialized the technology through patents, while the original text clearly mentions that he made the Web freely available with no patents or royalties.',
alignmentScore: 0,
coverageScore: 0.17
}
}
Metric configuration
You can create a SummarizationMetric
instance by providing a model. No additional configuration is required.
const metric = new SummarizationMetric(openai("gpt-4o-mini"));
See SummarizationMetric for a full list of configuration options.
Understanding the results
SummarizationMetric
returns a result in the following shape:
{
score: number,
info: {
reason: string,
alignmentScore: number,
coverageScore: number
}
}
Summarization score
A summarization score between 0 and 1:
- 1.0: Perfect summary – fully accurate and complete.
- 0.7–0.9: Strong summary – minor omissions or slight inaccuracies.
- 0.4–0.6: Mixed summary – partially accurate or incomplete.
- 0.1–0.3: Weak summary – significant gaps or errors.
- 0.0: Failed summary – mostly inaccurate or missing key content.
Summarization info
An explanation for the score, with details including:
- Alignment with factual content from the input.
- Coverage of key points from the source.
- Individual scores for alignment and coverage.
- Justification describing what was preserved, omitted, or misstated.