Answer Relevancy Scorer

Use createAnswerRelevancyScorer to score how relevant the response is to the original query.

Installation


npm install @mastra/evals

For complete API documentation and configuration options, see createAnswerRelevancyScorer.

High relevancy example

In this example, the response accurately addresses the input query with specific and relevant information.

src/example-high-answer-relevancy.ts


import { openai } from "@ai-sdk/openai";
import { createAnswerRelevancyScorer } from "@mastra/evals/scorers/llm";
 
const scorer = createAnswerRelevancyScorer({ model: openai("gpt-4o-mini") });
 
const inputMessages = [{ role: 'user', content: "What are the health benefits of regular exercise?" }];
const outputMessage = { text: "Regular exercise improves cardiovascular health, strengthens muscles, boosts metabolism, and enhances mental well-being through the release of endorphins." };
 
const result = await scorer.run({
  input: inputMessages,
  output: outputMessage,
});
 
console.log(result);

High relevancy output

The output receives a high score because it accurately answers the query without including unrelated information.


{
  score: 1,
  reason: 'The score is 1 because the output directly addresses the question by providing multiple explicit health benefits of regular exercise, including improvements in cardiovascular health, muscle strength, metabolism, and mental well-being. Each point is relevant and contributes to a comprehensive understanding of the health benefits.'
}

Partial relevancy example

In this example, the response addresses the query in part but includes additional information that isn’t directly relevant.

src/example-partial-answer-relevancy.ts


import { openai } from "@ai-sdk/openai";
import { createAnswerRelevancyScorer } from "@mastra/evals/scorers/llm";
 
const scorer = createAnswerRelevancyScorer({ model: openai("gpt-4o-mini") });
 
const inputMessages = [{ role: 'user', content: "What should a healthy breakfast include?" }];
const outputMessage = { text: "A nutritious breakfast should include whole grains and protein. However, the timing of your breakfast is just as important - studies show eating within 2 hours of waking optimizes metabolism and energy levels throughout the day." };
 
const result = await scorer.run({
  input: inputMessages,
  output: outputMessage,
});
 
console.log(result);

Partial relevancy output

The output receives a lower score because it partially answers the query. While some relevant information is included, unrelated details reduce the overall relevance.


{
  score: 0.25,
  reason: 'The score is 0.25 because the output provides a direct answer by mentioning whole grains and protein as components of a healthy breakfast, which is relevant. However, the additional information about the timing of breakfast and its effects on metabolism and energy levels is not directly related to the question, leading to a lower overall relevance score.'
}

Low relevancy example

In this example, the response does not address the query and contains information that is entirely unrelated.

src/example-low-answer-relevancy.ts


import { openai } from "@ai-sdk/openai";
import { createAnswerRelevancyScorer } from "@mastra/evals/scorers/llm";
 
const scorer = createAnswerRelevancyScorer({ model: openai("gpt-4o-mini") });
 
const inputMessages = [{ role: 'user', content: "What are the benefits of meditation?" }];
const outputMessage = { text: "The Great Wall of China is over 13,000 miles long and was built during the Ming Dynasty to protect against invasions." };
 
const result = await scorer.run({
  input: inputMessages,
  output: outputMessage,
});
 
console.log(result);

Low relevancy output

The output receives a score of 0 because it fails to answer the query or provide any relevant information.


{
  score: 0,
  reason: 'The score is 0 because the output about the Great Wall of China is completely unrelated to the benefits of meditation, providing no relevant information or context that addresses the input question.'
}

Scorer configuration

You can customize how the Answer Relevancy Scorer calculates scores by adjusting optional parameters. For example, uncertaintyWeight controls how much weight to give to uncertain responses, and scale sets the maximum possible score.


const scorer = createAnswerRelevancyScorer({ model: openai("gpt-4o-mini"), options: { uncertaintyWeight: 0.3, scale: 1 } });

See createAnswerRelevancyScorer for a full list of configuration options.

Understanding the results

.run() returns a result in the following shape:


{
  runId: string,
  score: number,
  extractPrompt: string,
  extractStepResult: { statements: string[] },
  analyzePrompt: string,
  analyzeStepResult: { results: Array<{ result: 'yes' | 'unsure' | 'no', reason: string }> },
  reasonPrompt: string,
  reason: string
}

score

A relevancy score between 0 and 1:

1.0: The response fully answers the query with relevant and focused information.
0.7–0.9: The response mostly answers the query but may include minor unrelated content.
0.4–0.6: The response partially answers the query, mixing relevant and unrelated information.
0.1–0.3: The response includes minimal relevant content and largely misses the intent of the query.
0.0: The response is entirely unrelated and does not answer the query.

runId

The unique identifier for this scorer run.

extractPrompt

The prompt sent to the LLM for the extract step.

extractStepResult

The extracted statements from the output, e.g. { statements: string[] }.

analyzePrompt

The prompt sent to the LLM for the analyze step.

analyzeStepResult

The analysis results, e.g. { results: Array<{ result: 'yes' | 'unsure' | 'no', reason: string }> }.

reasonPrompt

The prompt sent to the LLM for the reason step.

reason

The explanation for the score, including alignment, focus, and suggestions for improvement.

View Example on GitHub