AnswerRelevancyMetric
New Scorer API
We just released a new evals API called Scorers, with a more ergonomic API and more metadata stored for error analysis, and more flexibility to evaluate data structures. It’s fairly simple to migrate, but we will continue to support the existing Evals API.
The AnswerRelevancyMetric
class evaluates how well an LLM’s output answers or addresses the input query. It uses a judge-based system to determine relevancy and provides detailed scoring and reasoning.
Basic Usage
import { openai } from "@ai-sdk/openai";
import { AnswerRelevancyMetric } from "@mastra/evals/llm";
// Configure the model for evaluation
const model = openai("gpt-4o-mini");
const metric = new AnswerRelevancyMetric(model, {
uncertaintyWeight: 0.3,
scale: 1,
});
const result = await metric.measure(
"What is the capital of France?",
"Paris is the capital of France.",
);
console.log(result.score); // Score from 0-1
console.log(result.info.reason); // Explanation of the score
Constructor Parameters
model:
LanguageModel
Configuration for the model used to evaluate relevancy
options?:
AnswerRelevancyMetricOptions
= { uncertaintyWeight: 0.3, scale: 1 }
Configuration options for the metric
AnswerRelevancyMetricOptions
uncertaintyWeight?:
number
= 0.3
Weight given to 'unsure' verdicts in scoring (0-1)
scale?:
number
= 1
Maximum score value
measure() Parameters
input:
string
The original query or prompt
output:
string
The LLM's response to evaluate
Returns
score:
number
Relevancy score (0 to scale, default 0-1)
info:
object
Object containing the reason for the score
string
reason:
string
Explanation of the score
Scoring Details
The metric evaluates relevancy through query-answer alignment, considering completeness, accuracy, and detail level.
Scoring Process
-
Statement Analysis:
- Breaks output into meaningful statements while preserving context
- Evaluates each statement against query requirements
-
Evaluates relevance of each statement:
- “yes”: Full weight for direct matches
- “unsure”: Partial weight (default: 0.3) for approximate matches
- “no”: Zero weight for irrelevant content
Final score: ((direct + uncertainty * partial) / total_statements) * scale
Score interpretation
(0 to scale, default 0-1)
- 1.0: Perfect relevance - complete and accurate
- 0.7-0.9: High relevance - minor gaps or imprecisions
- 0.4-0.6: Moderate relevance - significant gaps
- 0.1-0.3: Low relevance - major issues
- 0.0: No relevance - incorrect or off-topic
Example with Custom Configuration
import { openai } from "@ai-sdk/openai";
import { AnswerRelevancyMetric } from "@mastra/evals/llm";
// Configure the model for evaluation
const model = openai("gpt-4o-mini");
const metric = new AnswerRelevancyMetric(model, {
uncertaintyWeight: 0.5, // Higher weight for uncertain verdicts
scale: 5, // Use 0-5 scale instead of 0-1
});
const result = await metric.measure(
"What are the benefits of exercise?",
"Regular exercise improves cardiovascular health, builds strength, and boosts mental wellbeing.",
);
// Example output:
// {
// score: 4.5,
// info: {
// reason: "The score is 4.5 out of 5 because the response directly addresses the query
// with specific, accurate benefits of exercise. It covers multiple aspects
// (cardiovascular, muscular, and mental health) in a clear and concise manner.
// The answer is highly relevant and provides appropriate detail without
// including unnecessary information."
// }
// }