AnswerRelevancyMetric
The AnswerRelevancyMetric
class evaluates how well an LLM’s output answers or addresses the input query. It uses a judge-based system to determine relevancy and provides detailed scoring and reasoning.
Basic Usage
import { openai } from "@ai-sdk/openai";
import { AnswerRelevancyMetric } from "@mastra/evals/llm";
// Configure the model for evaluation
const model = openai("gpt-4o-mini");
const metric = new AnswerRelevancyMetric(model, {
uncertaintyWeight: 0.3,
scale: 1,
});
const result = await metric.measure(
"What is the capital of France?",
"Paris is the capital of France.",
);
console.log(result.score); // Score from 0-1
console.log(result.info.reason); // Explanation of the score
Constructor Parameters
model:
LanguageModel
Configuration for the model used to evaluate relevancy
options?:
AnswerRelevancyMetricOptions
= { uncertaintyWeight: 0.3, scale: 1 }
Configuration options for the metric
AnswerRelevancyMetricOptions
uncertaintyWeight?:
number
= 0.3
Weight given to 'unsure' verdicts in scoring (0-1)
scale?:
number
= 1
Maximum score value
measure() Parameters
input:
string
The original query or prompt
output:
string
The LLM's response to evaluate
Returns
score:
number
Relevancy score (0 to scale, default 0-1)
info:
object
Object containing the reason for the score
string
reason:
string
Explanation of the score
Scoring Details
The metric evaluates relevancy through query-answer alignment, considering completeness, accuracy, and detail level.
Scoring Process
-
Statement Analysis:
- Breaks output into meaningful statements while preserving context
- Evaluates each statement against query requirements
-
Evaluates relevance of each statement:
- “yes”: Full weight for direct matches
- “unsure”: Partial weight (default: 0.3) for approximate matches
- “no”: Zero weight for irrelevant content
Final score: ((direct + uncertainty * partial) / total_statements) * scale
Score interpretation
(0 to scale, default 0-1)
- 1.0: Perfect relevance - complete and accurate
- 0.7-0.9: High relevance - minor gaps or imprecisions
- 0.4-0.6: Moderate relevance - significant gaps
- 0.1-0.3: Low relevance - major issues
- 0.0: No relevance - incorrect or off-topic
Example with Custom Configuration
import { openai } from "@ai-sdk/openai";
import { AnswerRelevancyMetric } from "@mastra/evals/llm";
// Configure the model for evaluation
const model = openai("gpt-4o-mini");
const metric = new AnswerRelevancyMetric(
model,
{
uncertaintyWeight: 0.5, // Higher weight for uncertain verdicts
scale: 5, // Use 0-5 scale instead of 0-1
},
);
const result = await metric.measure(
"What are the benefits of exercise?",
"Regular exercise improves cardiovascular health, builds strength, and boosts mental wellbeing.",
);
// Example output:
// {
// score: 4.5,
// info: {
// reason: "The score is 4.5 out of 5 because the response directly addresses the query
// with specific, accurate benefits of exercise. It covers multiple aspects
// (cardiovascular, muscular, and mental health) in a clear and concise manner.
// The answer is highly relevant and provides appropriate detail without
// including unnecessary information."
// }
// }