DocsReferenceEvalsAnswerRelevancy

AnswerRelevancyMetric

The AnswerRelevancyMetric class evaluates how well an LLM’s output answers or addresses the input query. It uses a judge-based system to determine relevancy and provides detailed scoring and reasoning.

Basic Usage

import { openai } from "@ai-sdk/openai";
import { AnswerRelevancyMetric } from "@mastra/evals/llm";
 
// Configure the model for evaluation
const model = openai("gpt-4o-mini");
 
const metric = new AnswerRelevancyMetric(model, {
  uncertaintyWeight: 0.3,
  scale: 1,
});
 
const result = await metric.measure(
  "What is the capital of France?",
  "Paris is the capital of France.",
);
 
console.log(result.score); // Score from 0-1
console.log(result.info.reason); // Explanation of the score

Constructor Parameters

model:

LanguageModel
Configuration for the model used to evaluate relevancy

options?:

AnswerRelevancyMetricOptions
= { uncertaintyWeight: 0.3, scale: 1 }
Configuration options for the metric

AnswerRelevancyMetricOptions

uncertaintyWeight?:

number
= 0.3
Weight given to 'unsure' verdicts in scoring (0-1)

scale?:

number
= 1
Maximum score value

measure() Parameters

input:

string
The original query or prompt

output:

string
The LLM's response to evaluate

Returns

score:

number
Relevancy score (0 to scale, default 0-1)

info:

object
Object containing the reason for the score
string

reason:

string
Explanation of the score

Scoring Details

The metric evaluates relevancy through query-answer alignment, considering completeness, accuracy, and detail level.

Scoring Process

  1. Statement Analysis:

    • Breaks output into meaningful statements while preserving context
    • Evaluates each statement against query requirements
  2. Evaluates relevance of each statement:

    • “yes”: Full weight for direct matches
    • “unsure”: Partial weight (default: 0.3) for approximate matches
    • “no”: Zero weight for irrelevant content

Final score: ((direct + uncertainty * partial) / total_statements) * scale

Score interpretation

(0 to scale, default 0-1)

  • 1.0: Perfect relevance - complete and accurate
  • 0.7-0.9: High relevance - minor gaps or imprecisions
  • 0.4-0.6: Moderate relevance - significant gaps
  • 0.1-0.3: Low relevance - major issues
  • 0.0: No relevance - incorrect or off-topic

Example with Custom Configuration

import { openai } from "@ai-sdk/openai";
import { AnswerRelevancyMetric } from "@mastra/evals/llm";
 
// Configure the model for evaluation
const model = openai("gpt-4o-mini");
 
const metric = new AnswerRelevancyMetric(
  model,
  {
    uncertaintyWeight: 0.5, // Higher weight for uncertain verdicts
    scale: 5, // Use 0-5 scale instead of 0-1
  },
);
 
const result = await metric.measure(
  "What are the benefits of exercise?",
  "Regular exercise improves cardiovascular health, builds strength, and boosts mental wellbeing.",
);
 
// Example output:
// {
//   score: 4.5,
//   info: {
//     reason: "The score is 4.5 out of 5 because the response directly addresses the query
//           with specific, accurate benefits of exercise. It covers multiple aspects
//           (cardiovascular, muscular, and mental health) in a clear and concise manner.
//           The answer is highly relevant and provides appropriate detail without
//           including unnecessary information."
//   }
// }