AnswerRelevancyMetric

The AnswerRelevancyMetric class evaluates how well an LLM’s output answers or addresses the input query. It uses a judge-based system to determine relevancy and provides detailed scoring and reasoning.

Basic Usage


import { openai } from "@ai-sdk/openai";
import { AnswerRelevancyMetric } from "@mastra/evals/llm";
 
// Configure the model for evaluation
const model = openai("gpt-4o-mini");
 
const metric = new AnswerRelevancyMetric(model, {
  uncertaintyWeight: 0.3,
  scale: 1,
});
 
const result = await metric.measure(
  "What is the capital of France?",
  "Paris is the capital of France.",
);
 
console.log(result.score); // Score from 0-1
console.log(result.info.reason); // Explanation of the score

Constructor Parameters

model:

LanguageModel

Configuration for the model used to evaluate relevancy

options?:

AnswerRelevancyMetricOptions

= { uncertaintyWeight: 0.3, scale: 1 }

Configuration options for the metric

AnswerRelevancyMetricOptions

uncertaintyWeight?:

number

= 0.3

Weight given to 'unsure' verdicts in scoring (0-1)

scale?:

number

= 1

Maximum score value

measure() Parameters

input:

string

The original query or prompt

output:

string

The LLM's response to evaluate

Returns

score:

number

Relevancy score (0 to scale, default 0-1)

info:

object

Object containing the reason for the score

string

reason:

string

Explanation of the score

Scoring Details

The metric evaluates relevancy through query-answer alignment, considering completeness, accuracy, and detail level.

Scoring Process

Statement Analysis:
- Breaks output into meaningful statements while preserving context
- Evaluates each statement against query requirements
Evaluates relevance of each statement:
- “yes”: Full weight for direct matches
- “unsure”: Partial weight (default: 0.3) for approximate matches
- “no”: Zero weight for irrelevant content

Final score: ((direct + uncertainty * partial) / total_statements) * scale

Score interpretation

(0 to scale, default 0-1)

1.0: Perfect relevance - complete and accurate
0.7-0.9: High relevance - minor gaps or imprecisions
0.4-0.6: Moderate relevance - significant gaps
0.1-0.3: Low relevance - major issues
0.0: No relevance - incorrect or off-topic

Example with Custom Configuration


import { openai } from "@ai-sdk/openai";
import { AnswerRelevancyMetric } from "@mastra/evals/llm";
 
// Configure the model for evaluation
const model = openai("gpt-4o-mini");
 
const metric = new AnswerRelevancyMetric(
  model,
  {
    uncertaintyWeight: 0.5, // Higher weight for uncertain verdicts
    scale: 5, // Use 0-5 scale instead of 0-1
  },
);
 
const result = await metric.measure(
  "What are the benefits of exercise?",
  "Regular exercise improves cardiovascular health, builds strength, and boosts mental wellbeing.",
);
 
// Example output:
// {
//   score: 4.5,
//   info: {
//     reason: "The score is 4.5 out of 5 because the response directly addresses the query
//           with specific, accurate benefits of exercise. It covers multiple aspects
//           (cardiovascular, muscular, and mental health) in a clear and concise manner.
//           The answer is highly relevant and provides appropriate detail without
//           including unnecessary information."
//   }
// }

AnswerRelevancyMetric

Basic Usage

Constructor Parameters

model:

options?:

AnswerRelevancyMetricOptions

uncertaintyWeight?:

scale?:

measure() Parameters

input:

output:

Returns

score:

info:

reason:

Scoring Details

Scoring Process

Score interpretation

Example with Custom Configuration

Related