Skip to Content
ReferenceEvalsAnswerRelevancy

AnswerRelevancyMetric

New Scorer API

We just released a new evals API called Scorers, with a more ergonomic API and more metadata stored for error analysis, and more flexibility to evaluate data structures. It’s fairly simple to migrate, but we will continue to support the existing Evals API.

The AnswerRelevancyMetric class evaluates how well an LLM’s output answers or addresses the input query. It uses a judge-based system to determine relevancy and provides detailed scoring and reasoning.

Basic Usage

import { openai } from "@ai-sdk/openai"; import { AnswerRelevancyMetric } from "@mastra/evals/llm"; // Configure the model for evaluation const model = openai("gpt-4o-mini"); const metric = new AnswerRelevancyMetric(model, { uncertaintyWeight: 0.3, scale: 1, }); const result = await metric.measure( "What is the capital of France?", "Paris is the capital of France.", ); console.log(result.score); // Score from 0-1 console.log(result.info.reason); // Explanation of the score

Constructor Parameters

model:

LanguageModel
Configuration for the model used to evaluate relevancy

options?:

AnswerRelevancyMetricOptions
= { uncertaintyWeight: 0.3, scale: 1 }
Configuration options for the metric

AnswerRelevancyMetricOptions

uncertaintyWeight?:

number
= 0.3
Weight given to 'unsure' verdicts in scoring (0-1)

scale?:

number
= 1
Maximum score value

measure() Parameters

input:

string
The original query or prompt

output:

string
The LLM's response to evaluate

Returns

score:

number
Relevancy score (0 to scale, default 0-1)

info:

object
Object containing the reason for the score
string

reason:

string
Explanation of the score

Scoring Details

The metric evaluates relevancy through query-answer alignment, considering completeness, accuracy, and detail level.

Scoring Process

  1. Statement Analysis:

    • Breaks output into meaningful statements while preserving context
    • Evaluates each statement against query requirements
  2. Evaluates relevance of each statement:

    • “yes”: Full weight for direct matches
    • “unsure”: Partial weight (default: 0.3) for approximate matches
    • “no”: Zero weight for irrelevant content

Final score: ((direct + uncertainty * partial) / total_statements) * scale

Score interpretation

(0 to scale, default 0-1)

  • 1.0: Perfect relevance - complete and accurate
  • 0.7-0.9: High relevance - minor gaps or imprecisions
  • 0.4-0.6: Moderate relevance - significant gaps
  • 0.1-0.3: Low relevance - major issues
  • 0.0: No relevance - incorrect or off-topic

Example with Custom Configuration

import { openai } from "@ai-sdk/openai"; import { AnswerRelevancyMetric } from "@mastra/evals/llm"; // Configure the model for evaluation const model = openai("gpt-4o-mini"); const metric = new AnswerRelevancyMetric(model, { uncertaintyWeight: 0.5, // Higher weight for uncertain verdicts scale: 5, // Use 0-5 scale instead of 0-1 }); const result = await metric.measure( "What are the benefits of exercise?", "Regular exercise improves cardiovascular health, builds strength, and boosts mental wellbeing.", ); // Example output: // { // score: 4.5, // info: { // reason: "The score is 4.5 out of 5 because the response directly addresses the query // with specific, accurate benefits of exercise. It covers multiple aspects // (cardiovascular, muscular, and mental health) in a clear and concise manner. // The answer is highly relevant and provides appropriate detail without // including unnecessary information." // } // }