AnswerRelevancyMetric

New Scorer API

We just released a new evals API called Scorers, with a more ergonomic API and more metadata stored for error analysis, and more flexibility to evaluate data structures. It’s fairly simple to migrate, but we will continue to support the existing Evals API.

The AnswerRelevancyMetric class evaluates how well an LLM’s output answers or addresses the input query. It uses a judge-based system to determine relevancy and provides detailed scoring and reasoning.

Basic Usage


import { openai } from "@ai-sdk/openai";
import { AnswerRelevancyMetric } from "@mastra/evals/llm";
 
// Configure the model for evaluation
const model = openai("gpt-4o-mini");
 
const metric = new AnswerRelevancyMetric(model, {
  uncertaintyWeight: 0.3,
  scale: 1,
});
 
const result = await metric.measure(
  "What is the capital of France?",
  "Paris is the capital of France.",
);
 
console.log(result.score); // Score from 0-1
console.log(result.info.reason); // Explanation of the score

Constructor Parameters

model:

LanguageModel

Configuration for the model used to evaluate relevancy

options?:

AnswerRelevancyMetricOptions

= { uncertaintyWeight: 0.3, scale: 1 }

Configuration options for the metric

AnswerRelevancyMetricOptions

uncertaintyWeight?:

number

= 0.3

Weight given to 'unsure' verdicts in scoring (0-1)

scale?:

number

= 1

Maximum score value

measure() Parameters

input:

string

The original query or prompt

output:

string

The LLM's response to evaluate

Returns

score:

number

Relevancy score (0 to scale, default 0-1)

info:

object

Object containing the reason for the score

string

reason:

string

Explanation of the score

Scoring Details

The metric evaluates relevancy through query-answer alignment, considering completeness, accuracy, and detail level.

Scoring Process

Statement Analysis:
- Breaks output into meaningful statements while preserving context
- Evaluates each statement against query requirements
Evaluates relevance of each statement:
- “yes”: Full weight for direct matches
- “unsure”: Partial weight (default: 0.3) for approximate matches
- “no”: Zero weight for irrelevant content

Final score: ((direct + uncertainty * partial) / total_statements) * scale

Score interpretation

(0 to scale, default 0-1)

1.0: Perfect relevance - complete and accurate
0.7-0.9: High relevance - minor gaps or imprecisions
0.4-0.6: Moderate relevance - significant gaps
0.1-0.3: Low relevance - major issues
0.0: No relevance - incorrect or off-topic

Example with Custom Configuration


import { openai } from "@ai-sdk/openai";
import { AnswerRelevancyMetric } from "@mastra/evals/llm";
 
// Configure the model for evaluation
const model = openai("gpt-4o-mini");
 
const metric = new AnswerRelevancyMetric(model, {
  uncertaintyWeight: 0.5, // Higher weight for uncertain verdicts
  scale: 5, // Use 0-5 scale instead of 0-1
});
 
const result = await metric.measure(
  "What are the benefits of exercise?",
  "Regular exercise improves cardiovascular health, builds strength, and boosts mental wellbeing.",
);
 
// Example output:
// {
//   score: 4.5,
//   info: {
//     reason: "The score is 4.5 out of 5 because the response directly addresses the query
//           with specific, accurate benefits of exercise. It covers multiple aspects
//           (cardiovascular, muscular, and mental health) in a clear and concise manner.
//           The answer is highly relevant and provides appropriate detail without
//           including unnecessary information."
//   }
// }

AnswerRelevancyMetric

New Scorer API

Basic Usage


import { openai } from "@ai-sdk/openai";
import { AnswerRelevancyMetric } from "@mastra/evals/llm";
 
// Configure the model for evaluation
const model = openai("gpt-4o-mini");
 
const metric = new AnswerRelevancyMetric(model, {
  uncertaintyWeight: 0.3,
  scale: 1,
});
 
const result = await metric.measure(
  "What is the capital of France?",
  "Paris is the capital of France.",
);
 
console.log(result.score); // Score from 0-1
console.log(result.info.reason); // Explanation of the score

Constructor Parameters

model:

LanguageModel

Configuration for the model used to evaluate relevancy

options?:

AnswerRelevancyMetricOptions

= { uncertaintyWeight: 0.3, scale: 1 }

Configuration options for the metric

AnswerRelevancyMetricOptions

uncertaintyWeight?:

number

= 0.3

Weight given to 'unsure' verdicts in scoring (0-1)

scale?:

number

= 1

Maximum score value

measure() Parameters

input:

string

The original query or prompt

output:

string

The LLM's response to evaluate

Returns

score:

number

Relevancy score (0 to scale, default 0-1)

info:

object

Object containing the reason for the score

string

reason:

string

Explanation of the score

Scoring Details

The metric evaluates relevancy through query-answer alignment, considering completeness, accuracy, and detail level.

Scoring Process

Statement Analysis:
- Breaks output into meaningful statements while preserving context
- Evaluates each statement against query requirements
Evaluates relevance of each statement:
- “yes”: Full weight for direct matches
- “unsure”: Partial weight (default: 0.3) for approximate matches
- “no”: Zero weight for irrelevant content

Final score: ((direct + uncertainty * partial) / total_statements) * scale

Score interpretation

(0 to scale, default 0-1)

1.0: Perfect relevance - complete and accurate
0.7-0.9: High relevance - minor gaps or imprecisions
0.4-0.6: Moderate relevance - significant gaps
0.1-0.3: Low relevance - major issues
0.0: No relevance - incorrect or off-topic

Example with Custom Configuration


import { openai } from "@ai-sdk/openai";
import { AnswerRelevancyMetric } from "@mastra/evals/llm";
 
// Configure the model for evaluation
const model = openai("gpt-4o-mini");
 
const metric = new AnswerRelevancyMetric(model, {
  uncertaintyWeight: 0.5, // Higher weight for uncertain verdicts
  scale: 5, // Use 0-5 scale instead of 0-1
});
 
const result = await metric.measure(
  "What are the benefits of exercise?",
  "Regular exercise improves cardiovascular health, builds strength, and boosts mental wellbeing.",
);
 
// Example output:
// {
//   score: 4.5,
//   info: {
//     reason: "The score is 4.5 out of 5 because the response directly addresses the query
//           with specific, accurate benefits of exercise. It covers multiple aspects
//           (cardiovascular, muscular, and mental health) in a clear and concise manner.
//           The answer is highly relevant and provides appropriate detail without
//           including unnecessary information."
//   }
// }

AnswerRelevancyMetric

Basic Usage

Constructor Parameters

model:

options?:

AnswerRelevancyMetricOptions

uncertaintyWeight?:

scale?:

measure() Parameters

input:

output:

Returns

score:

info:

reason:

Scoring Details

Scoring Process

Score interpretation

Example with Custom Configuration

Related

AnswerRelevancyMetric

Basic Usage

Constructor Parameters

model:

options?:

AnswerRelevancyMetricOptions

uncertaintyWeight?:

scale?:

measure() Parameters

input:

output:

Returns

score:

info:

reason:

Scoring Details

Scoring Process

Score interpretation

Example with Custom Configuration

Related