ContextRelevancyMetric
We just released a new evals API called Scorers, with a more ergonomic API and more metadata stored for error analysis, and more flexibility to evaluate data structures. It’s fairly simple to migrate, but we will continue to support the existing Evals API.
The ContextRelevancyMetric
class evaluates the quality of your RAG (Retrieval-Augmented Generation) pipeline’s retriever by measuring how relevant the retrieved context is to the input query. It uses an LLM-based evaluation system that first extracts statements from the context and then assesses their relevance to the input.
Basic Usage
import { openai } from "@ai-sdk/openai";
import { ContextRelevancyMetric } from "@mastra/evals/llm";
// Configure the model for evaluation
const model = openai("gpt-4o-mini");
const metric = new ContextRelevancyMetric(model, {
context: [
"All data is encrypted at rest and in transit",
"Two-factor authentication is mandatory",
"The platform supports multiple languages",
"Our offices are located in San Francisco",
],
});
const result = await metric.measure(
"What are our product's security features?",
"Our product uses encryption and requires 2FA.",
);
console.log(result.score); // Score from 0-1
console.log(result.info.reason); // Explanation of the relevancy assessment
Constructor Parameters
model:
options:
ContextRelevancyMetricOptions
scale?:
context:
measure() Parameters
input:
output:
Returns
score:
info:
reason:
Scoring Details
The metric evaluates how well retrieved context matches the query through binary relevance classification.
Scoring Process
-
Extracts statements from context:
- Breaks down context into meaningful units
- Preserves semantic relationships
-
Evaluates statement relevance:
- Assesses each statement against query
- Counts relevant statements
- Calculates relevance ratio
Final score: (relevant_statements / total_statements) * scale
Score interpretation
(0 to scale, default 0-1)
- 1.0: Perfect relevancy - all retrieved context is relevant
- 0.7-0.9: High relevancy - most context is relevant with few irrelevant pieces
- 0.4-0.6: Moderate relevancy - a mix of relevant and irrelevant context
- 0.1-0.3: Low relevancy - mostly irrelevant context
- 0.0: No relevancy - completely irrelevant context
Example with Custom Configuration
import { openai } from "@ai-sdk/openai";
import { ContextRelevancyMetric } from "@mastra/evals/llm";
// Configure the model for evaluation
const model = openai("gpt-4o-mini");
const metric = new ContextRelevancyMetric(model, {
scale: 100, // Use 0-100 scale instead of 0-1
context: [
"Basic plan costs $10/month",
"Pro plan includes advanced features at $30/month",
"Enterprise plan has custom pricing",
"Our company was founded in 2020",
"We have offices worldwide",
],
});
const result = await metric.measure(
"What are our pricing plans?",
"We offer Basic, Pro, and Enterprise plans.",
);
// Example output:
// {
// score: 60,
// info: {
// reason: "3 out of 5 statements are relevant to pricing plans. The statements about
// company founding and office locations are not relevant to the pricing query."
// }
// }