Context Relevancy Evaluation
We just released a new evals API called Scorers, with a more ergonomic API and more metadata stored for error analysis, and more flexibility to evaluate data structures. It’s fairly simple to migrate, but we will continue to support the existing Evals API.
Use ContextRelevancyMetric
to evaluate how well the retrieved context aligns with the original query. The metric accepts a query
and a response
, and returns a score and an info
object containing a reason.
Installation
npm install @mastra/evals
High relevancy example
In this example, the response uses only context that is directly relevant to the query. Every context item supports the answer, resulting in a perfect relevancy score.
import { openai } from "@ai-sdk/openai";
import { ContextRelevancyMetric } from "@mastra/evals/llm";
const metric = new ContextRelevancyMetric(openai("gpt-4o-mini"), {
context: [
"Einstein won the Nobel Prize for his discovery of the photoelectric effect.",
"He published his theory of relativity in 1905.",
"His work revolutionized modern physics."
]
});
const query = "What were some of Einstein's achievements?";
const response = "Einstein won the Nobel Prize for discovering the photoelectric effect and published his groundbreaking theory of relativity.";
const result = await metric.measure(query, response);
console.log(result);
High relevancy output
The output receives a perfect score because all context statements directly contribute to answering the query without including any unrelated information.
{
score: 1,
info: {
reason: "The score is 1 because the retrieval context directly addresses the input by highlighting Einstein's significant achievements, making it entirely relevant."
}
}
Mixed relevancy example
In this example, the response uses one or more relevant context items but also includes unrelated or less useful information. This lowers the overall relevancy score.
import { openai } from "@ai-sdk/openai";
import { ContextRelevancyMetric } from "@mastra/evals/llm";
const metric = new ContextRelevancyMetric(openai("gpt-4o-mini"), {
context: [
"Solar eclipses occur when the Moon blocks the Sun.",
"The Moon moves between the Earth and Sun during eclipses.",
"The Moon is visible at night.",
"The Moon has no atmosphere."
]
});
const query = "What causes solar eclipses?";
const response =
"Solar eclipses happen when the Moon moves between Earth and the Sun, blocking sunlight.";
const result = await metric.measure(query, response);
console.log(result);
Mixed relevancy output
The output receives a mid-range score because it includes relevant context about the eclipse mechanism, but also includes unrelated facts that dilute the overall relevancy.
{
score: 0.5,
info: {
reason: "The score is 0.5 because the retrieval context contains statements that are irrelevant to the input, such as 'The Moon is visible at night' and 'The Moon has no atmosphere', which do not explain the causes of solar eclipses. This lack of relevant information significantly lowers the contextual relevancy score."
}
}
Low relevancy example
In this example, most of the context is unrelated to the query. Only one item is relevant, which results in a low relevancy score.
import { openai } from "@ai-sdk/openai";
import { ContextRelevancyMetric } from "@mastra/evals/llm";
const metric = new ContextRelevancyMetric(openai("gpt-4o-mini"), {
context: [
"The Great Barrier Reef is in Australia.",
"Coral reefs need warm water to survive.",
"Marine life depends on coral reefs.",
"The capital of Australia is Canberra."
]
});
const query = "What is the capital of Australia?";
const response = "The capital of Australia is Canberra.";
const result = await metric.measure(query, response);
console.log(result);
Low relevancy output
The output receives a low score because only one context item is relevant to the query. The remaining items introduce unrelated information that does not support the response.
{
score: 0.25,
info: {
reason: "The score is 0.25 because the retrieval context contains statements that are completely irrelevant to the input question about the capital of Australia. For instance, 'The Great Barrier Reef is in Australia' and 'Coral reefs need warm water to survive' do not provide any geographical or political information related to the capital, thus failing to address the inquiry."
}
}
Metric configuration
You can create a ContextRelevancyMetric
instance by providing a context
array representing background information relevant to a query. You can also configure optional parameters such as scale
to define the scoring range.
const metric = new ContextRelevancyMetric(openai("gpt-4o-mini"), {
context: [""],
scale: 1
});
See ContextRelevancyMetric for a full list of configuration options.
Understanding the results
ContextRelevancyMetric
returns a result in the following shape:
{
score: number,
info: {
reason: string
}
}
Relevancy score
A relevancy score between 0 and 1:
- 1.0: Perfect relevancy – all context directly relevant to query.
- 0.7–0.9: High relevancy – most context relevant to query.
- 0.4–0.6: Mixed relevancy – some context relevant to query.
- 0.1–0.3: Low relevancy – little context relevant to query.
- 0.0: No relevancy – no context relevant to query.
Relevancy info
An explanation for the score, with details including:
- Relevance to input query.
- Statement extraction from context.
- Usefulness for response.
- Overall context quality.