Context Precision Evaluation
We just released a new evals API called Scorers, with a more ergonomic API and more metadata stored for error analysis, and more flexibility to evaluate data structures. It’s fairly simple to migrate, but we will continue to support the existing Evals API.
Use ContextPrecisionMetric
to evaluate whether the response relies on the most relevant parts of the provided context. The metric accepts a query
and a response
, and returns a score and an info
object containing a reason.
Installation
npm install @mastra/evals
High precision example
In this example, the response draws only from context that is directly relevant to the query. Every piece of context supports the answer, resulting in a high precision score.
import { openai } from "@ai-sdk/openai";
import { ContextPrecisionMetric } from "@mastra/evals/llm";
const metric = new ContextPrecisionMetric(openai("gpt-4o-mini"), {
context: [
"Photosynthesis converts sunlight into energy.",
"Plants use chlorophyll for photosynthesis.",
"Photosynthesis produces oxygen as a byproduct.",
"The process requires sunlight and chlorophyll."
]
});
const query = "What is photosynthesis and how does it work?";
const response = "Photosynthesis is a process where plants convert sunlight into energy using chlorophyll, producing oxygen as a byproduct.";
const result = await metric.measure(query, response);
console.log(result);
High precision output
The output receives a perfect score because all context statements directly contribute to answering the query, without including unrelated information.
{
score: 1,
info: {
reason: 'The score is 1 because all context nodes are relevant and provide direct definitions, explanations, and essential components of photosynthesis, with the first node defining the process, followed by supporting details about chlorophyll and oxygen production, all in optimal ranking order.'
}
}
Mixed precision example
In this example, the response uses one or more relevant context items, but also includes unrelated or distracting information. This reduces the overall precision score.
import { openai } from "@ai-sdk/openai";
import { ContextPrecisionMetric } from "@mastra/evals/llm";
const metric = new ContextPrecisionMetric(openai("gpt-4o-mini"), {
context: [
"Volcanoes are openings in the Earth's crust.",
"Volcanoes can be active, dormant, or extinct.",
"Hawaii has many active volcanoes.",
"The Pacific Ring of Fire has many volcanoes."
]
});
const query = "What are the different types of volcanoes?";
const response = "Volcanoes can be classified as active, dormant, or extinct based on their activity status.";
const result = await metric.measure(query, response);
console.log(result);
Mixed precision output
The output receives a mid-range score because the response uses relevant context but is surrounded by unrelated or unnecessary information that lowers the overall precision.
{
score: 0.58,
info: {
reason: 'The score is 0.58 because while the second and third nodes provided direct definitions and examples of volcano types, the first and fourth nodes were irrelevant, leading to a lower precision score. The relevant nodes were not optimally ordered, as the most useful context was not the first, which affected the overall effectiveness.'
}
}
Low precision example
In this example, the response uses only a small portion of the provided context. Most of the context is unrelated to the query, resulting in a low precision score.
import { openai } from "@ai-sdk/openai";
import { ContextPrecisionMetric } from "@mastra/evals/llm";
const metric = new ContextPrecisionMetric(openai("gpt-4o-mini"), {
context: [
"The Nile River is in Africa.",
"The Nile is the longest river.",
"Ancient Egyptians used the Nile.",
"The Nile flows north."
]
});
const query = "Which direction does the Nile River flow?";
const response = "The Nile River flows northward.";
const result = await metric.measure(query, response);
console.log(result);
Low precision output
The output receives a low score because only one piece of context is relevant to the query. The rest of the context is unrelated and does not contribute to the response.
{
score: 0.25,
info: {
reason: "The score is 0.25 because only the fourth context node directly answers the question about the direction of the Nile River's flow, while the first three nodes are irrelevant, providing no useful information. This highlights a significant limitation in the overall relevance of the retrieved contexts, as the majority did not contribute to the expected output."
}
}
Metric configuration
You can create a ContextPrecisionMetric
instance by providing a context
array that represents the relevant background information. You can also configure optional parameters such as scale
to set the maximum possible score.
const metric = new ContextPrecisionMetric(openai("gpt-4o-mini"), {
context: [""],
scale: 1
});
See ContextPrecisionMetric for a full list of configuration options.
Understanding the results
ContextPrecisionMetric
returns a result in the following shape:
{
score: number,
info: {
reason: string
}
}
Precision score
A precision score between 0 and 1:
- 1.0: Perfect precision – all context items are relevant and used.
- 0.7–0.9: High precision – most context items are relevant.
- 0.4–0.6: Mixed precision – some context items are relevant.
- 0.1–0.3: Low precision – few context items are relevant.
- 0.0: No precision – no context items are relevant.
Precision info
An explanation for the score, with details including:
- Relevance of each context item to the query and response.
- Whether relevant items were included in the response.
- Whether irrelevant context was mistakenly included.
- Overall usefulness and focus of the response relative to the provided context.