ContextPrecisionMetric
The ContextPrecisionMetric
class evaluates how relevant and precise the retrieved context nodes are for generating the expected output. It uses a judge-based system to analyze each context piece’s contribution and provides weighted scoring based on position.
Basic Usage
import { openai } from "@ai-sdk/openai";
import { ContextPrecisionMetric } from "@mastra/evals/llm";
// Configure the model for evaluation
const model = openai("gpt-4o-mini");
const metric = new ContextPrecisionMetric(model, {
context: [
"Photosynthesis is a biological process used by plants to create energy from sunlight.",
"Plants need water and nutrients from the soil to grow.",
"The process of photosynthesis produces oxygen as a byproduct.",
],
});
const result = await metric.measure(
"What is photosynthesis?",
"Photosynthesis is the process by which plants convert sunlight into energy.",
);
console.log(result.score); // Precision score from 0-1
console.log(result.info.reason); // Explanation of the score
Constructor Parameters
model:
LanguageModel
Configuration for the model used to evaluate context relevance
options:
ContextPrecisionMetricOptions
Configuration options for the metric
ContextPrecisionMetricOptions
scale?:
number
= 1
Maximum score value
context:
string[]
Array of context pieces in their retrieval order
measure() Parameters
input:
string
The original query or prompt
output:
string
The generated response to evaluate
Returns
score:
number
Precision score (0 to scale, default 0-1)
info:
object
Object containing the reason for the score
string
reason:
string
Detailed explanation of the score
Scoring Details
The metric evaluates context precision through binary relevance assessment and Mean Average Precision (MAP) scoring.
Scoring Process
-
Assigns binary relevance scores:
- Relevant context: 1
- Irrelevant context: 0
-
Calculates Mean Average Precision:
- Computes precision at each position
- Weights earlier positions more heavily
- Normalizes to configured scale
Final score: Mean Average Precision * scale
Score interpretation
(0 to scale, default 0-1)
- 1.0: All relevant context in optimal order
- 0.7-0.9: Mostly relevant context with good ordering
- 0.4-0.6: Mixed relevance or suboptimal ordering
- 0.1-0.3: Limited relevance or poor ordering
- 0.0: No relevant context
Example with Analysis
import { openai } from "@ai-sdk/openai";
import { ContextPrecisionMetric } from "@mastra/evals/llm";
// Configure the model for evaluation
const model = openai("gpt-4o-mini");
const metric = new ContextPrecisionMetric(model, {
context: [
"Exercise strengthens the heart and improves blood circulation.",
"A balanced diet is important for health.",
"Regular physical activity reduces stress and anxiety.",
"Exercise equipment can be expensive.",
],
});
const result = await metric.measure(
"What are the benefits of exercise?",
"Regular exercise improves cardiovascular health and mental wellbeing.",
);
// Example output:
// {
// score: 0.75,
// info: {
// reason: "The score is 0.75 because the first and third contexts are highly relevant
// to the benefits mentioned in the output, while the second and fourth contexts
// are not directly related to exercise benefits. The relevant contexts are well-positioned
// at the beginning and middle of the sequence."
// }
// }