Using Evaluation Metrics in Mastra
Evaluation metrics help measure and validate AI model outputs across different dimensions. Here's a comprehensive guide to Mastra's evaluation suite. All of these metrics output a score between 0 and 1.
We are starting with some NLP-based metrics, and will add additional LLM-as-judgemetrics soon.
Core Evaluation Metrics
Model Configuration
First, set up the model configuration needed for several metrics:
import { ModelConfig } from "@mastra/core";
const model: ModelConfig = {
provider: "OPEN_AI",
model: "gpt-4",
apiKey: process.env.OPENAI_API_KEY,
};
Answer Relevancy
Answer relevancy evaluates if responses address queries appropriately:
import { AnswerRelevancyMetric } from "@mastra/evals/llm";
const metric = new AnswerRelevancyMetric(model, {
uncertaintyWeight: 0.3,
scale: 10,
});
const result = await metric.measure({
input: "What is the capital of France?",
output: "Paris is the capital of France.",
});
This metric uses an LLM to judge how well responses address queries, scoring yes/no/unsure verdicts with uncertainty weighting. Score is calculated as (relevancyCount / totalVerdicts) * scale, where "unsure" counts as 0.3.
Completeness
Completeness measures how thoroughly a response covers the key elements from the input:
import { CompletenessMetric } from "@mastra/evals/nlp";
const metric = new CompletenessMetric();
const result = await metric.measure({
input: "Explain the water cycle: evaporation, condensation, precipitation",
output:
"Water evaporates from surfaces, forms clouds through condensation, and returns as precipitation",
});
Specifically, this metric extracts and compares key elements (nouns, verbs, topics) between input and output using NLP. Score represents the ratio of matched elements to total elements, with intelligent partial matching for longer words.
Content Similarity
This metric performs direct string comparison using a similarity library, with configurable case and whitespace sensitivity. Perfect matches score 1.0, with decreasing scores based on string differences.
import { ContentSimilarityMetric } from "@mastra/evals/nlp";
const metric = new ContentSimilarityMetric({
ignoreCase: true,
ignoreWhitespace: true,
});
const result = await metric.measure({
input: "The quick brown fox",
output: "the Quick Brown fox",
});
Context Position
Context position evaluates how well the model uses ordered context. Earlier positions are weighted more heavily (weight = 1/position).
The final score is the ratio of weighted relevant items to maximum possible weighted sum.
import { ContextPositionMetric } from "@mastra/evals/llm";
const metric = new ContextPositionMetric(model, {
scale: 10,
});
const result = await metric.measure({
input: "Summarize the events",
output: "First came A, then B, finally C",
context: [
"A occurred in the morning",
"B happened at noon",
"C took place in the evening",
],
});
Context Precision
Context precision measures accurate use of provided context by calculating precision at each relevant position in the response. Score is normalized sum of precision values at relevant positions divided by number of relevant items.
import { ContextPrecisionMetric } from "@mastra/evals/llm";
const metric = new ContextPrecisionMetric(model, {
scale: 10,
});
const result = await metric.measure({
input: "What did the research find?",
output: "The study found significant improvements",
context: [
"Research showed 45% improvement",
"Results were statistically significant",
],
});
Difference
Difference metric calculates detailed text differences:
import { TextualDifferenceMetric } from "@mastra/evals/nlp";
const metric = new TextualDifferenceMetric();
const result = await metric.measure({
input: "Original text version",
output: "Modified text version",
});
// Provides ratio, number of changes, and length differences
Keyword Coverage
Extracts and compares keywords between input and output using a keyword extraction library. Score is simply matched keywords divided by total keywords from input.
import { KeywordCoverageMetric } from "@mastra/evals/nlp";
const metric = new KeywordCoverageMetric();
const result = await metric.measure({
input: "Explain photosynthesis: chlorophyll, sunlight, glucose",
output: "Plants use chlorophyll to convert sunlight into glucose",
});
Prompt Alignment
Uses an LLM to check adherence to specific instructions with binary yes/no scoring. Final score is the ratio of followed instructions to total instructions, scaled to range.
import { PromptAlignmentMetric } from "@mastra/evals/llm";
const metric = new PromptAlignmentMetric(model, {
instructions: [
"Use formal language",
"Include specific examples",
"Stay under 100 words",
],
scale: 10,
});
const result = await metric.measure({
input: "Describe quantum computing",
output: "Quantum computing uses quantum bits...",
});
Tone Consistency
Analyzes sentiment consistency either between input/output or within sentences of a single text. Score is based on sentiment difference or variance, with 1.0 indicating perfect consistency.
import { ToneConsistencyMetric } from "@mastra/evals/nlp";
const metric = new ToneConsistencyMetric();
const result = await metric.measure({
input: "Write a positive product review",
output: "This product exceeded my expectations!",
});
// Measures sentiment stability and alignment
Combining Metrics
Combine metrics for more assessment:
import {
AnswerRelevancyMetric,
ContextPrecisionMetric,
PromptAlignmentMetric,
ToneConsistencyMetric,
} from "@mastra/evals/llm";
import {
CompletenessMetric,
ContentSimilarityMetric,
TextualDifferenceMetric,
KeywordCoverageMetric,
ToneConsistencyMetric,
} from "@mastra/evals/nlp";
async function evaluateResponse({
input,
output,
context,
instructions,
}: {
input: string;
output: string;
context?: string[];
instructions?: string[];
}) {
const model: ModelConfig = {
provider: "openai",
model: "gpt-4",
apiKey: process.env.OPENAI_API_KEY,
};
const metrics = [
new AnswerRelevancyMetric(model),
new CompletenessMetric(),
new ContentSimilarityMetric(),
new ContextPrecisionMetric(model),
new TextualDifferenceMetric(),
new KeywordCoverageMetric(),
new PromptAlignmentMetric(model, { instructions: instructions || [] }),
new ToneConsistencyMetric(),
];
const results = await Promise.all(
metrics.map((metric) =>
metric.measure({
input,
output,
context,
}),
),
);
return results;
}