PromptAlignmentMetric
We just released a new evals API called Scorers, with a more ergonomic API and more metadata stored for error analysis, and more flexibility to evaluate data structures. It’s fairly simple to migrate, but we will continue to support the existing Evals API.
The PromptAlignmentMetric
class evaluates how strictly an LLM’s output follows a set of given prompt instructions. It uses a judge-based system to verify each instruction is followed exactly and provides detailed reasoning for any deviations.
Basic Usage
import { openai } from "@ai-sdk/openai";
import { PromptAlignmentMetric } from "@mastra/evals/llm";
// Configure the model for evaluation
const model = openai("gpt-4o-mini");
const instructions = [
"Start sentences with capital letters",
"End each sentence with a period",
"Use present tense",
];
const metric = new PromptAlignmentMetric(model, {
instructions,
scale: 1,
});
const result = await metric.measure(
"describe the weather",
"The sun is shining. Clouds float in the sky. A gentle breeze blows.",
);
console.log(result.score); // Alignment score from 0-1
console.log(result.info.reason); // Explanation of the score
Constructor Parameters
model:
options:
PromptAlignmentOptions
instructions:
scale?:
measure() Parameters
input:
output:
Returns
score:
info:
reason:
Scoring Details
The metric evaluates instruction alignment through:
- Applicability assessment for each instruction
- Strict compliance evaluation for applicable instructions
- Detailed reasoning for all verdicts
- Proportional scoring based on applicable instructions
Instruction Verdicts
Each instruction receives one of three verdicts:
- “yes”: Instruction is applicable and completely followed
- “no”: Instruction is applicable but not followed or only partially followed
- “n/a”: Instruction is not applicable to the given context
Scoring Process
-
Evaluates instruction applicability:
- Determines if each instruction applies to the context
- Marks irrelevant instructions as “n/a”
- Considers domain-specific requirements
-
Assesses compliance for applicable instructions:
- Evaluates each applicable instruction independently
- Requires complete compliance for “yes” verdict
- Documents specific reasons for all verdicts
-
Calculates alignment score:
- Counts followed instructions (“yes” verdicts)
- Divides by total applicable instructions (excluding “n/a”)
- Scales to configured range
Final score: (followed_instructions / applicable_instructions) * scale
Important Considerations
- Empty outputs:
- All formatting instructions are considered applicable
- Marked as “no” since they cannot satisfy requirements
- Domain-specific instructions:
- Always applicable if about the queried domain
- Marked as “no” if not followed, not “n/a”
- “n/a” verdicts:
- Only used for completely different domains
- Do not affect the final score calculation
Score interpretation
(0 to scale, default 0-1)
- 1.0: All applicable instructions followed perfectly
- 0.7-0.9: Most applicable instructions followed
- 0.4-0.6: Mixed compliance with applicable instructions
- 0.1-0.3: Limited compliance with applicable instructions
- 0.0: No applicable instructions followed
Example with Analysis
import { openai } from "@ai-sdk/openai";
import { PromptAlignmentMetric } from "@mastra/evals/llm";
// Configure the model for evaluation
const model = openai("gpt-4o-mini");
const metric = new PromptAlignmentMetric(model, {
instructions: [
"Use bullet points for each item",
"Include exactly three examples",
"End each point with a semicolon"
],
scale: 1
});
const result = await metric.measure(
"List three fruits",
"• Apple is red and sweet;
• Banana is yellow and curved;
• Orange is citrus and round."
);
// Example output:
// {
// score: 1.0,
// info: {
// reason: "The score is 1.0 because all instructions were followed exactly:
// bullet points were used, exactly three examples were provided, and
// each point ends with a semicolon."
// }
// }
const result2 = await metric.measure(
"List three fruits",
"1. Apple
2. Banana
3. Orange and Grape"
);
// Example output:
// {
// score: 0.33,
// info: {
// reason: "The score is 0.33 because: numbered lists were used instead of bullet points,
// no semicolons were used, and four fruits were listed instead of exactly three."
// }
// }