PromptAlignmentMetric
The PromptAlignmentMetric
class evaluates how strictly an LLM’s output follows a set of given prompt instructions. It uses a judge-based system to verify each instruction is followed exactly and provides detailed reasoning for any deviations.
Basic Usage
import { openai } from "@ai-sdk/openai";
import { PromptAlignmentMetric } from "@mastra/evals/llm";
// Configure the model for evaluation
const model = openai("gpt-4o-mini");
const instructions = [
"Start sentences with capital letters",
"End each sentence with a period",
"Use present tense",
];
const metric = new PromptAlignmentMetric(model, {
instructions,
scale: 1,
});
const result = await metric.measure(
"describe the weather",
"The sun is shining. Clouds float in the sky. A gentle breeze blows.",
);
console.log(result.score); // Alignment score from 0-1
console.log(result.info.reason); // Explanation of the score
Constructor Parameters
model:
LanguageModel
Configuration for the model used to evaluate instruction alignment
options:
PromptAlignmentOptions
Configuration options for the metric
PromptAlignmentOptions
instructions:
string[]
Array of instructions that the output should follow
scale?:
number
= 1
Maximum score value
measure() Parameters
input:
string
The original prompt or query
output:
string
The LLM's response to evaluate
Returns
score:
number
Alignment score (0 to scale, default 0-1)
info:
object
Object containing detailed metrics about instruction compliance
string
reason:
string
Detailed explanation of the score and instruction compliance
Scoring Details
The metric evaluates instruction alignment through:
- Applicability assessment for each instruction
- Strict compliance evaluation for applicable instructions
- Detailed reasoning for all verdicts
- Proportional scoring based on applicable instructions
Instruction Verdicts
Each instruction receives one of three verdicts:
- “yes”: Instruction is applicable and completely followed
- “no”: Instruction is applicable but not followed or only partially followed
- “n/a”: Instruction is not applicable to the given context
Scoring Process
-
Evaluates instruction applicability:
- Determines if each instruction applies to the context
- Marks irrelevant instructions as “n/a”
- Considers domain-specific requirements
-
Assesses compliance for applicable instructions:
- Evaluates each applicable instruction independently
- Requires complete compliance for “yes” verdict
- Documents specific reasons for all verdicts
-
Calculates alignment score:
- Counts followed instructions (“yes” verdicts)
- Divides by total applicable instructions (excluding “n/a”)
- Scales to configured range
Final score: (followed_instructions / applicable_instructions) * scale
Important Considerations
- Empty outputs:
- All formatting instructions are considered applicable
- Marked as “no” since they cannot satisfy requirements
- Domain-specific instructions:
- Always applicable if about the queried domain
- Marked as “no” if not followed, not “n/a”
- “n/a” verdicts:
- Only used for completely different domains
- Do not affect the final score calculation
Score interpretation
(0 to scale, default 0-1)
- 1.0: All applicable instructions followed perfectly
- 0.7-0.9: Most applicable instructions followed
- 0.4-0.6: Mixed compliance with applicable instructions
- 0.1-0.3: Limited compliance with applicable instructions
- 0.0: No applicable instructions followed
Example with Analysis
import { openai } from "@ai-sdk/openai";
import { PromptAlignmentMetric } from "@mastra/evals/llm";
// Configure the model for evaluation
const model = openai("gpt-4o-mini");
const metric = new PromptAlignmentMetric(model, {
instructions: [
"Use bullet points for each item",
"Include exactly three examples",
"End each point with a semicolon"
],
scale: 1
});
const result = await metric.measure(
"List three fruits",
"• Apple is red and sweet;
• Banana is yellow and curved;
• Orange is citrus and round."
);
// Example output:
// {
// score: 1.0,
// info: {
// reason: "The score is 1.0 because all instructions were followed exactly:
// bullet points were used, exactly three examples were provided, and
// each point ends with a semicolon."
// }
// }
const result2 = await metric.measure(
"List three fruits",
"1. Apple
2. Banana
3. Orange and Grape"
);
// Example output:
// {
// score: 0.33,
// info: {
// reason: "The score is 0.33 because: numbered lists were used instead of bullet points,
// no semicolons were used, and four fruits were listed instead of exactly three."
// }
// }