PromptAlignmentMetric

The PromptAlignmentMetric class evaluates how strictly an LLM’s output follows a set of given prompt instructions. It uses a judge-based system to verify each instruction is followed exactly and provides detailed reasoning for any deviations.

Basic Usage


import { openai } from "@ai-sdk/openai";
import { PromptAlignmentMetric } from "@mastra/evals/llm";
 
// Configure the model for evaluation
const model = openai("gpt-4o-mini");
 
const instructions = [
  "Start sentences with capital letters",
  "End each sentence with a period",
  "Use present tense",
];
 
const metric = new PromptAlignmentMetric(model, {
  instructions,
  scale: 1,
});
 
const result = await metric.measure(
  "describe the weather",
  "The sun is shining. Clouds float in the sky. A gentle breeze blows.",
);
 
console.log(result.score); // Alignment score from 0-1
console.log(result.info.reason); // Explanation of the score

Constructor Parameters

model:

LanguageModel

Configuration for the model used to evaluate instruction alignment

options:

PromptAlignmentOptions

Configuration options for the metric

PromptAlignmentOptions

instructions:

string[]

Array of instructions that the output should follow

scale?:

number

= 1

Maximum score value

measure() Parameters

input:

string

The original prompt or query

output:

string

The LLM's response to evaluate

Returns

score:

number

Alignment score (0 to scale, default 0-1)

info:

object

Object containing detailed metrics about instruction compliance

string

reason:

string

Detailed explanation of the score and instruction compliance

Scoring Details

The metric evaluates instruction alignment through:

Applicability assessment for each instruction
Strict compliance evaluation for applicable instructions
Detailed reasoning for all verdicts
Proportional scoring based on applicable instructions

Instruction Verdicts

Each instruction receives one of three verdicts:

“yes”: Instruction is applicable and completely followed
“no”: Instruction is applicable but not followed or only partially followed
“n/a”: Instruction is not applicable to the given context

Scoring Process

Evaluates instruction applicability:
- Determines if each instruction applies to the context
- Marks irrelevant instructions as “n/a”
- Considers domain-specific requirements
Assesses compliance for applicable instructions:
- Evaluates each applicable instruction independently
- Requires complete compliance for “yes” verdict
- Documents specific reasons for all verdicts
Calculates alignment score:
- Counts followed instructions (“yes” verdicts)
- Divides by total applicable instructions (excluding “n/a”)
- Scales to configured range

Final score: (followed_instructions / applicable_instructions) * scale

Important Considerations

Empty outputs:
- All formatting instructions are considered applicable
- Marked as “no” since they cannot satisfy requirements
Domain-specific instructions:
- Always applicable if about the queried domain
- Marked as “no” if not followed, not “n/a”
“n/a” verdicts:
- Only used for completely different domains
- Do not affect the final score calculation

Score interpretation

(0 to scale, default 0-1)

1.0: All applicable instructions followed perfectly
0.7-0.9: Most applicable instructions followed
0.4-0.6: Mixed compliance with applicable instructions
0.1-0.3: Limited compliance with applicable instructions
0.0: No applicable instructions followed

Example with Analysis


import { openai } from "@ai-sdk/openai";
import { PromptAlignmentMetric } from "@mastra/evals/llm";
 
// Configure the model for evaluation
const model = openai("gpt-4o-mini");
 
const metric = new PromptAlignmentMetric(model, {
  instructions: [
    "Use bullet points for each item",
    "Include exactly three examples",
    "End each point with a semicolon"
  ],
  scale: 1
});
 
const result = await metric.measure(
  "List three fruits",
  "• Apple is red and sweet;
• Banana is yellow and curved;
• Orange is citrus and round."
);
 
// Example output:
// {
//   score: 1.0,
//   info: {
//     reason: "The score is 1.0 because all instructions were followed exactly:
//           bullet points were used, exactly three examples were provided, and
//           each point ends with a semicolon."
//   }
// }
 
const result2 = await metric.measure(
  "List three fruits",
  "1. Apple
2. Banana
3. Orange and Grape"
);
 
// Example output:
// {
//   score: 0.33,
//   info: {
//     reason: "The score is 0.33 because: numbered lists were used instead of bullet points,
//           no semicolons were used, and four fruits were listed instead of exactly three."
//   }
// }

PromptAlignmentMetric

Basic Usage

Constructor Parameters

model:

options:

PromptAlignmentOptions

instructions:

scale?:

measure() Parameters

input:

output:

Returns

score:

info:

reason:

Scoring Details

Instruction Verdicts

Scoring Process

Important Considerations

Score interpretation

Example with Analysis

Related