DocsReferenceEvalsPromptAlignment

PromptAlignmentMetric

The PromptAlignmentMetric class evaluates how strictly an LLM’s output follows a set of given prompt instructions. It uses a judge-based system to verify each instruction is followed exactly and provides detailed reasoning for any deviations.

Basic Usage

import { openai } from "@ai-sdk/openai";
import { PromptAlignmentMetric } from "@mastra/evals/llm";
 
// Configure the model for evaluation
const model = openai("gpt-4o-mini");
 
const instructions = [
  "Start sentences with capital letters",
  "End each sentence with a period",
  "Use present tense",
];
 
const metric = new PromptAlignmentMetric(model, {
  instructions,
  scale: 1,
});
 
const result = await metric.measure(
  "describe the weather",
  "The sun is shining. Clouds float in the sky. A gentle breeze blows.",
);
 
console.log(result.score); // Alignment score from 0-1
console.log(result.info.reason); // Explanation of the score

Constructor Parameters

model:

LanguageModel
Configuration for the model used to evaluate instruction alignment

options:

PromptAlignmentOptions
Configuration options for the metric

PromptAlignmentOptions

instructions:

string[]
Array of instructions that the output should follow

scale?:

number
= 1
Maximum score value

measure() Parameters

input:

string
The original prompt or query

output:

string
The LLM's response to evaluate

Returns

score:

number
Alignment score (0 to scale, default 0-1)

info:

object
Object containing detailed metrics about instruction compliance
string

reason:

string
Detailed explanation of the score and instruction compliance

Scoring Details

The metric evaluates instruction alignment through:

  • Applicability assessment for each instruction
  • Strict compliance evaluation for applicable instructions
  • Detailed reasoning for all verdicts
  • Proportional scoring based on applicable instructions

Instruction Verdicts

Each instruction receives one of three verdicts:

  • “yes”: Instruction is applicable and completely followed
  • “no”: Instruction is applicable but not followed or only partially followed
  • “n/a”: Instruction is not applicable to the given context

Scoring Process

  1. Evaluates instruction applicability:

    • Determines if each instruction applies to the context
    • Marks irrelevant instructions as “n/a”
    • Considers domain-specific requirements
  2. Assesses compliance for applicable instructions:

    • Evaluates each applicable instruction independently
    • Requires complete compliance for “yes” verdict
    • Documents specific reasons for all verdicts
  3. Calculates alignment score:

    • Counts followed instructions (“yes” verdicts)
    • Divides by total applicable instructions (excluding “n/a”)
    • Scales to configured range

Final score: (followed_instructions / applicable_instructions) * scale

Important Considerations

  • Empty outputs:
    • All formatting instructions are considered applicable
    • Marked as “no” since they cannot satisfy requirements
  • Domain-specific instructions:
    • Always applicable if about the queried domain
    • Marked as “no” if not followed, not “n/a”
  • “n/a” verdicts:
    • Only used for completely different domains
    • Do not affect the final score calculation

Score interpretation

(0 to scale, default 0-1)

  • 1.0: All applicable instructions followed perfectly
  • 0.7-0.9: Most applicable instructions followed
  • 0.4-0.6: Mixed compliance with applicable instructions
  • 0.1-0.3: Limited compliance with applicable instructions
  • 0.0: No applicable instructions followed

Example with Analysis

import { openai } from "@ai-sdk/openai";
import { PromptAlignmentMetric } from "@mastra/evals/llm";
 
// Configure the model for evaluation
const model = openai("gpt-4o-mini");
 
const metric = new PromptAlignmentMetric(model, {
  instructions: [
    "Use bullet points for each item",
    "Include exactly three examples",
    "End each point with a semicolon"
  ],
  scale: 1
});
 
const result = await metric.measure(
  "List three fruits",
  "• Apple is red and sweet;
• Banana is yellow and curved;
• Orange is citrus and round."
);
 
// Example output:
// {
//   score: 1.0,
//   info: {
//     reason: "The score is 1.0 because all instructions were followed exactly:
//           bullet points were used, exactly three examples were provided, and
//           each point ends with a semicolon."
//   }
// }
 
const result2 = await metric.measure(
  "List three fruits",
  "1. Apple
2. Banana
3. Orange and Grape"
);
 
// Example output:
// {
//   score: 0.33,
//   info: {
//     reason: "The score is 0.33 because: numbered lists were used instead of bullet points,
//           no semicolons were used, and four fruits were listed instead of exactly three."
//   }
// }