Prompt Alignment

This example demonstrates how to use Mastra’s Prompt Alignment metric to evaluate how well responses follow given instructions.

Overview

The example shows how to:

Configure the Prompt Alignment metric
Evaluate instruction adherence
Handle non-applicable instructions
Calculate alignment scores

Setup

Environment Setup

Make sure to set up your environment variables:

.env


OPENAI_API_KEY=your_api_key_here

Dependencies

Import the necessary dependencies:

src/index.ts


import { openai } from "@ai-sdk/openai";
import { PromptAlignmentMetric } from "@mastra/evals/llm";

Example Usage

Perfect Alignment Example

Evaluate a response that follows all instructions:

src/index.ts


const instructions1 = [
  "Use complete sentences",
  "Include temperature in Celsius",
  "Mention wind conditions",
  "State precipitation chance",
];
 
const metric1 = new PromptAlignmentMetric(openai("gpt-4o-mini"), {
  instructions: instructions1,
});
 
const query1 = "What is the weather like?";
const response1 =
  "The temperature is 22 degrees Celsius with moderate winds from the northwest. There is a 30% chance of rain.";
 
console.log("Example 1 - Perfect Alignment:");
console.log("Instructions:", instructions1);
console.log("Query:", query1);
console.log("Response:", response1);
 
const result1 = await metric1.measure(query1, response1);
console.log("Metric Result:", {
  score: result1.score,
  reason: result1.info.reason,
  details: result1.info.scoreDetails,
});
// Example Output:
// Metric Result: { score: 1, reason: 'The response follows all instructions.' }

Mixed Alignment Example

Evaluate a response that misses some instructions:

src/index.ts


const instructions2 = [
  "Use bullet points",
  "Include prices in USD",
  "Show stock status",
  "Add product descriptions",
];
 
const metric2 = new PromptAlignmentMetric(openai("gpt-4o-mini"), {
  instructions: instructions2,
});
 
const query2 = "List the available products";
const response2 =
  "• Coffee - $4.99 (In Stock)\n• Tea - $3.99\n• Water - $1.99 (Out of Stock)";
 
console.log("Example 2 - Mixed Alignment:");
console.log("Instructions:", instructions2);
console.log("Query:", query2);
console.log("Response:", response2);
 
const result2 = await metric2.measure(query2, response2);
console.log("Metric Result:", {
  score: result2.score,
  reason: result2.info.reason,
  details: result2.info.scoreDetails,
});
// Example Output:
// Metric Result: { score: 0.5, reason: 'The response misses some instructions.' }

Non-Applicable Instructions Example

Evaluate a response where instructions don’t apply:

src/index.ts


const instructions3 = [
  "Show account balance",
  "List recent transactions",
  "Display payment history",
];
 
const metric3 = new PromptAlignmentMetric(openai("gpt-4o-mini"), {
  instructions: instructions3,
});
 
const query3 = "What is the weather like?";
const response3 = "It is sunny and warm outside.";
 
console.log("Example 3 - N/A Instructions:");
console.log("Instructions:", instructions3);
console.log("Query:", query3);
console.log("Response:", response3);
 
const result3 = await metric3.measure(query3, response3);
console.log("Metric Result:", {
  score: result3.score,
  reason: result3.info.reason,
  details: result3.info.scoreDetails,
});
// Example Output:
// Metric Result: { score: 0, reason: 'No instructions are followed or are applicable to the query.' }

Understanding the Results

The metric provides:

An alignment score between 0 and 1, or -1 for special cases:
- 1.0: Perfect alignment - all applicable instructions followed
- 0.5-0.8: Mixed alignment - some instructions missed
- 0.1-0.4: Poor alignment - most instructions not followed
- 0.0:No alignment - no instructions are applicable or followed
Detailed reason for the score, including analysis of:
- Query-response alignment
- Instruction adherence
Score details, including breakdown of:
- Followed instructions
- Missed instructions
- Non-applicable instructions
- Reasoning for each instruction’s status

When no instructions are applicable to the context (score: -1), this indicates a prompt design issue rather than a response quality issue.

View Example on GitHub