Prompt Alignment Evaluation
We just released a new evals API called Scorers, with a more ergonomic API and more metadata stored for error analysis, and more flexibility to evaluate data structures. It’s fairly simple to migrate, but we will continue to support the existing Evals API.
Use PromptAlignmentMetric
to evaluate how well a response follows a given set of instructions. The metric accepts a query
and a response
, and returns a score and an info
object containing a reason and instruction-level alignment details.
Installation
npm install @mastra/evals
Perfect alignment example
In this example, the response follows all applicable instructions from the input. The score reflects full adherence, with no instructions missed or ignored.
import { openai } from "@ai-sdk/openai";
import { PromptAlignmentMetric } from "@mastra/evals/llm";
const metric = new PromptAlignmentMetric(openai("gpt-4o-mini"), {
instructions: [
"Use complete sentences",
"Include temperature in Celsius",
"Mention wind conditions", "State precipitation chance"
]
});
const query = "What is the weather like?";
const response = "The temperature is 22 degrees Celsius with moderate winds from the northwest. There is a 30% chance of rain.";
const result = await metric.measure(query, response);
console.log(result);
Perfect alignment output
The response receives a high score because it fully satisfies all applicable instructions. The info
field confirms that each instruction was followed without omissions.
{
score: 1,
info: {
reason: 'The score is 1 because the output fully aligns with all applicable instructions, providing a comprehensive weather report that includes temperature, wind conditions, and chance of precipitation, all presented in complete sentences.',
scoreDetails: {
totalInstructions: 4,
applicableInstructions: 4,
followedInstructions: 4,
naInstructions: 0
}
}
}
Mixed alignment example
In this example, the response follows some of the instructions but omits others. The score reflects partial adherence, with a mix of followed and missed instructions.
import { openai } from "@ai-sdk/openai";
import { PromptAlignmentMetric } from "@mastra/evals/llm";
const metric = new PromptAlignmentMetric(openai("gpt-4o-mini"), {
instructions: [
"Use bullet points",
"Include prices in USD",
"Show stock status", "Add product descriptions"
]
});
const query = "List the available products";
const response = "• Coffee - $4.99 (In Stock)\n• Tea - $3.99\n• Water - $1.99 (Out of Stock)";
const result = await metric.measure(query, response);
console.log(result);
Mixed alignment output
The response receives a mixed score because it follows some of the instructions while missing others. The info
field includes a breakdown of followed and missed instructions along with a justification for the score.
{
score: 0.75,
info: {
reason: 'The score is 0.75 because the output meets most of the instructions by using bullet points, including prices in USD, and showing stock status. However, it does not fully align with the instruction to provide product descriptions, which affects the overall score.',
scoreDetails: {
totalInstructions: 4,
applicableInstructions: 4,
followedInstructions: 3,
naInstructions: 0
}
}
}
Non-applicable alignment example
In this example, the response does not address any of the instructions because they are unrelated to the query. The score reflects that the instructions were not applicable in this context.
import { openai } from "@ai-sdk/openai";
import { PromptAlignmentMetric } from "@mastra/evals/llm";
const metric = new PromptAlignmentMetric(openai("gpt-4o-mini"), {
instructions: [
"Show account balance",
"List recent transactions",
"Display payment history"
]
});
const query = "What is the weather like?";
const response = "It is sunny and warm outside.";
const result = await metric.measure(query, response);
console.log(result);
Non-applicable alignment output
The response receives a score indicating that none of the instructions could be applied. The info
field notes that the response and query are unrelated to the instructions, resulting in no measurable alignment.
{
score: 0,
info: {
reason: 'The score is 0 because the output does not follow any applicable instructions related to the context of a weather query, as the instructions provided are irrelevant to the input.',
scoreDetails: {
totalInstructions: 3,
applicableInstructions: 0,
followedInstructions: 0,
naInstructions: 3
}
}
}
Metric configuration
You can create a PromptAlignmentMetric
instance by providing an instructions
array that defines the expected behaviors or requirements. You can also configure optional parameters such as scale
const metric = new PromptAlignmentMetric(openai("gpt-4o-mini"), {
instructions: [""],
scale: 1
});
See PromptAlignmentMetric for a full list of configuration options.
Understanding the results
PromptAlignment
returns a result in the following shape:
{
score: number,
info: {
reason: string,
scoreDetails: {
followed: string[],
missed: string[],
notApplicable: string[]
}
}
}
Prompt alignment score
A prompt alignment score between 0 and 1:
- 1.0: Perfect alignment – all applicable instructions followed.
- 0.5–0.8: Mixed alignment – some instructions missed.
- 0.1–0.4: Poor alignment – most instructions not followed.
- 0.0: No alignment – no instructions are applicable or followed.
- -1: Not applicable – instructions unrelated to the query.
Prompt alignment info
An explanation for the score, with details including:
- Adherence to each instruction.
- Degree of applicability to the query.
- Classification of followed, missed, and non-applicable instructions.
- Reasoning for the alignment score.