Prompt Alignment Scorer
Use createPromptAlignmentScorerLLM
to score how well responses align with user prompts across intent, requirements, completeness, and format.
Installation
npm install @mastra/evals
For complete API documentation and configuration options, see
createPromptAlignmentScorerLLM
.
Excellent alignment example
In this example, the response fully addresses the user’s prompt with all requirements met.
import { openai } from "@ai-sdk/openai";
import { createPromptAlignmentScorerLLM } from "@mastra/evals/scorers/llm";
const scorer = createPromptAlignmentScorerLLM({
model: openai("gpt-4o-mini")
});
const inputMessages = [{
role: 'user',
content: "Write a Python function to calculate factorial with error handling for negative numbers"
}];
const outputMessage = {
text: `def factorial(n):
"""Calculate factorial of a number."""
if n < 0:
raise ValueError("Factorial not defined for negative numbers")
if n == 0 or n == 1:
return 1
return n * factorial(n - 1)`
};
const result = await scorer.run({
input: inputMessages,
output: outputMessage,
});
console.log(result);
Excellent alignment output
The output receives a high score because it perfectly addresses the intent, fulfills all requirements, and uses appropriate format.
{
score: 0.95,
reason: 'The score is 0.95 because the response perfectly addresses the primary intent of creating a factorial function and fulfills all requirements including Python implementation, error handling for negative numbers, and proper documentation. The code format is appropriate and the implementation is complete.'
}
Partial alignment example
In this example, the response addresses the core intent but misses some requirements or has format issues.
import { openai } from "@ai-sdk/openai";
import { createPromptAlignmentScorerLLM } from "@mastra/evals/scorers/llm";
const scorer = createPromptAlignmentScorerLLM({
model: openai("gpt-4o-mini")
});
const inputMessages = [{
role: 'user',
content: "List the benefits of TypeScript in bullet points"
}];
const outputMessage = {
text: "TypeScript provides static typing, better IDE support, and enhanced code reliability through compile-time error checking."
};
const result = await scorer.run({
input: inputMessages,
output: outputMessage,
});
console.log(result);
Partial alignment output
The output receives a lower score because while the content is accurate, it doesn’t follow the requested format (bullet points).
{
score: 0.75,
reason: 'The score is 0.75 because the response addresses the intent of explaining TypeScript benefits and provides accurate information, but fails to use the requested bullet point format, resulting in lower appropriateness scoring.'
}
Poor alignment example
In this example, the response fails to address the user’s specific requirements.
import { openai } from "@ai-sdk/openai";
import { createPromptAlignmentScorerLLM } from "@mastra/evals/scorers/llm";
const scorer = createPromptAlignmentScorerLLM({
model: openai("gpt-4o-mini")
});
const inputMessages = [{
role: 'user',
content: "Write a Python class with initialization, validation, error handling, and documentation"
}];
const outputMessage = {
text: `class Example:
def __init__(self, value):
self.value = value`
};
const result = await scorer.run({
input: inputMessages,
output: outputMessage,
});
console.log(result);
Poor alignment output
The output receives a low score because it only partially fulfills the requirements, missing validation, error handling, and documentation.
{
score: 0.35,
reason: 'The score is 0.35 because while the response addresses the basic intent of creating a Python class with initialization, it fails to include validation, error handling, and documentation as specifically requested, resulting in incomplete requirement fulfillment.'
}
Scorer configuration
You can customize the Prompt Alignment Scorer by adjusting the scale parameter and evaluation mode to fit your scoring needs.
const scorer = createPromptAlignmentScorerLLM({
model: openai("gpt-4o-mini"),
options: {
scale: 10, // Score from 0-10 instead of 0-1
evaluationMode: 'both' // 'user', 'system', or 'both' (default)
}
});
Evaluation Mode Examples
User Mode - Focus on User Prompt Only
Evaluates how well the response addresses the user’s request, ignoring system instructions:
const scorer = createPromptAlignmentScorerLLM({
model: openai("gpt-4o-mini"),
options: { evaluationMode: 'user' }
});
const result = await scorer.run({
input: {
inputMessages: [{
role: 'user',
content: "Explain recursion with an example"
}],
systemMessages: [{
role: 'system',
content: "Always provide code examples in Python"
}]
},
output: {
text: "Recursion is when a function calls itself. For example: factorial(5) = 5 * factorial(4)"
}
});
// Scores high for addressing user request, even without Python code
System Mode - Focus on System Guidelines Only
Evaluates compliance with system behavioral guidelines and constraints:
const scorer = createPromptAlignmentScorerLLM({
model: openai("gpt-4o-mini"),
options: { evaluationMode: 'system' }
});
const result = await scorer.run({
input: {
systemMessages: [{
role: 'system',
content: "You are a helpful assistant. Always be polite, concise, and provide examples."
}],
inputMessages: [{
role: 'user',
content: "What is machine learning?"
}]
},
output: {
text: "Machine learning is a subset of AI where computers learn from data. For example, spam filters learn to identify unwanted emails by analyzing patterns in previously marked spam."
}
});
// Evaluates politeness, conciseness, and example provision
Both Mode - Combined Evaluation (Default)
Evaluates both user intent fulfillment and system compliance with weighted scoring (70% user, 30% system):
const scorer = createPromptAlignmentScorerLLM({
model: openai("gpt-4o-mini"),
options: { evaluationMode: 'both' } // This is the default
});
const result = await scorer.run({
input: {
systemMessages: [{
role: 'system',
content: "Always provide code examples when explaining programming concepts"
}],
inputMessages: [{
role: 'user',
content: "Explain how to reverse a string"
}]
},
output: {
text: `To reverse a string, you can iterate through it backwards. Here's an example in Python:
def reverse_string(s):
return s[::-1]
# Usage: reverse_string("hello") returns "olleh"`
}
});
// High score for both addressing the user's request AND following system guidelines
See createPromptAlignmentScorerLLM for a full list of configuration options.
Understanding the results
.run()
returns a result in the following shape:
{
runId: string,
score: number,
reason: string,
analyzeStepResult: {
intentAlignment: {
score: number,
primaryIntent: string,
isAddressed: boolean,
reasoning: string
},
requirementsFulfillment: {
requirements: Array<{
requirement: string,
isFulfilled: boolean,
reasoning: string
}>,
overallScore: number
},
completeness: {
score: number,
missingElements: string[],
reasoning: string
},
responseAppropriateness: {
score: number,
formatAlignment: boolean,
toneAlignment: boolean,
reasoning: string
},
overallAssessment: string
}
}
score
A multi-dimensional alignment score between 0 and scale (default 0-1):
- 0.9-1.0: Excellent alignment across all dimensions
- 0.8-0.9: Very good alignment with minor gaps
- 0.7-0.8: Good alignment but missing some requirements
- 0.6-0.7: Moderate alignment with noticeable gaps
- 0.4-0.6: Poor alignment with significant issues
- 0.0-0.4: Very poor alignment, response doesn’t address prompt effectively
Scoring dimensions
The scorer evaluates four weighted dimensions that adapt based on the evaluation mode:
User Mode Weights:
- Intent Alignment (40%): Whether the response addresses the user’s core request
- Requirements Fulfillment (30%): If all user requirements are met
- Completeness (20%): Whether the response is comprehensive for the user’s needs
- Response Appropriateness (10%): If the format and tone match user expectations
System Mode Weights:
- Intent Alignment (35%): Whether the response follows system behavioral guidelines
- Requirements Fulfillment (35%): If all system constraints are respected
- Completeness (15%): Whether the response adheres to all system rules
- Response Appropriateness (15%): If the format and tone match system specifications
Both Mode (Default):
- Combines user alignment (70% weight) with system compliance (30% weight)
- Provides balanced evaluation of both user satisfaction and system adherence
runId
The unique identifier for this scorer run.
reason
A detailed explanation of the score including breakdown by dimension and specific issues identified.
analyzeStepResult
The detailed analysis results showing scores and reasoning for each dimension.