Skip to main content

Prompt Alignment Scorer

The createPromptAlignmentScorerLLM() function creates a scorer that evaluates how well agent responses align with user prompts across multiple dimensions: intent understanding, requirement fulfillment, response completeness, and format appropriateness.

Parameters

model:

MastraLanguageModel
The language model to use for evaluating prompt-response alignment

options:

PromptAlignmentOptions
Configuration options for the scorer

.run() Returns

score:

number
Multi-dimensional alignment score between 0 and scale (default 0-1)

reason:

string
Human-readable explanation of the prompt alignment evaluation with detailed breakdown

Scoring Details

Multi-Dimensional Analysis

Prompt Alignment evaluates responses across four key dimensions with weighted scoring that adapts based on the evaluation mode:

User Mode ('user')

Evaluates alignment with user prompts only:

  1. Intent Alignment (40% weight) - Whether the response addresses the user's core request
  2. Requirements Fulfillment (30% weight) - If all user requirements are met
  3. Completeness (20% weight) - Whether the response is comprehensive for user needs
  4. Response Appropriateness (10% weight) - If format and tone match user expectations

System Mode ('system')

Evaluates compliance with system guidelines only:

  1. Intent Alignment (35% weight) - Whether the response follows system behavioral guidelines
  2. Requirements Fulfillment (35% weight) - If all system constraints are respected
  3. Completeness (15% weight) - Whether the response adheres to all system rules
  4. Response Appropriateness (15% weight) - If format and tone match system specifications

Both Mode ('both' - default)

Combines evaluation of both user and system alignment:

  • User alignment: 70% of final score (using user mode weights)
  • System compliance: 30% of final score (using system mode weights)
  • Provides balanced assessment of user satisfaction and system adherence

Scoring Formula

User Mode:

Weighted Score = (intent_score × 0.4) + (requirements_score × 0.3) +
(completeness_score × 0.2) + (appropriateness_score × 0.1)
Final Score = Weighted Score × scale

System Mode:

Weighted Score = (intent_score × 0.35) + (requirements_score × 0.35) +
(completeness_score × 0.15) + (appropriateness_score × 0.15)
Final Score = Weighted Score × scale

Both Mode (default):

User Score = (user dimensions with user weights)
System Score = (system dimensions with system weights)
Weighted Score = (User Score × 0.7) + (System Score × 0.3)
Final Score = Weighted Score × scale

Weight Distribution Rationale:

  • User Mode: Prioritizes intent (40%) and requirements (30%) for user satisfaction
  • System Mode: Balances behavioral compliance (35%) and constraints (35%) equally
  • Both Mode: 70/30 split ensures user needs are primary while maintaining system compliance

Score Interpretation

  • 0.9-1.0 = Excellent alignment across all dimensions
  • 0.8-0.9 = Very good alignment with minor gaps
  • 0.7-0.8 = Good alignment but missing some requirements or completeness
  • 0.6-0.7 = Moderate alignment with noticeable gaps
  • 0.4-0.6 = Poor alignment with significant issues
  • 0.0-0.4 = Very poor alignment, response doesn't address the prompt effectively

Comparison with Other Scorers

AspectPrompt AlignmentAnswer RelevancyFaithfulness
FocusMulti-dimensional prompt adherenceQuery-response relevanceContext groundedness
EvaluationIntent, requirements, completeness, formatSemantic similarity to queryFactual consistency with context
Use CaseGeneral prompt followingInformation retrievalRAG/context-based systems
Dimensions4 weighted dimensionsSingle relevance dimensionSingle faithfulness dimension

When to Use Each Mode

User Mode ('user') - Use when:

  • Evaluating customer service responses for user satisfaction
  • Testing content generation quality from user perspective
  • Measuring how well responses address user questions
  • Focusing purely on request fulfillment without system constraints

System Mode ('system') - Use when:

  • Auditing AI safety and compliance with behavioral guidelines
  • Ensuring agents follow brand voice and tone requirements
  • Validating adherence to content policies and constraints
  • Testing system-level behavioral consistency

Both Mode ('both') - Use when (default, recommended):

  • Comprehensive evaluation of overall AI agent performance
  • Balancing user satisfaction with system compliance
  • Production monitoring where both user and system requirements matter
  • Holistic assessment of prompt-response alignment

Usage Examples

Basic Configuration

import { openai } from "@ai-sdk/openai";
import { createPromptAlignmentScorerLLM } from "@mastra/evals";

const scorer = createPromptAlignmentScorerLLM({
model: openai("gpt-4o"),
});

// Evaluate a code generation task
const result = await scorer.run({
input: [
{
role: "user",
content:
"Write a Python function to calculate factorial with error handling",
},
],
output: {
role: "assistant",
text: `def factorial(n):
if n < 0:
raise ValueError("Factorial not defined for negative numbers")
if n == 0:
return 1
return n * factorial(n-1)`,
},
});
// Result: { score: 0.95, reason: "Excellent alignment - function addresses intent, includes error handling..." }

Custom Configuration Examples

// Configure scale and evaluation mode
const scorer = createPromptAlignmentScorerLLM({
model: openai("gpt-4o"),
options: {
scale: 10, // Score from 0-10 instead of 0-1
evaluationMode: "both", // 'user', 'system', or 'both' (default)
},
});

// User-only evaluation - focus on user satisfaction
const userScorer = createPromptAlignmentScorerLLM({
model: openai("gpt-4o"),
options: { evaluationMode: "user" },
});

// System-only evaluation - focus on compliance
const systemScorer = createPromptAlignmentScorerLLM({
model: openai("gpt-4o"),
options: { evaluationMode: "system" },
});

const result = await scorer.run(testRun);
// Result: { score: 8.5, reason: "Score: 8.5 out of 10 - Good alignment with both user intent and system guidelines..." }

Format-Specific Evaluation

// Evaluate bullet point formatting
const result = await scorer.run({
input: [
{
role: "user",
content: "List the benefits of TypeScript in bullet points",
},
],
output: {
role: "assistant",
text: "TypeScript provides static typing, better IDE support, and enhanced code reliability.",
},
});
// Result: Lower appropriateness score due to format mismatch (paragraph vs bullet points)

Usage Patterns

Code Generation Evaluation

Ideal for evaluating:

  • Programming task completion
  • Code quality and completeness
  • Adherence to coding requirements
  • Format specifications (functions, classes, etc.)
// Example: API endpoint creation
const codePrompt =
"Create a REST API endpoint with authentication and rate limiting";
// Scorer evaluates: intent (API creation), requirements (auth + rate limiting),
// completeness (full implementation), format (code structure)

Instruction Following Assessment

Perfect for:

  • Task completion verification
  • Multi-step instruction adherence
  • Requirement compliance checking
  • Educational content evaluation
// Example: Multi-requirement task
const taskPrompt =
"Write a Python class with initialization, validation, error handling, and documentation";
// Scorer tracks each requirement individually and provides detailed breakdown

Content Format Validation

Useful for:

  • Format specification compliance
  • Style guide adherence
  • Output structure verification
  • Response appropriateness checking
// Example: Structured output
const formatPrompt =
"Explain the differences between let and const in JavaScript using bullet points";
// Scorer evaluates content accuracy AND format compliance

Common Use Cases

1. Agent Response Quality

Measure how well your AI agents follow user instructions:

const agent = new Agent({
name: "CodingAssistant",
instructions:
"You are a helpful coding assistant. Always provide working code examples.",
model: openai("gpt-4o"),
});

// Evaluate comprehensive alignment (default)
const scorer = createPromptAlignmentScorerLLM({
model: openai("gpt-4o-mini"),
options: { evaluationMode: "both" }, // Evaluates both user intent and system guidelines
});

// Evaluate just user satisfaction
const userScorer = createPromptAlignmentScorerLLM({
model: openai("gpt-4o-mini"),
options: { evaluationMode: "user" }, // Focus only on user request fulfillment
});

// Evaluate system compliance
const systemScorer = createPromptAlignmentScorerLLM({
model: openai("gpt-4o-mini"),
options: { evaluationMode: "system" }, // Check adherence to system instructions
});

const result = await scorer.run(agentRun);

2. Prompt Engineering Optimization

Test different prompts to improve alignment:

const prompts = [
"Write a function to calculate factorial",
"Create a Python function that calculates factorial with error handling for negative inputs",
"Implement a factorial calculator in Python with: input validation, error handling, and docstring",
];

// Compare alignment scores to find the best prompt
for (const prompt of prompts) {
const result = await scorer.run(createTestRun(prompt, response));
console.log(`Prompt alignment: ${result.score}`);
}

3. Multi-Agent System Evaluation

Compare different agents or models:

const agents = [agent1, agent2, agent3];
const testPrompts = [...]; // Array of test prompts

for (const agent of agents) {
let totalScore = 0;
for (const prompt of testPrompts) {
const response = await agent.run(prompt);
const evaluation = await scorer.run({ input: prompt, output: response });
totalScore += evaluation.score;
}
console.log(`${agent.name} average alignment: ${totalScore / testPrompts.length}`);
}

Error Handling

The scorer handles various edge cases gracefully:

// Missing user prompt
try {
await scorer.run({ input: [], output: response });
} catch (error) {
// Error: "Both user prompt and agent response are required for prompt alignment scoring"
}

// Empty response
const result = await scorer.run({
input: [userMessage],
output: { role: "assistant", text: "" },
});
// Returns low scores with detailed reasoning about incompleteness