Prompt Alignment Scorer
The createPromptAlignmentScorerLLM()
function creates a scorer that evaluates how well agent responses align with user prompts across multiple dimensions: intent understanding, requirement fulfillment, response completeness, and format appropriateness.
Parameters
model:
options:
.run() Returns
score:
reason:
Scoring Details
Multi-Dimensional Analysis
Prompt Alignment evaluates responses across four key dimensions with weighted scoring that adapts based on the evaluation mode:
User Mode (‘user’)
Evaluates alignment with user prompts only:
- Intent Alignment (40% weight) - Whether the response addresses the user’s core request
- Requirements Fulfillment (30% weight) - If all user requirements are met
- Completeness (20% weight) - Whether the response is comprehensive for user needs
- Response Appropriateness (10% weight) - If format and tone match user expectations
System Mode (‘system’)
Evaluates compliance with system guidelines only:
- Intent Alignment (35% weight) - Whether the response follows system behavioral guidelines
- Requirements Fulfillment (35% weight) - If all system constraints are respected
- Completeness (15% weight) - Whether the response adheres to all system rules
- Response Appropriateness (15% weight) - If format and tone match system specifications
Both Mode (‘both’ - default)
Combines evaluation of both user and system alignment:
- User alignment: 70% of final score (using user mode weights)
- System compliance: 30% of final score (using system mode weights)
- Provides balanced assessment of user satisfaction and system adherence
Scoring Formula
User Mode:
Weighted Score = (intent_score × 0.4) + (requirements_score × 0.3) +
(completeness_score × 0.2) + (appropriateness_score × 0.1)
Final Score = Weighted Score × scale
System Mode:
Weighted Score = (intent_score × 0.35) + (requirements_score × 0.35) +
(completeness_score × 0.15) + (appropriateness_score × 0.15)
Final Score = Weighted Score × scale
Both Mode (default):
User Score = (user dimensions with user weights)
System Score = (system dimensions with system weights)
Weighted Score = (User Score × 0.7) + (System Score × 0.3)
Final Score = Weighted Score × scale
Weight Distribution Rationale:
- User Mode: Prioritizes intent (40%) and requirements (30%) for user satisfaction
- System Mode: Balances behavioral compliance (35%) and constraints (35%) equally
- Both Mode: 70/30 split ensures user needs are primary while maintaining system compliance
Score Interpretation
- 0.9-1.0 = Excellent alignment across all dimensions
- 0.8-0.9 = Very good alignment with minor gaps
- 0.7-0.8 = Good alignment but missing some requirements or completeness
- 0.6-0.7 = Moderate alignment with noticeable gaps
- 0.4-0.6 = Poor alignment with significant issues
- 0.0-0.4 = Very poor alignment, response doesn’t address the prompt effectively
Comparison with Other Scorers
Aspect | Prompt Alignment | Answer Relevancy | Faithfulness |
---|---|---|---|
Focus | Multi-dimensional prompt adherence | Query-response relevance | Context groundedness |
Evaluation | Intent, requirements, completeness, format | Semantic similarity to query | Factual consistency with context |
Use Case | General prompt following | Information retrieval | RAG/context-based systems |
Dimensions | 4 weighted dimensions | Single relevance dimension | Single faithfulness dimension |
When to Use Each Mode
User Mode ('user'
) - Use when:
- Evaluating customer service responses for user satisfaction
- Testing content generation quality from user perspective
- Measuring how well responses address user questions
- Focusing purely on request fulfillment without system constraints
System Mode ('system'
) - Use when:
- Auditing AI safety and compliance with behavioral guidelines
- Ensuring agents follow brand voice and tone requirements
- Validating adherence to content policies and constraints
- Testing system-level behavioral consistency
Both Mode ('both'
) - Use when (default, recommended):
- Comprehensive evaluation of overall AI agent performance
- Balancing user satisfaction with system compliance
- Production monitoring where both user and system requirements matter
- Holistic assessment of prompt-response alignment
Usage Examples
Basic Configuration
import { openai } from '@ai-sdk/openai';
import { createPromptAlignmentScorerLLM } from '@mastra/evals';
const scorer = createPromptAlignmentScorerLLM({
model: openai('gpt-4o'),
});
// Evaluate a code generation task
const result = await scorer.run({
input: [{
role: 'user',
content: 'Write a Python function to calculate factorial with error handling'
}],
output: {
role: 'assistant',
text: `def factorial(n):
if n < 0:
raise ValueError("Factorial not defined for negative numbers")
if n == 0:
return 1
return n * factorial(n-1)`
}
});
// Result: { score: 0.95, reason: "Excellent alignment - function addresses intent, includes error handling..." }
Custom Configuration Examples
// Configure scale and evaluation mode
const scorer = createPromptAlignmentScorerLLM({
model: openai('gpt-4o'),
options: {
scale: 10, // Score from 0-10 instead of 0-1
evaluationMode: 'both' // 'user', 'system', or 'both' (default)
},
});
// User-only evaluation - focus on user satisfaction
const userScorer = createPromptAlignmentScorerLLM({
model: openai('gpt-4o'),
options: { evaluationMode: 'user' }
});
// System-only evaluation - focus on compliance
const systemScorer = createPromptAlignmentScorerLLM({
model: openai('gpt-4o'),
options: { evaluationMode: 'system' }
});
const result = await scorer.run(testRun);
// Result: { score: 8.5, reason: "Score: 8.5 out of 10 - Good alignment with both user intent and system guidelines..." }
Format-Specific Evaluation
// Evaluate bullet point formatting
const result = await scorer.run({
input: [{
role: 'user',
content: 'List the benefits of TypeScript in bullet points'
}],
output: {
role: 'assistant',
text: 'TypeScript provides static typing, better IDE support, and enhanced code reliability.'
}
});
// Result: Lower appropriateness score due to format mismatch (paragraph vs bullet points)
Usage Patterns
Code Generation Evaluation
Ideal for evaluating:
- Programming task completion
- Code quality and completeness
- Adherence to coding requirements
- Format specifications (functions, classes, etc.)
// Example: API endpoint creation
const codePrompt = "Create a REST API endpoint with authentication and rate limiting";
// Scorer evaluates: intent (API creation), requirements (auth + rate limiting),
// completeness (full implementation), format (code structure)
Instruction Following Assessment
Perfect for:
- Task completion verification
- Multi-step instruction adherence
- Requirement compliance checking
- Educational content evaluation
// Example: Multi-requirement task
const taskPrompt = "Write a Python class with initialization, validation, error handling, and documentation";
// Scorer tracks each requirement individually and provides detailed breakdown
Content Format Validation
Useful for:
- Format specification compliance
- Style guide adherence
- Output structure verification
- Response appropriateness checking
// Example: Structured output
const formatPrompt = "Explain the differences between let and const in JavaScript using bullet points";
// Scorer evaluates content accuracy AND format compliance
Common Use Cases
1. Agent Response Quality
Measure how well your AI agents follow user instructions:
const agent = new Agent({
name: 'CodingAssistant',
instructions: 'You are a helpful coding assistant. Always provide working code examples.',
model: openai('gpt-4o'),
});
// Evaluate comprehensive alignment (default)
const scorer = createPromptAlignmentScorerLLM({
model: openai('gpt-4o-mini'),
options: { evaluationMode: 'both' } // Evaluates both user intent and system guidelines
});
// Evaluate just user satisfaction
const userScorer = createPromptAlignmentScorerLLM({
model: openai('gpt-4o-mini'),
options: { evaluationMode: 'user' } // Focus only on user request fulfillment
});
// Evaluate system compliance
const systemScorer = createPromptAlignmentScorerLLM({
model: openai('gpt-4o-mini'),
options: { evaluationMode: 'system' } // Check adherence to system instructions
});
const result = await scorer.run(agentRun);
2. Prompt Engineering Optimization
Test different prompts to improve alignment:
const prompts = [
'Write a function to calculate factorial',
'Create a Python function that calculates factorial with error handling for negative inputs',
'Implement a factorial calculator in Python with: input validation, error handling, and docstring'
];
// Compare alignment scores to find the best prompt
for (const prompt of prompts) {
const result = await scorer.run(createTestRun(prompt, response));
console.log(`Prompt alignment: ${result.score}`);
}
3. Multi-Agent System Evaluation
Compare different agents or models:
const agents = [agent1, agent2, agent3];
const testPrompts = [...]; // Array of test prompts
for (const agent of agents) {
let totalScore = 0;
for (const prompt of testPrompts) {
const response = await agent.run(prompt);
const evaluation = await scorer.run({ input: prompt, output: response });
totalScore += evaluation.score;
}
console.log(`${agent.name} average alignment: ${totalScore / testPrompts.length}`);
}
Error Handling
The scorer handles various edge cases gracefully:
// Missing user prompt
try {
await scorer.run({ input: [], output: response });
} catch (error) {
// Error: "Both user prompt and agent response are required for prompt alignment scoring"
}
// Empty response
const result = await scorer.run({
input: [userMessage],
output: { role: 'assistant', text: '' }
});
// Returns low scores with detailed reasoning about incompleteness
Related
- Answer Relevancy Scorer - Evaluates query-response relevance
- Faithfulness Scorer - Measures context groundedness
- Tool Call Accuracy Scorer - Evaluates tool selection
- Custom Scorers - Creating your own evaluation metrics