Prompt Alignment Scorer

The createPromptAlignmentScorerLLM() function creates a scorer that evaluates how well agent responses align with user prompts across multiple dimensions: intent understanding, requirement fulfillment, response completeness, and format appropriateness.

Parameters

model:

MastraModelConfig

The language model to use for evaluating prompt-response alignment

options:

PromptAlignmentOptions

Configuration options for the scorer

.run() Returns

score:

number

Multi-dimensional alignment score between 0 and scale (default 0-1)

reason:

string

Human-readable explanation of the prompt alignment evaluation with detailed breakdown

.run() returns a result in the following shape:

{
  runId: string,
  score: number,
  reason: string,
  analyzeStepResult: {
    intentAlignment: {
      score: number,
      primaryIntent: string,
      isAddressed: boolean,
      reasoning: string
    },
    requirementsFulfillment: {
      requirements: Array<{
        requirement: string,
        isFulfilled: boolean,
        reasoning: string
      }>,
      overallScore: number
    },
    completeness: {
      score: number,
      missingElements: string[],
      reasoning: string
    },
    responseAppropriateness: {
      score: number,
      formatAlignment: boolean,
      toneAlignment: boolean,
      reasoning: string
    },
    overallAssessment: string
  }
}

Scoring Details

Scorer configuration

You can customize the Prompt Alignment Scorer by adjusting the scale parameter and evaluation mode to fit your scoring needs.

const scorer = createPromptAlignmentScorerLLM({
  model: "openai/gpt-4o-mini",
  options: {
    scale: 10, // Score from 0-10 instead of 0-1
    evaluationMode: "both", // 'user', 'system', or 'both' (default)
  },
});

Multi-Dimensional Analysis

Prompt Alignment evaluates responses across four key dimensions with weighted scoring that adapts based on the evaluation mode:

User Mode ('user')

Evaluates alignment with user prompts only:

Intent Alignment (40% weight) - Whether the response addresses the user's core request
Requirements Fulfillment (30% weight) - If all user requirements are met
Completeness (20% weight) - Whether the response is comprehensive for user needs
Response Appropriateness (10% weight) - If format and tone match user expectations

System Mode ('system')

Evaluates compliance with system guidelines only:

Intent Alignment (35% weight) - Whether the response follows system behavioral guidelines
Requirements Fulfillment (35% weight) - If all system constraints are respected
Completeness (15% weight) - Whether the response adheres to all system rules
Response Appropriateness (15% weight) - If format and tone match system specifications

Both Mode ('both' - default)

Combines evaluation of both user and system alignment:

User alignment: 70% of final score (using user mode weights)
System compliance: 30% of final score (using system mode weights)
Provides balanced assessment of user satisfaction and system adherence

Scoring Formula

User Mode:

Weighted Score = (intent_score × 0.4) + (requirements_score × 0.3) +
                 (completeness_score × 0.2) + (appropriateness_score × 0.1)
Final Score = Weighted Score × scale

System Mode:

Weighted Score = (intent_score × 0.35) + (requirements_score × 0.35) +
                 (completeness_score × 0.15) + (appropriateness_score × 0.15)
Final Score = Weighted Score × scale

Both Mode (default):

User Score = (user dimensions with user weights)
System Score = (system dimensions with system weights)
Weighted Score = (User Score × 0.7) + (System Score × 0.3)
Final Score = Weighted Score × scale

Weight Distribution Rationale:

User Mode: Prioritizes intent (40%) and requirements (30%) for user satisfaction
System Mode: Balances behavioral compliance (35%) and constraints (35%) equally
Both Mode: 70/30 split ensures user needs are primary while maintaining system compliance

Score Interpretation

0.9-1.0 = Excellent alignment across all dimensions
0.8-0.9 = Very good alignment with minor gaps
0.7-0.8 = Good alignment but missing some requirements or completeness
0.6-0.7 = Moderate alignment with noticeable gaps
0.4-0.6 = Poor alignment with significant issues
0.0-0.4 = Very poor alignment, response doesn't address the prompt effectively

When to Use Each Mode

User Mode ('user') - Use when:

Evaluating customer service responses for user satisfaction
Testing content generation quality from user perspective
Measuring how well responses address user questions
Focusing purely on request fulfillment without system constraints

System Mode ('system') - Use when:

Auditing AI safety and compliance with behavioral guidelines
Ensuring agents follow brand voice and tone requirements
Validating adherence to content policies and constraints
Testing system-level behavioral consistency

Both Mode ('both') - Use when (default, recommended):

Comprehensive evaluation of overall AI agent performance
Balancing user satisfaction with system compliance
Production monitoring where both user and system requirements matter
Holistic assessment of prompt-response alignment

Common Use Cases

Code Generation Evaluation

Ideal for evaluating:

Programming task completion
Code quality and completeness
Adherence to coding requirements
Format specifications (functions, classes, etc.)

// Example: API endpoint creation
const codePrompt =
  "Create a REST API endpoint with authentication and rate limiting";
// Scorer evaluates: intent (API creation), requirements (auth + rate limiting),
// completeness (full implementation), format (code structure)

Instruction Following Assessment

Perfect for:

Task completion verification
Multi-step instruction adherence
Requirement compliance checking
Educational content evaluation

// Example: Multi-requirement task
const taskPrompt =
  "Write a Python class with initialization, validation, error handling, and documentation";
// Scorer tracks each requirement individually and provides detailed breakdown

Content Format Validation

Useful for:

Format specification compliance
Style guide adherence
Output structure verification
Response appropriateness checking

// Example: Structured output
const formatPrompt =
  "Explain the differences between let and const in JavaScript using bullet points";
// Scorer evaluates content accuracy AND format compliance

Agent Response Quality

Measure how well your AI agents follow user instructions:

const agent = new Agent({
  name: "CodingAssistant",
  instructions:
    "You are a helpful coding assistant. Always provide working code examples.",
  model: "openai/gpt-4o",
});

// Evaluate comprehensive alignment (default)
const scorer = createPromptAlignmentScorerLLM({
  model: "openai/gpt-4o-mini",
  options: { evaluationMode: "both" }, // Evaluates both user intent and system guidelines
});

// Evaluate just user satisfaction
const userScorer = createPromptAlignmentScorerLLM({
  model: "openai/gpt-4o-mini",
  options: { evaluationMode: "user" }, // Focus only on user request fulfillment
});

// Evaluate system compliance
const systemScorer = createPromptAlignmentScorerLLM({
  model: "openai/gpt-4o-mini",
  options: { evaluationMode: "system" }, // Check adherence to system instructions
});

const result = await scorer.run(agentRun);

Prompt Engineering Optimization

Test different prompts to improve alignment:

const prompts = [
  "Write a function to calculate factorial",
  "Create a Python function that calculates factorial with error handling for negative inputs",
  "Implement a factorial calculator in Python with: input validation, error handling, and docstring",
];

// Compare alignment scores to find the best prompt
for (const prompt of prompts) {
  const result = await scorer.run(createTestRun(prompt, response));
  console.log(`Prompt alignment: ${result.score}`);
}

Multi-Agent System Evaluation

Compare different agents or models:

const agents = [agent1, agent2, agent3];
const testPrompts = [...]; // Array of test prompts

for (const agent of agents) {
  let totalScore = 0;
  for (const prompt of testPrompts) {
    const response = await agent.run(prompt);
    const evaluation = await scorer.run({ input: prompt, output: response });
    totalScore += evaluation.score;
  }
  console.log(`${agent.name} average alignment: ${totalScore / testPrompts.length}`);
}

Examples

Basic Configuration

import { createPromptAlignmentScorerLLM } from "@mastra/evals";

const scorer = createPromptAlignmentScorerLLM({
  model: "openai/gpt-4o",
});

// Evaluate a code generation task
const result = await scorer.run({
  input: [
    {
      role: "user",
      content:
        "Write a Python function to calculate factorial with error handling",
    },
  ],
  output: {
    role: "assistant",
    text: `def factorial(n):
    if n < 0:
        raise ValueError("Factorial not defined for negative numbers")
    if n == 0:
        return 1
    return n * factorial(n-1)`,
  },
});
// Result: { score: 0.95, reason: "Excellent alignment - function addresses intent, includes error handling..." }

Custom Configuration Examples

// Configure scale and evaluation mode
const scorer = createPromptAlignmentScorerLLM({
  model: "openai/gpt-4o",
  options: {
    scale: 10, // Score from 0-10 instead of 0-1
    evaluationMode: "both", // 'user', 'system', or 'both' (default)
  },
});

// User-only evaluation - focus on user satisfaction
const userScorer = createPromptAlignmentScorerLLM({
  model: "openai/gpt-4o",
  options: { evaluationMode: "user" },
});

// System-only evaluation - focus on compliance
const systemScorer = createPromptAlignmentScorerLLM({
  model: "openai/gpt-4o",
  options: { evaluationMode: "system" },
});

const result = await scorer.run(testRun);
// Result: { score: 8.5, reason: "Score: 8.5 out of 10 - Good alignment with both user intent and system guidelines..." }

Format-Specific Evaluation

// Evaluate bullet point formatting
const result = await scorer.run({
  input: [
    {
      role: "user",
      content: "List the benefits of TypeScript in bullet points",
    },
  ],
  output: {
    role: "assistant",
    text: "TypeScript provides static typing, better IDE support, and enhanced code reliability.",
  },
});
// Result: Lower appropriateness score due to format mismatch (paragraph vs bullet points)

Excellent alignment example

In this example, the response fully addresses the user's prompt with all requirements met.

src/example-excellent-prompt-alignment.ts
import { createPromptAlignmentScorerLLM } from "@mastra/evals/scorers/llm";

const scorer = createPromptAlignmentScorerLLM({
  model: "openai/gpt-4o-mini",
});

const inputMessages = [
  {
    role: "user",
    content:
      "Write a Python function to calculate factorial with error handling for negative numbers",
  },
];

const outputMessage = {
  text: `def factorial(n):
    """Calculate factorial of a number."""
    if n < 0:
        raise ValueError("Factorial not defined for negative numbers")
    if n == 0 or n == 1:
        return 1
    return n * factorial(n - 1)`,
};

const result = await scorer.run({
  input: inputMessages,
  output: outputMessage,
});

console.log(result);

Excellent alignment output

The output receives a high score because it perfectly addresses the intent, fulfills all requirements, and uses appropriate format.

{
  score: 0.95,
  reason: 'The score is 0.95 because the response perfectly addresses the primary intent of creating a factorial function and fulfills all requirements including Python implementation, error handling for negative numbers, and proper documentation. The code format is appropriate and the implementation is complete.'
}

Partial alignment example

In this example, the response addresses the core intent but misses some requirements or has format issues.

src/example-partial-prompt-alignment.ts
import { createPromptAlignmentScorerLLM } from "@mastra/evals/scorers/llm";

const scorer = createPromptAlignmentScorerLLM({
  model: "openai/gpt-4o-mini",
});

const inputMessages = [
  {
    role: "user",
    content: "List the benefits of TypeScript in bullet points",
  },
];

const outputMessage = {
  text: "TypeScript provides static typing, better IDE support, and enhanced code reliability through compile-time error checking.",
};

const result = await scorer.run({
  input: inputMessages,
  output: outputMessage,
});

console.log(result);

Partial alignment output

The output receives a lower score because while the content is accurate, it doesn't follow the requested format (bullet points).

{
  score: 0.75,
  reason: 'The score is 0.75 because the response addresses the intent of explaining TypeScript benefits and provides accurate information, but fails to use the requested bullet point format, resulting in lower appropriateness scoring.'
}

Poor alignment example

In this example, the response fails to address the user's specific requirements.

src/example-poor-prompt-alignment.ts
import { createPromptAlignmentScorerLLM } from "@mastra/evals/scorers/llm";

const scorer = createPromptAlignmentScorerLLM({
  model: "openai/gpt-4o-mini",
});

const inputMessages = [
  {
    role: "user",
    content:
      "Write a Python class with initialization, validation, error handling, and documentation",
  },
];

const outputMessage = {
  text: `class Example:
    def __init__(self, value):
        self.value = value`,
};

const result = await scorer.run({
  input: inputMessages,
  output: outputMessage,
});

console.log(result);

Poor alignment output

The output receives a low score because it only partially fulfills the requirements, missing validation, error handling, and documentation.

{
  score: 0.35,
  reason: 'The score is 0.35 because while the response addresses the basic intent of creating a Python class with initialization, it fails to include validation, error handling, and documentation as specifically requested, resulting in incomplete requirement fulfillment.'
}

Evaluation Mode Examples

User Mode - Focus on User Prompt Only

Evaluates how well the response addresses the user's request, ignoring system instructions:

src/example-user-mode.ts
const scorer = createPromptAlignmentScorerLLM({
  model: "openai/gpt-4o-mini",
  options: { evaluationMode: "user" },
});

const result = await scorer.run({
  input: {
    inputMessages: [
      {
        role: "user",
        content: "Explain recursion with an example",
      },
    ],
    systemMessages: [
      {
        role: "system",
        content: "Always provide code examples in Python",
      },
    ],
  },
  output: {
    text: "Recursion is when a function calls itself. For example: factorial(5) = 5 * factorial(4)",
  },
});
// Scores high for addressing user request, even without Python code

System Mode - Focus on System Guidelines Only

Evaluates compliance with system behavioral guidelines and constraints:

src/example-system-mode.ts
const scorer = createPromptAlignmentScorerLLM({
  model: "openai/gpt-4o-mini",
  options: { evaluationMode: "system" },
});

const result = await scorer.run({
  input: {
    systemMessages: [
      {
        role: "system",
        content:
          "You are a helpful assistant. Always be polite, concise, and provide examples.",
      },
    ],
    inputMessages: [
      {
        role: "user",
        content: "What is machine learning?",
      },
    ],
  },
  output: {
    text: "Machine learning is a subset of AI where computers learn from data. For example, spam filters learn to identify unwanted emails by analyzing patterns in previously marked spam.",
  },
});
// Evaluates politeness, conciseness, and example provision

Both Mode - Combined Evaluation (Default)

Evaluates both user intent fulfillment and system compliance with weighted scoring (70% user, 30% system):

src/example-both-mode.ts
const scorer = createPromptAlignmentScorerLLM({
  model: "openai/gpt-4o-mini",
  options: { evaluationMode: "both" }, // This is the default
});

const result = await scorer.run({
  input: {
    systemMessages: [
      {
        role: "system",
        content:
          "Always provide code examples when explaining programming concepts",
      },
    ],
    inputMessages: [
      {
        role: "user",
        content: "Explain how to reverse a string",
      },
    ],
  },
  output: {
    text: `To reverse a string, you can iterate through it backwards. Here's an example in Python:
    
    def reverse_string(s):
        return s[::-1]
    
    # Usage: reverse_string("hello") returns "olleh"`,
  },
});
// High score for both addressing the user's request AND following system guidelines

Comparison with Other Scorers

Aspect	Prompt Alignment	Answer Relevancy	Faithfulness
Focus	Multi-dimensional prompt adherence	Query-response relevance	Context groundedness
Evaluation	Intent, requirements, completeness, format	Semantic similarity to query	Factual consistency with context
Use Case	General prompt following	Information retrieval	RAG/context-based systems
Dimensions	4 weighted dimensions	Single relevance dimension	Single faithfulness dimension

Answer Relevancy Scorer - Evaluates query-response relevance
Faithfulness Scorer - Measures context groundedness
Tool Call Accuracy Scorer - Evaluates tool selection
Custom Scorers - Creating your own evaluation metrics

Parameters​

model:

options:

.run() Returns​

score:

reason:

Scoring Details​

Scorer configuration​

Multi-Dimensional Analysis​

User Mode ('user')​

System Mode ('system')​

Both Mode ('both' - default)​

Scoring Formula​

Score Interpretation​

When to Use Each Mode​

Common Use Cases​

Code Generation Evaluation​

Instruction Following Assessment​

Content Format Validation​

Agent Response Quality​

Prompt Engineering Optimization​

Multi-Agent System Evaluation​

Examples​

Basic Configuration​

Custom Configuration Examples​

Format-Specific Evaluation​

Excellent alignment example​

Excellent alignment output​

Partial alignment example​

Partial alignment output​

Poor alignment example​

Poor alignment output​

Evaluation Mode Examples​

User Mode - Focus on User Prompt Only​

System Mode - Focus on System Guidelines Only​

Both Mode - Combined Evaluation (Default)​

Comparison with Other Scorers​

Related​