Answer Similarity Scorer

The createAnswerSimilarityScorer() function creates a scorer that evaluates how similar an agent's output is to a ground truth answer. This scorer is specifically designed for CI/CD testing scenarios where you have expected answers and want to ensure consistency over time.

For usage examples, see the Answer Similarity Examples.

Parameters

model:

LanguageModel

The language model used to evaluate semantic similarity between outputs and ground truth.

options:

AnswerSimilarityOptions

Configuration options for the scorer.

AnswerSimilarityOptions

requireGroundTruth:

boolean

= true

Whether to require ground truth for evaluation. If false, missing ground truth returns score 0.

semanticThreshold:

number

= 0.8

Weight for semantic matches vs exact matches (0-1).

exactMatchBonus:

number

= 0.2

Additional score bonus for exact matches (0-1).

missingPenalty:

number

= 0.15

Penalty per missing key concept from ground truth.

contradictionPenalty:

number

= 1.0

Penalty for contradictory information. High value ensures wrong answers score near 0.

extraInfoPenalty:

number

= 0.05

Mild penalty for extra information not present in ground truth (capped at 0.2).

scale:

number

= 1

Score scaling factor.

This function returns an instance of the MastraScorer class. The .run() method accepts the same input as other scorers (see the MastraScorer reference), but requires ground truth to be provided in the run object.

.run() Returns

runId:

string

The id of the run (optional).

score:

number

Similarity score between 0-1 (or 0-scale if custom scale used). Higher scores indicate better similarity to ground truth.

reason:

string

Human-readable explanation of the score with actionable feedback.

preprocessStepResult:

object

Extracted semantic units from output and ground truth.

analyzeStepResult:

object

Detailed analysis of matches, contradictions, and extra information.

preprocessPrompt:

string

The prompt used for semantic unit extraction.

analyzePrompt:

string

The prompt used for similarity analysis.

generateReasonPrompt:

string

The prompt used for generating the explanation.

Usage with runExperiment

This scorer is designed for use with runExperiment for CI/CD testing:

import { runExperiment } from "@mastra/core/scores";
import { createAnswerSimilarityScorer } from "@mastra/evals/scorers/llm";

const scorer = createAnswerSimilarityScorer({ model });

await runExperiment({
  data: [
    {
      input: "What is the capital of France?",
      groundTruth: "Paris is the capital of France",
    },
  ],
  scorers: [scorer],
  target: myAgent,
  onItemComplete: ({ scorerResults }) => {
    // Assert similarity score meets threshold
    expect(scorerResults["Answer Similarity Scorer"].score).toBeGreaterThan(
      0.8,
    );
  },
});

Key Features

Semantic Analysis: Uses LLM to extract and compare semantic units rather than simple string matching
Contradiction Detection: Identifies factually incorrect information and scores it near 0
Flexible Matching: Supports exact, semantic, partial, and missing match types
CI/CD Ready: Designed for automated testing with ground truth comparison
Actionable Feedback: Provides specific explanations of what matched and what needs improvement

Scoring Algorithm

The scorer uses a multi-step process:

Extract: Breaks down output and ground truth into semantic units
Analyze: Compares units and identifies matches, contradictions, and gaps
Score: Calculates weighted similarity with penalties for contradictions
Reason: Generates human-readable explanation

Score calculation: max(0, base_score - contradiction_penalty - missing_penalty - extra_info_penalty) × scale

Parameters​

model:

options:

AnswerSimilarityOptions​

requireGroundTruth:

semanticThreshold:

exactMatchBonus:

missingPenalty:

contradictionPenalty:

extraInfoPenalty:

scale:

.run() Returns​

runId:

score:

reason:

preprocessStepResult:

analyzeStepResult:

preprocessPrompt:

analyzePrompt:

generateReasonPrompt:

Usage with runExperiment​

Key Features​

Scoring Algorithm​

Parameters

AnswerSimilarityOptions

.run() Returns

Usage with runExperiment

Key Features

Scoring Algorithm