Answer Similarity Scorer
The createAnswerSimilarityScorer()
function creates a scorer that evaluates how similar an agent’s output is to a ground truth answer. This scorer is specifically designed for CI/CD testing scenarios where you have expected answers and want to ensure consistency over time.
For usage examples, see the Answer Similarity Examples.
Parameters
model:
options:
AnswerSimilarityOptions
requireGroundTruth:
semanticThreshold:
exactMatchBonus:
missingPenalty:
contradictionPenalty:
extraInfoPenalty:
scale:
This function returns an instance of the MastraScorer class. The .run()
method accepts the same input as other scorers (see the MastraScorer reference), but requires ground truth to be provided in the run object.
.run() Returns
runId:
score:
reason:
preprocessStepResult:
analyzeStepResult:
preprocessPrompt:
analyzePrompt:
generateReasonPrompt:
Usage with runExperiment
This scorer is designed for use with runExperiment
for CI/CD testing:
import { runExperiment } from '@mastra/core/scores';
import { createAnswerSimilarityScorer } from '@mastra/evals/scorers/llm';
const scorer = createAnswerSimilarityScorer({ model });
await runExperiment({
data: [
{
input: "What is the capital of France?",
groundTruth: "Paris is the capital of France"
}
],
scorers: [scorer],
target: myAgent,
onItemComplete: ({ scorerResults }) => {
// Assert similarity score meets threshold
expect(scorerResults['Answer Similarity Scorer'].score).toBeGreaterThan(0.8);
}
});
Key Features
- Semantic Analysis: Uses LLM to extract and compare semantic units rather than simple string matching
- Contradiction Detection: Identifies factually incorrect information and scores it near 0
- Flexible Matching: Supports exact, semantic, partial, and missing match types
- CI/CD Ready: Designed for automated testing with ground truth comparison
- Actionable Feedback: Provides specific explanations of what matched and what needs improvement
Scoring Algorithm
The scorer uses a multi-step process:
- Extract: Breaks down output and ground truth into semantic units
- Analyze: Compares units and identifies matches, contradictions, and gaps
- Score: Calculates weighted similarity with penalties for contradictions
- Reason: Generates human-readable explanation
Score calculation: max(0, base_score - contradiction_penalty - missing_penalty - extra_info_penalty) × scale