Answer Relevancy Scorer
The createAnswerRelevancyScorer() function accepts a single options object with the following properties:
Parameters
model:
LanguageModel
Configuration for the model used to evaluate relevancy.
uncertaintyWeight:
number
= 0.3
Weight given to 'unsure' verdicts in scoring (0-1).
scale:
number
= 1
Maximum score value.
This function returns an instance of the MastraScorer class. The .run() method accepts the same input as other scorers (see the MastraScorer reference), but the return value includes LLM-specific fields as documented below.
.run() Returns
runId:
string
The id of the run (optional).
score:
number
Relevancy score (0 to scale, default 0-1)
preprocessPrompt:
string
The prompt sent to the LLM for the preprocess step (optional).
preprocessStepResult:
object
Object with extracted statements: { statements: string[] }
analyzePrompt:
string
The prompt sent to the LLM for the analyze step (optional).
analyzeStepResult:
object
Object with results: { results: Array<{ result: 'yes' | 'unsure' | 'no', reason: string }> }
generateReasonPrompt:
string
The prompt sent to the LLM for the reason step (optional).
reason:
string
Explanation of the score.
Scoring Details
The scorer evaluates relevancy through query-answer alignment, considering completeness and detail level, but not factual correctness.
Scoring Process
- Statement Preprocess:
- Breaks output into meaningful statements while preserving context.
- Relevance Analysis:
- Each statement is evaluated as:
- "yes": Full weight for direct matches
- "unsure": Partial weight (default: 0.3) for approximate matches
- "no": Zero weight for irrelevant content
- Each statement is evaluated as:
- Score Calculation:
((direct + uncertainty * partial) / total_statements) * scale
Score Interpretation
A relevancy score between 0 and 1:
- 1.0: The response fully answers the query with relevant and focused information.
- 0.7–0.9: The response mostly answers the query but may include minor unrelated content.
- 0.4–0.6: The response partially answers the query, mixing relevant and unrelated information.
- 0.1–0.3: The response includes minimal relevant content and largely misses the intent of the query.
- 0.0: The response is entirely unrelated and does not answer the query.
Example
Evaluate agent responses for relevancy across different scenarios:
src/example-answer-relevancy.ts
import { runEvals } from "@mastra/core/evals";
import { createAnswerRelevancyScorer } from "@mastra/evals/scorers/prebuilt";
import { myAgent } from "./agent";
const scorer = createAnswerRelevancyScorer({ model: "openai/gpt-4o" });
const result = await runEvals({
data: [
{
input: "What are the health benefits of regular exercise?",
},
{
input: "What should a healthy breakfast include?",
},
{
input: "What are the benefits of meditation?",
},
],
scorers: [scorer],
target: myAgent,
onItemComplete: ({ scorerResults }) => {
console.log({
score: scorerResults[scorer.id].score,
reason: scorerResults[scorer.id].reason,
});
},
});
console.log(result.scores);
For more details on runEvals, see the runEvals reference.
To add this scorer to an agent, see the Scorers overview guide.