Scorers overview
Scorers are evaluation tools that measure the quality, accuracy, or performance of AI-generated outputs. Scorers provide an automated way to assess whether your agents, workflows, or language models are producing the desired results by analyzing their responses against specific criteria.
Scores are numerical values (typically between 0 and 1) that quantify how well an output meets your evaluation criteria. These scores enable you to objectively track performance, compare different approaches, and identify areas for improvement in your AI systems.
Evaluation pipeline
Mastra scorers follow a flexible four-step pipeline that allows for simple to complex evaluation workflows:
- preprocess (Optional): Prepare or transform input/output data for evaluation
- analyze (Optional): Perform evaluation analysis and gather insights
- generateScore (Required): Convert analysis into a numerical score
- generateReason (Optional): Generate explanations or justifications for the score
This modular structure enables both simple single-step evaluations and complex multi-stage analysis workflows, allowing you to build evaluations that match your specific needs.
When to use each step
preprocess step - Use when your content is complex or needs preprocessing:
- Extracting specific elements from complex data structures
- Cleaning or normalizing text before analysis
- Parsing multiple claims that need individual evaluation
- Filtering content to focus evaluation on relevant sections
analyze step - Use when you need structured evaluation analysis:
- Gathering insights that inform the scoring decision
- Breaking down complex evaluation criteria into components
- Performing detailed analysis that generateScore will use
- Collecting evidence or reasoning data for transparency
generateScore step - Always required for converting analysis to scores:
- Simple scenarios: Direct scoring of input/output pairs
- Complex scenarios: Converting detailed analysis results into numerical scores
- Applying business logic and weighting to analysis results
- The only step that produces the final numerical score
generateReason step - Use when explanations are important:
- Users need to understand why a score was assigned
- Debugging and transparency are critical
- Compliance or auditing requires explanations
- Providing actionable feedback for improvement
To learn how to create your own Scorers, see Creating Custom Scorers.
Installation
To access Mastra’s scorers feature install the @mastra/evals
package.
npm install @mastra/evals@latest
Live evaluations
Live evaluations allow you to automatically score AI outputs in real-time as your agents and workflows operate. Instead of running evaluations manually or in batches, scorers run asynchronously alongside your AI systems, providing continuous quality monitoring.
Adding scorers to agents
You can add built-in scorers to your agents to automatically evaluate their outputs. See the full list of built-in scorers for all available options.
import { Agent } from "@mastra/core/agent";
import { openai } from "@ai-sdk/openai";
import {
createAnswerRelevancyScorer,
createToxicityScorer
} from "@mastra/evals/scorers/llm";
export const evaluatedAgent = new Agent({
// ...
scorers: {
relevancy: {
scorer: createAnswerRelevancyScorer({ model: openai("gpt-4o-mini") }),
sampling: { type: "ratio", rate: 0.5 }
},
safety: {
scorer: createToxicityScorer({ model: openai("gpt-4o-mini") }),
sampling: { type: "ratio", rate: 1 }
}
}
});
Adding scorers to workflow steps
You can also add scorers to individual workflow steps to evaluate outputs at specific points in your process:
import { createWorkflow, createStep } from "@mastra/core/workflows";
import { z } from "zod";
import { customStepScorer } from "../scorers/custom-step-scorer";
const contentStep = createStep({
// ...
scorers: {
customStepScorer: {
scorer: customStepScorer(),
sampling: {
type: "ratio",
rate: 1, // Score every step execution
}
}
},
});
export const contentWorkflow = createWorkflow({ ... })
.then(contentStep)
.commit();
How live evaluations work
Asynchronous execution: Live evaluations run in the background without blocking your agent responses or workflow execution. This ensures your AI systems maintain their performance while still being monitored.
Sampling control: The sampling.rate
parameter (0-1) controls what percentage of outputs get scored:
1.0
: Score every single response (100%)0.5
: Score half of all responses (50%)0.1
: Score 10% of responses0.0
: Disable scoring
Automatic storage: All scoring results are automatically stored in the mastra_scorers
table in your configured database, allowing you to analyze performance trends over time.
Testing scorers locally
Mastra provides a CLI command mastra dev
to test your scorers. The playground includes a scorers section where you can run individual scorers against test inputs and view detailed results.
For more details, see the Local Dev Playground docs.
Next steps
- Learn how to create your own scorers in the Creating Custom Scorers guide
- Explore built-in scorers in the Off-the-shelf Scorers section
- Test scorers with the Local Dev Playground
- See example scorers in the Examples Overview section