We're launching scorers in Mastra—our new version of evals—and they're now available in the Playground.
Evals are how you productionize agents. Without them, you're flying blind—you have no idea if your changes are making things better or worse. This is especially true for use cases with clear ground truth, where you need to know that your agent is giving correct answers, not just plausible-sounding ones.
Why We Built Scorers
Evals were our first implementation for measuring output quality. They worked, but we learned a lot from watching how people used them. Users wanted to use evals on workflow steps. Businesses need custom evals tailored to specific use cases and KPIs.
Scorers are the result of these learnings. We built them to give users more control of the eval pipeline. Scorers are simpler, their functionality is stronger, and they work better with the rest of Mastra.
You're probably wondering: why the term "scorers"? The term "evaluator" is bad. It's overly academic. So we went with "scorers" because that's what they do—they score things.
How Scorers Work
A scorer belongs to an agent or workflow step. An agent can have multiple scorers. Each scorer runs asynchronously after your agent responds, evaluating the output without blocking the response:
1import { Agent } from "@mastra/core/agent";
2import { openai } from "@ai-sdk/openai";
3import {
4 createAnswerRelevancyScorer,
5 createBiasScorer
6} from "@mastra/evals/scorers/llm";
7
8export const customerSupportAgent = new Agent({
9 name: "CustomerSupport",
10 instructions: "You are a helpful customer support agent",
11 model: openai("gpt-4o"),
12 scorers: {
13 relevancy: {
14 scorer: createAnswerRelevancyScorer({ model: openai("gpt-4o") }),
15 sampling: { type: "ratio", rate: 0.5 }
16 },
17 bias: {
18 scorer: createBiasScorer({ model: openai("gpt-4o") }),
19 sampling: { type: "ratio", rate: 1 }
20 }
21 }
22});
The sampling.rate
controls what percentage of outputs get scored. Set it to 1 to score everything, 0.5 for half, etc. This lets you balance evaluation coverage with cost.
The Scoring Pipeline
Scorers can have up to four steps, though only generateScore
is required:
1import { createScorer } from "@mastra/core/scores";
2
3// Updated API
4export const customScorer = createScorer({
5 name: "My Scorer",
6 description: "Evaluates something specific",
7}).preprocess(( { run }) => {
8 const { input, output } = run
9 return processedData
10}).analyze(( { results }) => {
11 const { preprocessStepResult } = results
12 return analyzedData
13}).generateScore(({ results }) => {
14 const { analyzeStepResult } = results
15 return 0.85
16}).generateReason(({ score, results }) => {
17 const { analyzeStepResult, preprocessStepResult } = results
18 return "Explanation of why this scored 0.85"
19})
Each step serves a specific purpose: the preprocess step prepares data when you need to extract or transform complex inputs (like breaking down long responses into individual claims). The analyze step performs the core evaluation logic, gathering insights that inform scoring (like detecting bias in each claim). The generateScore step converts analysis results into a numerical score using deterministic logic. The generateReason step provides human-readable explanations for debugging or compliance.
Built-in Scorers
We ship several scorers out of the box. Here are two examples:
The bias scorer detects discriminatory language or unfair stereotypes. It extracts opinions from the text, evaluates each one, and returns the proportion that contain bias:
1const biasScorer = createBiasScorer({
2 model: openai("gpt-4o"),
3 scale: 1 // Score range 0-1
4});
The answer relevancy scorer checks whether responses actually address the user's question. It breaks the output into statements and evaluates how many directly answer the query:
1const relevancyScorer = createAnswerRelevancyScorer({
2 model: openai("gpt-4o"),
3 uncertaintyWeight: 0.3 // Weight for partially relevant statements
4});
See the full list at /docs/scorers/off-the-shelf-scorers.
Creating Custom Scorers
You're not limited to our built-in scorers. You can create custom scorers using the createScorer
function, mixing functions and LLM prompts as needed for your evaluation logic.
Scorers follow a flexible four-step pipeline (preprocess, analyze, generateScore, generateReason), where only generateScore
is required. Each step can either be a function for deterministic logic or an LLM prompt for nuanced evaluation.
The interesting part is how we handle LLM scorers. LLMs are terrible at producing consistent numerical scores—ask the same model to rate something from 0-1 five times and you'll get five different numbers. So we have LLMs output structured data instead, then use a deterministic generateScore
function to convert that into a number. This gives you the nuance of LLM evaluation with the consistency of code.
We'll be publishing a detailed post on custom scorers very soon. For now, check out our docs.
Scorers in the Playground
The Playground has a new Scorers tab where you can view all scoring results from your agents and workflows. When you add scorers to your agents or workflow steps and run them in the playground, the results automatically appear in the Scorers tab.
Run mastra dev
and navigate to the Scorers tab to try it out.
How We Built This
Under the hood, we use Mastra workflows to run the scoring pipeline. Each step (preprocess, analyze, generateScore, generateReason) is a workflow step. This gives us async execution, error handling, and all the other workflow benefits for free.
The scoring results are stored in the mastra_scorers
table in your configured database. This happens automatically when you have storage configured.
How to migrate from evals
If you're using evals, here's what changes with scorers:
The main difference is the API. Instead of extending Metric
classes and managing a separate MastraAgentJudge
class, you use the unified createScorer
function that makes the scoring pipeline explicit.
Step 1: Replace the class-based approach
1// Old: Extend Metric class
2export class GlutenCheckerMetric extends Metric {
3 async measure(output) {
4 // evaluation logic
5 }
6}
7
8// New: Use createScorer function
9export const glutenCheckerScorer = createScorer({
10 name: 'Gluten Checker',
11 description: 'Check if the output contains any gluten',
12})
Step 2: Move your evaluation logic to the pipeline
1// Old: Split across Metric and MastraAgentJudge classes
2export class GlutenCheckerMetric extends Metric {
3 async measure(output) {
4 const analysis = await this.judge.evaluate(output);
5 const score = this.calculateScore(analysis.isGlutenFree);
6 const reason = await this.judge.getReason(analysis);
7 return { score, info: { ...analysis, reason } };
8 }
9}
10
11// Your judge class
12class GlutenCheckerJudge extends MastraAgentJudge {
13 async evaluate(output) {
14 const result = await this.agent.generate(
15 prompt,
16 { output: schema }
17 );
18
19 return result.object;
20 }
21
22 async getReason(args) {
23 const result = await this.agent.generate(
24 reasonPrompt,
25 { output: schema }
26 );
27
28 return result.object.reason;
29 }
30}
31
32// New: Consolidated into pipeline steps
33export const glutenCheckerScorer = createScorer(...)
34.analyze({
35 // Move your judge.evaluate() logic here
36 createPrompt: ({ run }) =>
37 generateGlutenPrompt({ output: run.output.text }),
38
39 // Provide the output schema of the prompt
40 outputSchema: z.object({
41 isGlutenFree: z.boolean(),
42 glutenSources: z.array(z.string())
43 }),
44})
45.generateScore(({ results }) => {
46 // Move your calculateScore() logic here
47 return results.analyzeStepResult.isGlutenFree ? 1 : 0;
48})
49.generateReason({
50 // Move your judge.getReason() logic here
51 createPrompt: ({ results }) =>
52 generateReasonPrompt({
53 isGlutenFree: results.analyzeStepResult.isGlutenFree,
54 glutenSources: results.analyzeStepResult.glutenSources
55 })
56})
Step 3: Update your agent configuration
1// Old: evals property with metric instances
2export const chefAgent = new Agent({
3 name: 'chef-agent',
4 instructions: 'You are Michel, an experienced home chef...',
5 model: openai('gpt-4o'),
6 evals: {
7 glutenChecker: new GlutenCheckerMetric(openai('gpt-4o')),
8 },
9});
10
11// New: scorers property with sampling configuration
12export const chefAgent = new Agent({
13 name: 'chef-agent',
14 instructions: 'You are Michel, an experienced home chef...',
15 model: openai('gpt-4o'),
16 scorers: {
17 glutenChecker: {
18 scorer: glutenCheckerScorer,
19 sampling: { type: 'ratio', rate: 1 },
20 },
21 },
22});
To see the complete implementation of both approaches:
- Evals approach: Custom Eval Example
- New scorers approach: Custom Scorer Example
Both examples implement the same gluten checker logic, so you can compare the old and new approaches side by side.
We're not deprecating evals immediately. They'll continue to work for existing projects. But scorers are where we're focusing development going forward. The migration is straightforward—the core evaluation logic stays the same, you're just wrapping it differently.
Coming soon: golden answers and more
We're adding golden answers soon. You'll be able to define reference outputs that scorers compare against. This is useful for regression testing and tracking drift over time.
We’ll also be adding command line support, conversational evals, and more.
Getting Started
Upgrade to the latest Mastra version:
pnpm add @mastra/core@latest @mastra/evals@latest
Add scorers to your agents, run mastra dev
, and check the Scorers tab in the Playground.
If you have questions or run into issues, we're on Discord and GitHub.