Introducing Scorers in Mastra

We're launching scorers in Mastra—our new version of evals—and they're now available in the Playground.

Evals are how you productionize agents. Without them, you're flying blind—you have no idea if your changes are making things better or worse. This is especially true for use cases with clear ground truth, where you need to know that your agent is giving correct answers, not just plausible-sounding ones.

Why We Built Scorers

Evals were our first implementation for measuring output quality. They worked, but we learned a lot from watching how people used them. Users wanted to use evals on workflow steps. Businesses need custom evals tailored to specific use cases and KPIs.

Scorers are the result of these learnings. We built them to give users more control of the eval pipeline. Scorers are simpler, their functionality is stronger, and they work better with the rest of Mastra.

You're probably wondering: why the term "scorers"? The term "evaluator" is bad. It's overly academic. So we went with "scorers" because that's what they do—they score things.

How Scorers Work

A scorer belongs to an agent or workflow step. An agent can have multiple scorers. Each scorer runs asynchronously after your agent responds, evaluating the output without blocking the response:

 1import { Agent } from "@mastra/core/agent";
 2import { openai } from "@ai-sdk/openai";
 3import {
 4  createAnswerRelevancyScorer,
 5  createBiasScorer
 6} from "@mastra/evals/scorers/llm";
 7
 8export const customerSupportAgent = new Agent({
 9  name: "CustomerSupport",
10  instructions: "You are a helpful customer support agent",
11  model: openai("gpt-4o"),
12  scorers: {
13    relevancy: {
14      scorer: createAnswerRelevancyScorer({ model: openai("gpt-4o") }),
15      sampling: { type: "ratio", rate: 0.5 }
16    },
17    bias: {
18      scorer: createBiasScorer({ model: openai("gpt-4o") }),
19      sampling: { type: "ratio", rate: 1 }
20    }
21  }
22});

The sampling.rate controls what percentage of outputs get scored. Set it to 1 to score everything, 0.5 for half, etc. This lets you balance evaluation coverage with cost.

The Scoring Pipeline

Scorers can have up to four steps, though only generateScore is required:

 1import { createScorer } from "@mastra/core/scores";
 2
 3// Updated API
 4export const customScorer = createScorer({
 5    name: "My Scorer",
 6    description: "Evaluates something specific",
 7}).preprocess(( { run }) => {
 8    const { input, output } = run
 9    return processedData
10}).analyze(( { results }) => {
11    const { preprocessStepResult } = results
12    return analyzedData
13}).generateScore(({ results }) => {
14    const { analyzeStepResult } = results
15    return 0.85
16}).generateReason(({ score, results }) => {
17    const { analyzeStepResult, preprocessStepResult } = results
18    return "Explanation of why this scored 0.85"
19})

Each step serves a specific purpose: the preprocess step prepares data when you need to extract or transform complex inputs (like breaking down long responses into individual claims). The analyze step performs the core evaluation logic, gathering insights that inform scoring (like detecting bias in each claim). The generateScore step converts analysis results into a numerical score using deterministic logic. The generateReason step provides human-readable explanations for debugging or compliance.

Built-in Scorers

We ship several scorers out of the box. Here are two examples:

The bias scorer detects discriminatory language or unfair stereotypes. It extracts opinions from the text, evaluates each one, and returns the proportion that contain bias:

 1const biasScorer = createBiasScorer({
 2  model: openai("gpt-4o"),
 3  scale: 1  // Score range 0-1
 4});

The answer relevancy scorer checks whether responses actually address the user's question. It breaks the output into statements and evaluates how many directly answer the query:

 1const relevancyScorer = createAnswerRelevancyScorer({
 2  model: openai("gpt-4o"),
 3  uncertaintyWeight: 0.3  // Weight for partially relevant statements
 4});

See the full list at /docs/scorers/off-the-shelf-scorers.

Creating Custom Scorers

You're not limited to our built-in scorers. You can create custom scorers using the createScorer function, mixing functions and LLM prompts as needed for your evaluation logic.

Scorers follow a flexible four-step pipeline (preprocess, analyze, generateScore, generateReason), where only generateScore is required. Each step can either be a function for deterministic logic or an LLM prompt for nuanced evaluation.

The interesting part is how we handle LLM scorers. LLMs are terrible at producing consistent numerical scores—ask the same model to rate something from 0-1 five times and you'll get five different numbers. So we have LLMs output structured data instead, then use a deterministic generateScore function to convert that into a number. This gives you the nuance of LLM evaluation with the consistency of code.

We'll be publishing a detailed post on custom scorers very soon. For now, check out our docs.

Scorers in the Playground

The Playground has a new Scorers tab where you can view all scoring results from your agents and workflows. When you add scorers to your agents or workflow steps and run them in the playground, the results automatically appear in the Scorers tab.

Run mastra dev and navigate to the Scorers tab to try it out.

How We Built This

Under the hood, we use Mastra workflows to run the scoring pipeline. Each step (preprocess, analyze, generateScore, generateReason) is a workflow step. This gives us async execution, error handling, and all the other workflow benefits for free.

The scoring results are stored in the mastra_scorers table in your configured database. This happens automatically when you have storage configured.

How to migrate from evals

If you're using evals, here's what changes with scorers:

The main difference is the API. Instead of extending Metric classes and managing a separate MastraAgentJudge class, you use the unified createScorer function that makes the scoring pipeline explicit.

Step 1: Replace the class-based approach

 1// Old: Extend Metric class
 2export class GlutenCheckerMetric extends Metric {
 3  async measure(output) {
 4    // evaluation logic
 5  }
 6}
 7
 8// New: Use createScorer function  
 9export const glutenCheckerScorer = createScorer({
10  name: 'Gluten Checker',
11  description: 'Check if the output contains any gluten',
12})

Step 2: Move your evaluation logic to the pipeline

 1// Old: Split across Metric and MastraAgentJudge classes
 2export class GlutenCheckerMetric extends Metric {
 3	async measure(output) {
 4	  const analysis = await this.judge.evaluate(output);
 5	  const score = this.calculateScore(analysis.isGlutenFree);
 6	  const reason = await this.judge.getReason(analysis);
 7	  return { score, info: { ...analysis, reason } };
 8	}
 9}
10
11// Your judge class
12class GlutenCheckerJudge extends MastraAgentJudge {
13  async evaluate(output) {
14    const result = await this.agent.generate(
15      prompt,
16      { output: schema }
17    );
18
19    return result.object;
20  }
21  
22  async getReason(args) {
23    const result = await this.agent.generate(
24      reasonPrompt,
25      { output: schema }
26    );
27
28    return result.object.reason;
29  }
30}
31
32// New: Consolidated into pipeline steps
33export const glutenCheckerScorer = createScorer(...)
34.analyze({
35  // Move your judge.evaluate() logic here
36  createPrompt: ({ run }) =>
37    generateGlutenPrompt({ output: run.output.text }),
38
39  // Provide the output schema of the prompt
40  outputSchema: z.object({
41    isGlutenFree: z.boolean(),
42    glutenSources: z.array(z.string())
43  }),
44})
45.generateScore(({ results }) => {
46  // Move your calculateScore() logic here
47  return results.analyzeStepResult.isGlutenFree ? 1 : 0;
48})
49.generateReason({
50  // Move your judge.getReason() logic here
51  createPrompt: ({ results }) => 
52    generateReasonPrompt({
53      isGlutenFree: results.analyzeStepResult.isGlutenFree,
54      glutenSources: results.analyzeStepResult.glutenSources
55    })
56})

Step 3: Update your agent configuration

 1// Old: evals property with metric instances
 2export const chefAgent = new Agent({
 3  name: 'chef-agent',
 4  instructions: 'You are Michel, an experienced home chef...',
 5  model: openai('gpt-4o'),
 6  evals: {
 7    glutenChecker: new GlutenCheckerMetric(openai('gpt-4o')),
 8  },
 9});
10
11// New: scorers property with sampling configuration
12export const chefAgent = new Agent({
13  name: 'chef-agent',
14  instructions: 'You are Michel, an experienced home chef...',
15  model: openai('gpt-4o'),
16  scorers: {
17    glutenChecker: {
18      scorer: glutenCheckerScorer,
19      sampling: { type: 'ratio', rate: 1 },
20    },
21  },
22});

To see the complete implementation of both approaches:

Evals approach: Custom Eval Example
New scorers approach: Custom Scorer Example

Both examples implement the same gluten checker logic, so you can compare the old and new approaches side by side.

We're not deprecating evals immediately. They'll continue to work for existing projects. But scorers are where we're focusing development going forward. The migration is straightforward—the core evaluation logic stays the same, you're just wrapping it differently.

Coming soon: golden answers and more

We're adding golden answers soon. You'll be able to define reference outputs that scorers compare against. This is useful for regression testing and tracking drift over time.

We’ll also be adding command line support, conversational evals, and more.

Getting Started

Upgrade to the latest Mastra version:

pnpm add @mastra/core@latest @mastra/evals@latest

Add scorers to your agents, run mastra dev, and check the Scorers tab in the Playground.

If you have questions or run into issues, we're on Discord and GitHub.