Custom Scorers

Mastra provides a unified createScorer factory that allows you to build custom evaluation logic using either JavaScript functions or LLM-based prompt objects for each step. This flexibility lets you choose the best approach for each part of your evaluation pipeline.

The four-step pipeline

All scorers in Mastra follow a consistent four-step evaluation pipeline:

preprocess (optional): Prepare or transform input/output data
analyze (optional): Perform evaluation analysis and gather insights
generateScore (required): Convert analysis into a numerical score
generateReason (optional): Generate human-readable explanations

Each step can use either functions or prompt objects (LLM-based evaluation), giving you the flexibility to combine deterministic algorithms with AI judgment as needed.

Functions vs prompt objects

Functions use JavaScript for deterministic logic. They're ideal for:

Algorithmic evaluations with clear criteria
Performance-critical scenarios
Integration with existing libraries
Consistent, reproducible results

Prompt Objects use LLMs as judges for evaluation. They're perfect for:

Subjective evaluations requiring human-like judgment
Complex criteria difficult to code algorithmically
Natural language understanding tasks
Nuanced context evaluation

You can mix and match approaches within a single scorer - for example, use a function for preprocessing data and an LLM for analyzing quality.

Initializing a scorer

Every scorer starts with the createScorer factory function, which requires a name and description, and optionally accepts a type specification and judge configuration.

import { createScorer } from '@mastra/core/scores';
import { openai } from '@ai-sdk/openai';

const glutenCheckerScorer = createScorer({
  name: 'Gluten Checker',
  description: 'Check if recipes contain gluten ingredients',
  judge: {                    // Optional: for prompt object steps
    model: openai('gpt-4o'),
    instructions: 'You are a Chef that identifies if recipes contain gluten.'
  }
})
// Chain step methods here
.preprocess(...)
.analyze(...)
.generateScore(...)
.generateReason(...)

The judge configuration is only needed if you plan to use prompt objects in any step. Individual steps can override this default configuration with their own judge settings.

Agent type for agent evaluation

For type safety and compatibility with both live agent scoring and trace scoring, use type: 'agent' when creating scorers for agent evaluation. This allows you to use the same scorer for an agent and also use it to score traces:

const myScorer = createScorer({
  // ...
  type: "agent", // Automatically handles agent input/output types
}).generateScore(({ run, results }) => {
  // run.output is automatically typed as ScorerRunOutputForAgent
  // run.input is automatically typed as ScorerRunInputForAgent
});

Step-by-step breakdown

Preprocess step (optional)

Prepares input/output data when you need to extract specific elements, filter content, or transform complex data structures.

Functions: ({ run, results }) => any

const glutenCheckerScorer = createScorer(...)
.preprocess(({ run }) => {
  // Extract and clean recipe text
  const recipeText = run.output.text.toLowerCase();
  const wordCount = recipeText.split(' ').length;

  return {
    recipeText,
    wordCount,
    hasCommonGlutenWords: /flour|wheat|bread|pasta/.test(recipeText)
  };
})

Prompt Objects: Use description, outputSchema, and createPrompt to structure LLM-based preprocessing.

const glutenCheckerScorer = createScorer(...)
.preprocess({
  description: 'Extract ingredients from the recipe',
  outputSchema: z.object({
    ingredients: z.array(z.string()),
    cookingMethods: z.array(z.string())
  }),
  createPrompt: ({ run }) => `
    Extract all ingredients and cooking methods from this recipe:
    ${run.output.text}

    Return JSON with ingredients and cookingMethods arrays.
  `
})

Data Flow: Results are available to subsequent steps as results.preprocessStepResult

Analyze step (optional)

Performs core evaluation analysis, gathering insights that will inform the scoring decision.

Functions: ({ run, results }) => any

const glutenCheckerScorer = createScorer({...})
.preprocess(...)
.analyze(({ run, results }) => {
  const { recipeText, hasCommonGlutenWords } = results.preprocessStepResult;

  // Simple gluten detection algorithm
  const glutenKeywords = ['wheat', 'flour', 'barley', 'rye', 'bread'];
  const foundGlutenWords = glutenKeywords.filter(word =>
    recipeText.includes(word)
  );

  return {
    isGlutenFree: foundGlutenWords.length === 0,
    detectedGlutenSources: foundGlutenWords,
    confidence: hasCommonGlutenWords ? 0.9 : 0.7
  };
})

Prompt Objects: Use description, outputSchema, and createPrompt for LLM-based analysis.

const glutenCheckerScorer = createScorer({...})
.preprocess(...)
.analyze({
  description: 'Analyze recipe for gluten content',
  outputSchema: z.object({
    isGlutenFree: z.boolean(),
    glutenSources: z.array(z.string()),
    confidence: z.number().min(0).max(1)
  }),
  createPrompt: ({ run, results }) => `
    Analyze this recipe for gluten content:
    "${results.preprocessStepResult.recipeText}"

    Look for wheat, barley, rye, and hidden sources like soy sauce.
    Return JSON with isGlutenFree, glutenSources array, and confidence (0-1).
  `
})

Data Flow: Results are available to subsequent steps as results.analyzeStepResult

Generate score step (required)

Converts analysis results into a numerical score. This is the only required step in the pipeline.

Functions: ({ run, results }) => number

const glutenCheckerScorer = createScorer({...})
.preprocess(...)
.analyze(...)
.generateScore(({ results }) => {
  const { isGlutenFree, confidence } = results.analyzeStepResult;

  // Return 1 for gluten-free, 0 for contains gluten
  // Weight by confidence level
  return isGlutenFree ? confidence : 0;
})

Prompt Objects: See the createScorer API reference for details on using prompt objects with generateScore, including required calculateScore function.

Data Flow: The score is available to generateReason as the score parameter

Generate reason step (optional)

Generates human-readable explanations for the score, useful for debugging, transparency, or user feedback.

Functions: ({ run, results, score }) => string

const glutenCheckerScorer = createScorer({...})
.preprocess(...)
.analyze(...)
.generateScore(...)
.generateReason(({ results, score }) => {
  const { isGlutenFree, glutenSources } = results.analyzeStepResult;

  if (isGlutenFree) {
    return `Score: ${score}. This recipe is gluten-free with no harmful ingredients detected.`;
  } else {
    return `Score: ${score}. Contains gluten from: ${glutenSources.join(', ')}`;
  }
})

Prompt Objects: Use description and createPrompt for LLM-generated explanations.

const glutenCheckerScorer = createScorer({...})
.preprocess(...)
.analyze(...)
.generateScore(...)
.generateReason({
  description: 'Explain the gluten assessment',
  createPrompt: ({ results, score }) => `
    Explain why this recipe received a score of ${score}.
    Analysis: ${JSON.stringify(results.analyzeStepResult)}

    Provide a clear explanation for someone with dietary restrictions.
  `
})

Create a custom scorer

A custom scorer in Mastra uses createScorer with four core components:

Judge Configuration
Analysis Step
Score Generation
Reason Generation

Together, these components allow you to define custom evaluation logic using LLMs as judges.

See createScorer for the full API and configuration options.

src/mastra/scorers/gluten-checker.ts
import { openai } from "@ai-sdk/openai";
import { createScorer } from "@mastra/core/scores";
import { z } from "zod";

export const GLUTEN_INSTRUCTIONS = `You are a Chef that identifies if recipes contain gluten.`;

export const generateGlutenPrompt = ({
  output,
}: {
  output: string;
}) => `Check if this recipe is gluten-free.

Check for:
- Wheat
- Barley
- Rye
- Common sources like flour, pasta, bread

Example with gluten:
"Mix flour and water to make dough"
Response: {
  "isGlutenFree": false,
  "glutenSources": ["flour"]
}

Example gluten-free:
"Mix rice, beans, and vegetables"
Response: {
  "isGlutenFree": true,
  "glutenSources": []
}

Recipe to analyze:
${output}

Return your response in this format:
{
  "isGlutenFree": boolean,
  "glutenSources": ["list ingredients containing gluten"]
}`;

export const generateReasonPrompt = ({
  isGlutenFree,
  glutenSources,
}: {
  isGlutenFree: boolean;
  glutenSources: string[];
}) => `Explain why this recipe is${isGlutenFree ? "" : " not"} gluten-free.

${glutenSources.length > 0 ? `Sources of gluten: ${glutenSources.join(", ")}` : "No gluten-containing ingredients found"}

Return your response in this format:
"This recipe is [gluten-free/contains gluten] because [explanation]"`;

export const glutenCheckerScorer = createScorer({
  name: "Gluten Checker",
  description: "Check if the output contains any gluten",
  judge: {
    model: openai("gpt-4o"),
    instructions: GLUTEN_INSTRUCTIONS,
  },
})
  .analyze({
    description: "Analyze the output for gluten",
    outputSchema: z.object({
      isGlutenFree: z.boolean(),
      glutenSources: z.array(z.string()),
    }),
    createPrompt: ({ run }) => {
      const { output } = run;
      return generateGlutenPrompt({ output: output.text });
    },
  })
  .generateScore(({ results }) => {
    return results.analyzeStepResult.isGlutenFree ? 1 : 0;
  })
  .generateReason({
    description: "Generate a reason for the score",
    createPrompt: ({ results }) => {
      return generateReasonPrompt({
        glutenSources: results.analyzeStepResult.glutenSources,
        isGlutenFree: results.analyzeStepResult.isGlutenFree,
      });
    },
  });

Judge configuration

Sets up the LLM model and defines its role as a domain expert.

judge: {
  model: openai('gpt-4o'),
  instructions: GLUTEN_INSTRUCTIONS,
}

Analysis step

Defines how the LLM should analyze the input and what structured output to return.

.analyze({
  description: 'Analyze the output for gluten',
  outputSchema: z.object({
    isGlutenFree: z.boolean(),
    glutenSources: z.array(z.string()),
  }),
  createPrompt: ({ run }) => {
    const { output } = run;
    return generateGlutenPrompt({ output: output.text });
  },
})

The analysis step uses a prompt object to:

Provide a clear description of the analysis task
Define expected output structure with Zod schema (both boolean result and list of gluten sources)
Generate dynamic prompts based on the input content

Score generation

Converts the LLM's structured analysis into a numerical score.

.generateScore(({ results }) => {
  return results.analyzeStepResult.isGlutenFree ? 1 : 0;
})

The score generation function takes the analysis results and applies business logic to produce a score. In this case, the LLM directly determines if the recipe is gluten-free, so we use that boolean result: 1 for gluten-free, 0 for contains gluten.

Reason generation

Provides human-readable explanations for the score using another LLM call.

.generateReason({
  description: 'Generate a reason for the score',
  createPrompt: ({ results }) => {
    return generateReasonPrompt({
      glutenSources: results.analyzeStepResult.glutenSources,
      isGlutenFree: results.analyzeStepResult.isGlutenFree,
    });
  },
})

The reason generation step creates explanations that help users understand why a particular score was assigned, using both the boolean result and the specific gluten sources identified by the analysis step.

## High gluten-free example

```typescript title="src/example-high-gluten-free.ts" showLineNumbers copy
const result = await glutenCheckerScorer.run({
  input: [{ role: 'user', content: 'Mix rice, beans, and vegetables' }],
  output: { text: 'Mix rice, beans, and vegetables' },
});

console.log('Score:', result.score);
console.log('Gluten sources:', result.analyzeStepResult.glutenSources);
console.log('Reason:', result.reason);

High gluten-free output

{
  score: 1,
  analyzeStepResult: {
    isGlutenFree: true,
    glutenSources: []
  },
  reason: 'This recipe is gluten-free because rice, beans, and vegetables are naturally gluten-free ingredients that are safe for people with celiac disease.'
}

Partial gluten example

src/example-partial-gluten.ts
const result = await glutenCheckerScorer.run({
  input: [{ role: "user", content: "Mix flour and water to make dough" }],
  output: { text: "Mix flour and water to make dough" },
});

console.log("Score:", result.score);
console.log("Gluten sources:", result.analyzeStepResult.glutenSources);
console.log("Reason:", result.reason);

Partial gluten output

{
  score: 0,
  analyzeStepResult: {
    isGlutenFree: false,
    glutenSources: ['flour']
  },
  reason: 'This recipe is not gluten-free because it contains flour. Regular flour is made from wheat and contains gluten, making it unsafe for people with celiac disease or gluten sensitivity.'
}

Low gluten-free example

src/example-low-gluten-free.ts
const result = await glutenCheckerScorer.run({
  input: [{ role: "user", content: "Add soy sauce and noodles" }],
  output: { text: "Add soy sauce and noodles" },
});

console.log("Score:", result.score);
console.log("Gluten sources:", result.analyzeStepResult.glutenSources);
console.log("Reason:", result.reason);

Low gluten-free output

{
  score: 0,
  analyzeStepResult: {
    isGlutenFree: false,
    glutenSources: ['soy sauce', 'noodles']
  },
  reason: 'This recipe is not gluten-free because it contains soy sauce, noodles. Regular soy sauce contains wheat and most noodles are made from wheat flour, both of which contain gluten and are unsafe for people with gluten sensitivity.'
}

Examples and Resources:

createScorer API Reference - Complete technical documentation
Built-in Scorers Source Code - Real implementations for reference

The four-step pipeline​

Functions vs prompt objects​

Initializing a scorer​

Agent type for agent evaluation​

Step-by-step breakdown​

Preprocess step (optional)​

Analyze step (optional)​

Generate score step (required)​

Generate reason step (optional)​

Create a custom scorer​

Judge configuration​

Analysis step​

Score generation​

Reason generation​

High gluten-free output​

Partial gluten example​

Partial gluten output​

Low gluten-free example​

Low gluten-free output​