DocsEvalsCustom Evals

Create your own Eval

Creating your own eval is as easy as creating a new function. You simply create a class that extends the Metric class and implement the measure method.

Basic example

Here is a very basic example of a custom eval that checks if the output contains a certain keyword. This is a simplified version of our own keyword coverage eval.

src/mastra/evals/keyword-coverage.ts
import { Metric, type MetricResult } from '@mastra/core/eval';
 
interface KeywordCoverageResult extends MetricResult {
  info: {
    totalKeywords: number;
    matchedKeywords: number;
  };
}
 
export class KeywordCoverageMetric extends Metric {
  private referenceKeywords: Set<string>;
 
  constructor(keywords: string[]) {
    super();
    this.referenceKeywords = new Set(keywords);
  }
 
  async measure(input: string, output: string): Promise<KeywordCoverageResult> {
    // Handle empty strings case
    if (!input && !output) {
      return {
        score: 1,
        info: {
          totalKeywords: 0,
          matchedKeywords: 0,
        },
      };
    }
 
    const matchedKeywords = [...this.referenceKeywords].filter(k => output.includes(k));
    const totalKeywords = this.referenceKeywords.size;
    const coverage = totalKeywords > 0 ? matchedKeywords.length / totalKeywords : 0;
 
    return {
      score: coverage,
      info: {
        totalKeywords: this.referenceKeywords.size,
        matchedKeywords: matchedKeywords.length,
      },
    };
  }
}

Creating a custom LLM-Judge

A custom LLM judge can provide more targeted and meaningful evaluations for your use case. For example, if you’re building a medical Q&A system, you might want to evaluate not just answer relevancy but also medical accuracy and safety considerations.

Let’s create an example to make sure our Chef Michel is giving complete recipe information to the user.

We’ll start with creating the judge agent. You can put it all in one file but we prefer splitting it into a separate file to keep things readable.

src/mastra/evals/recipe-completeness/metricJudge.ts
import { type LanguageModel } from '@mastra/core/llm';
import { MastraAgentJudge } from '@mastra/evals/judge';
import { z } from 'zod';
 
import { RECIPE_COMPLETENESS_INSTRUCTIONS, generateCompletenessPrompt, generateReasonPrompt } from './prompts';
 
export class RecipeCompletenessJudge extends MastraAgentJudge {
  constructor(model: LanguageModel) {
    super('Recipe Completeness', RECIPE_COMPLETENESS_INSTRUCTIONS, model);
  }
 
  async evaluate(
    input: string,
    output: string,
  ): Promise<{
    missing: string[];
    verdict: string;
  }> {
    const completenessPrompt = generateCompletenessPrompt({ input, output });
    const result = await this.agent.generate(completenessPrompt, {
      output: z.object({
        missing: z.array(z.string()),
        verdict: z.string(),
      }),
    });
 
    return result.object;
  }
 
  async getReason(args: {
    input: string;
    output: string;
    missing: string[];
    verdict: string;
  }): Promise<string> {
    const prompt = generateReasonPrompt(args);
    const result = await this.agent.generate(prompt, {
      output: z.object({
        reason: z.string(),
      }),
    });
 
    return result.object.reason;
  }
}
src/mastra/evals/recipe-completeness/index.ts
import { Metric, type MetricResult } from '@mastra/core/eval';
import { type LanguageModel } from '@mastra/core/llm';
 
import { RecipeCompletenessJudge } from './metricJudge';
 
export interface RecipeCompletenessMetricOptions {
  scale?: number;
}
 
export interface MetricResultWithInfo extends MetricResult {
  info: {
    reason: string;
    missing: string[];
  };
}
 
export class RecipeCompletenessMetric extends Metric {
  private judge: RecipeCompletenessJudge;
  private scale: number;
  constructor(model: LanguageModel, { scale = 1 }: RecipeCompletenessMetricOptions = {}) {
    super();
 
    this.judge = new RecipeCompletenessJudge(model);
    this.scale = scale;
  }
 
  async measure(input: string, output: string): Promise<MetricResultWithInfo> {
    const { verdict, missing } = await this.judge.evaluate(input, output);
    const score = this.calculateScore({ verdict });
    const reason = await this.judge.getReason({
      input,
      output,
      verdict,
      missing,
    });
 
    return {
      score,
      info: {
        missing,
        reason,
      },
    };
  }
 
  private calculateScore(verdict: { verdict: string }): number {
    return verdict.verdict.toLowerCase() === 'incomplete' ? 0 : 1;
  }
}
src/mastra/agents/chefAgent.ts
import { openai } from '@ai-sdk/openai';
import { Agent } from '@mastra/core/agent';
 
import { RecipeCompletenessMetric } from '../evals';
 
export const chefAgent = new Agent({
  name: 'chef-agent',
  instructions:
    'You are Michel, a practical and experienced home chef' +
    'You help people cook with whatever ingredients they have available.',
  model: openai('gpt-4o-mini'),
  evals: {
    recipeCompleteness: new RecipeCompletenessMetric(openai('gpt-4o-mini')),
  },
});

You can now use the RecipeCompletenessMetric in your project. See the full example here.