Testing your agents with evals

New Scorer API

We just released a new evals API called Scorers, with a more ergonomic API and more metadata stored for error analysis, and more flexibility to evaluate data structures. It’s fairly simple to migrate, but we will continue to support the existing Evals API.

While traditional software tests have clear pass/fail conditions, AI outputs are non-deterministic — they can vary with the same input. Evals help bridge this gap by providing quantifiable metrics for measuring agent quality.

Evals are automated tests that evaluate Agents outputs using model-graded, rule-based, and statistical methods. Each eval returns a normalized score between 0-1 that can be logged and compared. Evals can be customized with your own prompts and scoring functions.

Evals can be run in the cloud, capturing real-time results. But evals can also be part of your CI/CD pipeline, allowing you to test and monitor your agents over time.

Types of Evals

There are different kinds of evals, each serving a specific purpose. Here are some common types:

Textual Evals: Evaluate accuracy, reliability, and context understanding of agent responses
Classification Evals: Measure accuracy in categorizing data based on predefined categories
Prompt Engineering Evals: Explore impact of different instructions and input formats

Installation

To access Mastra’s evals feature install the @mastra/evals package.


npm install @mastra/evals@latest

Getting Started

Evals need to be added to an agent. Here’s an example using the summarization, content similarity, and tone consistency metrics:

src/mastra/agents/index.ts


import { Agent } from "@mastra/core/agent";
import { openai } from "@ai-sdk/openai";
import { SummarizationMetric } from "@mastra/evals/llm";
import {
  ContentSimilarityMetric,
  ToneConsistencyMetric,
} from "@mastra/evals/nlp";
 
const model = openai("gpt-4o");
 
export const myAgent = new Agent({
  name: "ContentWriter",
  instructions: "You are a content writer that creates accurate summaries",
  model,
  evals: {
    summarization: new SummarizationMetric(model),
    contentSimilarity: new ContentSimilarityMetric(),
    tone: new ToneConsistencyMetric(),
  },
});

You can view eval results in the Mastra dashboard when using mastra dev.

Beyond Automated Testing

While automated evals are valuable, high-performing AI teams often combine them with:

A/B Testing: Compare different versions with real users
Human Review: Regular review of production data and traces
Continuous Monitoring: Track eval metrics over time to detect regressions

Understanding Eval Results

Each eval metric measures a specific aspect of your agent’s output. Here’s how to interpret and improve your results:

Understanding Scores

For any metric:

Check the metric documentation to understand the scoring process
Look for patterns in when scores change
Compare scores across different inputs and contexts
Track changes over time to spot trends

Improving Results

When scores aren’t meeting your targets:

Check your instructions - Are they clear? Try making them more specific
Look at your context - Is it giving the agent what it needs?
Simplify your prompts - Break complex tasks into smaller steps
Add guardrails - Include specific rules for tricky cases

Maintaining Quality

Once you’re hitting your targets:

Monitor stability - Do scores remain consistent?
Document what works - Keep notes on successful approaches
Test edge cases - Add examples that cover unusual scenarios
Fine-tune - Look for ways to improve efficiency

See Textual Evals for more info on what evals can do.

For more info on how to create your own evals, see the Custom Evals guide.

For running evals in your CI pipeline, see the Running in CI guide.