Testing your agents with evals
We just released a new evals API called Scorers, with a more ergonomic API and more metadata stored for error analysis, and more flexibility to evaluate data structures. It’s fairly simple to migrate, but we will continue to support the existing Evals API.
While traditional software tests have clear pass/fail conditions, AI outputs are non-deterministic — they can vary with the same input. Evals help bridge this gap by providing quantifiable metrics for measuring agent quality.
Evals are automated tests that evaluate Agents outputs using model-graded, rule-based, and statistical methods. Each eval returns a normalized score between 0-1 that can be logged and compared. Evals can be customized with your own prompts and scoring functions.
Evals can be run in the cloud, capturing real-time results. But evals can also be part of your CI/CD pipeline, allowing you to test and monitor your agents over time.
Types of Evals
There are different kinds of evals, each serving a specific purpose. Here are some common types:
- Textual Evals: Evaluate accuracy, reliability, and context understanding of agent responses
- Classification Evals: Measure accuracy in categorizing data based on predefined categories
- Prompt Engineering Evals: Explore impact of different instructions and input formats
Installation
To access Mastra’s evals feature install the @mastra/evals package.
npm install @mastra/evals@latestGetting Started
Evals need to be added to an agent. Here’s an example using the summarization, content similarity, and tone consistency metrics:
import { Agent } from "@mastra/core/agent";
import { openai } from "@ai-sdk/openai";
import { SummarizationMetric } from "@mastra/evals/llm";
import {
  ContentSimilarityMetric,
  ToneConsistencyMetric,
} from "@mastra/evals/nlp";
 
const model = openai("gpt-4o");
 
export const myAgent = new Agent({
  name: "ContentWriter",
  instructions: "You are a content writer that creates accurate summaries",
  model,
  evals: {
    summarization: new SummarizationMetric(model),
    contentSimilarity: new ContentSimilarityMetric(),
    tone: new ToneConsistencyMetric(),
  },
});You can view eval results in the Mastra dashboard when using mastra dev.
Beyond Automated Testing
While automated evals are valuable, high-performing AI teams often combine them with:
- A/B Testing: Compare different versions with real users
- Human Review: Regular review of production data and traces
- Continuous Monitoring: Track eval metrics over time to detect regressions
Understanding Eval Results
Each eval metric measures a specific aspect of your agent’s output. Here’s how to interpret and improve your results:
Understanding Scores
For any metric:
- Check the metric documentation to understand the scoring process
- Look for patterns in when scores change
- Compare scores across different inputs and contexts
- Track changes over time to spot trends
Improving Results
When scores aren’t meeting your targets:
- Check your instructions - Are they clear? Try making them more specific
- Look at your context - Is it giving the agent what it needs?
- Simplify your prompts - Break complex tasks into smaller steps
- Add guardrails - Include specific rules for tricky cases
Maintaining Quality
Once you’re hitting your targets:
- Monitor stability - Do scores remain consistent?
- Document what works - Keep notes on successful approaches
- Test edge cases - Add examples that cover unusual scenarios
- Fine-tune - Look for ways to improve efficiency
See Textual Evals for more info on what evals can do.
For more info on how to create your own evals, see the Custom Evals guide.
For running evals in your CI pipeline, see the Running in CI guide.