Running Scorers in CI
Running scorers in your CI pipeline provides quantifiable metrics for measuring agent quality over time. The runEvals function processes multiple test cases through your agent or workflow and returns aggregate scores.
Basic Setup
You can use any testing framework that supports ESM modules, such as Vitest, Jest, or Mocha.
Creating Test Cases
Use runEvals to evaluate your agent against multiple test cases. The function accepts an array of data items, each containing an input and optional groundTruth for scorer validation.
src/mastra/agents/weather-agent.test.ts
import { describe, it, expect } from 'vitest';
import { createScorer, runEvals } from "@mastra/core/evals";
import { weatherAgent } from "./weather-agent";
import { locationScorer } from "../scorers/location-scorer";
describe('Weather Agent Tests', () => {
it('should correctly extract locations from queries', async () => {
const result = await runEvals({
data: [
{
input: 'weather in Berlin',
groundTruth: { expectedLocation: 'Berlin', expectedCountry: 'DE' }
},
{
input: 'weather in Berlin, Maryland',
groundTruth: { expectedLocation: 'Berlin', expectedCountry: 'US' }
},
{
input: 'weather in Berlin, Russia',
groundTruth: { expectedLocation: 'Berlin', expectedCountry: 'RU' }
},
],
target: weatherAgent,
scorers: [locationScorer]
});
// Assert aggregate score meets threshold
expect(result.scores['location-accuracy']).toBe(1);
expect(result.summary.totalItems).toBe(3);
});
});
View the full example here
Understanding Results
The runEvals function returns an object with:
scores: Average scores for each scorer across all test casessummary.totalItems: Total number of test cases processed
{
scores: {
'location-accuracy': 1.0, // Average score across all items
'another-scorer': 0.85
},
summary: {
totalItems: 3
}
}
Multiple Test Scenarios
Create separate test cases for different evaluation scenarios:
src/mastra/agents/weather-agent.test.ts
describe('Weather Agent Tests', () => {
const locationScorer = createScorer({ /* ... */ });
it('should handle location disambiguation', async () => {
const result = await runEvals({
data: [
{ input: 'weather in Berlin', groundTruth: { /* ... */ } },
{ input: 'weather in Berlin, Maryland', groundTruth: { /* ... */ } },
],
target: weatherAgent,
scorers: [locationScorer]
});
expect(result.scores['location-accuracy']).toBe(1);
});
it('should handle typos and misspellings', async () => {
const result = await runEvals({
data: [
{ input: 'weather in Berln', groundTruth: { expectedLocation: 'Berlin', expectedCountry: 'DE' } },
{ input: 'weather in Parris', groundTruth: { expectedLocation: 'Paris', expectedCountry: 'FR' } },
],
target: weatherAgent,
scorers: [locationScorer]
});
expect(result.scores['location-accuracy']).toBe(1);
});
});
Next Steps
- Learn about creating custom scorers
- Explore built-in scorers
- Read the runEvals API reference