Running Scorers in CI

Running scorers in your CI pipeline provides quantifiable metrics for measuring agent quality over time. The runEvals function processes multiple test cases through your agent or workflow and returns aggregate scores.

Basic Setup
Direct link to Basic Setup

You can use any testing framework that supports ESM modules, such as Vitest, Jest, or Mocha.

Creating Test Cases
Direct link to Creating Test Cases

Use runEvals to evaluate your agent against multiple test cases. The function accepts an array of data items, each containing an input and optional groundTruth for scorer validation.

src/mastra/agents/weather-agent.test.ts
import { describe, it, expect } from 'vitest';
import { createScorer, runEvals } from "@mastra/core/evals";
import { weatherAgent } from "./weather-agent";
import { locationScorer } from "../scorers/location-scorer";

describe('Weather Agent Tests', () => {
  it('should correctly extract locations from queries', async () => {
    const result = await runEvals({
      data: [
        {
          input: 'weather in Berlin',
          groundTruth: { expectedLocation: 'Berlin', expectedCountry: 'DE' }
        },
        {
          input: 'weather in Berlin, Maryland',
          groundTruth: { expectedLocation: 'Berlin', expectedCountry: 'US' }
        },
        {
          input: 'weather in Berlin, Russia',
          groundTruth: { expectedLocation: 'Berlin', expectedCountry: 'RU' }
        },
      ],
      target: weatherAgent,
      scorers: [locationScorer]
    });

    // Assert aggregate score meets threshold
    expect(result.scores['location-accuracy']).toBe(1);
    expect(result.summary.totalItems).toBe(3);
  });
});

Understanding Results
Direct link to Understanding Results

The runEvals function returns an object with:

scores: Average scores for each scorer across all test cases
summary.totalItems: Total number of test cases processed

{
  scores: {
    'location-accuracy': 1.0,  // Average score across all items
    'another-scorer': 0.85
  },
  summary: {
    totalItems: 3
  }
}

Multiple Test Scenarios
Direct link to Multiple Test Scenarios

Create separate test cases for different evaluation scenarios:

src/mastra/agents/weather-agent.test.ts
describe('Weather Agent Tests', () => {
  const locationScorer = createScorer({ /* ... */ });

  it('should handle location disambiguation', async () => {
    const result = await runEvals({
      data: [
        { input: 'weather in Berlin', groundTruth: { /* ... */ } },
        { input: 'weather in Berlin, Maryland', groundTruth: { /* ... */ } },
      ],
      target: weatherAgent,
      scorers: [locationScorer]
    });

    expect(result.scores['location-accuracy']).toBe(1);
  });

  it('should handle typos and misspellings', async () => {
    const result = await runEvals({
      data: [
        { input: 'weather in Berln', groundTruth: { expectedLocation: 'Berlin', expectedCountry: 'DE' } },
        { input: 'weather in Parris', groundTruth: { expectedLocation: 'Paris', expectedCountry: 'FR' } },
      ],
      target: weatherAgent,
      scorers: [locationScorer]
    });

    expect(result.scores['location-accuracy']).toBe(1);
  });
});

Basic SetupDirect link to Basic Setup

Creating Test CasesDirect link to Creating Test Cases

Understanding ResultsDirect link to Understanding Results

Multiple Test ScenariosDirect link to Multiple Test Scenarios

Next StepsDirect link to Next Steps

Basic Setup
Direct link to Basic Setup

Creating Test Cases
Direct link to Creating Test Cases

Understanding Results
Direct link to Understanding Results

Multiple Test Scenarios
Direct link to Multiple Test Scenarios

Next Steps
Direct link to Next Steps