runEvals

The runEvals function enables batch evaluation of agents and workflows by running multiple test cases against scorers concurrently. This is essential for systematic testing, performance analysis, and validation of AI systems.

Usage example
Direct link to Usage example

import { runEvals } from '@mastra/core/evals'
import { myAgent } from './agents/my-agent'
import { myScorer1, myScorer2 } from './scorers'

const result = await runEvals({
  target: myAgent,
  data: [
    { input: 'What is machine learning?' },
    { input: 'Explain neural networks' },
    { input: 'How does AI work?' },
  ],
  scorers: [myScorer1, myScorer2],
  targetOptions: { maxSteps: 5 },
  concurrency: 2,
  onItemComplete: ({ item, targetResult, scorerResults }) => {
    console.log(`Completed: ${item.input}`)
    console.log(`Scores:`, scorerResults)
  },
})

console.log(`Average scores:`, result.scores)
console.log(`Processed ${result.summary.totalItems} items`)

Parameters
Direct link to Parameters

target:

Agent | Workflow

The agent or workflow to evaluate.

data:

RunEvalsDataItem[]

Array of test cases with input data and optional ground truth.

scorers:

MastraScorer[] | AgentScorerConfig | WorkflowScorerConfig

Scorers to use. A flat array applies all scorers to the raw output. For agents, an `AgentScorerConfig` object separates agent-level and trajectory scorers. For workflows, a `WorkflowScorerConfig` object specifies scorers for the workflow, individual steps, and trajectory.

targetOptions?:

AgentExecutionOptions | WorkflowRunOptions

Options forwarded to the target during execution. For agents: options passed to agent.generate() (e.g. maxSteps, modelSettings, instructions). For workflows: options passed to run.start() (e.g. perStep, outputOptions, initialState).

concurrency?:

number

= 1

Number of test cases to run concurrently.

onItemComplete?:

function

Callback function called after each test case completes. Receives item, target result, and scorer results.

Data item structure
Direct link to Data item structure

input:

string | string[] | CoreMessage[] | any

Input data for the target. For agents: messages or strings. For workflows: workflow input data.

groundTruth?:

any

Expected or reference output for comparison during scoring.

expectedTrajectory?:

TrajectoryExpectation

Expected trajectory configuration for trajectory scoring. Includes expected steps, ordering, efficiency budgets, blacklists, and tool failure tolerance. Passed to trajectory scorers as `run.expectedTrajectory`. Overrides the static defaults in scorer constructors.

requestContext?:

RequestContext

Request Context to pass to the target during execution.

tracingContext?:

TracingContext

Tracing context for observability and debugging.

startOptions?:

WorkflowRunOptions

Per-item workflow run options (e.g. initialState, perStep, outputOptions). Merged on top of targetOptions, so per-item values take precedence. Only applicable when the target is a workflow.

Agent scorer configuration
Direct link to Agent scorer configuration

For agents, use AgentScorerConfig to separate agent-level scorers from trajectory scorers:

agent?:

MastraScorer[]

Scorers that receive the raw agent output (MastraDBMessage[]). Use for evaluating response quality, content, etc.

trajectory?:

MastraScorer[]

Scorers that receive a pre-extracted Trajectory object. When storage is configured, the pipeline extracts a hierarchical trajectory from observability traces (including nested tool calls and model generations). Otherwise, it falls back to extracting tool calls from agent messages.

Workflow scorer configuration
Direct link to Workflow scorer configuration

For workflows, use WorkflowScorerConfig to specify scorers at different levels:

workflow?:

MastraScorer[]

Scorers to evaluate the entire workflow output.

steps?:

Record<string, MastraScorer[]>

Object mapping step IDs to arrays of scorers for evaluating individual step outputs.

trajectory?:

MastraScorer[]

Scorers that receive a pre-extracted Trajectory from the workflow execution. When storage is configured, the pipeline extracts a hierarchical trajectory from observability traces (including nested agent runs and tool calls within workflow steps). Otherwise, it falls back to extracting step results from the workflow output.

Returns
Direct link to Returns

scores:

Record<string, any>

Average scores across all test cases, organized by scorer name.

summary:

object

Summary information about the experiment execution.

summary.totalItems:

number

Total number of test cases processed.

Examples
Direct link to Examples

Agent Evaluation
Direct link to Agent Evaluation

import { createScorer, runEvals } from '@mastra/core/evals'

const myScorer = createScorer({
  id: 'my-scorer',
  description: "Check if Agent's response contains ground truth",
  type: 'agent',
}).generateScore(({ run }) => {
  const response = run.output[0]?.content || ''
  const expectedResponse = run.groundTruth
  return response.includes(expectedResponse) ? 1 : 0
})

const result = await runEvals({
  target: chatAgent,
  data: [
    {
      input: 'What is AI?',
      groundTruth: 'AI is a field of computer science that creates intelligent machines.',
    },
    {
      input: 'How does machine learning work?',
      groundTruth: 'Machine learning uses algorithms to learn patterns from data.',
    },
  ],
  scorers: [relevancyScorer],
  concurrency: 3,
})

Agent trajectory evaluation
Direct link to Agent trajectory evaluation

Use AgentScorerConfig to evaluate both the agent response and its tool-calling trajectory:

import { runEvals } from '@mastra/core/evals'
import { createTrajectoryAccuracyScorerCode } from '@mastra/evals/scorers/code/trajectory'

const trajectoryScorer = createTrajectoryAccuracyScorerCode()

const result = await runEvals({
  target: chatAgent,
  data: [
    {
      input: 'What is the weather in London?',
      expectedTrajectory: {
        steps: [{ stepType: 'tool_call', name: 'weatherTool' }],
      },
    },
  ],
  scorers: {
    // agent: [responseQualityScorer], // Optional: add agent-level scorers
    trajectory: [trajectoryScorer],
  },
})

// result.scores.agent — average agent-level scores
// result.scores.trajectory — average trajectory scores

Agent with `targetOptions`
Direct link to agent-with-targetoptions

Pass execution options like maxSteps or modelSettings to customize agent behavior during evaluation:

const result = await runEvals({
  target: chatAgent,
  data: [{ input: 'Summarize this article' }, { input: 'Translate to French' }],
  scorers: [relevancyScorer],
  targetOptions: {
    maxSteps: 5,
    modelSettings: { temperature: 0 },
  },
})

Workflow Evaluation
Direct link to Workflow Evaluation

const workflowResult = await runEvals({
  target: myWorkflow,
  data: [
    { input: { query: 'Process this data', priority: 'high' } },
    { input: { query: 'Another task', priority: 'low' } },
  ],
  scorers: {
    workflow: [outputQualityScorer],
    steps: {
      'validation-step': [validationScorer],
      'processing-step': [processingScorer],
    },
  },
  onItemComplete: ({ item, targetResult, scorerResults }) => {
    console.log(`Workflow completed for: ${item.inputData.query}`)
    if (scorerResults.workflow) {
      console.log('Workflow scores:', scorerResults.workflow)
    }
    if (scorerResults.steps) {
      console.log('Step scores:', scorerResults.steps)
    }
  },
})

Workflow trajectory evaluation
Direct link to Workflow trajectory evaluation

Add trajectory scoring to workflow evaluations to validate step execution order:

const workflowResult = await runEvals({
  target: myWorkflow,
  data: [
    {
      input: { query: 'Process this data' },
      expectedTrajectory: {
        steps: [
          { stepType: 'workflow_step', name: 'validate' },
          { stepType: 'workflow_step', name: 'process' },
          { stepType: 'workflow_step', name: 'output' },
        ],
      },
    },
  ],
  scorers: {
    workflow: [outputQualityScorer],
    steps: {
      validate: [validationScorer],
    },
    trajectory: [trajectoryScorer],
  },
})

// result.scores.trajectory — workflow trajectory scores

Workflow with per-item `startOptions`
Direct link to workflow-with-per-item-startoptions

Use startOptions on individual data items to customize each workflow run. Per-item values take precedence over targetOptions:

const result = await runEvals({
  target: myWorkflow,
  data: [
    {
      input: { query: 'hello' },
      startOptions: { initialState: { counter: 1 } },
    },
    {
      input: { query: 'world' },
      startOptions: { initialState: { counter: 2 } },
    },
  ],
  scorers: [outputQualityScorer],
  targetOptions: { perStep: true },
})

createScorer(): Create custom scorers for experiments
MastraScorer: Learn about scorer structure and methods
Trajectory Accuracy: Built-in trajectory evaluation scorers
Scorer Utilities: Helper functions for extracting trajectory data
Custom Scorers: Guide to building evaluation logic
Scorers Overview: Understanding scorer concepts

Usage exampleDirect link to Usage example

ParametersDirect link to Parameters

target:

data:

scorers:

targetOptions?:

concurrency?:

onItemComplete?:

Data item structureDirect link to Data item structure

input:

groundTruth?:

expectedTrajectory?:

requestContext?:

tracingContext?:

startOptions?:

Agent scorer configurationDirect link to Agent scorer configuration

agent?:

trajectory?:

Workflow scorer configurationDirect link to Workflow scorer configuration

workflow?:

steps?:

trajectory?:

ReturnsDirect link to Returns

scores:

summary:

summary.totalItems:

ExamplesDirect link to Examples

Agent EvaluationDirect link to Agent Evaluation

Agent trajectory evaluationDirect link to Agent trajectory evaluation

Agent with targetOptionsDirect link to agent-with-targetoptions

Workflow EvaluationDirect link to Workflow Evaluation

Workflow trajectory evaluationDirect link to Workflow trajectory evaluation

Workflow with per-item startOptionsDirect link to workflow-with-per-item-startoptions

RelatedDirect link to Related

Usage example
Direct link to Usage example

Parameters
Direct link to Parameters

Data item structure
Direct link to Data item structure

Agent scorer configuration
Direct link to Agent scorer configuration

Workflow scorer configuration
Direct link to Workflow scorer configuration

Returns
Direct link to Returns

Examples
Direct link to Examples

Agent Evaluation
Direct link to Agent Evaluation

Agent trajectory evaluation
Direct link to Agent trajectory evaluation

Agent with `targetOptions`
Direct link to agent-with-targetoptions

Workflow Evaluation
Direct link to Workflow Evaluation

Workflow trajectory evaluation
Direct link to Workflow trajectory evaluation

Workflow with per-item `startOptions`
Direct link to workflow-with-per-item-startoptions

Related
Direct link to Related