Skip to main content

runEvals

The runEvals function enables batch evaluation of agents and workflows by running multiple test cases against scorers concurrently. This is essential for systematic testing, performance analysis, and validation of AI systems.

Usage example
Direct link to Usage example

import { runEvals } from '@mastra/core/evals'
import { myAgent } from './agents/my-agent'
import { myScorer1, myScorer2 } from './scorers'

const result = await runEvals({
target: myAgent,
data: [
{ input: 'What is machine learning?' },
{ input: 'Explain neural networks' },
{ input: 'How does AI work?' },
],
scorers: [myScorer1, myScorer2],
targetOptions: { maxSteps: 5 },
concurrency: 2,
onItemComplete: ({ item, targetResult, scorerResults }) => {
console.log(`Completed: ${item.input}`)
console.log(`Scores:`, scorerResults)
},
})

console.log(`Average scores:`, result.scores)
console.log(`Processed ${result.summary.totalItems} items`)

Parameters
Direct link to Parameters

target:

Agent | Workflow
The agent or workflow to evaluate.

data:

RunEvalsDataItem[]
Array of test cases with input data and optional ground truth.

scorers:

MastraScorer[] | AgentScorerConfig | WorkflowScorerConfig
Scorers to use. A flat array applies all scorers to the raw output. For agents, an `AgentScorerConfig` object separates agent-level and trajectory scorers. For workflows, a `WorkflowScorerConfig` object specifies scorers for the workflow, individual steps, and trajectory.

targetOptions?:

AgentExecutionOptions | WorkflowRunOptions
Options forwarded to the target during execution. For agents: options passed to agent.generate() (e.g. maxSteps, modelSettings, instructions). For workflows: options passed to run.start() (e.g. perStep, outputOptions, initialState).

concurrency?:

number
= 1
Number of test cases to run concurrently.

onItemComplete?:

function
Callback function called after each test case completes. Receives item, target result, and scorer results.

Data item structure
Direct link to Data item structure

input:

string | string[] | CoreMessage[] | any
Input data for the target. For agents: messages or strings. For workflows: workflow input data.

groundTruth?:

any
Expected or reference output for comparison during scoring.

expectedTrajectory?:

TrajectoryExpectation
Expected trajectory configuration for trajectory scoring. Includes expected steps, ordering, efficiency budgets, blacklists, and tool failure tolerance. Passed to trajectory scorers as `run.expectedTrajectory`. Overrides the static defaults in scorer constructors.

requestContext?:

RequestContext
Request Context to pass to the target during execution.

tracingContext?:

TracingContext
Tracing context for observability and debugging.

startOptions?:

WorkflowRunOptions
Per-item workflow run options (e.g. initialState, perStep, outputOptions). Merged on top of targetOptions, so per-item values take precedence. Only applicable when the target is a workflow.

Agent scorer configuration
Direct link to Agent scorer configuration

For agents, use AgentScorerConfig to separate agent-level scorers from trajectory scorers:

agent?:

MastraScorer[]
Scorers that receive the raw agent output (MastraDBMessage[]). Use for evaluating response quality, content, etc.

trajectory?:

MastraScorer[]
Scorers that receive a pre-extracted Trajectory object. When storage is configured, the pipeline extracts a hierarchical trajectory from observability traces (including nested tool calls and model generations). Otherwise, it falls back to extracting tool calls from agent messages.

Workflow scorer configuration
Direct link to Workflow scorer configuration

For workflows, use WorkflowScorerConfig to specify scorers at different levels:

workflow?:

MastraScorer[]
Scorers to evaluate the entire workflow output.

steps?:

Record<string, MastraScorer[]>
Object mapping step IDs to arrays of scorers for evaluating individual step outputs.

trajectory?:

MastraScorer[]
Scorers that receive a pre-extracted Trajectory from the workflow execution. When storage is configured, the pipeline extracts a hierarchical trajectory from observability traces (including nested agent runs and tool calls within workflow steps). Otherwise, it falls back to extracting step results from the workflow output.

Returns
Direct link to Returns

scores:

Record<string, any>
Average scores across all test cases, organized by scorer name.

summary:

object
Summary information about the experiment execution.

summary.totalItems:

number
Total number of test cases processed.

Examples
Direct link to Examples

Agent Evaluation
Direct link to Agent Evaluation

import { createScorer, runEvals } from '@mastra/core/evals'

const myScorer = createScorer({
id: 'my-scorer',
description: "Check if Agent's response contains ground truth",
type: 'agent',
}).generateScore(({ run }) => {
const response = run.output[0]?.content || ''
const expectedResponse = run.groundTruth
return response.includes(expectedResponse) ? 1 : 0
})

const result = await runEvals({
target: chatAgent,
data: [
{
input: 'What is AI?',
groundTruth: 'AI is a field of computer science that creates intelligent machines.',
},
{
input: 'How does machine learning work?',
groundTruth: 'Machine learning uses algorithms to learn patterns from data.',
},
],
scorers: [relevancyScorer],
concurrency: 3,
})

Agent trajectory evaluation
Direct link to Agent trajectory evaluation

Use AgentScorerConfig to evaluate both the agent response and its tool-calling trajectory:

import { runEvals } from '@mastra/core/evals'
import { createTrajectoryAccuracyScorerCode } from '@mastra/evals/scorers/code/trajectory'

const trajectoryScorer = createTrajectoryAccuracyScorerCode()

const result = await runEvals({
target: chatAgent,
data: [
{
input: 'What is the weather in London?',
expectedTrajectory: {
steps: [{ stepType: 'tool_call', name: 'weatherTool' }],
},
},
],
scorers: {
// agent: [responseQualityScorer], // Optional: add agent-level scorers
trajectory: [trajectoryScorer],
},
})

// result.scores.agent — average agent-level scores
// result.scores.trajectory — average trajectory scores

Agent with targetOptions
Direct link to agent-with-targetoptions

Pass execution options like maxSteps or modelSettings to customize agent behavior during evaluation:

const result = await runEvals({
target: chatAgent,
data: [{ input: 'Summarize this article' }, { input: 'Translate to French' }],
scorers: [relevancyScorer],
targetOptions: {
maxSteps: 5,
modelSettings: { temperature: 0 },
},
})

Workflow Evaluation
Direct link to Workflow Evaluation

const workflowResult = await runEvals({
target: myWorkflow,
data: [
{ input: { query: 'Process this data', priority: 'high' } },
{ input: { query: 'Another task', priority: 'low' } },
],
scorers: {
workflow: [outputQualityScorer],
steps: {
'validation-step': [validationScorer],
'processing-step': [processingScorer],
},
},
onItemComplete: ({ item, targetResult, scorerResults }) => {
console.log(`Workflow completed for: ${item.inputData.query}`)
if (scorerResults.workflow) {
console.log('Workflow scores:', scorerResults.workflow)
}
if (scorerResults.steps) {
console.log('Step scores:', scorerResults.steps)
}
},
})

Workflow trajectory evaluation
Direct link to Workflow trajectory evaluation

Add trajectory scoring to workflow evaluations to validate step execution order:

const workflowResult = await runEvals({
target: myWorkflow,
data: [
{
input: { query: 'Process this data' },
expectedTrajectory: {
steps: [
{ stepType: 'workflow_step', name: 'validate' },
{ stepType: 'workflow_step', name: 'process' },
{ stepType: 'workflow_step', name: 'output' },
],
},
},
],
scorers: {
workflow: [outputQualityScorer],
steps: {
validate: [validationScorer],
},
trajectory: [trajectoryScorer],
},
})

// result.scores.trajectory — workflow trajectory scores

Workflow with per-item startOptions
Direct link to workflow-with-per-item-startoptions

Use startOptions on individual data items to customize each workflow run. Per-item values take precedence over targetOptions:

const result = await runEvals({
target: myWorkflow,
data: [
{
input: { query: 'hello' },
startOptions: { initialState: { counter: 1 } },
},
{
input: { query: 'world' },
startOptions: { initialState: { counter: 2 } },
},
],
scorers: [outputQualityScorer],
targetOptions: { perStep: true },
})