runEvals
The runEvals function enables batch evaluation of agents and workflows by running multiple test cases against scorers concurrently. This is essential for systematic testing, performance analysis, and validation of AI systems.
Usage exampleDirect link to Usage example
import { runEvals } from '@mastra/core/evals'
import { myAgent } from './agents/my-agent'
import { myScorer1, myScorer2 } from './scorers'
const result = await runEvals({
target: myAgent,
data: [
{ input: 'What is machine learning?' },
{ input: 'Explain neural networks' },
{ input: 'How does AI work?' },
],
scorers: [myScorer1, myScorer2],
targetOptions: { maxSteps: 5 },
concurrency: 2,
onItemComplete: ({ item, targetResult, scorerResults }) => {
console.log(`Completed: ${item.input}`)
console.log(`Scores:`, scorerResults)
},
})
console.log(`Average scores:`, result.scores)
console.log(`Processed ${result.summary.totalItems} items`)
ParametersDirect link to Parameters
target:
data:
scorers:
targetOptions?:
concurrency?:
onItemComplete?:
Data item structureDirect link to Data item structure
input:
groundTruth?:
expectedTrajectory?:
requestContext?:
tracingContext?:
startOptions?:
Agent scorer configurationDirect link to Agent scorer configuration
For agents, use AgentScorerConfig to separate agent-level scorers from trajectory scorers:
agent?:
trajectory?:
Workflow scorer configurationDirect link to Workflow scorer configuration
For workflows, use WorkflowScorerConfig to specify scorers at different levels:
workflow?:
steps?:
trajectory?:
ReturnsDirect link to Returns
scores:
summary:
summary.totalItems:
ExamplesDirect link to Examples
Agent EvaluationDirect link to Agent Evaluation
import { createScorer, runEvals } from '@mastra/core/evals'
const myScorer = createScorer({
id: 'my-scorer',
description: "Check if Agent's response contains ground truth",
type: 'agent',
}).generateScore(({ run }) => {
const response = run.output[0]?.content || ''
const expectedResponse = run.groundTruth
return response.includes(expectedResponse) ? 1 : 0
})
const result = await runEvals({
target: chatAgent,
data: [
{
input: 'What is AI?',
groundTruth: 'AI is a field of computer science that creates intelligent machines.',
},
{
input: 'How does machine learning work?',
groundTruth: 'Machine learning uses algorithms to learn patterns from data.',
},
],
scorers: [relevancyScorer],
concurrency: 3,
})
Agent trajectory evaluationDirect link to Agent trajectory evaluation
Use AgentScorerConfig to evaluate both the agent response and its tool-calling trajectory:
import { runEvals } from '@mastra/core/evals'
import { createTrajectoryAccuracyScorerCode } from '@mastra/evals/scorers/code/trajectory'
const trajectoryScorer = createTrajectoryAccuracyScorerCode()
const result = await runEvals({
target: chatAgent,
data: [
{
input: 'What is the weather in London?',
expectedTrajectory: {
steps: [{ stepType: 'tool_call', name: 'weatherTool' }],
},
},
],
scorers: {
// agent: [responseQualityScorer], // Optional: add agent-level scorers
trajectory: [trajectoryScorer],
},
})
// result.scores.agent — average agent-level scores
// result.scores.trajectory — average trajectory scores
Agent with targetOptionsDirect link to agent-with-targetoptions
Pass execution options like maxSteps or modelSettings to customize agent behavior during evaluation:
const result = await runEvals({
target: chatAgent,
data: [{ input: 'Summarize this article' }, { input: 'Translate to French' }],
scorers: [relevancyScorer],
targetOptions: {
maxSteps: 5,
modelSettings: { temperature: 0 },
},
})
Workflow EvaluationDirect link to Workflow Evaluation
const workflowResult = await runEvals({
target: myWorkflow,
data: [
{ input: { query: 'Process this data', priority: 'high' } },
{ input: { query: 'Another task', priority: 'low' } },
],
scorers: {
workflow: [outputQualityScorer],
steps: {
'validation-step': [validationScorer],
'processing-step': [processingScorer],
},
},
onItemComplete: ({ item, targetResult, scorerResults }) => {
console.log(`Workflow completed for: ${item.inputData.query}`)
if (scorerResults.workflow) {
console.log('Workflow scores:', scorerResults.workflow)
}
if (scorerResults.steps) {
console.log('Step scores:', scorerResults.steps)
}
},
})
Workflow trajectory evaluationDirect link to Workflow trajectory evaluation
Add trajectory scoring to workflow evaluations to validate step execution order:
const workflowResult = await runEvals({
target: myWorkflow,
data: [
{
input: { query: 'Process this data' },
expectedTrajectory: {
steps: [
{ stepType: 'workflow_step', name: 'validate' },
{ stepType: 'workflow_step', name: 'process' },
{ stepType: 'workflow_step', name: 'output' },
],
},
},
],
scorers: {
workflow: [outputQualityScorer],
steps: {
validate: [validationScorer],
},
trajectory: [trajectoryScorer],
},
})
// result.scores.trajectory — workflow trajectory scores
Workflow with per-item startOptionsDirect link to workflow-with-per-item-startoptions
Use startOptions on individual data items to customize each workflow run. Per-item values take precedence over targetOptions:
const result = await runEvals({
target: myWorkflow,
data: [
{
input: { query: 'hello' },
startOptions: { initialState: { counter: 1 } },
},
{
input: { query: 'world' },
startOptions: { initialState: { counter: 2 } },
},
],
scorers: [outputQualityScorer],
targetOptions: { perStep: true },
})
RelatedDirect link to Related
- createScorer() - Create custom scorers for experiments
- MastraScorer - Learn about scorer structure and methods
- Trajectory Accuracy - Built-in trajectory evaluation scorers
- Scorer Utilities - Helper functions for extracting trajectory data
- Custom Scorers - Guide to building evaluation logic
- Scorers Overview - Understanding scorer concepts