runExperiment

The runExperiment function enables batch evaluation of agents and workflows by running multiple test cases against scorers concurrently. This is essential for systematic testing, performance analysis, and validation of AI systems.

Usage Example


import { runExperiment } from '@mastra/core/scores';
import { myAgent } from './agents/my-agent';
import { myScorer1, myScorer2 } from './scorers';
 
const result = await runExperiment({
  target: myAgent,
  data: [
    { input: "What is machine learning?" },
    { input: "Explain neural networks" },
    { input: "How does AI work?" }
  ],
  scorers: [myScorer1, myScorer2],
  concurrency: 2,
  onItemComplete: ({ item, targetResult, scorerResults }) => {
    console.log(`Completed: ${item.input}`);
    console.log(`Scores:`, scorerResults);
  }
});
 
console.log(`Average scores:`, result.scores);
console.log(`Processed ${result.summary.totalItems} items`);

Parameters

target:

Agent | Workflow

The agent or workflow to evaluate.

data:

RunExperimentDataItem[]

Array of test cases with input data and optional ground truth.

scorers:

MastraScorer[] | WorkflowScorerConfig

Array of scorers for agents, or configuration object for workflows specifying scorers for the workflow and individual steps.

concurrency?:

number

= 1

Number of test cases to run concurrently.

onItemComplete?:

function

Callback function called after each test case completes. Receives item, target result, and scorer results.

Data Item Structure

input:

string | string[] | CoreMessage[] | any

Input data for the target. For agents: messages or strings. For workflows: workflow input data.

groundTruth?:

any

Expected or reference output for comparison during scoring.

runtimeContext?:

RuntimeContext

Runtime context to pass to the target during execution.

tracingContext?:

TracingContext

Tracing context for observability and debugging.

Workflow Scorer Configuration

For workflows, you can specify scorers at different levels using WorkflowScorerConfig:

workflow?:

MastraScorer[]

Array of scorers to evaluate the entire workflow output.

steps?:

Record<string, MastraScorer[]>

Object mapping step IDs to arrays of scorers for evaluating individual step outputs.

Returns

scores:

Record<string, any>

Average scores across all test cases, organized by scorer name.

summary:

object

Summary information about the experiment execution.

summary.totalItems:

number

Total number of test cases processed.

Examples

Agent Evaluation


import { runExperiment } from '@mastra/core/scores';
import { createScorer } from '@mastra/core/scores';
 
const myScorer = createScorer({
  name: 'My Scorer',
  description: "Check if Agent's response contains ground truth",
  type: 'agent'
}).generateScore(({ run }) => {
  const response = run.output[0]?.content || '';
  const expectedResponse = run.groundTruth
  return response.includes(expectedResponse) ? 1 : 0
});
 
const result = await runExperiment({
  target: chatAgent,
  data: [
    { 
      input: "What is AI?",
      groundTruth: "AI is a field of computer science that creates intelligent machines."
    },
    {
      input: "How does machine learning work?",
      groundTruth: "Machine learning uses algorithms to learn patterns from data."
    }
  ],
  scorers: [relevancyScorer],
  concurrency: 3
});

Workflow Evaluation


const workflowResult = await runExperiment({
  target: myWorkflow,
  data: [
    { input: { query: "Process this data", priority: "high" } },
    { input: { query: "Another task", priority: "low" } }
  ],
  scorers: {
    workflow: [outputQualityScorer],
    steps: {
      'validation-step': [validationScorer],
      'processing-step': [processingScorer]
    }
  },
  onItemComplete: ({ item, targetResult, scorerResults }) => {
    console.log(`Workflow completed for: ${item.input.query}`);
    if (scorerResults.workflow) {
      console.log('Workflow scores:', scorerResults.workflow);
    }
    if (scorerResults.steps) {
      console.log('Step scores:', scorerResults.steps);
    }
  }
});

createScorer() - Create custom scorers for experiments
MastraScorer - Learn about scorer structure and methods
Custom Scorers - Guide to building evaluation logic
Scorers Overview - Understanding scorer concepts

runExperiment

Usage Example


import { runExperiment } from '@mastra/core/scores';
import { myAgent } from './agents/my-agent';
import { myScorer1, myScorer2 } from './scorers';
 
const result = await runExperiment({
  target: myAgent,
  data: [
    { input: "What is machine learning?" },
    { input: "Explain neural networks" },
    { input: "How does AI work?" }
  ],
  scorers: [myScorer1, myScorer2],
  concurrency: 2,
  onItemComplete: ({ item, targetResult, scorerResults }) => {
    console.log(`Completed: ${item.input}`);
    console.log(`Scores:`, scorerResults);
  }
});
 
console.log(`Average scores:`, result.scores);
console.log(`Processed ${result.summary.totalItems} items`);

Parameters

target:

Agent | Workflow

The agent or workflow to evaluate.

data:

RunExperimentDataItem[]

Array of test cases with input data and optional ground truth.

scorers:

MastraScorer[] | WorkflowScorerConfig

Array of scorers for agents, or configuration object for workflows specifying scorers for the workflow and individual steps.

concurrency?:

number

= 1

Number of test cases to run concurrently.

onItemComplete?:

function

Callback function called after each test case completes. Receives item, target result, and scorer results.

Data Item Structure

input:

string | string[] | CoreMessage[] | any

Input data for the target. For agents: messages or strings. For workflows: workflow input data.

groundTruth?:

any

Expected or reference output for comparison during scoring.

runtimeContext?:

RuntimeContext

Runtime context to pass to the target during execution.

tracingContext?:

TracingContext

Tracing context for observability and debugging.

Workflow Scorer Configuration

For workflows, you can specify scorers at different levels using WorkflowScorerConfig:

workflow?:

MastraScorer[]

Array of scorers to evaluate the entire workflow output.

steps?:

Record<string, MastraScorer[]>

Object mapping step IDs to arrays of scorers for evaluating individual step outputs.

Returns

scores:

Record<string, any>

Average scores across all test cases, organized by scorer name.

summary:

object

Summary information about the experiment execution.

summary.totalItems:

number

Total number of test cases processed.

Examples

Agent Evaluation


import { runExperiment } from '@mastra/core/scores';
import { createScorer } from '@mastra/core/scores';
 
const myScorer = createScorer({
  name: 'My Scorer',
  description: "Check if Agent's response contains ground truth",
  type: 'agent'
}).generateScore(({ run }) => {
  const response = run.output[0]?.content || '';
  const expectedResponse = run.groundTruth
  return response.includes(expectedResponse) ? 1 : 0
});
 
const result = await runExperiment({
  target: chatAgent,
  data: [
    { 
      input: "What is AI?",
      groundTruth: "AI is a field of computer science that creates intelligent machines."
    },
    {
      input: "How does machine learning work?",
      groundTruth: "Machine learning uses algorithms to learn patterns from data."
    }
  ],
  scorers: [relevancyScorer],
  concurrency: 3
});

Workflow Evaluation


const workflowResult = await runExperiment({
  target: myWorkflow,
  data: [
    { input: { query: "Process this data", priority: "high" } },
    { input: { query: "Another task", priority: "low" } }
  ],
  scorers: {
    workflow: [outputQualityScorer],
    steps: {
      'validation-step': [validationScorer],
      'processing-step': [processingScorer]
    }
  },
  onItemComplete: ({ item, targetResult, scorerResults }) => {
    console.log(`Workflow completed for: ${item.input.query}`);
    if (scorerResults.workflow) {
      console.log('Workflow scores:', scorerResults.workflow);
    }
    if (scorerResults.steps) {
      console.log('Step scores:', scorerResults.steps);
    }
  }
});

createScorer() - Create custom scorers for experiments
MastraScorer - Learn about scorer structure and methods
Custom Scorers - Guide to building evaluation logic
Scorers Overview - Understanding scorer concepts

runExperiment

Usage Example

Parameters

target:

data:

scorers:

concurrency?:

onItemComplete?:

Data Item Structure

input:

groundTruth?:

runtimeContext?:

tracingContext?:

Workflow Scorer Configuration

workflow?:

steps?:

Returns

scores:

summary:

summary.totalItems:

Examples

Agent Evaluation

Workflow Evaluation

Related

runExperiment

Usage Example

Parameters

target:

data:

scorers:

concurrency?:

onItemComplete?:

Data Item Structure

input:

groundTruth?:

runtimeContext?:

tracingContext?:

Workflow Scorer Configuration

workflow?:

steps?:

Returns

scores:

summary:

summary.totalItems:

Examples

Agent Evaluation

Workflow Evaluation

Related