Skip to main content

runExperiment

The runExperiment function enables batch evaluation of agents and workflows by running multiple test cases against scorers concurrently. This is essential for systematic testing, performance analysis, and validation of AI systems.

Usage Example

import { runExperiment } from "@mastra/core/scores";
import { myAgent } from "./agents/my-agent";
import { myScorer1, myScorer2 } from "./scorers";

const result = await runExperiment({
target: myAgent,
data: [
{ input: "What is machine learning?" },
{ input: "Explain neural networks" },
{ input: "How does AI work?" },
],
scorers: [myScorer1, myScorer2],
concurrency: 2,
onItemComplete: ({ item, targetResult, scorerResults }) => {
console.log(`Completed: ${item.input}`);
console.log(`Scores:`, scorerResults);
},
});

console.log(`Average scores:`, result.scores);
console.log(`Processed ${result.summary.totalItems} items`);

Parameters

target:

Agent | Workflow
The agent or workflow to evaluate.

data:

RunExperimentDataItem[]
Array of test cases with input data and optional ground truth.

scorers:

MastraScorer[] | WorkflowScorerConfig
Array of scorers for agents, or configuration object for workflows specifying scorers for the workflow and individual steps.

concurrency?:

number
= 1
Number of test cases to run concurrently.

onItemComplete?:

function
Callback function called after each test case completes. Receives item, target result, and scorer results.

Data Item Structure

input:

string | string[] | CoreMessage[] | any
Input data for the target. For agents: messages or strings. For workflows: workflow input data.

groundTruth?:

any
Expected or reference output for comparison during scoring.

runtimeContext?:

RuntimeContext
Runtime context to pass to the target during execution.

tracingContext?:

TracingContext
Tracing context for observability and debugging.

Workflow Scorer Configuration

For workflows, you can specify scorers at different levels using WorkflowScorerConfig:

workflow?:

MastraScorer[]
Array of scorers to evaluate the entire workflow output.

steps?:

Record<string, MastraScorer[]>
Object mapping step IDs to arrays of scorers for evaluating individual step outputs.

Returns

scores:

Record<string, any>
Average scores across all test cases, organized by scorer name.

summary:

object
Summary information about the experiment execution.

summary.totalItems:

number
Total number of test cases processed.

Examples

Agent Evaluation

import { runExperiment } from "@mastra/core/scores";
import { createScorer } from "@mastra/core/scores";

const myScorer = createScorer({
name: "My Scorer",
description: "Check if Agent's response contains ground truth",
type: "agent",
}).generateScore(({ run }) => {
const response = run.output[0]?.content || "";
const expectedResponse = run.groundTruth;
return response.includes(expectedResponse) ? 1 : 0;
});

const result = await runExperiment({
target: chatAgent,
data: [
{
input: "What is AI?",
groundTruth:
"AI is a field of computer science that creates intelligent machines.",
},
{
input: "How does machine learning work?",
groundTruth:
"Machine learning uses algorithms to learn patterns from data.",
},
],
scorers: [relevancyScorer],
concurrency: 3,
});

Workflow Evaluation

const workflowResult = await runExperiment({
target: myWorkflow,
data: [
{ input: { query: "Process this data", priority: "high" } },
{ input: { query: "Another task", priority: "low" } },
],
scorers: {
workflow: [outputQualityScorer],
steps: {
"validation-step": [validationScorer],
"processing-step": [processingScorer],
},
},
onItemComplete: ({ item, targetResult, scorerResults }) => {
console.log(`Workflow completed for: ${item.input.query}`);
if (scorerResults.workflow) {
console.log("Workflow scores:", scorerResults.workflow);
}
if (scorerResults.steps) {
console.log("Step scores:", scorerResults.steps);
}
},
});