runExperiment
The runExperiment
function enables batch evaluation of agents and workflows by running multiple test cases against scorers concurrently. This is essential for systematic testing, performance analysis, and validation of AI systems.
Usage Example
import { runExperiment } from '@mastra/core/scores';
import { myAgent } from './agents/my-agent';
import { myScorer1, myScorer2 } from './scorers';
const result = await runExperiment({
target: myAgent,
data: [
{ input: "What is machine learning?" },
{ input: "Explain neural networks" },
{ input: "How does AI work?" }
],
scorers: [myScorer1, myScorer2],
concurrency: 2,
onItemComplete: ({ item, targetResult, scorerResults }) => {
console.log(`Completed: ${item.input}`);
console.log(`Scores:`, scorerResults);
}
});
console.log(`Average scores:`, result.scores);
console.log(`Processed ${result.summary.totalItems} items`);
Parameters
target:
Agent | Workflow
The agent or workflow to evaluate.
data:
RunExperimentDataItem[]
Array of test cases with input data and optional ground truth.
scorers:
MastraScorer[] | WorkflowScorerConfig
Array of scorers for agents, or configuration object for workflows specifying scorers for the workflow and individual steps.
concurrency?:
number
= 1
Number of test cases to run concurrently.
onItemComplete?:
function
Callback function called after each test case completes. Receives item, target result, and scorer results.
Data Item Structure
input:
string | string[] | CoreMessage[] | any
Input data for the target. For agents: messages or strings. For workflows: workflow input data.
groundTruth?:
any
Expected or reference output for comparison during scoring.
runtimeContext?:
RuntimeContext
Runtime context to pass to the target during execution.
tracingContext?:
TracingContext
Tracing context for observability and debugging.
Workflow Scorer Configuration
For workflows, you can specify scorers at different levels using WorkflowScorerConfig
:
workflow?:
MastraScorer[]
Array of scorers to evaluate the entire workflow output.
steps?:
Record<string, MastraScorer[]>
Object mapping step IDs to arrays of scorers for evaluating individual step outputs.
Returns
scores:
Record<string, any>
Average scores across all test cases, organized by scorer name.
summary:
object
Summary information about the experiment execution.
summary.totalItems:
number
Total number of test cases processed.
Examples
Agent Evaluation
import { runExperiment } from '@mastra/core/scores';
import { createScorer } from '@mastra/core/scores';
const myScorer = createScorer({
name: 'My Scorer',
description: "Check if Agent's response contains ground truth",
type: 'agent'
}).generateScore(({ run }) => {
const response = run.output[0]?.content || '';
const expectedResponse = run.groundTruth
return response.includes(expectedResponse) ? 1 : 0
});
const result = await runExperiment({
target: chatAgent,
data: [
{
input: "What is AI?",
groundTruth: "AI is a field of computer science that creates intelligent machines."
},
{
input: "How does machine learning work?",
groundTruth: "Machine learning uses algorithms to learn patterns from data."
}
],
scorers: [relevancyScorer],
concurrency: 3
});
Workflow Evaluation
const workflowResult = await runExperiment({
target: myWorkflow,
data: [
{ input: { query: "Process this data", priority: "high" } },
{ input: { query: "Another task", priority: "low" } }
],
scorers: {
workflow: [outputQualityScorer],
steps: {
'validation-step': [validationScorer],
'processing-step': [processingScorer]
}
},
onItemComplete: ({ item, targetResult, scorerResults }) => {
console.log(`Workflow completed for: ${item.input.query}`);
if (scorerResults.workflow) {
console.log('Workflow scores:', scorerResults.workflow);
}
if (scorerResults.steps) {
console.log('Step scores:', scorerResults.steps);
}
}
});
Related
- createScorer() - Create custom scorers for experiments
- MastraScorer - Learn about scorer structure and methods
- Custom Scorers - Guide to building evaluation logic
- Scorers Overview - Understanding scorer concepts