Running experiments
Added in: @mastra/core@1.4.0
An experiment runs every item in a dataset through a target (an agent, a workflow, or a scorer) and then optionally scores the outputs. Use a scorer as the target when you want to evaluate an LLM judge itself. Results are persisted to storage so you can compare runs across different prompts, models, or code changes.
Basic experimentDirect link to Basic experiment
Call startExperiment() with a target and scorers:
import { mastra } from '../index'
const dataset = await mastra.datasets.get({ id: 'translation-dataset-id' })
const summary = await dataset.startExperiment({
name: 'gpt-5.1-baseline',
targetType: 'agent',
targetId: 'translation-agent',
scorers: ['accuracy', 'fluency'],
})
console.log(summary.status) // 'completed' | 'failed'
console.log(summary.succeededCount) // number of items that ran successfully
console.log(summary.failedCount) // number of items that failed
startExperiment() blocks until all items finish. For fire-and-forget execution, see async experiments.
StudioDirect link to Studio
You can also run experiments in Studio. After you've added a dataset item, open it and select Run Experiment and configure the target, scorers, and options.
After running an experiment, the Experiments tab shows all runs for that dataset (with status, counts, and timestamps). Select an experiment to see per-item results, scores, and execution traces.
In the Experiments tab, select Compare and choose two or more experiments to compare their scores and results side by side.
Experiment targetsDirect link to Experiment targets
You can point an experiment at a registered agent, workflow, or scorer.
Registered agentDirect link to Registered agent
Point to an agent registered on your Mastra instance:
const summary = await dataset.startExperiment({
name: 'agent-v2-eval',
targetType: 'agent',
targetId: 'translation-agent',
scorers: ['accuracy'],
})
Each item's input is passed directly to agent.generate(), so it must be a string, string[], or CoreMessage[].
Registered workflowDirect link to Registered workflow
Point to a workflow registered on your Mastra instance:
const summary = await dataset.startExperiment({
name: 'workflow-eval',
targetType: 'workflow',
targetId: 'translation-workflow',
scorers: ['accuracy'],
})
The workflow receives each item's input as its trigger data.
Registered scorerDirect link to Registered scorer
Point to a scorer to evaluate an LLM judge against ground truth:
const summary = await dataset.startExperiment({
name: 'judge-accuracy-eval',
targetType: 'scorer',
targetId: 'accuracy',
})
The scorer receives each item's input and groundTruth. LLM-based judges can drift over time as underlying models change, so it's important to periodically realign them against known-good labels. A dataset gives you a stable benchmark to detect that drift.
Scoring resultsDirect link to Scoring results
Scorers automatically run after each item's target execution. Pass scorer instances or registered scorer IDs:
- Scorer IDs
- Scorer instances
// Reference scorers registered on the Mastra instance
const summary = await dataset.startExperiment({
name: 'with-registered-scorers',
targetType: 'agent',
targetId: 'translation-agent',
scorers: ['accuracy', 'fluency'],
})
import { createAnswerRelevancyScorer } from '@mastra/evals/scorers/prebuilt'
const relevancy = createAnswerRelevancyScorer({ model: 'openai/gpt-5-mini' })
const summary = await dataset.startExperiment({
name: 'with-scorer-instances',
targetType: 'agent',
targetId: 'translation-agent',
scorers: [relevancy],
})
Each item's results include per-scorer scores:
for (const item of summary.results) {
console.log(item.itemId, item.output)
for (const score of item.scores) {
console.log(` ${score.scorerName}: ${score.score} — ${score.reason}`)
}
}
Visit the Scorers overview for details on available and custom scorers.
Tool mocksDirect link to Tool mocks
When an experiment runs an agent that calls side-effecting tools, you can make the run deterministic by attaching static tool mocks to individual dataset items. During the experiment, a mocked tool returns its declared output instead of executing. Tools that have no mock on the item run live.
Mocks live on the dataset item, so they version with the row and travel with the test case. Each mock declares a tool name, the arguments it expects, and the output to return:
await dataset.addItem({
input: 'What is the weather in Seattle?',
toolMocks: [
{
toolName: 'getWeather',
args: { city: 'Seattle' },
output: { temperature: 60, conditions: 'rainy' },
},
],
})
Tool mocks are supported for agent targets only.
Matching and consumptionDirect link to Matching and consumption
Arguments are matched strictly: object key order is ignored, array order is significant, and there is no type coercion. A mock is served only when the agent calls the tool with arguments that deep-equal the mock's args.
When an item declares several mocks for the same tool and arguments, they are consumed in order — the first call gets the first mock, the next call gets the second, and so on. Ordering is tracked per (toolName, args) group and is independent across different arguments.
Matching modeDirect link to Matching mode
By default each mock matches strictly on its args. Set matchArgs: 'ignore' to match on the tool name only — the mock's args are not compared and the next unconsumed mock for that tool is served regardless of how the agent called it:
const subAgentMock = {
toolName: 'agent-balanceAgent',
args: { prompt: 'look up the balance for YJ' },
output: { text: "YJ's balance is $100." },
matchArgs: 'ignore',
}
This is useful when a tool's arguments are noisy or generated by the model. The most common case is mocking a sub-agent's response: a delegated sub-agent is exposed to the parent as a tool named agent-<name>, and its arguments include an LLM-authored prompt plus runtime-injected fields. Mocking agent-<name> returns the canned response in place of running the sub-agent and its inner tools. When you create a mock from a trace, sub-agent delegation calls are derived with matchArgs: 'ignore' automatically; you can change it to 'strict' to pin the exact arguments.
FailuresDirect link to Failures
A mocked tool call fails the item when the arguments do not match or all matching mocks have been consumed:
TOOL_MOCK_MISMATCH— the tool was called with arguments that no mock matches.TOOL_MOCK_EXHAUSTED— every matching mock has already been consumed.
When a mocked tool is mis-called, the agent run is aborted immediately, so the model cannot go on to call any further tools — including unmocked, side-effecting tools that would otherwise run live. These failures are deterministic, so they are not retried. Mocks that are declared but never used do not fail the item — they are reported as unconsumed.
While an item has mocks, the agent's tools execute sequentially so repeated (toolName, args) mocks are consumed in the provider's call order. This serialization applies only to items that declare mocks.
DiagnosticsDirect link to Diagnostics
Each item result carries a toolMockReport describing what the run did with the item's mocks:
for (const item of summary.results) {
const report = item.toolMockReport
if (!report) continue
console.log(report.served) // mocks matched and returned
console.log(report.unconsumed) // mocks declared but never used
console.log(report.liveCalls) // unmocked tools that ran live
console.log(report.failure) // the mismatch/exhausted failure, if any
}
In Studio, edit a dataset item to author tool mocks as a JSON array, and open an experiment result to see the same report.
LimitationsDirect link to Limitations
- No tool span for mocked calls. A mocked call returns its output before the tool executes, so it does not create a tool span. Trajectory scorers backed by stored traces may therefore not see mocked tool calls. Trajectory extraction that falls back to the agent's message output still sees them, so trajectory scoring can differ depending on your observability configuration.
- Storage support. Tool mocks and tool mock reports are persisted by the LibSQL, PostgreSQL, MongoDB, and Spanner adapters. The MySQL adapter does not support them and rejects writes that carry tool mocks or a tool mock report so the feature never silently runs tools live.
Async experimentsDirect link to Async experiments
startExperiment() blocks until every item completes. For long-running datasets, use startExperimentAsync() to start the experiment in the background:
const { experimentId, status } = await dataset.startExperimentAsync({
name: 'large-dataset-run',
targetType: 'agent',
targetId: 'translation-agent',
scorers: ['accuracy'],
})
console.log(experimentId) // UUID
console.log(status) // 'pending'
Poll for completion using getExperiment():
let experiment = await dataset.getExperiment({ experimentId })
while (experiment.status === 'pending' || experiment.status === 'running') {
await new Promise(resolve => setTimeout(resolve, 5000))
experiment = await dataset.getExperiment({ experimentId })
}
console.log(experiment.status) // 'completed' | 'failed'
Configuration optionsDirect link to Configuration options
ConcurrencyDirect link to Concurrency
Control how many items run in parallel (default: 5):
const summary = await dataset.startExperiment({
targetType: 'agent',
targetId: 'translation-agent',
maxConcurrency: 10,
})
Timeouts and retriesDirect link to Timeouts and retries
Set a per-item timeout (in milliseconds) and retry count:
const summary = await dataset.startExperiment({
targetType: 'agent',
targetId: 'translation-agent',
itemTimeout: 30_000, // 30 seconds per item
maxRetries: 2, // retry failed items up to 2 times
})
Retries use exponential backoff. Abort errors are never retried.
Aborting an experimentDirect link to Aborting an experiment
Pass an AbortSignal to cancel a running experiment:
const controller = new AbortController()
// Cancel after 60 seconds
setTimeout(() => controller.abort(), 60_000)
const summary = await dataset.startExperiment({
targetType: 'agent',
targetId: 'translation-agent',
signal: controller.signal,
})
Remaining items are marked as skipped in the summary.
Pinning a dataset versionDirect link to Pinning a dataset version
Run against a specific snapshot of the dataset:
const summary = await dataset.startExperiment({
targetType: 'agent',
targetId: 'translation-agent',
version: 3, // use items from dataset version 3
})
Viewing resultsDirect link to Viewing results
Listing experimentsDirect link to Listing experiments
const { experiments, pagination } = await dataset.listExperiments({
page: 0,
perPage: 10,
})
for (const exp of experiments) {
console.log(`${exp.name} — ${exp.status} (${exp.succeededCount}/${exp.totalItems})`)
}
Experiment detailsDirect link to Experiment details
const experiment = await dataset.getExperiment({
experimentId: 'exp-abc-123',
})
console.log(experiment.status)
console.log(experiment.startedAt)
console.log(experiment.completedAt)
Item-level resultsDirect link to Item-level results
const { results, pagination } = await dataset.listExperimentResults({
experimentId: 'exp-abc-123',
page: 0,
perPage: 50,
})
for (const result of results) {
console.log(result.itemId, result.output, result.error)
}
Understanding the summaryDirect link to Understanding the summary
startExperiment() returns an ExperimentSummary with counts and per-item results:
completedWithErrorsistruewhen the experiment finished but some items failed.- Items cancelled via
signalappear inskippedCount.
Visit the startExperiment reference for the full parameter and return type documentation.