dataset.startExperiment()
Added in: @mastra/core@1.4.0
Runs an experiment on the dataset and waits for completion. Executes all items against a target (agent, workflow, or scorer) with optional scoring.
Usage exampleDirect link to Usage example
import { Mastra } from "@mastra/core";
const mastra = new Mastra({ /* storage config */ });
const dataset = await mastra.datasets.get({ id: "dataset-id" });
// Run against a registered agent with scorers
const summary = await dataset.startExperiment({
targetType: "agent",
targetId: "my-agent",
scorers: ["accuracy", "relevancy"],
maxConcurrency: 10,
});
console.log(`${summary.succeededCount}/${summary.totalItems} succeeded`);
console.log(`Status: ${summary.status}`);
ParametersDirect link to Parameters
targetType?:
'agent' | 'workflow' | 'scorer'
Type of registered target to run items against. Use with `targetId`.
targetId?:
string
ID of the registered target. Use with `targetType`.
scorers?:
(MastraScorer | string)[]
Scorers to evaluate each result. Pass `MastraScorer` instances or registered scorer IDs.
name?:
string
Display name for the experiment.
description?:
string
Description of the experiment.
metadata?:
Record<string, unknown>
Arbitrary metadata for the experiment.
version?:
number
Pin to a specific dataset version. Defaults to the latest version.
maxConcurrency?:
number
Maximum concurrent item executions. Defaults to `5`.
signal?:
AbortSignal
AbortSignal for cancelling the experiment.
itemTimeout?:
number
Per-item execution timeout in milliseconds.
maxRetries?:
number
Maximum retries per item on failure. Defaults to `0` (no retries). Abort errors are never retried.
ReturnsDirect link to Returns
result:
Promise<ExperimentSummary>
Summary of the completed experiment.
ExperimentSummary
experimentId:
string
Unique ID of the experiment.
status:
'pending' | 'running' | 'completed' | 'failed'
Final status of the experiment.
totalItems:
number
Total number of items in the dataset.
succeededCount:
number
Number of items that succeeded.
failedCount:
number
Number of items that failed.
skippedCount:
number
Number of items skipped (e.g., due to abort).
completedWithErrors:
boolean
`true` if the run completed but some items failed.
startedAt:
Date
When the experiment started.
completedAt:
Date
When the experiment completed.
results:
ItemWithScores[]
All item results with their scores.
ItemWithScores
itemId:
string
ID of the dataset item.
itemVersion:
number
Dataset version of the item when executed.
input:
unknown
Input data passed to the target.
output:
unknown | null
Output from the target, or `null` if failed.
groundTruth:
unknown | null
Expected output from the dataset item.
error:
{ message: string; stack?: string; code?: string } | null
Structured error if execution failed.
startedAt:
Date
When item execution started.
completedAt:
Date
When item execution completed.
retryCount:
number
Number of retry attempts.
scores:
ScorerResult[]
Results from all scorers for this item.
ScorerResult
scorerId:
string
ID of the scorer.
scorerName:
string
Display name of the scorer.
score:
number | null
Computed score, or `null` if the scorer failed.
reason:
string | null
Reason/explanation for the score.
error:
string | null
Error message if the scorer failed.