dataset.startExperiment()
Added in: @mastra/core@1.4.0
Runs an experiment on the dataset and waits for completion. Executes all items against a target (agent, workflow, or scorer) with optional scoring.
Usage exampleDirect link to Usage example
import { Mastra } from '@mastra/core'
const mastra = new Mastra({
/* storage config */
})
const dataset = await mastra.datasets.get({ id: 'dataset-id' })
// Run against a registered agent with a flat scorer list
const summary = await dataset.startExperiment({
targetType: 'agent',
targetId: 'my-agent',
scorers: ['accuracy', 'relevancy'],
maxConcurrency: 10,
})
// Or pass the same categorised shape accepted by runEvals
const summary2 = await dataset.startExperiment({
targetType: 'agent',
targetId: 'my-agent',
scorers: {
agent: [accuracyScorer],
trajectory: [toolOrderScorer],
},
})
// For workflow targets, score individual steps with their own scorers
const summary3 = await dataset.startExperiment({
targetType: 'workflow',
targetId: 'my-workflow',
scorers: {
workflow: [overallScorer],
steps: {
'fetch-data': [fetchScorer],
transform: [transformScorer],
},
trajectory: [executionPathScorer],
},
})
console.log(`${summary.succeededCount}/${summary.totalItems} succeeded`)
console.log(`Status: ${summary.status}`)
console.log(`${summary2.succeededCount}/${summary2.totalItems} succeeded`)
console.log(`Status: ${summary2.status}`)
ParametersDirect link to Parameters
targetType?:
'agent' | 'workflow' | 'scorer'
Type of registered target to run items against. Use with `targetId`.
targetId?:
string
ID of the registered target. Use with `targetType`.
scorers?:
(MastraScorer | string)[] | AgentScorerConfig | WorkflowScorerConfig
Scorers to evaluate each result. Accepts a flat array of `MastraScorer` instances or registered scorer IDs, or the same categorised config shape used by `runEvals` (`AgentScorerConfig` / `WorkflowScorerConfig`). Trajectory scorers (`type: "trajectory"`) automatically receive a pre-extracted `Trajectory` as their output regardless of which form is used. For workflow targets, per-step scorers can be passed via `scorers: { steps: { stepId: [...] } }` and run against each step's output; their results carry the originating `stepId` and keep `targetScope: "span"` (matching `runEvals`).
name?:
string
Display name for the experiment.
description?:
string
Description of the experiment.
metadata?:
Record<string, unknown>
Arbitrary metadata for the experiment.
version?:
number
Pin to a specific dataset version. Defaults to the latest version.
maxConcurrency?:
number
Maximum concurrent item executions. Defaults to `5`.
signal?:
AbortSignal
AbortSignal for cancelling the experiment.
itemTimeout?:
number
Per-item execution timeout in milliseconds.
maxRetries?:
number
Maximum retries per item on failure. Defaults to `0` (no retries). Abort errors are never retried.
ReturnsDirect link to Returns
result:
Promise<ExperimentSummary>
Summary of the completed experiment.
ExperimentSummary
experimentId:
string
Unique ID of the experiment.
status:
'pending' | 'running' | 'completed' | 'failed'
Final status of the experiment.
totalItems:
number
Total number of items in the dataset.
succeededCount:
number
Number of items that succeeded.
failedCount:
number
Number of items that failed.
skippedCount:
number
Number of items skipped (e.g., due to abort).
completedWithErrors:
boolean
`true` if the run completed but some items failed.
startedAt:
Date
When the experiment started.
completedAt:
Date
When the experiment completed.
results:
ItemWithScores[]
All item results with their scores.
ItemWithScores
itemId:
string
ID of the dataset item.
itemVersion:
number
Dataset version of the item when executed.
input:
unknown
Input data passed to the target.
output:
unknown | null
Output from the target, or `null` if failed.
groundTruth:
unknown | null
Expected output from the dataset item.
error:
{ message: string; stack?: string; code?: string } | null
Structured error if execution failed.
startedAt:
Date
When item execution started.
completedAt:
Date
When item execution completed.
retryCount:
number
Number of retry attempts.
scores:
ScorerResult[]
Results from all scorers for this item.
ScorerResult
scorerId:
string
ID of the scorer.
scorerName:
string
Display name of the scorer.
score:
number | null
Computed score, or `null` if the scorer failed.
reason:
string | null
Reason/explanation for the score.
error:
string | null
Error message if the scorer failed.