# Running experiments

**Added in:** `@mastra/core@1.4.0`

An experiment runs every item in a dataset through a target (an agent, a workflow, or a scorer) and then optionally scores the outputs. Use a scorer as the target when you want to evaluate an LLM judge itself. Results are persisted to storage so you can compare runs across different prompts, models, or code changes.

## Basic experiment

Call [`startExperiment()`](https://mastra.ai/reference/datasets/startExperiment) with a target and scorers:

```typescript
import { mastra } from '../index'

const dataset = await mastra.datasets.get({ id: 'translation-dataset-id' })

const summary = await dataset.startExperiment({
  name: 'gpt-5.1-baseline',
  targetType: 'agent',
  targetId: 'translation-agent',
  scorers: ['accuracy', 'fluency'],
})

console.log(summary.status) // 'completed' | 'failed'
console.log(summary.succeededCount) // number of items that ran successfully
console.log(summary.failedCount) // number of items that failed
```

`startExperiment()` blocks until all items finish. For fire-and-forget execution, see [async experiments](#async-experiments).

## Studio

You can also run experiments in [Studio](https://mastra.ai/docs/studio/overview). After you've added a dataset item, open it and select **Run Experiment** and configure the target, scorers, and options.

After running an experiment, the **Experiments** tab shows all runs for that dataset (with status, counts, and timestamps). Select an experiment to see per-item results, scores, and execution traces.

In the **Experiments** tab, select **Compare** and choose two or more experiments to compare their scores and results side by side.

## Experiment targets

You can point an experiment at a registered agent, workflow, or scorer.

### Registered agent

Point to an agent registered on your Mastra instance:

```typescript
const summary = await dataset.startExperiment({
  name: 'agent-v2-eval',
  targetType: 'agent',
  targetId: 'translation-agent',
  scorers: ['accuracy'],
})
```

Each item's `input` is passed directly to `agent.generate()`, so it must be a `string`, `string[]`, or `CoreMessage[]`.

### Registered workflow

Point to a workflow registered on your Mastra instance:

```typescript
const summary = await dataset.startExperiment({
  name: 'workflow-eval',
  targetType: 'workflow',
  targetId: 'translation-workflow',
  scorers: ['accuracy'],
})
```

The workflow receives each item's `input` as its trigger data.

### Registered scorer

Point to a scorer to evaluate an LLM judge against ground truth:

```typescript
const summary = await dataset.startExperiment({
  name: 'judge-accuracy-eval',
  targetType: 'scorer',
  targetId: 'accuracy',
})
```

The scorer receives each item's `input` and `groundTruth`. LLM-based judges can drift over time as underlying models change, so it's important to periodically realign them against known-good labels. A dataset gives you a stable benchmark to detect that drift.

## Scoring results

Scorers automatically run after each item's target execution. Pass scorer instances or registered scorer IDs:

**Scorer IDs**:

```typescript
// Reference scorers registered on the Mastra instance
const summary = await dataset.startExperiment({
  name: 'with-registered-scorers',
  targetType: 'agent',
  targetId: 'translation-agent',
  scorers: ['accuracy', 'fluency'],
})
```

**Scorer instances**:

```typescript
import { createAnswerRelevancyScorer } from '@mastra/evals/scorers/prebuilt'

const relevancy = createAnswerRelevancyScorer({ model: 'openai/gpt-5-mini' })

const summary = await dataset.startExperiment({
  name: 'with-scorer-instances',
  targetType: 'agent',
  targetId: 'translation-agent',
  scorers: [relevancy],
})
```

Each item's results include per-scorer scores:

```typescript
for (const item of summary.results) {
  console.log(item.itemId, item.output)
  for (const score of item.scores) {
    console.log(`  ${score.scorerName}: ${score.score} — ${score.reason}`)
  }
}
```

> **Info:** Visit the [Scorers overview](https://mastra.ai/docs/evals/overview) for details on available and custom scorers.

## Async experiments

`startExperiment()` blocks until every item completes. For long-running datasets, use [`startExperimentAsync()`](https://mastra.ai/reference/datasets/startExperimentAsync) to start the experiment in the background:

```typescript
const { experimentId, status } = await dataset.startExperimentAsync({
  name: 'large-dataset-run',
  targetType: 'agent',
  targetId: 'translation-agent',
  scorers: ['accuracy'],
})

console.log(experimentId) // UUID
console.log(status) // 'pending'
```

Poll for completion using [`getExperiment()`](https://mastra.ai/reference/datasets/getExperiment):

```typescript
let experiment = await dataset.getExperiment({ experimentId })

while (experiment.status === 'pending' || experiment.status === 'running') {
  await new Promise(resolve => setTimeout(resolve, 5000))
  experiment = await dataset.getExperiment({ experimentId })
}

console.log(experiment.status) // 'completed' | 'failed'
```

## Configuration options

### Concurrency

Control how many items run in parallel (default: 5):

```typescript
const summary = await dataset.startExperiment({
  targetType: 'agent',
  targetId: 'translation-agent',
  maxConcurrency: 10,
})
```

### Timeouts and retries

Set a per-item timeout (in milliseconds) and retry count:

```typescript
const summary = await dataset.startExperiment({
  targetType: 'agent',
  targetId: 'translation-agent',
  itemTimeout: 30_000, // 30 seconds per item
  maxRetries: 2, // retry failed items up to 2 times
})
```

Retries use exponential backoff. Abort errors are never retried.

### Aborting an experiment

Pass an `AbortSignal` to cancel a running experiment:

```typescript
const controller = new AbortController()

// Cancel after 60 seconds
setTimeout(() => controller.abort(), 60_000)

const summary = await dataset.startExperiment({
  targetType: 'agent',
  targetId: 'translation-agent',
  signal: controller.signal,
})
```

Remaining items are marked as skipped in the summary.

### Pinning a dataset version

Run against a specific snapshot of the dataset:

```typescript
const summary = await dataset.startExperiment({
  targetType: 'agent',
  targetId: 'translation-agent',
  version: 3, // use items from dataset version 3
})
```

## Viewing results

### Listing experiments

```typescript
const { experiments, pagination } = await dataset.listExperiments({
  page: 0,
  perPage: 10,
})

for (const exp of experiments) {
  console.log(`${exp.name} — ${exp.status} (${exp.succeededCount}/${exp.totalItems})`)
}
```

### Experiment details

```typescript
const experiment = await dataset.getExperiment({
  experimentId: 'exp-abc-123',
})

console.log(experiment.status)
console.log(experiment.startedAt)
console.log(experiment.completedAt)
```

### Item-level results

```typescript
const { results, pagination } = await dataset.listExperimentResults({
  experimentId: 'exp-abc-123',
  page: 0,
  perPage: 50,
})

for (const result of results) {
  console.log(result.itemId, result.output, result.error)
}
```

## Understanding the summary

`startExperiment()` returns an `ExperimentSummary` with counts and per-item results:

- `completedWithErrors` is `true` when the experiment finished but some items failed.
- Items cancelled via `signal` appear in `skippedCount`.

> **Info:** Visit the [`startExperiment` reference](https://mastra.ai/reference/datasets/startExperiment) for the full parameter and return type documentation.

## Related

- [Datasets overview](https://mastra.ai/docs/evals/datasets/overview)
- [Scorers overview](https://mastra.ai/docs/evals/overview)
- [`startExperiment` reference](https://mastra.ai/reference/datasets/startExperiment)
- [`listExperimentResults` reference](https://mastra.ai/reference/datasets/listExperimentResults)