# Running Experiments

**Added in:** `@mastra/core@1.4.0`

An experiment runs every item in a dataset through a target (an agent, a workflow, or a scorer) and then optionally scores the outputs. Use a scorer as the target when you want to evaluate an LLM judge itself. Results are persisted to storage so you can compare runs across different prompts, models, or code changes.

## Basic experiment

Call [`startExperiment()`](https://mastra.ai/reference/datasets/startExperiment) with a target and scorers:

```typescript
import { mastra } from "../index";

const dataset = await mastra.datasets.get({ id: "translation-dataset-id" });

const summary = await dataset.startExperiment({
  name: "gpt-5.1-baseline",
  targetType: "agent",
  targetId: "translation-agent",
  scorers: ["accuracy", "fluency"],
});

console.log(summary.status);          // 'completed' | 'failed'
console.log(summary.succeededCount);  // number of items that ran successfully
console.log(summary.failedCount);     // number of items that failed
```

`startExperiment()` blocks until all items finish. For fire-and-forget execution, see [async experiments](#async-experiments).

## Experiment targets

You can point an experiment at a registered agent, workflow, or scorer.

### Registered agent

Point to an agent registered on your Mastra instance:

```typescript
const summary = await dataset.startExperiment({
  name: "agent-v2-eval",
  targetType: "agent",
  targetId: "translation-agent",
  scorers: ["accuracy"],
});
```

Each item's `input` is passed directly to `agent.generate()`, so it must be a `string`, `string[]`, or `CoreMessage[]`.

### Registered workflow

Point to a workflow registered on your Mastra instance:

```typescript
const summary = await dataset.startExperiment({
  name: "workflow-eval",
  targetType: "workflow",
  targetId: "translation-workflow",
  scorers: ["accuracy"],
});
```

The workflow receives each item's `input` as its trigger data.

### Registered scorer

Point to a scorer to evaluate an LLM judge against ground truth:

```typescript
const summary = await dataset.startExperiment({
  name: "judge-accuracy-eval",
  targetType: "scorer",
  targetId: "accuracy",
});
```

The scorer receives each item's `input` and `groundTruth`. LLM-based judges can drift over time as underlying models change, so it's important to periodically realign them against known-good labels. A dataset gives you a stable benchmark to detect that drift.

## Scoring results

Scorers automatically run after each item's target execution. Pass scorer instances or registered scorer IDs:

**Scorer IDs**:

```typescript
// Reference scorers registered on the Mastra instance
const summary = await dataset.startExperiment({
  name: "with-registered-scorers",
  targetType: "agent",
  targetId: "translation-agent",
  scorers: ["accuracy", "fluency"],
});
```

**Scorer instances**:

```typescript
import { createAnswerRelevancyScorer } from "@mastra/evals/scorers/prebuilt";

const relevancy = createAnswerRelevancyScorer({ model: "openai/gpt-4.1-nano" });

const summary = await dataset.startExperiment({
  name: "with-scorer-instances",
  targetType: "agent",
  targetId: "translation-agent",
  scorers: [relevancy],
});
```

Each item's results include per-scorer scores:

```typescript
for (const item of summary.results) {
  console.log(item.itemId, item.output);
  for (const score of item.scores) {
    console.log(`  ${score.scorerName}: ${score.score} — ${score.reason}`);
  }
}
```

> **Info:** Visit the [Scorers overview](https://mastra.ai/docs/evals/overview) for details on available and custom scorers.

## Async experiments

`startExperiment()` blocks until every item completes. For long-running datasets, use [`startExperimentAsync()`](https://mastra.ai/reference/datasets/startExperimentAsync) to start the experiment in the background:

```typescript
const { experimentId, status } = await dataset.startExperimentAsync({
  name: "large-dataset-run",
  targetType: "agent",
  targetId: "translation-agent",
  scorers: ["accuracy"],
});

console.log(experimentId); // UUID
console.log(status);       // 'pending'
```

Poll for completion using [`getExperiment()`](https://mastra.ai/reference/datasets/getExperiment):

```typescript
let experiment = await dataset.getExperiment({ experimentId });

while (experiment.status === "pending" || experiment.status === "running") {
  await new Promise(resolve => setTimeout(resolve, 5000));
  experiment = await dataset.getExperiment({ experimentId });
}

console.log(experiment.status); // 'completed' | 'failed'
```

## Configuration options

### Concurrency

Control how many items run in parallel (default: 5):

```typescript
const summary = await dataset.startExperiment({
  targetType: "agent",
  targetId: "translation-agent",
  maxConcurrency: 10,
});
```

### Timeouts and retries

Set a per-item timeout (in milliseconds) and retry count:

```typescript
const summary = await dataset.startExperiment({
  targetType: "agent",
  targetId: "translation-agent",
  itemTimeout: 30_000,  // 30 seconds per item
  maxRetries: 2,        // retry failed items up to 2 times
});
```

Retries use exponential backoff. Abort errors are never retried.

### Aborting an experiment

Pass an `AbortSignal` to cancel a running experiment:

```typescript
const controller = new AbortController();

// Cancel after 60 seconds
setTimeout(() => controller.abort(), 60_000);

const summary = await dataset.startExperiment({
  targetType: "agent",
  targetId: "translation-agent",
  signal: controller.signal,
});
```

Remaining items are marked as skipped in the summary.

### Pinning a dataset version

Run against a specific snapshot of the dataset:

```typescript
const summary = await dataset.startExperiment({
  targetType: "agent",
  targetId: "translation-agent",
  version: 3, // use items from dataset version 3
});
```

## Viewing results

### Listing experiments

```typescript
const { experiments, pagination } = await dataset.listExperiments({
  page: 0,
  perPage: 10,
});

for (const exp of experiments) {
  console.log(`${exp.name} — ${exp.status} (${exp.succeededCount}/${exp.totalItems})`);
}
```

### Experiment details

```typescript
const experiment = await dataset.getExperiment({
  experimentId: "exp-abc-123",
});

console.log(experiment.status);
console.log(experiment.startedAt);
console.log(experiment.completedAt);
```

### Item-level results

```typescript
const { results, pagination } = await dataset.listExperimentResults({
  experimentId: "exp-abc-123",
  page: 0,
  perPage: 50,
});

for (const result of results) {
  console.log(result.itemId, result.output, result.error);
}
```

## Understanding the summary

`startExperiment()` returns an `ExperimentSummary` with counts and per-item results:

- `completedWithErrors` is `true` when the experiment finished but some items failed.
- Items cancelled via `signal` appear in `skippedCount`.

> **Info:** Visit the [`startExperiment` reference](https://mastra.ai/reference/datasets/startExperiment) for the full parameter and return type documentation.

## Related

- [Datasets overview](https://mastra.ai/docs/observability/datasets/overview)
- [Scorers overview](https://mastra.ai/docs/evals/overview)
- [`startExperiment` reference](https://mastra.ai/reference/datasets/startExperiment)
- [`listExperimentResults` reference](https://mastra.ai/reference/datasets/listExperimentResults)