Change, Run, and Compare with Experiments in Mastra Studio

Run test cases against agents and workflows, score the results, and track quality over time.

Yujohn NattrassYujohn Nattrass·

Mar 3, 2026

·

5 min read

Experiments, paired with our recently shipped datasets, are a big step forward in Mastra Studio’s observability capabilities, giving you the ability to measure agent accuracy and improve output over time.

This post covers the experiment workflow: how you run them, what you learn from them, and how they fit into your development process.

The iteration loop

When developing agentic systems you’re frequently tweaking prompts, swapping models, or adding tools. Experiments let you measure whether those changes made things better or worse.

Each item in a dataset has an input and optionally a ground truth (the expected output). An experiment runs each item through a target and scores the result. Every time you change something, you run a new experiment and compare it against previous runs. You can even run multiple experiments per dataset, so it's easy to test different agent configurations or model swaps against the same datasets.

Running an experiment

Experiments are run from the Dataset view in Studio. You pick a target type (agent, or workflow), optionally add scorers to grade the output, then run the experiment.

During the experiment, Studio polls for results and updates status indicators and counts in real time.

Each experiment tracks its status from pending through running to completed, with aggregate counts of succeeded, failed, and skipped items. Per-item results include the full input and output, error details if something failed, and execution metadata including timing, retry counts, trace IDs, and token usage.

You can also run experiments using the startExperiment() API:

 1const summary = await dataset.startExperiment({
 2  name: "gpt-4.1-baseline",
 3  targetType: "agent",
 4  targetId: "translation-agent",
 5  scorers: ["answer-similarity-scorer", "content-similarity-scorer"]
 6});

The summary includes the same output you'd see in Studio:

 1{
 2  experimentId: "07c5efee-71be-49da-b329-cedcf5511f8c",
 3  status: "completed",
 4  totalItems: 6,
 5  succeededCount: 6,
 6  failedCount: 0,
 7  skippedCount: 0,
 8  completedWithErrors: false,
 9  startedAt: "2026-03-02T14:49:24.992Z",
10  completedAt: "2026-03-02T14:49:34.568Z",
11  results: [
12    {
13      itemId: "1c83e29a-6854-4490-a328-e09a69afb2c0",
14      itemVersion: 1,
15      input: "Goodbye",
16      output: {
17        text: "Adiós",
18        usage: { inputTokens: 44, outputTokens: 3, totalTokens: 47 }
19      },
20      groundTruth: "Adiós",
21      error: null,
22      retryCount: 0,
23      scores: [
24        {
25          scorerId: "answer-similarity-scorer",
26          scorerName: "Answer Similarity Scorer",
27          score: 1,
28          reason: "The score is 1/1 because the output matches the ground truth exactly."
29        },
30        {
31          scorerId: "content-similarity-scorer",
32          scorerName: "Content Similarity Scorer",
33          score: 1,
34          reason: "Exact match."
35        }
36      ]
37    }
38    // ... remaining items
39  ]
40}

Scoring

We recommend adding scorers to grade experiment output. @mastra/evals includes built-in scorers for things like relevancy, faithfulness, hallucination, toxicity, bias, content similarity, and completeness. Each per-item result includes the score and reasoning.

Scorers come in two flavors:

  • Code-based scorers like content-similarity-scorer use string comparison algorithms and don't need a model.
  • LLM-based scorers like answer-relevancy-scorer require a model, which you set when creating the scorer instance.

You can add multiple scorers to a single experiment. The following example creates an inline relevancy scorer using gpt-4.1-nano alongside a registered content-similarity-scorer:

 1import { createAnswerRelevancyScorer } from "@mastra/evals/scorers/llm";
 2import { openai } from "@ai-sdk/openai";
 3
 4const relevancy = createAnswerRelevancyScorer({ model: openai("gpt-4.1-nano") });
 5
 6const summary = await dataset.startExperiment({
 7  name: "with-custom-scorers",
 8  targetType: "agent",
 9  targetId: "translation-agent",
10  scorers: [relevancy, "content-similarity-scorer"]
11});

To learn more, visit the built-in scorers docs.

Comparing experiments

This is where things get super interesting. Say you've run experiment A with your current prompt. You tweak the prompt and run experiment B against the same dataset. Studio lets you compare the two side by side, showing per-scorer averages, deltas, and per-item score progressions so you can see exactly what changed, for better or worse.

You can also compare experiments using the compareExperiments() API. For example, experiment A is the baseline (before) and experiment B is the new run (after changes were made):

 1const comparison = await mastra.datasets.compareExperiments({
 2  experimentIds: [experimentAId, experimentBId],
 3  baselineId: experimentAId // experiment A is the baseline
 4});

The output includes every item with its input, ground truth, and per-experiment output and scores:

 1{
 2  baselineId: "1f7e7576-cecf-4fcd-898d-133beafa955f",
 3  items: [
 4    {
 5      itemId: "1c83e29a-6854-4490-a328-e09a69afb2c0",
 6      input: "Goodbye",
 7      groundTruth: "Adiós",
 8      results: {
 9        "1f7e7576-cecf-4fcd-898d-133beafa955f": {
10          output: { text: "Adiós", usage: { totalTokens: 47 } },
11          scores: { "answer-similarity-scorer": 1, "content-similarity-scorer": null }
12        },
13        "07c5efee-71be-49da-b329-cedcf5511f8c": {
14          output: { text: "Adiós", usage: { totalTokens: 47 } },
15          scores: { "answer-similarity-scorer": 1, "content-similarity-scorer": 0 }
16        }
17      }
18    }
19    // ... remaining items
20  ]
21}

Experiments in CI

Because both startExperiment() and compareExperiments() are programmatic, you can wire them into CI as a quality gate. Run an experiment on every push, compares scores against a known baseline, and could be configured to fail a build if anything regressed.

Configuration

startExperiment() accepts the following configuration options:

  • maxConcurrency - number of items to run in parallel (default: 5)
  • itemTimeout - per-item execution timeout in milliseconds
  • maxRetries - retry failed items with exponential backoff (default: 0, no retries). Abort errors are never retried.
  • signal - pass an AbortSignal to cancel a running experiment
  • version - pin to a specific dataset version (default: latest)

Errors are isolated per-item, so one failure doesn't crash the run. See the running experiments docs for full details.

Get started

Share:
Yujohn Nattrass
Yujohn NattrassSoftware Engineer

Yujohn Nattrass is a software engineer at Mastra based in New York. He's a co-creator of Able and holds AWS and Kafka certifications. Previously at Netlify, he was known for quickly picking up complex architectures and consistently performing above his title.

All articles by Yujohn Nattrass