Change, Run, and Compare with Experiments in Mastra Studio

Experiments, paired with our recently shipped datasets, are a big step forward in Mastra Studio’s observability capabilities, giving you the ability to measure agent accuracy and improve output over time.

This post covers the experiment workflow: how you run them, what you learn from them, and how they fit into your development process.

The iteration loop

When developing agentic systems you’re frequently tweaking prompts, swapping models, or adding tools. Experiments let you measure whether those changes made things better or worse.

Each item in a dataset has an input and optionally a ground truth (the expected output). An experiment runs each item through a target and scores the result. Every time you change something, you run a new experiment and compare it against previous runs. You can even run multiple experiments per dataset, so it's easy to test different agent configurations or model swaps against the same datasets.

Running an experiment

Experiments are run from the Dataset view in Studio. You pick a target type (agent, or workflow), optionally add scorers to grade the output, then run the experiment.

During the experiment, Studio polls for results and updates status indicators and counts in real time.

Each experiment tracks its status from pending through running to completed, with aggregate counts of succeeded, failed, and skipped items. Per-item results include the full input and output, error details if something failed, and execution metadata including timing, retry counts, trace IDs, and token usage.

You can also run experiments using the startExperiment() API:

const summary = await dataset.startExperiment({
  name: "gpt-4.1-baseline",
  targetType: "agent",
  targetId: "translation-agent",
  scorers: ["answer-similarity-scorer", "content-similarity-scorer"]
});

The summary includes the same output you'd see in Studio:

{
  experimentId: "07c5efee-71be-49da-b329-cedcf5511f8c",
  status: "completed",
  totalItems: 6,
  succeededCount: 6,
  failedCount: 0,
  skippedCount: 0,
  completedWithErrors: false,
  startedAt: "2026-03-02T14:49:24.992Z",
  completedAt: "2026-03-02T14:49:34.568Z",
  results: [
    {
      itemId: "1c83e29a-6854-4490-a328-e09a69afb2c0",
      itemVersion: 1,
      input: "Goodbye",
      output: {
        text: "Adiós",
        usage: { inputTokens: 44, outputTokens: 3, totalTokens: 47 }
      },
      groundTruth: "Adiós",
      error: null,
      retryCount: 0,
      scores: [
        {
          scorerId: "answer-similarity-scorer",
          scorerName: "Answer Similarity Scorer",
          score: 1,
          reason: "The score is 1/1 because the output matches the ground truth exactly."
        },
        {
          scorerId: "content-similarity-scorer",
          scorerName: "Content Similarity Scorer",
          score: 1,
          reason: "Exact match."
        }
      ]
    }
    // ... remaining items
  ]
}

Scoring

We recommend adding scorers to grade experiment output. @mastra/evals includes built-in scorers for things like relevancy, faithfulness, hallucination, toxicity, bias, content similarity, and completeness. Each per-item result includes the score and reasoning.

Scorers come in two flavors:

Code-based scorers like content-similarity-scorer use string comparison algorithms and don't need a model.
LLM-based scorers like answer-relevancy-scorer require a model, which you set when creating the scorer instance.

You can add multiple scorers to a single experiment. The following example creates an inline relevancy scorer using gpt-4.1-nano alongside a registered content-similarity-scorer:

import { createAnswerRelevancyScorer } from "@mastra/evals/scorers/llm";
import { openai } from "@ai-sdk/openai";
 
const relevancy = createAnswerRelevancyScorer({ model: openai("gpt-4.1-nano") });
 
const summary = await dataset.startExperiment({
  name: "with-custom-scorers",
  targetType: "agent",
  targetId: "translation-agent",
  scorers: [relevancy, "content-similarity-scorer"]
});

To learn more, visit the built-in scorers docs.

Comparing experiments

This is where things get super interesting. Say you've run experiment A with your current prompt. You tweak the prompt and run experiment B against the same dataset. Studio lets you compare the two side by side, showing per-scorer averages, deltas, and per-item score progressions so you can see exactly what changed, for better or worse.

You can also compare experiments using the compareExperiments() API. For example, experiment A is the baseline (before) and experiment B is the new run (after changes were made):

const comparison = await mastra.datasets.compareExperiments({
  experimentIds: [experimentAId, experimentBId],
  baselineId: experimentAId // experiment A is the baseline
});

The output includes every item with its input, ground truth, and per-experiment output and scores:

{
  baselineId: "1f7e7576-cecf-4fcd-898d-133beafa955f",
  items: [
    {
      itemId: "1c83e29a-6854-4490-a328-e09a69afb2c0",
      input: "Goodbye",
      groundTruth: "Adiós",
      results: {
        "1f7e7576-cecf-4fcd-898d-133beafa955f": {
          output: { text: "Adiós", usage: { totalTokens: 47 } },
          scores: { "answer-similarity-scorer": 1, "content-similarity-scorer": null }
        },
        "07c5efee-71be-49da-b329-cedcf5511f8c": {
          output: { text: "Adiós", usage: { totalTokens: 47 } },
          scores: { "answer-similarity-scorer": 1, "content-similarity-scorer": 0 }
        }
      }
    }
    // ... remaining items
  ]
}

Experiments in CI

Because both startExperiment() and compareExperiments() are programmatic, you can wire them into CI as a quality gate. Run an experiment on every push, compares scores against a known baseline, and could be configured to fail a build if anything regressed.

Configuration

startExperiment() accepts the following configuration options:

maxConcurrency - number of items to run in parallel (default: 5)
itemTimeout - per-item execution timeout in milliseconds
maxRetries - retry failed items with exponential backoff (default: 0, no retries). Abort errors are never retried.
signal - pass an AbortSignal to cancel a running experiment
version - pin to a specific dataset version (default: latest)

Errors are isolated per-item, so one failure doesn't crash the run. See the running experiments docs for full details.