Last September, we shipped scorers, a way to evaluate your agents. Scorers let you measure the quality of an agent’s responses, but they need data to measure against.
Now we’re shipping datasets. Define versioned test cases for your agents and workflows to track, measure, and improve response quality across prompt, model, and code changes. Upload data in Mastra Studio or create datasets programmatically via the API.
Why datasets?
Agents produce non-deterministic output. The same prompt can return different results depending on the model, temperature, or context window. That makes it hard to tell whether a prompt tweak, model swap, or code change actually improved performance or introduced a regression.
For example, a translation agent might receive "Hello" as input, with "Hola" stored as the ground truth. If the agent returns "Hola", the scorer records a strong match; if it returns something incorrect or incomplete, the score reflects that. Each time you change the prompt or model, you run the same test cases and see exactly how the results shift.
Without datasets, you’re relying on memory and manual checks. With datasets and scorers together, you have a consistent baseline and a way to measure the impact of every change.
Creating datasets
To use datasets, configure a storage adapter in your main Mastra instance. Mastra currently supports LibSQL and PostgreSQL, with plans to support more storage adapters.
1// src/mastra/index.ts
2
3import { Mastra } from "@mastra/core";
4import { LibSQLStore } from "@mastra/libsql";
5
6export const mastra = new Mastra({
7 // ...
8 storage: new LibSQLStore({
9 id: "my-store",
10 url: "file:./mastra.db",
11 }),
12});Once storage is set up, you can create datasets in two ways:
In Mastra Studio, create a dataset, give it a name and description, then upload .csv or .json files or add single items manually.
Or, programmatically using datasets.create() and datasets.addItems().
1// src/mastra/datasets/seed.ts
2
3const dataset = await mastra.datasets.create({
4 name: "spanish-translations",
5 description: "English to Spanish translation pairs"
6});
7
8await dataset.addItems({
9 items: [
10 {
11 input: "Hello",
12 groundTruth: { translation: "Hola" }
13 },
14 // ...
15 ]
16});Schemas are optional but recommended. Define the shape of your inputs and ground truth with Zod or JSON Schema, and Mastra validates each item on insert. Invalid items are rejected automatically.
Run experiments
An experiment runs each item in a dataset through a target and scores the output. The target can be an agent, a workflow, or a custom function.
When you pair a dataset with scorers, each run compares the output against the ground truth, giving you a consistent and repeatable measure of performance or accuracy. Experiments can be run using Mastra Studio:
Or, programmatically using datasets.startExperiment():
1const result = await dataset.startExperiment({
2 targetType: "agent",
3 targetId: "translation-agent",
4 name: "translation-baseline",
5 description: "Baseline experiment for Spanish translations",
6 scorers: ["content-similarity-scorer"]
7});The result includes a summary with per-item scores and outputs. Change the prompt, run the experiment again, and compare results.
startExperiment()runs synchronously, which makes it suitable for CIstartExperimentAsync()starts the experiment in the background, for longer running instances
The API manages concurrency, per-item timeouts, and retries with exponential backoff. See the startExperiment and startExperimentAsync docs for full details.
Automatic dataset versioning
Every add, update, or delete automatically creates a new dataset version. No manual version management is required.
In Mastra Studio, you can browse version history to see how items changed over time and view experiment results for any previous version of the dataset.
Pin experiments by version
When you run an experiment, you can pin it to a specific dataset version. This allows you know which test cases produced which results.
1const result = await dataset.startExperiment({
2 // ...
3 version: 4,
4});Months later, you can reproduce the same experiment against the same data. Datasets evolve over time, and without version pinning they can feel like a moving target. With version pinning, every experiment is a reproducible snapshot of the dataset from a specific point in time.
What’s coming next?
We’re expanding datasets beyond manual test case creation. Soon, you’ll be able to build datasets directly from production traces, turning real agent interactions into curated datasets instead of hand-writing them.
We’re also building richer tooling in Mastra Studio for browsing and editing datasets, reviewing experiment results, and comparing runs side by side to spot regressions quickly.
Error analysis is another focus. Subject matter experts will be able to review outputs and annotate results at the item level, making evaluation more collaborative and structured.
Datasets are the foundation. From here, we’re building a tighter feedback loop between production traffic, structured testing, and regression detection.
Get started
For more information see the documentation at: mastra.ai/docs/observability/datasets/overview.
