Skip to main content

Evals with memory

Agents that use memory in thread scope — including observational memory — require a thread ID at run time. When an eval invokes the agent without one, you'll see:

ObservationalMemory (scope: 'thread') requires a threadId, but none was found in RequestContext or MessageList.

This page covers the three working patterns for running Mastra evals against memory-enabled agents, what each path supports, and which one to pick. A complete runnable repro for all three approaches lives in examples/evals-with-memory.

When to use which approach
Direct link to When to use which approach

GoalApproach
One shared conversation across every itemrunEvals with global targetOptions.memory
One independent thread per item, simple CI looprunEvals per item
Per-item threads driven by a stored Datasetdataset.startExperiment with an inline task

Pre-seeding RequestContext with MastraMemory is not a supported way to drive memory into an agent. Thread resolution reads args.memory.threadRequestContext.MastraMemory is populated by prepare-memory-step after the agent has already resolved its thread.

Shared thread with runEvals
Direct link to shared-thread-with-runevals

runEvals accepts targetOptions, which is forwarded to agent.generate(). Passing memory: { thread, resource } runs every data item against the same thread — useful for testing recall across a multi-turn conversation.

src/mastra/agents/support-agent.test.ts
import { runEvals } from '@mastra/core/evals'
import { supportAgent } from './support-agent'
import { recallScorer } from '../scorers/recall-scorer'

const memory = await supportAgent.getMemory()
await memory!.createThread({ threadId: 'eval-thread', resourceId: 'ci-user' })

const result = await runEvals({
target: supportAgent,
scorers: [recallScorer],
targetOptions: {
memory: { thread: 'eval-thread', resource: 'ci-user' },
},
data: [
{ input: 'My order number is 12345' },
{ input: 'What is my order number?', groundTruth: '12345' },
],
})

targetOptions is global per call. There is no per-item override on RunEvalsDataItem today.

Per-item threads with runEvals
Direct link to per-item-threads-with-runevals

When each data item needs its own thread (the common CI shape), call runEvals once per item with a unique targetOptions.memory and aggregate the scores yourself.

src/mastra/agents/support-agent.test.ts
import { randomUUID } from 'node:crypto'
import { runEvals } from '@mastra/core/evals'
import { supportAgent } from './support-agent'
import { recallScorer } from '../scorers/recall-scorer'

const memory = await supportAgent.getMemory()
const resourceId = 'ci-user'

const items = [
{ input: 'Cats are mammals', groundTruth: 'mammals' },
{ input: 'Dogs are mammals too', groundTruth: 'mammals' },
]

// `runEvals` returns `{ scores: Record<string, number>; summary: { totalItems } }`.
const scores: number[] = []
for (const item of items) {
const threadId = `eval-${randomUUID()}`
await memory!.createThread({ threadId, resourceId, title: item.input })

const result = await runEvals({
target: supportAgent,
scorers: [recallScorer],
targetOptions: { memory: { thread: threadId, resource: resourceId } },
data: [item],
})

scores.push(result.scores[recallScorer.id])
}

const average = scores.reduce((a, b) => a + b, 0) / scores.length
note

Create the thread before running the eval. Observational memory in thread scope reads from a record that must already exist.

Dataset experiments with an inline task
Direct link to Dataset experiments with an inline task

dataset.startExperiment({ target: agent }) does not forward a memory option to the agent — only requestContext. To run a stored dataset against a memory-enabled agent, use an inline task function and stash { threadId, resourceId } in each item's metadata. The scorer pipeline still runs as normal.

src/mastra/evals/dataset-experiment.ts
import { randomUUID } from 'node:crypto'
import { mastra } from '../index'
import { supportAgent } from '../agents/support-agent'
import { recallScorer } from '../scorers/recall-scorer'

const memory = await supportAgent.getMemory()
const resourceId = 'ci-user'

const items = [
{ input: 'Cats are mammals', groundTruth: 'mammals', thread: `ds-${randomUUID()}` },
{ input: 'Dogs are mammals too', groundTruth: 'mammals', thread: `ds-${randomUUID()}` },
]

for (const it of items) {
await memory!.createThread({ threadId: it.thread, resourceId, title: it.input })
}

const dataset = await mastra.datasets.create({
name: 'support-recall',
description: 'Per-item memory via inline task + item metadata',
})

await dataset.addItems({
items: items.map(it => ({
input: it.input,
groundTruth: it.groundTruth,
metadata: { threadId: it.thread, resourceId },
})),
})

const summary = await dataset.startExperiment({
scorers: [recallScorer],
task: async ({ input, metadata }) => {
const { threadId, resourceId: rid } = (metadata ?? {}) as {
threadId: string
resourceId: string
}
const result = await supportAgent.generate(input as string, {
memory: { thread: threadId, resource: rid },
})
return result.text
},
})

The inline task receives the item's metadata, so each row can drive its own thread without changing the agent or any scorer.

note

Visit runEvals reference and Dataset reference for full configuration.