Evals with memory

Agents that use memory in thread scope — including observational memory — require a thread ID at run time. When an eval invokes the agent without one, you'll see:

ObservationalMemory (scope: 'thread') requires a threadId, but none was found in RequestContext or MessageList.

This page covers the three working patterns for running Mastra evals against memory-enabled agents, what each path supports, and which one to pick. A complete runnable repro for all three approaches lives in examples/evals-with-memory.

When to use which approach
Direct link to When to use which approach

Goal	Approach
One shared conversation across every item	`runEvals` with global `targetOptions.memory`
One independent thread per item, focused CI loop	`runEvals` per item
Per-item threads driven by a stored `Dataset`	`dataset.startExperiment` with an inline task

Pre-seeding RequestContext with MastraMemory isn't a supported way to drive memory into an agent. Thread resolution reads args.memory.thread — RequestContext.MastraMemory is populated by prepare-memory-step after the agent has already resolved its thread.

Shared thread with `runEvals`
Direct link to shared-thread-with-runevals

runEvals accepts targetOptions, which is forwarded to agent.generate(). Passing memory: { thread, resource } runs every data item against the same thread — useful for testing recall across a multi-turn conversation.

src/mastra/agents/support-agent.test.ts
import { runEvals } from '@mastra/core/evals'
import { supportAgent } from './support-agent'
import { recallScorer } from '../scorers/recall-scorer'

const memory = await supportAgent.getMemory()
await memory!.createThread({ threadId: 'eval-thread', resourceId: 'ci-user' })

const result = await runEvals({
  target: supportAgent,
  scorers: [recallScorer],
  targetOptions: {
    memory: { thread: 'eval-thread', resource: 'ci-user' },
  },
  data: [
    { input: 'My order number is 12345' },
    { input: 'What is my order number?', groundTruth: '12345' },
  ],
})

targetOptions is global per call. No per-item override on RunEvalsDataItem is available today.

Per-item threads with `runEvals`
Direct link to per-item-threads-with-runevals

When each data item needs its own thread (the common CI shape), call runEvals once per item with a unique targetOptions.memory and aggregate the scores yourself.

src/mastra/agents/support-agent.test.ts
import { randomUUID } from 'node:crypto'
import { runEvals } from '@mastra/core/evals'
import { supportAgent } from './support-agent'
import { recallScorer } from '../scorers/recall-scorer'

const memory = await supportAgent.getMemory()
const resourceId = 'ci-user'

const items = [
  { input: 'Cats are mammals', groundTruth: 'mammals' },
  { input: 'Dogs are mammals too', groundTruth: 'mammals' },
]

// `runEvals` returns `{ scores: Record<string, number>; summary: { totalItems } }`.
const scores: number[] = []
for (const item of items) {
  const threadId = `eval-${randomUUID()}`
  await memory!.createThread({ threadId, resourceId, title: item.input })

  const result = await runEvals({
    target: supportAgent,
    scorers: [recallScorer],
    targetOptions: { memory: { thread: threadId, resource: resourceId } },
    data: [item],
  })

  scores.push(result.scores[recallScorer.id])
}

const average = scores.reduce((a, b) => a + b, 0) / scores.length

note

Create the thread before running the eval. Observational memory in thread scope reads from a record that must already exist.

Dataset experiments with an inline task
Direct link to Dataset experiments with an inline task

dataset.startExperiment({ target: agent }) doesn't forward a memory option to the agent — only requestContext. To run a stored dataset against a memory-enabled agent, use an inline task function and stash { threadId, resourceId } in each item's metadata. The scorer pipeline still runs as normal.

src/mastra/evals/dataset-experiment.ts
import { randomUUID } from 'node:crypto'
import { mastra } from '../index'
import { supportAgent } from '../agents/support-agent'
import { recallScorer } from '../scorers/recall-scorer'

const memory = await supportAgent.getMemory()
const resourceId = 'ci-user'

const items = [
  { input: 'Cats are mammals', groundTruth: 'mammals', thread: `ds-${randomUUID()}` },
  { input: 'Dogs are mammals too', groundTruth: 'mammals', thread: `ds-${randomUUID()}` },
]

for (const it of items) {
  await memory!.createThread({ threadId: it.thread, resourceId, title: it.input })
}

const dataset = await mastra.datasets.create({
  name: 'support-recall',
  description: 'Per-item memory via inline task + item metadata',
})

await dataset.addItems({
  items: items.map(it => ({
    input: it.input,
    groundTruth: it.groundTruth,
    metadata: { threadId: it.thread, resourceId },
  })),
})

const summary = await dataset.startExperiment({
  scorers: [recallScorer],
  task: async ({ input, metadata }) => {
    const { threadId, resourceId: rid } = (metadata ?? {}) as {
      threadId: string
      resourceId: string
    }
    const result = await supportAgent.generate(input as string, {
      memory: { thread: threadId, resource: rid },
    })
    return result.text
  },
})

The inline task receives the item's metadata, so each row can drive its own thread without changing the agent or any scorer. Visit runEvals reference and Dataset reference for full configuration.

When to use which approachDirect link to When to use which approach

Shared thread with runEvalsDirect link to shared-thread-with-runevals

Per-item threads with runEvalsDirect link to per-item-threads-with-runevals

Dataset experiments with an inline taskDirect link to Dataset experiments with an inline task

RelatedDirect link to Related

When to use which approach
Direct link to When to use which approach

Shared thread with `runEvals`
Direct link to shared-thread-with-runevals

Per-item threads with `runEvals`
Direct link to per-item-threads-with-runevals

Dataset experiments with an inline task
Direct link to Dataset experiments with an inline task

Related
Direct link to Related