Skip to main content
Mastra 1.0 is available πŸŽ‰ Read announcement

Observational Memory

Added in: @mastra/memory@1.1.0

Observational Memory (OM) is Mastra's memory system for long-context agentic memory. Two background agents β€” an Observer and a Reflector β€” watch your agent's conversations and maintain a dense observation log that replaces raw message history as it grows.

Quick Start
Direct link to Quick Start

Enable observationalMemory in the memory options when creating your agent:

src/mastra/agents/agent.ts
import { Memory } from '@mastra/memory'
import { Agent } from '@mastra/core/agent'

export const agent = new Agent({
name: 'my-agent',
instructions: 'You are a helpful assistant.',
model: 'openai/gpt-5-mini',
memory: new Memory({
options: {
observationalMemory: true,
},
}),
})

That's it. The agent now has humanlike long-term memory that persists across conversations. Setting observationalMemory: true uses google/gemini-2.5-flash by default. To use a different model or customize thresholds, pass a config object instead:

const memory = new Memory({
options: {
observationalMemory: {
model: 'deepseek/deepseek-reasoner',
},
},
})

See configuration options for full API details.

note

OM currently only supports @mastra/pg, @mastra/libsql, and @mastra/mongodb storage adapters. It uses background agents for managing memory. When using observationalMemory: true, the default model is google/gemini-2.5-flash. When passing a config object, a model must be explicitly set.

Benefits
Direct link to Benefits

  • Prompt caching: OM's context is stable β€” observations append over time rather than being dynamically retrieved each turn. This keeps the prompt prefix cacheable, which reduces costs.
  • Compression: Raw message history and tool results get compressed into a dense observation log. Smaller context means faster responses and longer coherent conversations.
  • Zero context rot: The agent sees relevant information instead of noisy tool calls and irrelevant tokens, so the agent stays on task over long sessions.

How It Works
Direct link to How It Works

You don't remember every word of every conversation you've ever had. You observe what happened subconsciously, then your brain reflects β€” reorganizing, combining, and condensing into long-term memory. OM works the same way.

Every time an agent responds, it sees a context window containing its system prompt, recent message history, and any injected context. The context window is finite β€” even models with large token limits perform worse when the window is full. This causes two problems:

  • Context rot: the more raw message history an agent carries, the worse it performs.
  • Context waste: most of that history contains tokens no longer needed to keep the agent on task.

OM solves both problems by compressing old context into dense observations.

Observations
Direct link to Observations

When message history tokens exceed a threshold (default: 30,000), the Observer creates observations β€” concise notes about what happened:

Date: 2026-01-15
- πŸ”΄ 12:10 User is building a Next.js app with Supabase auth, due in 1 week (meaning January 22nd 2026)
- πŸ”΄ 12:10 App uses server components with client-side hydration
- 🟑 12:12 User asked about middleware configuration for protected routes
- πŸ”΄ 12:15 User stated the app name is "Acme Dashboard"

The compression is typically 5–40Γ—. The Observer also tracks a current task and suggested response so the agent picks up where it left off.

Example: an agent using Playwright MCP might see 50,000+ tokens per page snapshot. With OM, the Observer watches the interaction and creates a few hundred tokens of observations about what was on the page and what actions were taken. The agent stays on task without carrying every raw snapshot.

Reflections
Direct link to Reflections

When observations exceed their threshold (default: 40,000 tokens), the Reflector condenses them β€” combining related items and reflecting on patterns.

The result is a three-tier system:

  1. Recent messages: Exact conversation history for the current task
  2. Observations: A log of what the Observer has seen
  3. Reflections: Condensed observations when memory becomes too long

Models
Direct link to Models

The Observer and Reflector run in the background. Any model that works with Mastra's model routing (e.g. openai/..., google/..., deepseek/...) can be used.

When using observationalMemory: true, the default model is google/gemini-2.5-flash. When passing a config object, a model must be explicitly set.

We recommend google/gemini-2.5-flash β€” it works well for both observation and reflection, and its 1M token context window gives the Reflector headroom.

We've also tested deepseek, qwen3, and glm-4.7 for the Observer. For the Reflector, make sure the model's context window can fit all observations. Note that Claude 4.5 models currently don't work well as observer or reflector.

const memory = new Memory({
options: {
observationalMemory: {
model: 'deepseek/deepseek-reasoner',
},
},
})

See model configuration for using different models per agent.

Scopes
Direct link to Scopes

Thread scope (default)
Direct link to Thread scope (default)

Each thread has its own observations. This scope is well tested and works well as a general purpose memory system, especially for long horizon agentic use-cases.

const memory = new Memory({
options: {
observationalMemory: {
model: 'google/gemini-2.5-flash',
scope: 'thread',
},
},
})

Thread scope requires a valid threadId to be provided when calling the agent. If threadId is missing, Observational Memory throws an error. This prevents multiple threads from silently sharing a single observation record, which can cause database deadlocks.

Resource scope (experimental)
Direct link to Resource scope (experimental)

Observations are shared across all threads for a resource (typically a user). Enables cross-conversation memory.

const memory = new Memory({
options: {
observationalMemory: {
model: 'google/gemini-2.5-flash',
scope: 'resource',
},
},
})

Resource scope works, however it's marked as experimental for now until we prove task adherence/continuity across multiple ongoing simultaneous threads. As of today, you may need to tweak your system prompt to prevent one thread from continuing the work that another had already started (but hadn't finished).

This is because in resource scope, each thread is a perspective on all threads for the resource.

For your use-case this may not be a problem, so your mileage may vary.

warning

In resource scope, unobserved messages across all threads are processed together. For users with many existing threads, this can be slow. Use thread scope for existing apps.

Token Budgets
Direct link to Token Budgets

OM uses token thresholds to decide when to observe and reflect. See token budget configuration for details.

const memory = new Memory({
options: {
observationalMemory: {
model: 'google/gemini-2.5-flash',
observation: {
// when to run the Observer (default: 30,000)
messageTokens: 30_000,
},
reflection: {
// when to run the Reflector (default: 40,000)
observationTokens: 40_000,
},
// let message history borrow from observation budget
// requires bufferTokens: false (temporary limitation)
shareTokenBudget: false,
},
},
})

Async Buffering
Direct link to Async Buffering

Without async buffering, the Observer runs synchronously when the message threshold is reached β€” the agent pauses mid-conversation while the Observer LLM call completes. With async buffering (enabled by default), observations are pre-computed in the background as the conversation grows. When the threshold is hit, buffered observations activate instantly with no pause.

How it works
Direct link to How it works

As the agent converses, message tokens accumulate. At regular intervals (bufferTokens), a background Observer call runs without blocking the agent. Each call produces a "chunk" of observations that's stored in a buffer.

When message tokens reach the messageTokens threshold, buffered chunks activate: their observations move into the active observation log, and the corresponding raw messages are removed from the context window. The agent never pauses.

Buffered observations also include continuation hints β€” a suggested next response and the current task β€” so the main agent maintains conversational continuity after activation shrinks the context window.

If the agent produces messages faster than the Observer can process them, a blockAfter safety threshold forces a synchronous observation as a last resort. Buffered activation still preserves a minimum remaining context (the smaller of ~1k tokens or the configured retention floor).

Reflection works similarly β€” the Reflector runs in the background when observations reach a fraction of the reflection threshold.

Settings
Direct link to Settings

SettingDefaultWhat it controls
observation.bufferTokens0.2How often to buffer. 0.2 means every 20% of messageTokens β€” with the default 30k threshold, that's roughly every 6k tokens. Can also be an absolute token count (e.g. 5000).
observation.bufferActivation0.8How aggressively to clear the message window on activation. 0.8 means remove enough messages to keep only 20% of messageTokens remaining. Lower values keep more message history.
observation.blockAfter1.2Safety threshold as a multiplier of messageTokens. At 1.2, synchronous observation is forced at 36k tokens (1.2 Γ— 30k). Only matters if buffering can't keep up.
reflection.bufferActivation0.5When to start background reflection. 0.5 means reflection begins when observations reach 50% of the observationTokens threshold.
reflection.blockAfter1.2Safety threshold for reflection, same logic as observation.

Disabling
Direct link to Disabling

To disable async buffering and use synchronous observation/reflection instead:

const memory = new Memory({
options: {
observationalMemory: {
model: 'google/gemini-2.5-flash',
observation: {
bufferTokens: false,
},
},
},
})

Setting bufferTokens: false disables both observation and reflection async buffering. See async buffering configuration for the full API.

note

Async buffering is not supported with scope: 'resource'. It is automatically disabled in resource scope.

Migrating existing threads
Direct link to Migrating existing threads

No manual migration needed. OM reads existing messages and observes them lazily when thresholds are exceeded.

  • Thread scope: The first time a thread exceeds observation.messageTokens, the Observer processes the backlog.
  • Resource scope: All unobserved messages across all threads for a resource are processed together. For users with many existing threads, this could take significant time.

Viewing in Mastra Studio
Direct link to Viewing in Mastra Studio

Mastra Studio shows OM status in real time in the memory tab: token usage, which model is running, current observations, and reflection history.

Comparing OM with other memory features
Direct link to Comparing OM with other memory features

  • Message history: High-fidelity record of the current conversation
  • Working memory: Small, structured state (JSON or markdown) for user preferences, names, goals
  • Semantic Recall: RAG-based retrieval of relevant past messages

If you're using working memory to store conversation summaries or ongoing state that grows over time, OM is a better fit. Working memory is for small, structured data; OM is for long-running event logs. OM also manages message history automaticallyβ€”the messageTokens setting controls how much raw history remains before observation runs.

In practical terms, OM replaces both working memory and message history, and has greater accuracy (and lower cost) than Semantic Recall.