Agent Memory

Agents in Mastra have a sophisticated memory system that stores conversation history and contextual information. This memory system supports both traditional message storage and vector-based semantic search, enabling agents to maintain state across interactions and retrieve relevant historical context.

Threads and Resources

In Mastra, you can organize conversations by a thread_id. This allows the system to maintain context and retrieve historical messages that belong to the same discussion.

Mastra also supports the concept of a resource_id, which typically represents the user involved in the conversation, ensuring that the agent’s memory and context are correctly associated with the right entity.

This separation allows you to manage multiple conversations (threads) for a single user or even share conversation context across users if needed.

import { Agent } from "@mastra/core/agent";
import { openai } from "@ai-sdk/openai";
 
const agent = new Agent({
  name: "Project Manager",
  instructions: "You are a project manager. You are responsible for managing the project and the team.",
  model: openai("gpt-4o-mini"),
});
 
await agent.stream("When will the project be completed?", {
  threadId: "project_123",
  resourceId: "user_123",
});

Managing Conversation Context

The key to getting good responses from LLMs is feeding them the right context.

Mastra has a Memory API that stores and manages conversation history and contextual information. The Memory API uses a storage backend to persist conversation history and contextual information (more on this later).

The Memory API uses two main mechanisms to maintain context in conversations, recent message history and semantic search.

Recent Message History

By default, Memory keeps track of the 40 most recent messages in a conversation. You can customize this with the lastMessages setting:

const memory = new Memory({
  options: {
    lastMessages: 5, // Keep 5 most recent messages
  },
});
 
// When user asks this question, the agent will see the last 10 messages,
await agent.stream("Can you summarize the search feature requirements?", {
  memoryOptions: {
    lastMessages: 10,
  },
});

Semantic search is enabled by default in Mastra. While FastEmbed (bge-small-en-v1.5) and LibSQL are included by default, you can use any embedder (like OpenAI or Cohere) and vector database (like PostgreSQL, Pinecone, or Chroma) that fits your needs.

This allows your agent to find and recall relevant information from earlier in the conversation:

const memory = new Memory({
  options: {
    semanticRecall: {
      topK: 10, // Include 10 most relevant past messages
      messageRange: 2, // Messages before and after each result
    },
  },
});
 
// Example: User asks about a past feature discussion
await agent.stream("What did we decide about the search feature last week?", {
  memoryOptions: {
    lastMessages: 10,
    semanticRecall: {
      topK: 3,
      messageRange: 2,
    },
  },
});

When semantic search is used:

  1. The message is converted to a vector embedding
  2. Similar messages are found using vector similarity search
  3. Surrounding context is included based on messageRange
  4. All relevant context is provided to the agent

You can also customize the vector database and embedder:

import { openai } from '@ai-sdk/openai';
import { PgVector } from '@mastra/pg';
 
const memory = new Memory({
  // Use a different vector database (libsql is default)
  vector: new PgVector({
    connectionString: "postgresql://user:pass@localhost:5432/db",
  }),
  // Or a different embedder (fastembed is default)
  embedder: openai.embedding('text-embedding-3-small'),
});

Memory Configuration

The Mastra memory system is highly configurable and supports multiple storage backends. By default, it uses LibSQL for storage and vector search, and FastEmbed for embeddings.

Basic Configuration

For most use cases, you can use the default configuration:

import { Memory } from "@mastra/memory";
 
const memory = new Memory();

Custom Configuration

For more control, you can customize the storage backend, vector database, and memory options:

import { Memory } from "@mastra/memory";
import { PostgresStore, PgVector } from "@mastra/pg";
 
const memory = new Memory({
  storage: new PostgresStore({
    host: "localhost",
    port: 5432,
    user: "postgres",
    database: "postgres",
    password: "postgres",
  }),
  vector: new PgVector({
    connectionString: "postgresql://user:pass@localhost:5432/db",
  }),
  options: {
    // Number of recent messages to include (false to disable)
    lastMessages: 10,
    // Configure vector-based semantic search
    semanticRecall: {
      topK: 3, // Number of semantic search results
      messageRange: 2, // Messages before and after each result
    },
  },
});

Overriding Memory Settings

When you initialize a Mastra instance with memory configuration, all agents will automatically use these memory settings when you call their stream() or generate() methods. You can override these default settings for individual calls:

// Use default memory settings from Memory configuration
const response1 = await agent.generate("What were we discussing earlier?", {
  resourceId: "user_123",
  threadId: "thread_456",
});
 
// Override memory settings for this specific call
const response2 = await agent.generate("What were we discussing earlier?", {
  resourceId: "user_123",
  threadId: "thread_456",
  memoryOptions: {
    lastMessages: 5, // Only inject 5 recent messages
    semanticRecall: {
      topK: 2, // Only get 2 semantic search results
      messageRange: 1, // Context around each result
    },
  },
});

Configuring Memory for Different Use Cases

You can adjust memory settings based on your agent’s needs:

// Customer support agent with minimal context
await agent.stream("What are your store hours?", {
  threadId,
  resourceId,
  memoryOptions: {
    lastMessages: 5, // Quick responses need minimal conversation history
    semanticRecall: false, // no need to search through earlier messages
  },
});
 
// Project management agent with extensive context
await agent.stream("Update me on the project status", {
  threadId,
  resourceId,
  memoryOptions: {
    lastMessages: 50, // Maintain longer conversation history across project discussions
    semanticRecall: {
      topK: 5, // Find more relevant project details
      messageRange: 3, // Number of messages before and after each result
    },
  },
});

Storage Options

Mastra currently supports several storage backends:

LibSQL Storage

import { DefaultStorage } from "@mastra/core/storage";
 
const storage = new DefaultStorage({
  config: {
    url: "file:example.db",
  },
});

PostgreSQL Storage

import { PostgresStore } from "@mastra/pg";
 
const storage = new PostgresStore({
  host: "localhost",
  port: 5432,
  user: "postgres",
  database: "postgres",
  password: "postgres",
});

Upstash KV Storage

import { UpstashStore } from "@mastra/upstash";
 
const storage = new UpstashStore({
  url: "http://localhost:8089",
  token: "your_token",
});

Mastra supports semantic search through vector embeddings. When configured with a vector store, agents can find relevant historical messages based on semantic similarity. To enable vector search:

  1. Configure a vector store (currently supports PostgreSQL):
import { PgVector } from "@mastra/pg";
 
const vector = new PgVector(connectionString);
 
const memory = new Memory({ vector });
  1. Configure embedding options:
const memory = new Memory({
  vector,
  embedder: openai.embedding("text-embedding-3-small"),
});
  1. Enable vector search in memory configuration options:
const memory = new Memory({
  vector,
  embedder,
 
  options: {
    semanticRecall: {
      topK: 3, // Number of similar messages to find
      messageRange: 2, // Context around each result
    },
  },
});

Using Memory in Agents

Once configured, the memory system is automatically used by agents. Here’s how to use it:

// Initialize Mastra with memory
const mastra = new Mastra({
  agents: { myAgent },
  memory,
});
 
// Memory is automatically used in agent interactions
const response = await myAgent.generate(
  "What were we discussing earlier about performance?",
  {
    resourceId: "user_123",
    threadId: "thread_456",
  },
);

The memory system will automatically:

  1. Store all messages in the configured storage backend
  2. Create vector embeddings for semantic search (if configured)
  3. Inject relevant historical context into new conversations
  4. Maintain conversation threads and context

Manually Managing Threads

While threads are automatically managed when using agent methods, you can also manually manage threads using the memory API directly. This is useful for advanced use cases like:

  • Creating threads before starting conversations
  • Managing thread metadata
  • Explicitly saving or retrieving messages
  • Cleaning up old threads

Here’s how to manually work with threads:

import { Memory } from "@mastra/memory";
import { PostgresStore } from "@mastra/pg";
 
// Initialize memory
const memory = new Memory({
  storage: new PostgresStore({
    host: "localhost",
    port: 5432,
    user: "postgres",
    database: "postgres",
    password: "postgres",
  }),
});
 
// Create a new thread
const thread = await memory.createThread({
  resourceId: "user_123",
  title: "Project Discussion",
  metadata: {
    project: "mastra",
    topic: "architecture",
  },
});
 
// Manually save messages to a thread
await memory.saveMessages({
  messages: [
    {
      id: "msg_1",
      threadId: thread.id,
      role: "user",
      content: "What's the project status?",
      createdAt: new Date(),
      type: "text",
    },
  ],
});
 
// Get messages from a thread with various filters
const messages = await memory.query({
  threadId: thread.id,
  selectBy: {
    last: 10, // Get last 10 messages
    vectorSearchString: "performance", // Find messages about performance
  },
});
 
// Get thread by ID
const existingThread = await memory.getThreadById({
  threadId: "thread_123",
});
 
// Get all threads for a resource
const threads = await memory.getThreadsByResourceId({
  resourceId: "user_123",
});
 
// Update thread metadata
await memory.updateThread({
  id: thread.id,
  title: "Updated Project Discussion",
  metadata: {
    status: "completed",
  },
});
 
// Delete a thread and all its messages
await memory.deleteThread(thread.id);

Note that in most cases, you won’t need to manage threads manually since the agent’s generate() and stream() methods handle thread management automatically. Manual thread management is primarily useful for advanced use cases or when you need more fine-grained control over the conversation history.

Working Memory

Working memory is a powerful feature that allows agents to maintain persistent information across conversations, even with minimal context. This is particularly useful for remembering user preferences, personal details, or any other contextual information that should persist throughout interactions.

Inspired by the working memory concept from the MemGPT whitepaper, our implementation improves upon it in several key ways:

  • No extra roundtrips or tool calls required
  • Full support for streaming messages
  • Seamless integration with the agent’s natural response flow

How It Works

Working memory operates through a system of XML tags and automatic updates:

  1. Template Structure: Define what information should be remembered using XML tags. The Memory class comes with a comprehensive default template for user information, or you can create your own template to match your specific needs.

  2. Automatic Updates: The Memory class injects special instructions into the agent’s system prompt that tell it to:

    • Store relevant information by including <working_memory>...</working_memory> tags in its responses
    • Update information proactively when anything changes
    • Maintain the XML structure while updating values
    • Keep this process invisible to users
  3. Memory Management: The system:

    • Extracts working memory blocks from agent responses
    • Stores them for future use
    • Injects working memory into the system prompt on the next agent call

The agent is instructed to be proactive about storing information - if there’s any doubt about whether something might be useful later, it should be stored. This helps maintain conversation context even when using very small context windows.

Basic Usage

import { openai } from "@ai-sdk/openai";
 
const agent = new Agent({
  name: "Customer Service",
  instructions:
    "You are a helpful customer service agent. Remember customer preferences and past interactions.",
  model: openai("gpt-4o-mini"),
 
  memory: new Memory({
    storage: new DefaultStorage({ url: ":memory:" }),
    options: {
      workingMemory: {
        enabled: true, // enables working memory
      },
      lastMessages: 5, // Only keep recent context
    },
  }),
});

Working memory becomes particularly powerful when combined with specialized system prompts. For example, you could create a TODO list manager that maintains state even though it only has access to the previous message:

const todoAgent = new Agent({
  name: "TODO Manager",
  instructions:
    "You are a TODO list manager. Update the todo list in working memory whenever tasks are added, completed, or modified.",
  model: openai("gpt-4o-mini"),
  memory: new Memory({
    storage: new DefaultStorage({ url: ":memory:" }),
    options: {
      workingMemory: {
        enabled: true,
 
        // optional XML-like template to encourage agent to store specific kinds of info.
        // if you leave this out a default template will be used
        template: `<todos>
  <in-progress></in-progress>
  <pending></pending>
  <completed></completed>
</todos>`,
      },
      lastMessages: 1, // Only keep the last message in context
    },
  }),
});

Handling Memory Updates in Streaming

When an agent responds, it includes working memory updates directly in its response stream. These updates appear as XML blocks in the text:

// Raw agent response stream:
Let me help you with that! <working_memory><user><first_name>John</first_name>...</user></working_memory> Based on your question...

To prevent these memory blocks from being visible to users while still allowing the system to process them, use the maskStreamTags utility:

import { maskStreamTags } from "@mastra/core/utils";
 
// Basic usage - just mask the working_memory tags
for await (const chunk of maskStreamTags(
  response.textStream,
  "working_memory",
)) {
  process.stdout.write(chunk);
}
 
// Without masking: "Let me help you! <working_memory>...</working_memory> Based on..."
// With masking: "Let me help you! Based on..."

You can also hook into memory update events:

const maskedStream = maskStreamTags(response.textStream, "working_memory", {
  onStart: () => showLoadingSpinner(),
  onEnd: () => hideLoadingSpinner(),
  onMask: (chunk) => console.debug(chunk),
});

The maskStreamTags utility:

  • Removes content between specified XML tags in a streaming response
  • Optionally provides lifecycle callbacks for memory updates
  • Handles tags that might be split across stream chunks