Blog/

memory

Yes, you can use RAG for agent memory

·

Jul 17, 2025

At Mastra, we recently implemented the LongMemEval benchmark to drastically improve long-term memory (semantic recall) as well as working memory.

With the improvements we shipped, we’ve reached state-of-the-art performance (80%) and delivered a number of improvements which we have pushed into a vNext version of memory.

We also (accidentally) spent $8k and burned a few billion tokens to run this benchmark.

Let’s dig in.

Backstory

We recently noticed that Mastra Memory was missing something crucial: systematic evaluation at scale.

The wake-up call came when we moved our "working memory" feature from the system prompt to user messages. The change seemed logical but we later discovered this "fix" created new problems.

Agents would randomly get stuck in loops calling tools multiple times, or worse, they'd respond directly to the injected memory as if it were a user message.

Before shipping additional fixes, we realized that we needed a way to measure conclusively whether new changes actually improve the accuracy of memory.

Enter LongMemEval

LongMemEval is a benchmark designed to test how well AI systems handle long-term memory across multiple conversations.

The benchmark dataset contains 500 questions, each question has about 50 unique conversations attached to it, and one (or more) of those conversations contain the answer to the question.

Other companies like Zep have published their LongMemEval results.

One (Zep) went so far as to claim developers should "stop using RAG for agent memory" (spoiler: our results say otherwise). In fact, our results — using a general purpose framework — are better than theirs: 80% accuracy compared to 72%.

Iteration 0: Initial results

To recap, Mastra has two main memory features: working memory, where we remember characteristics about the user such as their name, age, and preferences, and semantic recall, a RAG-based feature where we store chat history in a vector DB and then retrieve messages based on semantic similarity to the user’s query.

Initial results using these features were... disappointing:

  • Working memory alone: 20% accuracy
  • Semantic recall: 65% accuracy
  • Combined: 67% accuracy

Iteration 1: Tailored Templates

While thinking through the problem, it occurred to us that working memory is designed to track specific information for your application. In our benchmark we were using generic working memory templates — which is not how the feature was designed to be used. I needed to know: are generic templates the reason long-term working memory performs so poorly?

To test this we added two new benchmark configurations to the list:

  1. Working Memory Tailored - Custom templates for each benchmark question
  2. Combined Tailored - Semantic recall + working memory with custom templates

We wrote a script to generate custom templates for each question type in the benchmark, passing the question (not the answer) into an LLM and asking it to output a template to track this type of information. Our intention was to simulate a developer crafting templates for their specific use case (e.g. a workout assistant that is told to track food intake, workout preferences, etc).

Result: "tailored" working memory increased the score from 20% to 35%. Progress!

Iteration 2: Smarter Memory Updates (The vNext Tool)

This result still wasn't good enough. We designed working memory for short-term recall, however it still seemed wrong to me that it would score so low for long-term recall.

When ingesting benchmark data into Mastra memory, each conversation is processed one after another in order from oldest to newest. This allows the agent to build upon working memory while each new conversation progresses and builds upon the last, simulating what would happen in the real world.

I was surprised to discover that, when a conversation took a completely different direction from prior conversations, the agent would decide that the stored working memory was no longer relevant.

For simplicity, we designed out working memory tool to replace the entire working memory data every time the agent wanted to add new information. It worked fine during manual testing but it was setting the LLM up for failure when it had to work across conversations spanning thousands of messages.

So we developed a more granular API for the updateWorkingMemory tool, requiring an updateReason as well as an optional searchString for more targeted updates.

In addition, we gave agents more fine-grained instructions, only allowing them to do wholesale replacements when working memory doesn't span across multiple conversations. If the agent tries to remove "irrelevant" information from a previous conversation, a message is returned to it explaining that it's only allowed to clarify existing memories or add new ones, and the new data has been appended instead.

 1// Old approach - complete replacement every time
 2updateWorkingMemory({ newMemory: "User likes hiking" });
 3
 4// vnext approach - contextual updates
 5updateWorkingMemory({
 6  newMemory: "User likes hiking",
 7  updateReason: "append-new-memory", // or "clarify-existing-memory" or "replace-irrelevant-memory"
 8  // searchString: "hobbies", // optional, for targeted updates
 9});

Result: Working memory hit 57.34% with tailored templates, and 72% when combined with semantic recall!

We are shipping this as vnext memory, which you can use by specifying like so in the latest Mastra version:

 1import { Memory } from "@mastra/memory";
 2
 3const memory = new Memory({
 4  provider: myProvider,
 5  options: {
 6    lastMessages: 10,
 7    workingMemory: {
 8      enabled: true,
 9      template: `Track user preferences, habits, and key personal details.
10Remember: name, occupation, hobbies, dietary preferences, workout routines.`,
11      version: "vnext", // Enable the improved/experimental tool
12    },
13  },
14});

Iteration 3: Time is Hard

At this point we were on-par with Zep (with a specific memory configuration) — a pretty good result. However going over the data one last time we realized a specific category temporal-reasoning, was scoring abnormally low.

We started reading through prepared data, questions, and answers and found two unexpected behaviors related to time:

  1. All messages were being inserted in memory with current timestamps instead of their original dates from the benchmark data
  2. The agent thought "today" was actually when we ran the benchmark rather than using the question_date field from LongMemEval

The fix required correcting the timestamps as well as adding the question date to the system prompt: Today's date is ${question_date}.

Fixing these brought combined-tailored to 74%.

Iteration 4: Better Formatting Breakthrough

Now we were dying to know: if this improvement was so easy to find/fix, perhaps there were other problems we were overlooking. Could our score go even higher?

We looked through some failed benchmark questions where the answer was in context and the LLM should have known the answer, and we started to look at the data structure.

Recalled messages were being presented to the LLM as a flat list sorted by date, with no additional annotating context.

Hoping we could help the LLM temporally reason better, we restructured the semantic recall message formatting to:

  1. Group recalled messages by "Year, Month, Day"
  2. Add time labels to each message (e.g., "2:19 PM")
  3. Clarify if each message was from the current conversation or from a previous, separate conversation

So what happened?

The Results: RAG Is Very Much Alive

With improved formatting and different topK values for semantic recall:

  • topK 2: 63.41% (default setting)
  • topK 5: 73.98%
  • topK 10: 78.59%
  • topK 20: 80%

We hit 80% with semantic recall (RAG) alone - 8 points higher than Zep's benchmark that claims RAG doesn't work for agent memory! Even with a low topK of 5, our result is still higher and one of the best reported results.

📊 Final Benchmark Results

Semantic Recall

 1Dataset: longmemeval_s
 2Model: gpt-4o
 3Memory Config: semantic-recall topK 2
 4────────────────────────────────────────────────────
 5Accuracy by Question Type:
 6  knowledge-update
 7    52.6% [███████████░░░░░░░░░] (41/78)
 8  multi-session
 9    27.1% [█████░░░░░░░░░░░░░░░] (36/133)
10  single-session-assistant
11    96.4% [███████████████████░] (54/56)
12  single-session-preference
13    73.3% [███████████████░░░░░] (22/30)
14  single-session-user
15    95.7% [███████████████████░] (67/70)
16  temporal-reasoning
17    35.3% [███████░░░░░░░░░░░░░] (47/133)
18
19Overall Accuracy: 63.41%
 1Dataset: longmemeval_s
 2Model: gpt-4o
 3Memory Config: semantic-recall topK 5
 4────────────────────────────────────────────────────
 5Accuracy by Question Type:
 6  knowledge-update
 7    75.6% [███████████████░░░░░] (59/78)
 8  multi-session
 9    55.6% [███████████░░░░░░░░░] (74/133)
10  single-session-assistant
11    100.0% [████████████████████] (56/56)
12  single-session-preference
13    56.7% [███████████░░░░░░░░░] (17/30)
14  single-session-user
15    94.3% [███████████████████░] (66/70)
16  temporal-reasoning
17    61.7% [████████████░░░░░░░░] (82/133)
18
19Overall Accuracy: 73.98%
 1Dataset: longmemeval_s
 2Model: gpt-4o
 3Memory Config: semantic-recall topK 10
 4────────────────────────────────────────────────────
 5Accuracy by Question Type:
 6  knowledge-update
 7    85.9% [█████████████████░░░] (67/78)
 8  multi-session
 9    71.4% [██████████████░░░░░░] (95/133)
10  single-session-assistant
11    100.0% [████████████████████] (56/56)
12  single-session-preference
13    50.0% [██████████░░░░░░░░░░] (15/30)
14  single-session-user
15    94.3% [███████████████████░] (66/70)
16  temporal-reasoning
17    69.9% [██████████████░░░░░░] (93/133)
18
19Overall Accuracy: 78.59%
 1Dataset: longmemeval_s
 2Model: gpt-4o
 3Memory Config: semantic-recall topK 20
 4────────────────────────────────────────────────────
 5Accuracy by Question Type:
 6  knowledge-update
 7    84.6% [█████████████████░░░] (66/78)
 8  multi-session
 9    76.7% [███████████████░░░░░] (102/133)
10  single-session-assistant
11    100.0% [████████████████████] (56/56)
12  single-session-preference
13    46.7% [█████████░░░░░░░░░░░] (14/30)
14  single-session-user
15    97.1% [███████████████████░] (68/70)
16  temporal-reasoning
17    75.2% [███████████████░░░░░] (100/133)
18
19Overall Accuracy: 80.05%

Working Memory

These results are shared for reference, but it’s clear now there’s no reason to use working memory to improve long-term memory performance. Working memory is a context engineering tool that guides and personalizes agent behavior.

 1Dataset: longmemeval_s
 2Model: gpt-4o
 3Memory Config: working-memory
 4────────────────────────────────────────────────────
 5Accuracy by Question Type:
 6  knowledge-update
 7    53.8% [███████████░░░░░░░░░] (42/78)
 8  multi-session
 9    24.8% [█████░░░░░░░░░░░░░░░] (33/133)
10  single-session-assistant
11    26.8% [█████░░░░░░░░░░░░░░░] (15/56)
12  single-session-preference
13    23.3% [█████░░░░░░░░░░░░░░░] (7/30)
14  single-session-user
15    54.3% [███████████░░░░░░░░░] (38/70)
16  temporal-reasoning
17    29.3% [██████░░░░░░░░░░░░░░] (39/133
18)
19
20Overall Accuracy: 35.40%
 1Dataset: longmemeval_s
 2Model: gpt-4o
 3Memory Config: working-memory-tailored
 4────────────────────────────────────────────────────
 5Accuracy by Question Type:
 6  knowledge-update
 7    67.9% [██████████████░░░░░░] (53/78)
 8  multi-session
 9    47.4% [█████████░░░░░░░░░░░] (63/133)
10  single-session-assistant
11    67.9% [██████████████░░░░░░] (38/56)
12  single-session-preference
13    36.7% [███████░░░░░░░░░░░░░] (11/30)
14  single-session-user
15    82.9% [█████████████████░░░] (58/70)
16  temporal-reasoning
17    41.4% [████████░░░░░░░░░░░░] (55/133)
18
19Overall Accuracy: 57.34%

Latency

You might wonder why we focused on accuracy but not latency. As a framework, Mastra supports multiple storage and vector database backends like PostgreSQL, Pinecone, ChromaDB, LibSQL, each with its own latency characteristics.

Here's what happens during a typical semantic recall (RAG) operation in Mastra:

  1. Vector search for relevant messages (1 request)
  2. Retrieve full messages from storage (1 request)
  3. Pass results to the LLM (1 request)

This puts us in a similar latency range as Zep in terms of the number of requests we need to make to achieve the result.

Other solutions have achieved higher scores at the cost of doubling the number of API calls in order to rerank. Our accuracy rates come without those additional round trips, but we may explore re-ranking in the future to increase accuracy (and we’ll update the benchmarks accordingly).

Sidenote: how we spent $8k to run these benchmarks

Did I accidentally spend $8k on OpenAI credits in 3 days running memory benchmarks? Yes. Did I burn through 3.8 BILLION tokens from a single laptop? Also yes.

I wrote a script to squeeze out as many tokens per second as I could so we could compare an exhaustive list of different memory configurations. I ran the data prepare step + benchmark steps hundreds of times over a few days while debugging and testing various theories - and that cost way more than I thought it would.

So we're working to make sure these benchmark runs are extremely cheap. Why? Because we want to run them on every memory-related code change. The only way to do that sustainably is to drive costs down while maintaining (or preferrably increasing) accuracy. This creates a powerful alignment - our engineering incentive to reduce memory benchmark costs directly translates to better, more cost-effective memory features for you. When we optimize to save ourselves money on testing, you benefit from those same optimizations in production.

We're Just Getting Started

Now that we can measure, and we know RAG is so effective for long-term agent memory, there is a lot we can do to continue improving it. We're exploring features like adding a maxTokens setting to semantic recall, adding GraphRAG, conversation summaries, and other types of long-term memory like episodic and archival memories.

Try It Yourself

The benchmark setup is available at github.com/mastra-ai/mastra/tree/main/explorations/longmemeval. Note: The prepared data and results files aren't available in the Github repo due to file size, but we'll be sharing them soon.

Stay up to date