What are durable AI agents? How state persistence and checkpoints survive a restart

From a simplistic perspective, an agent is a language model in a loop. The model's initial response is fed back to it as input, and it can react to a mistake it made, set new subtasks, and keep going without you giving additional instructions. Give it tools like web search, code execution, and file edits, and it can take a loose goal and grind on it for a while.

That "for a while" part is the basis of durable AI agents.

Let's break down what a durable or long-running agent means, what the challenges are in making such an agent, and how we implemented long-running AI agents in Mastra.

What is a long-running or durable AI agent?

A long-running agent makes progress on a goal across many sessions and many machines, over hours or days, recovering from failures and resuming from the exact point it stopped.

There are a couple of ways you can think of long-running as a concept.

An agent could execute for a long time. That could be multiple hours or even days.
It could think for a long time before responding to queries, which we saw when the reasoning effort became a customizable option.
Then we have agents with persistent memory, where the agent comes back to where we left off, keeps accumulating memories, and continues a task without asking for the same context over and over again.

It's not fair to call these agents chatbots anymore.

An AI travel agent, for instance, can plan the itinerary, wait for a manager to approve the expense, then finish the booking on a different server than the one it started on.

But even though it all sounds cool on paper, long-running agents have three problems:

Context is finite, so the loop eventually runs out of room.
State is not persistent, so a restart wipes everything the loop held in memory.
The agent cannot reliably check its own work, so it needs a record outside itself to know what actually happened.

What keeps an agent running for days?

An agent can run for days only when the part holding its memory is separate from the part doing the work. The durable setups being built now split the job three ways:

The model loop decides the next move. It keeps nothing of its own, so you can kill it and start a fresh one between steps.
The sandbox is where tools run and files get written. It can crash or be swapped anytime, because it holds no truth either.
The durable record is the log of what happened. The one source of truth a fresh model loop reads to pick up where the last one stopped.

Any one part can die or get upgraded without ending the run, which makes a long-running agent less one program running for a week and more a relay of short-lived processes passing a record between them.

This is why the durable record is the most important part here. The rest of this piece is about that record: why context windows cannot be durable, what goes into the record, and how a fresh process reads it back.

Why is a context window not a persistence layer?

	Context window	Persistence layer
What it holds	recent tokens the model can attend to	every step, output, and pause for the run
How long it lasts	one model call	until you delete it
Survives a process restart	no	yes
What happens on overflow	older detail is summarized or dropped	nothing is dropped; the record is append-only

People reach for the context window as if it were the agent's memory. But it fails as a persistence layer because:

The window is finite and lossy. When it fills, the agent compacts history into a summary, and compaction throws away the detail the agent needs later.
A fresh session starts blank. The loop holds its progress in memory, so when the worker restarts, the in-memory loop comes back knowing nothing about the run that was halfway done.

What are the three granularities of persistence?

The right way to think about persistence is to ask what the smallest unit of work is that you refuse to lose. I've found there to be three answers, depending on what the agent is doing.

	Turn-level	Step-level	Mid-step
Unit of progress	one interaction	one workflow step	a pause inside a step
What is stored	history and working memory	each step's recorded output	the suspended step's state and payload
What triggers a save	each message	each step completion	the step suspending
Typical use	chat assistant	API pipeline	approval gate
Lost on a crash	the current turn	the in-flight step only	nothing

1. Turn-level persistence keeps a conversation alive between messages

The unit is one turn, a single back-and-forth with the user, and progress is driven by the person sending the next message.

Between turns, the agent saves the conversation and what it has learned, so it still knows what you said three turns ago and what it picked up about you last week. Losing a turn is cheap, since you just say it again, which is why a plain chat assistant only needs this layer.

2. Step-level persistence records each completed step in a workflow

The unit here is one step inside a single run (or turn), a node in a workflow that the engine drives on its own with no human waiting between steps, and the moment a step finishes, its output is written down so a crash three steps in restarts from the last good step instead of the top.

A single turn can kick off one of these runs, so step-level sits underneath turn-level rather than beside it. What makes it a different problem is the cost of losing a step, because a step that charged a card or called a paid API cannot simply be rerun. So an API pipeline that charges, then reserves inventory, then emails a receipt records each step as it completes and never re-charges on a retry.

3. Mid-step persistence holds a pause inside a single step

The unit here is a pause inside one step that holds its own state and waits for input that might arrive days later, sitting at zero compute the entire time.

An approval gate needs this, because the run should cost nothing while a human decides, and it should wake up exactly where it parked when they finally click approve.

How does a long-running agent survive a crash?

Addy Osmani put it well: the agent is amnesiac, but the filesystem is not.

To put that into practice, progress has to be written down in units. And the useful unit of progress is "the step." Record each completed step's result the moment it finishes, and a failure restarts from the last good step instead of the top.

Mastra writes that record as a checkpoint (more on that in the next section). But for context, a checkpoint is a snapshot of the whole run taken automatically on every pause and keyed by a run ID. It holds each step's status and recorded output, the payloads for any step that paused to wait on a human, and the run's original input, all as plain JSON so it reloads on any machine.

Recovery after the process dies is then almost boring. You have a run ID, and you load the run by it.

// brand-new process: nothing from the original run is in memory
const wf = mastra.getWorkflow('orderWorkflow');
const state = await wf.getWorkflowRunById(runId);
const reader = createWorkflowStateReader(state);
 
const suspended = reader.getSuspendedStep();   // { stepId: 'human-approval', ... }
const run = await wf.createRun({ runId });      // bind a run to the SAME id
await run.resume({ step: suspended.stepId, resumeData: { approved: true, approver: 'manager-jane' } });

I ran the snippet above from a fresh Mastra instance that shared only the database file, with nothing from the first process in memory. The run came back suspended at the approval step, still holding the reservation computed before the pause, and the resume drove it to completion.

The agent didn't have to "remember" anything, because it could rebuild the full execution state up to that point from just the ID.

How does checkpointing work in Mastra, and what gets serialized?

Now let's go a little deeper into the checkpointing aspect of Mastra. A checkpoint is a serializable picture of a run's entire state, written automatically when the run suspends. There's no manual intervention in this case. The run suspends, the snapshot lands in a storage table, and the process is free to exit.

A snapshot captures the whole run

I ran an order workflow that reserves inventory, waits for a manager to approve, then ships, and suspended it at the approval. Reading the snapshot straight back out of storage by run ID, here is what it holds:

Per-step status, one of success, suspended, failed, waiting, running, or paused.
Each completed step's recorded output, so the reservation ID computed before the pause is sitting right there.
The execution path through the graph, as a serialized step graph.
The suspend and resume payloads for any paused step.
The remaining retry attempts for each step.
The run's original input and the shared workflow state.

The whole snapshot for that suspended run serialized to 1,059 bytes of JSON, with the per-step map reading reserve-inventory: success and human-approval: suspended, and a suspended-paths entry pointing at the approval step.

Every step entry carries its own four timestamps

Mastra has a per-step record, which is interesting because it tells the engine what state a step is in without rerunning the step.

Let me explain. When the run completes a step, the entry status reads suspended, and it carries a startedAt and a suspendedAt timestamp with a suspend payload describing why the step stopped.

After I resumed the step, that same entry reads:

{
  "status": "success",
  "startedAt": 1782382177053,
  "suspendedAt": 1782382177053,
  "resumedAt": 1782382177060,
  "endedAt": 1782382177060,
  "resumePayload": { "approved": true, "approver": "manager-jane" }
}

The four timestamps here, startedAt, suspendedAt, resumedAt, and endedAt, are a full audit trail for that single step, and the resume payload is the human decision, captured as plain data the moment it arrived.

The decision could've taken five days for all I know, and the agent would continue right where it left off.

The whole snapshot has to convert to JSON

Everything in the snapshot is serialized as JSON. So if a step needs a 50 MB document, the agent should only keep the document's ID in its state and fetch the actual document when it's required.

Keep step state small and made of plain values, and the snapshot stays small, fast to write, and safe to reload anywhere.

Why can't a long-running agent trust its own work?

It's like letting a student grade their own papers. If you let an agent verify its own work, the agent is unlikely to find any mistakes, because it was the one that went through the decision-making process.

The way out is to separate the doing from the judging. A worker produces the result, and a judge, a different model or at least a fresh prompt holding the explicit done conditions, checks whether the work was actually done as per the requirements.

Cursor's coding agents work similarly. There are planners that emit tasks, workers that execute, and judges that verify.

This setup puts all the weight on the done conditions that you write before an agent starts working.

Checkpointing and verification reinforce each other here. The judge reads the worker's recorded output from the durable record and writes its verdict back into the same record, so the decision to advance or retry survives a restart along with everything else.

The two states of a long-running AI agent

A long-running agent has to remember two different things.

Execution state answers "where in the work are we." It is the step status, the recorded output, and the pause payload in the checkpoint, scoped to one run and run ID.

Memory answers "what does the agent know," and it is the conversation history and the facts the agent carries, scoped to a user and shared across every session that user has, so something learned in the app today is there over email next week.

They are not interchangeable, and a long-running agent needs both at once: the execution state to resume the task and the memory to stay the same agent across tasks.

Wrapping up

If we had to put it all together, I'd say a long-running agent is simply a durable record that short-lived processes take turns with.

The entire concept revolves around keeping the data out of the agentic loop. Because of this architecture:

A filled context window or a crash is just a reload, because the detail and the last good step are already in the record.
A premature "done" gets caught, because a judge writes its verdict to the record before the run moves on.
The agent stays itself across sessions, because what it knows lives in the record, not the process.

With Mastra you get execution, checkpointing, resuming runs, verification, and memory. You build the agent you need, and the architecture handles the rest behind the scenes.