Baby steps towards AI Ops

For the last decade as an engineer, I’ve been deep in the weeds of distributed systems and platform engineering, building clusters on AWS, CGP, and Azure to reliably run highly variable PaaS workloads.

Now, as I’m building Mastra, one thing is clear to me. As AI agents become increasingly central to modern software systems, we're facing a new challenge.

How do we apply the last decade of DevOps to manage, deploy, and operate AI agents at scale?

Just as DevOps transformed how we build and run traditional software, we’re learning how to reliably run AI agents in production.

More autonomy, more problems

Traditional DevOps practices were built around (mostly) deterministic systems - code that behaves the same way given the same inputs.

But AI agents introduce new layers of complexity: they learn, adapt, and make decisions autonomously. This fundamental difference requires us to rethink our operational practices.

Consider a simple deployment pipeline. In a traditional DevOps setup, we can test specific code paths and can be reasonably confident about how our application will behave.

With AI agents, we need to validate not just code paths but decision-making patterns, interaction models, and learning behaviors.

The areas of AI Ops

Okay, so what should we be thinking about?

Monitoring agent decision-making

Traditional metrics like CPU usage and response time are still important, but AI Ops introduces new dimensions of monitoring.

Let’s say you have two decision paths, and the AI returns a good response in 90% down one path, but 40% down another path.

You need to understand not just success rates along both paths, but when an LLM system starts going more down one path than usual.

In other words, we need to track decision quality, interaction patterns, and learning trajectories. Should we add "Continued Learning" to "CI/CD"?

This means building new types of observability systems that can understand and validate agent behaviors, not just performance metrics.

Moving beyond blue-green deploys

Rolling out updates to AI agents requires more sophisticated deployment strategies than traditional blue-green deployments.

Teams want to move quickly, but they are scared because non-deterministic systems need more testing than traditional software engineering.

We need patterns for safely introducing new agent behaviors, testing them in production, and rolling them back if they don't meet our criteria.

This might mean running multiple versions of agents in parallel and carefully routing traffic between them based on behavioral metrics.

It might even mean spinning up synthetic clusters that re-run production queries 5 or 10 or 30 times to see not just what happened in reality, but the whole distribution of likely scenarios.

Feature flagging is about to go to a whole new level.

Enabling human-in-the-loop agentic governance

Governance today mainly focuses on access control and audit logs. AI agents require a whole new layer of governance.

Our colleagues who have been working in the self-driving space have gotten used to an observing human-in-the-loop as a foundational piece of the puzzle, but we have not yet developed the control primitives for this in most AI applications.