For the last decade as an engineer, I’ve been deep in the weeds of distributed systems and platform engineering, building clusters on AWS, CGP, and Azure to reliably run highly variable PaaS workloads.
Now, as I’m building Mastra, one thing is clear to me. As AI agents become increasingly central to modern software systems, we're facing a new challenge.
How do we apply the last decade of DevOps to manage, deploy, and operate AI agents at scale?
Just as DevOps transformed how we build and run traditional software, we’re learning how to reliably run AI agents in production.
More autonomy, more problems
Traditional DevOps practices were built around (mostly) deterministic systems - code that behaves the same way given the same inputs.
But AI agents introduce new layers of complexity: they learn, adapt, and make decisions autonomously. This fundamental difference requires us to rethink our operational practices.
Consider a simple deployment pipeline. In a traditional DevOps setup, we can test specific code paths and can be reasonably confident about how our application will behave.
With AI agents, we need to validate not just code paths but decision-making patterns, interaction models, and learning behaviors.
The areas of AI Ops
Okay, so what should we be thinking about?
Monitoring agent decision-making
Traditional metrics like CPU usage and response time are still important, but AI Ops introduces new dimensions of monitoring.
Let’s say you have two decision paths, and the AI returns a good response in 90% down one path, but 40% down another path.
You need to understand not just success rates along both paths, but when an LLM system starts going more down one path than usual.
In other words, we need to track decision quality, interaction patterns, and learning trajectories. Should we add "Continued Learning" to "CI/CD"?
This means building new types of observability systems that can understand and validate agent behaviors, not just performance metrics.
Moving beyond blue-green deploys
Rolling out updates to AI agents requires more sophisticated deployment strategies than traditional blue-green deployments.
Teams want to move quickly, but they are scared because non-deterministic systems need more testing than traditional software engineering.
We need patterns for safely introducing new agent behaviors, testing them in production, and rolling them back if they don't meet our criteria.
This might mean running multiple versions of agents in parallel and carefully routing traffic between them based on behavioral metrics.
It might even mean spinning up synthetic clusters that re-run production queries 5 or 10 or 30 times to see not just what happened in reality, but the whole distribution of likely scenarios.
Feature flagging is about to go to a whole new level.
Enabling human-in-the-loop agentic governance
Governance today mainly focuses on access control and audit logs. AI agents require a whole new layer of governance.
Our colleagues who have been working in the self-driving space have gotten used to an observing human-in-the-loop as a foundational piece of the puzzle, but we have not yet developed the control primitives for this in most AI applications.
Baby steps towards AI Ops
How much do we know about what's required to do good AI Ops today? I'll be very honest, very little. But if you're looking for a starting point, here are some pieces I can think of:
Tracking stateful systems to reverse-engineer interpretability
Code and configuration often doesn't capture the full state of an agent. RAG results depend on data pipelines. Memory matters.
We're starting to see new tools and practices for tracking the evolution of agent behaviors over time. What's known as LLM interpretability will take on more importance.
There is some software engineer right now, somewhere, who is working on a system to answer questions like "why did the autonomous killer drone decide to detonate?"
Testing
Evals, unit tests, and workflow-level integration tests are a good start. But teams need more data.
We're starting to see teams build simulation environments where they can validate agent behaviors across a wide range of scenarios, using real and synthetic data.
Monitoring and Alerting
Our monitoring systems need to understand normal vs. abnormal agent behaviors. This goes beyond simple thresholds - we need systems that can detect when agents are drifting from expected behavioral patterns.
Incident Response
Unless your rollbacks are really good, when an AI agent starts behaving unexpectedly, traditional debugging tools may not be enough. Does your on-call team know how to do prompt engineering?
Looking forward
As we deploy more sophisticated AI agents, AI Ops will continue to evolve.
Even in our YC batch we're already seeing the demand for specialized tools for agent monitoring, behavior testing, and deployment management. But the field is still in its early stages, and there's so much room for innovation!
We're excited to see what the future holds.