AI Agents Hour — Mar 2026 till Feb 2026

March 12, 2026

#71

Meta Acquires Moltbook, Openai Releases GPT 5.4, TypeScript Is #1 on GitHub (This Week In AI)

A lot happened in eight days. Meta acquired Moltbook, a social network built entirely for AI agents, not humans. OpenAI dropped GPT-5.4 Thinking and GPT-5.4 Pro, Codex got forks for multi-agent workflows and Windows support, and there are rumblings of OpenAI building a GitHub alternative. Anthropic fired back hard — multi-agent PR code review for Claude Code, while loops via /loop, the Claude Marketplace, and a way to pull your context from other AI tools.

March 10, 2026

#70

The Biggest Threat to AI Agents (with Ismail Pelaseyed)

Ismail Pelaseyed from Superagent is back on Agents Hour, and this time he's talking about something most builders aren't thinking about yet — supply chain attacks on AI agents. Guardrails protect against what you tell your agent to do. But what about everything your agent reads, fetches, and installs on its own? That's the gap Brin is built to fill.

March 4, 2026

#69

Missile Strikes Disrupt AWS and Claude, Anthropic Banned from US Government, Cloudflare vs Vercel

This week in AI saw geopolitical turmoil, major funding news, and a shift in software development. Missile strikes in the UAE and Bahrain disrupted AWS and Claude services. Meanwhile, after Anthropic banned its models from autonomous weapons and mass surveillance, the Trump administration banned Anthropic from government contracts—posing a major supply chain risk. On the same day, Sam Altman secured a deal with the Department of War as OpenAI announced a $110 billion funding round, highlighting a sharp contrast in approaches.

March 1, 2026

#68

How to Build Reliable AI Agents with Datasets, Experiments, and Error Analysis

Yujohn from Mastra explains why datasets and experiments are essential for building production-grade AI agents. If you're building an agent, you need a way to verify it's working correctly before and after you make changes. Datasets provide that baseline. You create a collection of test cases (ground truth) that represent the scenarios your agent should handle. Then you run experiments: pass each test case through your agent and measure the results. This is error analysis in practice. You start by identifying where your agent fails, then build scorers to quantify those failure modes over time. Smaller teams often ship first and add datasets later, once they have user feedback. Larger teams need them earlier. But eventually, every production agent needs this. The demo shows how Mastra makes this accessible. You can create datasets through the UI, add items manually or import from CSV, and run experiments with a single click. The results show you exactly what went wrong: which tool calls failed, what the agent output was, and how it compared to ground truth. You can also compare experiments side by side to see if your prompt tweaks actually improved things. And because all the data lives in your own database, you can write your own agents to analyze the results, dig into traces, and iterate. The SDK makes it easy to integrate into CI/CD: run experiments on pull requests, gate deployments on eval scores, or just collect data from production and curate datasets later.

February 27, 2026

#67

A Coding Agent That Never Compacts

Abhi walks through Mastra Code, a new open-source coding agent with observational memory that compresses context without losing it.

February 25, 2026

#66

AI NEWS: Stripe's Minions, Distillation Attacks on Claude, Cloudflare's Code Mode

Shane and Abhi break down the biggest AI news from the past few days. Anthropic identified industrial-scale distillation attacks on Claude by DeepSeek, Moonshot AI, and MiniMax. Anthropic also released a groundbreaking report analyzing millions of AI agent interactions using Claude. Stripe is shipping 1,300+ AI-generated PRs per week with their Minions system. Code Mode for MCP is becoming a standard part of the MCP ecosystem, and we cover skills benchmarks, trajectory explorer for agent traces, Vercel AI Gateway video support, and more.

February 24, 2026

#82

Sazabi: AI-Native Observability for Fast-Moving Teams (with Sherwood Callaway)

In this episode, Shane and Abhi sit down with Sherwood Callaway, founder of Sazabi, an AI-native observability platform designed for engineering teams that move fast. Sherwood shares his journey from building infrastructure and observability teams at Brex to realizing that modern development tools are moving at light speed, while observability tooling hasn't kept pace. While AI agents can ship thousands of lines of code per day, teams are still debugging production with the same tools they've been using for years: Datadog, Sentry, manual dashboards, and manual incident triage. Sazabi takes a radically different approach to observability centered on three core principles: 1. Less is More — Debugging an incident is as simple as asking a question. "Why is production down?" The best UI for observability is chat. 2. Logs Are All You Need — The "three pillars of observability" (logs, metrics, traces) is outdated dogma. With AI, you can accomplish everything using just logs. Logs are events, metrics are aggregated events, and traces are collections of start/end events. Logs can do it all. 3. Monitoring as We Know It is Dead — Sazabi replaces static monitors with agentic anomaly detection. Think of it as a team of staff engineers constantly watching your app for issues, investigating problems, and only escalating what matters. In this conversation, we dive into the gap between modern development and modern observability, and why the idea that “logs are all you need” is both controversial and, in Sherwood's view, correct. We also explore how Sazabi uses AI agents for root cause analysis (RCA), the philosophy behind simplifying observability for all engineers, and the company’s current status.

February 24, 2026

#65

How to Orchestrate Coding Agents with Conductor, with Charlie Holtz

Shane and Abhi welcome Charlie Holtz from Conductor to AI Agents Hour. Charlie shares how frustration with managing multiple Claude Code instances led to building Conductor. They discuss Conductor's July 2025 launch as the first agent orchestration Mac app, early design choices, and its impact on the market.

February 20, 2026

#64

AI NEWS - Something Big is Happening: Gemini 3.1 Pro, GPT-5.3-Spark, and Anthropic $30B fundraise

It's time for another AI News roundup with Shane and Abhi! This week was absolutely massive. Matt Shumer's viral article about AI automation, which describes his own job being automated in real time, has reached 84 million views. Anthropic raised $30 billion at a $380B valuation (one of the largest private raises in tech history). Claude Sonnet 4.6 launched with a 1M token context window. And the Chinese model tsunami is real: Qwen 3.5, GLM 5.0, MiniMax M2.5 (nearly Opus-level at 1/8 the cost), and DeepSeek v4 rumors.

February 12, 2026

#63

Observational Memory: The Human-Inspired Memory System for AI Agents, with Tyler Barnes

Tyler Barnes, founding engineer at Mastra, introduces Observational Memory. It is a new memory system for AI agents that achieves state-of-the-art results on LongMemEval with a completely stable context window. Unlike semantic recall (which uses RAG and invalidates prompt caching), Observational Memory compresses conversations into dense observations while maintaining a stable, fully cacheable context. The result: 94.87% accuracy on LongMemEval with GPT-5 mini. This is the highest score recorded by any memory system to date. In this conversation, Tyler explains how the system works, why it outperforms raw context, and how you can integrate it into your agents in under 20 minutes. We also dive into the research, the benchmarks, and what's next for Observational Memory.

February 10, 2026

#62

AI News: Model Wars - Opus 4.6 vs GPT-5.3-Codex + Seedance 2.0 Redefines AI Video

Shane and Abhi cover top AI stories. This week was absolutely massive! Anthropic aired Super Bowl ads mocking OpenAI's decision to put ads in ChatGPT, Opus 4.6, and GPT-5.3-Codex launched within 15 minutes of each other, and then ClawHub dropped a bombshell: 11.9% of the entire marketplace is malware. We cover everything: Anthropic's competitive jabs, the model war benchmarks, Claude's 1M token context, OpenAI's Frontier platform, the security crisis that's reshaping how people think about agent marketplaces, Kimi K2.5's domination on OpenRouter, ElevenLabs' $500M raise at $11B valuation, and the explosion of AI video generation tools. Plus: Perplexity's Model Council, Roblox 4D generation, Mistral's Voxtral Transcribe 2, and why Swyx finally admits evals actually help.

February 5, 2026

#61

Running 100 AI Agents in Parallel: Superset Cofounder Kiet Ho

Shane and Abhi welcome Kiet Ho, cofounder of Superset, to discuss how Superset evolved from simple WorkTree management into a full-featured tool with file editing, automation, cloud support, and multi-agent orchestration.