Model fallbacks: Your safety net for production AI

Every AI provider has outages. OpenAI goes down. Anthropic hits capacity. Google rate-limits you. Models get deprecated with little warning.

Yet most AI applications are still built with a single point of failure: one model, one provider.

We just shipped model fallbacks so your application won’t go down — even if one provider does

How it works

When you configure an agent, you can now specify fallback models with retry logic:

 1Primary model: GPT-5
 2├── Try 3 times
 3├── If fails → Claude Sonnet 4.5
 4│   └── Try 2 times
 5│   └── If fails → Claude Opus 4.1
 6│       └── Try 2 times
 7│       └── If fails → Return error

The system tries your primary model first. If it gets a 500 error, rate limit, or timeout, it automatically switches to your first fallback. If that fails too, it moves to the next. Each model gets its own retry count before moving on.

Your users never know the difference. The response comes back with the same format, just from a different model.

How we implemented it

Under the hood, each model in your fallback chain maintains its own state. When a request fails, we preserve the error context and try the next model with the same input. The retry logic is configurable per model—some models might deserve more retries than others.

 1import { Agent } from '@mastra/core';
 2import { openai, anthropic } from '@mastra/llm';
 3
 4const agent = new Agent({
 5  name: 'weather-assistant',
 6  instructions: 'You are a weather assistant.',
 7  model: [
 8    {
 9      model: openai('gpt-4o-mini'),
10      maxRetries: 5,
11    },
12    {
13      model: anthropic('claude-3-5-sonnet'),
14      maxRetries: 1,
15    },
16  ],
17});
18
19// The agent will automatically try models in order
20const response = await agent.streamVNext("What's the weather like?");

Non-terminal models throw errors to trigger fallback. Only the last model in the chain returns error streams. This ensures clean error propagation through the fallback chain while maintaining streaming compatibility.

What's next

Model fallbacks open up interesting possibilities. What if instead of sequential fallback, you could parallelize model calls and return the fastest one? Run GPT-5 and Claude Opus simultaneously, return whichever responds first. Lower latency at higher cost.

Or what about load balancing across models based on their current response times? Route traffic away from models that are slow, toward ones that are fast.

Let us know what you think because we’re always trying to improve.

Model fallback is available now in @mastra/core 0.17.0+. See PR #7126 for implementation details.≈