Reducing tool calling error rates from 15% to 3% for OpenAI, Anthropic, and Google Gemini models

We recently built a tool compatibility layer that reduced tool calling error rates from 15% to 3% for 12 OpenAI, Anthropic, and Google Gemini models, across a set of 30 property types and constraints.

Tool Compatibility Error Rate by Provider

Background

Mastra is a TypeScript agent framework. When a user is doing tool calls that accepts some inputs they need to pass either a Zod schema (think z.string().min(5)) or a JSON schema.

We then transform the schema from Zod to JSON and then feed it into the model (for examples of Zod constraints, see the Appendix).

People in the Mastra community were raising issues that they were trying to use a specific model with a tool with an MCP server. The tool calls would fail with some models and succeed with others. This made us think something was happening on the model side.

We saw this quite often with the OpenAI reasoning models, but various errors would happen with other model providers as well.

Scoping the problem

To better understand the problem, we started by making a test suite with a bunch of schema properties and constraints (like unions, nullable, array length), and ran it for each of our providers and models. We ended up with essentially a green / yellow / red grid of models vs constraints.

Some things we noticed:

OpenAI models, if they didn't support a property, would throw an error with a message like "invalid schema for tool X"
Google Gemini models wouldn't explicitly fail, but they would silently ignore properties like the length of a string, or minimum length of an array.
Anthropic models performed quite well, the majority of the ones we tested did not error at all.
DeepSeek and Llama weren't the best at tool usage and would occasionally refuse to call the tool (Even with small extremely explicit prompts: "YOUR ONLY JOB IS TO CALL TOOL X, PLEASE PLEASE DO IT"). They would also sometimes pass a value that did not respect the schema.

The things that seemed within our control to fix were schema errors and ignoring the schema constraints. Improving tool call and usage would be up the provider itself or for the client to implement some retry functionality.

Testing approaches to feeding the models schema constraints

We started thinking about how to solve this problem. There are only a few options for transforming tool definitions. We could feed information to the LLM: the tool description, the input schema, or the LLM prompt.

We started with the input schema: sometimes we could transform the input schema into a shape the LLM would accept (for example changing nullable fields to optional). This helped, but only worked for a few edge cases. There are also various types of JSON schema formats that helped. For example, a JSON schema output target can be selected while transforming between JSON Schema (used by MCP servers) and Zod (used in our framework) to choose between different versions of the JSON Schema spec that are more compatible with certain models.

The next thing we tried was injecting the tool schema contraints into to the LLM prompt in cases where the tool call was likely going to fail. This worked, and our tests for the popular model providers (OpenAI, Google, Anthropic) passed!

The solution

Our problem was now fixed, but our worry was that this felt like an impedance mismatch — the prompt just didn't feel like the right place to put it. We also tested with short prompts and we somewhat worried about what would happen with longer prompts.

But it did work, so we gained confidence that this was a solvable problem. We continued experimenting and next tried passing schema constraints into tool instructions. Just like with agent prompt injection, it turned out that it worked well too — now our entire test suite for OpenAI, Anthropic, and Google models was properly following the tool schema constraints.

We tried to go a level deeper after this, we added it directly to the property description. This felt like the perfect place as the description shouldn't leak outside of the property itself.

This was much more appropriate as everything was contained in the tool property itself and nothing was spilling over to the tool or agent definitions.

Take for example a string that you need to be a url. In the JSONSchema spec you can specify a string to be of uri format, which you would define as z.string().url() using Zod. Before our changes, this is the payload that would get sent to a model like o3-mini.

 1{
 2  "parameters": {
 3    "type": "object",
 4    "properties": {
 5      "stringUrl": {
 6        "type": "string",
 7        "format": "uri"
 8      }
 9    },
10    "required": [
11      "stringUrl"
12    ],
13  }
14}

The problem is that o3-mini, like many other models, will ignore or error if given the format property.

After our changes, this is the payload that we send:

 1{
 2  "parameters": {
 3    "type": "object",
 4    "properties": {
 5      "stringUrl": {
 6        "type": "string",
 7        "description": "{\"url\":true}"
 8      }
 9    },
10    "required": [
11      "stringUrl"
12    ],
13  },
14}

Why this matters

We all remember the bad old days of web develop where browser compatibility was historically a major concern (IE8 😅).

Our hope is that the framework layer will provide a shim for model tool interoperability so that teams don't have to refactor their entire codebase if they want to switch (or just test out!) a different model provider.

Even though we saw this because of users wanting better MCP support, it actually works for all tools — not just ones being pulled in from an MCP server.

Before and after

We tested 30 extensive property types and constraints on those types and passed a tool with that input schema to the LLM to see if it would handle it properly. Here are the results of that test

NOTES:

For DeepSeek and Meta models, the tests would perform fairly inconsistently. Sometimes it would be a lot better, sometimes it would be a lot worse. Adding in retries for the calls would greatly improve performance with or without the compatability layer.
Some Zod properties like z.never(), z.undefined(), and z.tuple() were omitted from this test as they either don't make sense for the schema of a tool or are extremely rare. The majority of models don't handle these properties. As a result we throw a clear a error for when the model doesn't handle these field types.

Here are the results of our tests before and after the MCP Tool Compat layer was applied.

Tool Compatibility Error Rate by Model

PROVIDER	MODEL	Before	After
Anthropic	`claude-3.5-haiku`	96.67%	100%
Anthropic	`claude-3.5-sonnet`	100%	100%
Anthropic	`claude-3.7-sonnet`	100%	100%
Anthropic	`claude-opus-4`	100%	100%
Anthropic	`claude-sonnet-4`	100%	100%
DeepSeek	`deepseek-chat-v3-0324`	60.00%	86.67%
Google	`gemini-2.0-flash-lite-001`	73.33%	96.67%
Google	`gemini-2.5-flash-preview`	86.67%	96.67%
Google	`gemini-2.5-pro-preview`	90.00%	96.67%
Meta Llama	`llama-4-maverick`	86.67%	90.00%
OpenAI	`gpt-4.1`	96.67%	100%
OpenAI	`gpt-4.1-mini`	96.67%	100%
OpenAI	`gpt-4o`	96.67%	100%
OpenAI	`gpt-4o-mini`	93.33%	100%
OpenAI	`o3-mini`	73.33%	100%
OpenAI	`o4-mini`	73.33%	100%

Using it

The MCP Tool Compatibility Layer is now available in Mastra in all versions after 0.9.4 (including our 0.10.0 release). If we're missing a model, we've made it really easy to add new model provider coverage, check out the source code on GitHub for the OpenAI layer for an example.

Please drop us a Github issue or drop into our Discord if you have questions or comments about how we're doing this, want to extend it, or want to implement something in a project not using Mastra.

Here is a link to the implementation

We can't wait to see what you'll build!

Appendix

Google sheet with comprehensive test results. The models that did not error were not included in this. We hand selected popular models, it's likely that there is a wide range of performance from the models we did not include here.

Property Name	Zod Definition
String types
string	`z.string()`
stringMin	`z.string().min(5)`
stringMax	`z.string().max(10)`
stringEmail	`z.string().email()`
stringEmoji	`z.string().emoji()`
stringUrl	`z.string().url()`
stringUuid	`z.string().uuid()`
stringCuid	`z.string().cuid()`
stringRegex	`z.string().regex(/^test-/)`
Number types
number	`z.number()`
numberGt	`z.number().gt(3)`
numberLt	`z.number().lt(1)`
numberGte	`z.number().gte(5)`
numberLte	`z.number().lte(1)`
numberMultipleOf	`z.number().multipleOf(2)`
numberInt	`z.number().int()`
Array types
array	`z.array(z.string())`
arrayMin	`z.array(z.string()).min(5)`
arrayMax	`z.array(z.string()).max(5)`
Object types
object	`z.object({ foo: z.string(), bar: z.number() })`
objectNested	`z.object({ user: z.object({ name: z.string().min(5), age: z.number().gte(18) }) })`
Optional and nullable
optional	`z.string().optional()`
nullable	`z.string().nullable()`
Enums
enum	`z.enum(['A', 'B', 'C'])`
nativeEnum	`z.nativeEnum(TestEnum)`
Union types
unionPrimitives	`z.union([z.string(), z.number()])`
unionObjects	`z.union([ z.object({ amount: z.number(), name: z.string() }), z.object({ type: z.string(), permissions: z.array(z.string()) }) ])`
Uncategorized types
default	`z.string().default('test')`
anyOptional	`z.any().optional()`
any	`z.any()`
Unsupported types
intersection	`z.intersection(z.string().min(1), z.string().max(4))`
never	`z.never() as any`
null	`z.null()`
tuple	`z.tuple([z.string(), z.number(), z.boolean()])`
undefined	`z.undefined()`

Unsupported types were omitted from the test. But they are included in the tool compatibility layer as a way to throw a clear error if the model does not support the type.

Reducing tool calling error rates from 15% to 3% for OpenAI, Anthropic, and Google Gemini models

Stay up to date