Tool Call Accuracy Scorer Examples

Mastra provides two tool call accuracy scorers:

Code-based scorer for deterministic evaluation
LLM-based scorer for semantic evaluation

Installation


npm install @mastra/evals

For complete API documentation and configuration options, see Tool Call Accuracy Scorers.

Code-Based Scorer Examples

The code-based scorer provides deterministic, binary scoring (0 or 1) based on exact tool matching.

Import


import { createToolCallAccuracyScorerCode } from "@mastra/evals/scorers/code";
import { createAgentTestRun, createUIMessage, createToolInvocation } from "@mastra/evals/scorers/utils";

Correct tool selection

src/example-correct-tool.ts


const scorer = createToolCallAccuracyScorerCode({ 
  expectedTool: 'weather-tool' 
});
 
// Simulate LLM input and output with tool call
const inputMessages = [
  createUIMessage({ 
    content: 'What is the weather like in New York today?', 
    role: 'user', 
    id: 'input-1' 
  })
];
 
const output = [
  createUIMessage({
    content: 'Let me check the weather for you.',
    role: 'assistant',
    id: 'output-1',
    toolInvocations: [
      createToolInvocation({
        toolCallId: 'call-123',
        toolName: 'weather-tool',
        args: { location: 'New York' },
        result: { temperature: '72°F', condition: 'sunny' },
        state: 'result'
      })
    ]
  })
];
 
const run = createAgentTestRun({ inputMessages, output });
const result = await scorer.run(run);
 
console.log(result.score); // 1
console.log(result.preprocessStepResult?.correctToolCalled); // true

Strict mode evaluation

Only passes if exactly one tool is called:

src/example-strict-mode.ts


const strictScorer = createToolCallAccuracyScorerCode({ 
  expectedTool: 'weather-tool',
  strictMode: true
});
 
// Multiple tools called - fails in strict mode
const output = [
  createUIMessage({
    content: 'Let me help you with that.',
    role: 'assistant',
    id: 'output-1',
    toolInvocations: [
      createToolInvocation({
        toolCallId: 'call-1',
        toolName: 'search-tool',
        args: {},
        result: {},
        state: 'result',
      }),
      createToolInvocation({
        toolCallId: 'call-2',
        toolName: 'weather-tool',
        args: { location: 'New York' },
        result: { temperature: '20°C' },
        state: 'result',
      })
    ]
  })
];
 
const result = await strictScorer.run(run);
console.log(result.score); // 0 - fails because multiple tools were called

Tool order validation

Validates that tools are called in a specific sequence:

src/example-order-validation.ts


const orderScorer = createToolCallAccuracyScorerCode({
  expectedTool: 'auth-tool', // ignored when order is specified
  expectedToolOrder: ['auth-tool', 'fetch-tool'],
  strictMode: true // no extra tools allowed
});
 
const output = [
  createUIMessage({
    content: 'I will authenticate and fetch the data.',
    role: 'assistant',
    id: 'output-1',
    toolInvocations: [
      createToolInvocation({
        toolCallId: 'call-1',
        toolName: 'auth-tool',
        args: { token: 'abc123' },
        result: { authenticated: true },
        state: 'result'
      }),
      createToolInvocation({
        toolCallId: 'call-2',
        toolName: 'fetch-tool',
        args: { endpoint: '/data' },
        result: { data: ['item1'] },
        state: 'result'
      })
    ]
  })
];
 
const result = await orderScorer.run(run);
console.log(result.score); // 1 - correct order

Flexible order mode

Allows extra tools as long as expected tools maintain relative order:

src/example-flexible-order.ts


const flexibleOrderScorer = createToolCallAccuracyScorerCode({
  expectedTool: 'auth-tool',
  expectedToolOrder: ['auth-tool', 'fetch-tool'],
  strictMode: false // allows extra tools
});
 
const output = [
  createUIMessage({
    content: 'Performing comprehensive operation.',
    role: 'assistant',
    id: 'output-1',
    toolInvocations: [
      createToolInvocation({
        toolCallId: 'call-1',
        toolName: 'auth-tool',
        args: { token: 'abc123' },
        result: { authenticated: true },
        state: 'result'
      }),
      createToolInvocation({
        toolCallId: 'call-2',
        toolName: 'log-tool', // Extra tool - OK in flexible mode
        args: { message: 'Starting fetch' },
        result: { logged: true },
        state: 'result'
      }),
      createToolInvocation({
        toolCallId: 'call-3',
        toolName: 'fetch-tool',
        args: { endpoint: '/data' },
        result: { data: ['item1'] },
        state: 'result'
      })
    ]
  })
];
 
const result = await flexibleOrderScorer.run(run);
console.log(result.score); // 1 - auth-tool comes before fetch-tool

LLM-Based Scorer Examples

The LLM-based scorer uses AI to evaluate whether tool selections are appropriate for the user’s request.

Import


import { createToolCallAccuracyScorerLLM } from "@mastra/evals/scorers/llm";
import { openai } from "@ai-sdk/openai";

Basic LLM evaluation

src/example-llm-basic.ts


const llmScorer = createToolCallAccuracyScorerLLM({
  model: openai('gpt-4o-mini'),
  availableTools: [
    { 
      name: 'weather-tool', 
      description: 'Get current weather information for any location' 
    },
    { 
      name: 'calendar-tool', 
      description: 'Check calendar events and scheduling' 
    },
    { 
      name: 'search-tool', 
      description: 'Search the web for general information' 
    }
  ]
});
 
const inputMessages = [
  createUIMessage({ 
    content: 'What is the weather like in San Francisco today?', 
    role: 'user', 
    id: 'input-1' 
  })
];
 
const output = [
  createUIMessage({
    content: 'Let me check the current weather for you.',
    role: 'assistant',
    id: 'output-1',
    toolInvocations: [
      createToolInvocation({
        toolCallId: 'call-123',
        toolName: 'weather-tool',
        args: { location: 'San Francisco', date: 'today' },
        result: { temperature: '68°F', condition: 'foggy' },
        state: 'result'
      })
    ]
  })
];
 
const run = createAgentTestRun({ inputMessages, output });
const result = await llmScorer.run(run);
 
console.log(result.score); // 1.0 - appropriate tool usage
console.log(result.reason); // "The agent correctly used the weather-tool to address the user's request for weather information."

Handling inappropriate tool usage

src/example-llm-inappropriate.ts


const inputMessages = [
  createUIMessage({ 
    content: 'What is the weather in Tokyo?', 
    role: 'user', 
    id: 'input-1' 
  })
];
 
const inappropriateOutput = [
  createUIMessage({
    content: 'Let me search for that information.',
    role: 'assistant', 
    id: 'output-1',
    toolInvocations: [
      createToolInvocation({
        toolCallId: 'call-456',
        toolName: 'search-tool', // Less appropriate than weather-tool
        args: { query: 'Tokyo weather' },
        result: { results: ['Tokyo weather data...'] },
        state: 'result'
      })
    ]
  })
];
 
const run = createAgentTestRun({ inputMessages, output: inappropriateOutput });
const result = await llmScorer.run(run);
 
console.log(result.score); // 0.5 - partially appropriate
console.log(result.reason); // "The agent used search-tool when weather-tool would have been more appropriate for a direct weather query."

Evaluating clarification requests

The LLM scorer recognizes when agents appropriately ask for clarification:

src/example-llm-clarification.ts


const vagueInput = [
  createUIMessage({ 
    content: 'I need help with something', 
    role: 'user', 
    id: 'input-1' 
  })
];
 
const clarificationOutput = [
  createUIMessage({
    content: 'I'd be happy to help! Could you please provide more details about what you need assistance with?',
    role: 'assistant',
    id: 'output-1',
    // No tools called - asking for clarification instead
  })
];
 
const run = createAgentTestRun({ 
  inputMessages: vagueInput, 
  output: clarificationOutput 
});
const result = await llmScorer.run(run);
 
console.log(result.score); // 1.0 - appropriate to ask for clarification
console.log(result.reason); // "The agent appropriately asked for clarification rather than calling tools with insufficient information."

Comparing Both Scorers

Here’s an example using both scorers on the same data:

src/example-comparison.ts


import { createToolCallAccuracyScorerCode as createCodeScorer } from '@mastra/evals/scorers/code';
import { createToolCallAccuracyScorerLLM as createLLMScorer } from '@mastra/evals/scorers/llm';
import { openai } from '@ai-sdk/openai';
 
// Setup both scorers
const codeScorer = createCodeScorer({
  expectedTool: 'weather-tool',
  strictMode: false
});
 
const llmScorer = createLLMScorer({
  model: openai('gpt-4o-mini'),
  availableTools: [
    { name: 'weather-tool', description: 'Get weather information' },
    { name: 'search-tool', description: 'Search the web' }
  ]
});
 
// Test data
const run = createAgentTestRun({
  inputMessages: [
    createUIMessage({ 
      content: 'What is the weather?', 
      role: 'user', 
      id: 'input-1' 
    })
  ],
  output: [
    createUIMessage({
      content: 'Let me find that information.',
      role: 'assistant',
      id: 'output-1',
      toolInvocations: [
        createToolInvocation({
          toolCallId: 'call-1',
          toolName: 'search-tool',
          args: { query: 'weather' },
          result: { results: ['weather data'] },
          state: 'result'
        })
      ]
    })
  ]
});
 
// Run both scorers
const codeResult = await codeScorer.run(run);
const llmResult = await llmScorer.run(run);
 
console.log('Code Scorer:', codeResult.score); // 0 - wrong tool
console.log('LLM Scorer:', llmResult.score);   // 0.3 - partially appropriate
console.log('LLM Reason:', llmResult.reason);   // Explains why search-tool is less appropriate

Configuration Options

Code-Based Scorer Options


// Standard mode - passes if expected tool is called
const lenientScorer = createCodeScorer({ 
  expectedTool: 'search-tool',
  strictMode: false
});
 
// Strict mode - only passes if exactly one tool is called
const strictScorer = createCodeScorer({ 
  expectedTool: 'search-tool',
  strictMode: true
});
 
// Order checking with strict mode
const strictOrderScorer = createCodeScorer({
  expectedTool: 'step1-tool',
  expectedToolOrder: ['step1-tool', 'step2-tool', 'step3-tool'],
  strictMode: true // no extra tools allowed
});

LLM-Based Scorer Options


// Basic configuration
const basicLLMScorer = createLLMScorer({
  model: openai('gpt-4o-mini'),
  availableTools: [
    { name: 'tool1', description: 'Description 1' },
    { name: 'tool2', description: 'Description 2' }
  ]
});
 
// With different model
const customModelScorer = createLLMScorer({
  model: openai('gpt-4'), // More powerful model for complex evaluations
  availableTools: [...]
});

Understanding the Results

Code-Based Scorer Results


{
  runId: string,
  preprocessStepResult: {
    expectedTool: string,
    actualTools: string[],
    strictMode: boolean,
    expectedToolOrder?: string[],
    hasToolCalls: boolean,
    correctToolCalled: boolean,
    correctOrderCalled: boolean | null,
    toolCallInfos: ToolCallInfo[]
  },
  score: number // Always 0 or 1
}

LLM-Based Scorer Results


{
  runId: string,
  score: number,  // 0.0 to 1.0
  reason: string, // Human-readable explanation
  analyzeStepResult: {
    evaluations: Array<{
      toolCalled: string,
      wasAppropriate: boolean,
      reasoning: string
    }>,
    missingTools?: string[]
  }
}

When to Use Each Scorer

Use Code-Based Scorer For:

Unit testing
CI/CD pipelines
Regression testing
Exact tool matching requirements
Tool sequence validation

Use LLM-Based Scorer For:

Production evaluation
Quality assurance
User intent alignment
Context-aware evaluation
Handling edge cases

Tool Call Accuracy Scorer Examples

Mastra provides two tool call accuracy scorers:

Code-based scorer for deterministic evaluation
LLM-based scorer for semantic evaluation

Installation


npm install @mastra/evals

For complete API documentation and configuration options, see Tool Call Accuracy Scorers.

Code-Based Scorer Examples

The code-based scorer provides deterministic, binary scoring (0 or 1) based on exact tool matching.

Import


import { createToolCallAccuracyScorerCode } from "@mastra/evals/scorers/code";
import { createAgentTestRun, createUIMessage, createToolInvocation } from "@mastra/evals/scorers/utils";

Correct tool selection

src/example-correct-tool.ts


const scorer = createToolCallAccuracyScorerCode({ 
  expectedTool: 'weather-tool' 
});
 
// Simulate LLM input and output with tool call
const inputMessages = [
  createUIMessage({ 
    content: 'What is the weather like in New York today?', 
    role: 'user', 
    id: 'input-1' 
  })
];
 
const output = [
  createUIMessage({
    content: 'Let me check the weather for you.',
    role: 'assistant',
    id: 'output-1',
    toolInvocations: [
      createToolInvocation({
        toolCallId: 'call-123',
        toolName: 'weather-tool',
        args: { location: 'New York' },
        result: { temperature: '72°F', condition: 'sunny' },
        state: 'result'
      })
    ]
  })
];
 
const run = createAgentTestRun({ inputMessages, output });
const result = await scorer.run(run);
 
console.log(result.score); // 1
console.log(result.preprocessStepResult?.correctToolCalled); // true

Strict mode evaluation

Only passes if exactly one tool is called:

src/example-strict-mode.ts


const strictScorer = createToolCallAccuracyScorerCode({ 
  expectedTool: 'weather-tool',
  strictMode: true
});
 
// Multiple tools called - fails in strict mode
const output = [
  createUIMessage({
    content: 'Let me help you with that.',
    role: 'assistant',
    id: 'output-1',
    toolInvocations: [
      createToolInvocation({
        toolCallId: 'call-1',
        toolName: 'search-tool',
        args: {},
        result: {},
        state: 'result',
      }),
      createToolInvocation({
        toolCallId: 'call-2',
        toolName: 'weather-tool',
        args: { location: 'New York' },
        result: { temperature: '20°C' },
        state: 'result',
      })
    ]
  })
];
 
const result = await strictScorer.run(run);
console.log(result.score); // 0 - fails because multiple tools were called

Tool order validation

Validates that tools are called in a specific sequence:

src/example-order-validation.ts


const orderScorer = createToolCallAccuracyScorerCode({
  expectedTool: 'auth-tool', // ignored when order is specified
  expectedToolOrder: ['auth-tool', 'fetch-tool'],
  strictMode: true // no extra tools allowed
});
 
const output = [
  createUIMessage({
    content: 'I will authenticate and fetch the data.',
    role: 'assistant',
    id: 'output-1',
    toolInvocations: [
      createToolInvocation({
        toolCallId: 'call-1',
        toolName: 'auth-tool',
        args: { token: 'abc123' },
        result: { authenticated: true },
        state: 'result'
      }),
      createToolInvocation({
        toolCallId: 'call-2',
        toolName: 'fetch-tool',
        args: { endpoint: '/data' },
        result: { data: ['item1'] },
        state: 'result'
      })
    ]
  })
];
 
const result = await orderScorer.run(run);
console.log(result.score); // 1 - correct order

Flexible order mode

Allows extra tools as long as expected tools maintain relative order:

src/example-flexible-order.ts


const flexibleOrderScorer = createToolCallAccuracyScorerCode({
  expectedTool: 'auth-tool',
  expectedToolOrder: ['auth-tool', 'fetch-tool'],
  strictMode: false // allows extra tools
});
 
const output = [
  createUIMessage({
    content: 'Performing comprehensive operation.',
    role: 'assistant',
    id: 'output-1',
    toolInvocations: [
      createToolInvocation({
        toolCallId: 'call-1',
        toolName: 'auth-tool',
        args: { token: 'abc123' },
        result: { authenticated: true },
        state: 'result'
      }),
      createToolInvocation({
        toolCallId: 'call-2',
        toolName: 'log-tool', // Extra tool - OK in flexible mode
        args: { message: 'Starting fetch' },
        result: { logged: true },
        state: 'result'
      }),
      createToolInvocation({
        toolCallId: 'call-3',
        toolName: 'fetch-tool',
        args: { endpoint: '/data' },
        result: { data: ['item1'] },
        state: 'result'
      })
    ]
  })
];
 
const result = await flexibleOrderScorer.run(run);
console.log(result.score); // 1 - auth-tool comes before fetch-tool

LLM-Based Scorer Examples

The LLM-based scorer uses AI to evaluate whether tool selections are appropriate for the user’s request.

Import


import { createToolCallAccuracyScorerLLM } from "@mastra/evals/scorers/llm";
import { openai } from "@ai-sdk/openai";

Basic LLM evaluation

src/example-llm-basic.ts


const llmScorer = createToolCallAccuracyScorerLLM({
  model: openai('gpt-4o-mini'),
  availableTools: [
    { 
      name: 'weather-tool', 
      description: 'Get current weather information for any location' 
    },
    { 
      name: 'calendar-tool', 
      description: 'Check calendar events and scheduling' 
    },
    { 
      name: 'search-tool', 
      description: 'Search the web for general information' 
    }
  ]
});
 
const inputMessages = [
  createUIMessage({ 
    content: 'What is the weather like in San Francisco today?', 
    role: 'user', 
    id: 'input-1' 
  })
];
 
const output = [
  createUIMessage({
    content: 'Let me check the current weather for you.',
    role: 'assistant',
    id: 'output-1',
    toolInvocations: [
      createToolInvocation({
        toolCallId: 'call-123',
        toolName: 'weather-tool',
        args: { location: 'San Francisco', date: 'today' },
        result: { temperature: '68°F', condition: 'foggy' },
        state: 'result'
      })
    ]
  })
];
 
const run = createAgentTestRun({ inputMessages, output });
const result = await llmScorer.run(run);
 
console.log(result.score); // 1.0 - appropriate tool usage
console.log(result.reason); // "The agent correctly used the weather-tool to address the user's request for weather information."

Handling inappropriate tool usage

src/example-llm-inappropriate.ts


const inputMessages = [
  createUIMessage({ 
    content: 'What is the weather in Tokyo?', 
    role: 'user', 
    id: 'input-1' 
  })
];
 
const inappropriateOutput = [
  createUIMessage({
    content: 'Let me search for that information.',
    role: 'assistant', 
    id: 'output-1',
    toolInvocations: [
      createToolInvocation({
        toolCallId: 'call-456',
        toolName: 'search-tool', // Less appropriate than weather-tool
        args: { query: 'Tokyo weather' },
        result: { results: ['Tokyo weather data...'] },
        state: 'result'
      })
    ]
  })
];
 
const run = createAgentTestRun({ inputMessages, output: inappropriateOutput });
const result = await llmScorer.run(run);
 
console.log(result.score); // 0.5 - partially appropriate
console.log(result.reason); // "The agent used search-tool when weather-tool would have been more appropriate for a direct weather query."

Evaluating clarification requests

The LLM scorer recognizes when agents appropriately ask for clarification:

src/example-llm-clarification.ts


const vagueInput = [
  createUIMessage({ 
    content: 'I need help with something', 
    role: 'user', 
    id: 'input-1' 
  })
];
 
const clarificationOutput = [
  createUIMessage({
    content: 'I'd be happy to help! Could you please provide more details about what you need assistance with?',
    role: 'assistant',
    id: 'output-1',
    // No tools called - asking for clarification instead
  })
];
 
const run = createAgentTestRun({ 
  inputMessages: vagueInput, 
  output: clarificationOutput 
});
const result = await llmScorer.run(run);
 
console.log(result.score); // 1.0 - appropriate to ask for clarification
console.log(result.reason); // "The agent appropriately asked for clarification rather than calling tools with insufficient information."

Comparing Both Scorers

Here’s an example using both scorers on the same data:

src/example-comparison.ts


import { createToolCallAccuracyScorerCode as createCodeScorer } from '@mastra/evals/scorers/code';
import { createToolCallAccuracyScorerLLM as createLLMScorer } from '@mastra/evals/scorers/llm';
import { openai } from '@ai-sdk/openai';
 
// Setup both scorers
const codeScorer = createCodeScorer({
  expectedTool: 'weather-tool',
  strictMode: false
});
 
const llmScorer = createLLMScorer({
  model: openai('gpt-4o-mini'),
  availableTools: [
    { name: 'weather-tool', description: 'Get weather information' },
    { name: 'search-tool', description: 'Search the web' }
  ]
});
 
// Test data
const run = createAgentTestRun({
  inputMessages: [
    createUIMessage({ 
      content: 'What is the weather?', 
      role: 'user', 
      id: 'input-1' 
    })
  ],
  output: [
    createUIMessage({
      content: 'Let me find that information.',
      role: 'assistant',
      id: 'output-1',
      toolInvocations: [
        createToolInvocation({
          toolCallId: 'call-1',
          toolName: 'search-tool',
          args: { query: 'weather' },
          result: { results: ['weather data'] },
          state: 'result'
        })
      ]
    })
  ]
});
 
// Run both scorers
const codeResult = await codeScorer.run(run);
const llmResult = await llmScorer.run(run);
 
console.log('Code Scorer:', codeResult.score); // 0 - wrong tool
console.log('LLM Scorer:', llmResult.score);   // 0.3 - partially appropriate
console.log('LLM Reason:', llmResult.reason);   // Explains why search-tool is less appropriate

Configuration Options

Code-Based Scorer Options


// Standard mode - passes if expected tool is called
const lenientScorer = createCodeScorer({ 
  expectedTool: 'search-tool',
  strictMode: false
});
 
// Strict mode - only passes if exactly one tool is called
const strictScorer = createCodeScorer({ 
  expectedTool: 'search-tool',
  strictMode: true
});
 
// Order checking with strict mode
const strictOrderScorer = createCodeScorer({
  expectedTool: 'step1-tool',
  expectedToolOrder: ['step1-tool', 'step2-tool', 'step3-tool'],
  strictMode: true // no extra tools allowed
});

LLM-Based Scorer Options


// Basic configuration
const basicLLMScorer = createLLMScorer({
  model: openai('gpt-4o-mini'),
  availableTools: [
    { name: 'tool1', description: 'Description 1' },
    { name: 'tool2', description: 'Description 2' }
  ]
});
 
// With different model
const customModelScorer = createLLMScorer({
  model: openai('gpt-4'), // More powerful model for complex evaluations
  availableTools: [...]
});

Understanding the Results

Code-Based Scorer Results


{
  runId: string,
  preprocessStepResult: {
    expectedTool: string,
    actualTools: string[],
    strictMode: boolean,
    expectedToolOrder?: string[],
    hasToolCalls: boolean,
    correctToolCalled: boolean,
    correctOrderCalled: boolean | null,
    toolCallInfos: ToolCallInfo[]
  },
  score: number // Always 0 or 1
}

LLM-Based Scorer Results


{
  runId: string,
  score: number,  // 0.0 to 1.0
  reason: string, // Human-readable explanation
  analyzeStepResult: {
    evaluations: Array<{
      toolCalled: string,
      wasAppropriate: boolean,
      reasoning: string
    }>,
    missingTools?: string[]
  }
}

When to Use Each Scorer

Use Code-Based Scorer For:

Unit testing
CI/CD pipelines
Regression testing
Exact tool matching requirements
Tool sequence validation

Use LLM-Based Scorer For:

Production evaluation
Quality assurance
User intent alignment
Context-aware evaluation
Handling edge cases