Tool Call Accuracy Scorers

Mastra provides two tool call accuracy scorers for evaluating whether an LLM selects the correct tools from available options:

Code-based scorer - Deterministic evaluation using exact tool matching
LLM-based scorer - Semantic evaluation using AI to assess appropriateness

For usage examples, see the Tool Call Accuracy Examples.

Code-Based Tool Call Accuracy Scorer

The createToolCallAccuracyScorerCode() function from @mastra/evals/scorers/code provides deterministic binary scoring based on exact tool matching and supports both strict and lenient evaluation modes, as well as tool calling order validation.

Parameters

expectedTool:

string

The name of the tool that should be called for the given task. Ignored when expectedToolOrder is provided.

strictMode:

boolean

Controls evaluation strictness. For single tool mode: only exact single tool calls accepted. For order checking mode: tools must match exactly with no extra tools allowed.

expectedToolOrder:

string[]

Array of tool names in the expected calling order. When provided, enables order checking mode and ignores expectedTool parameter.

This function returns an instance of the MastraScorer class. See the MastraScorer reference for details on the .run() method and its input/output.

Evaluation Modes

The code-based scorer operates in two distinct modes:

Single Tool Mode

When expectedToolOrder is not provided, the scorer evaluates single tool selection:

Standard Mode (strictMode: false): Returns 1 if the expected tool is called, regardless of other tools
Strict Mode (strictMode: true): Returns 1 only if exactly one tool is called and it matches the expected tool

Order Checking Mode

When expectedToolOrder is provided, the scorer validates tool calling sequence:

Strict Order (strictMode: true): Tools must be called in exactly the specified order with no extra tools
Flexible Order (strictMode: false): Expected tools must appear in correct relative order (extra tools allowed)

Examples

import { createToolCallAccuracyScorerCode } from "@mastra/evals/scorers/code";

// Single tool validation
const scorer = createToolCallAccuracyScorerCode({
  expectedTool: "weather-tool",
});

// Strict single tool (no other tools allowed)
const strictScorer = createToolCallAccuracyScorerCode({
  expectedTool: "calculator-tool",
  strictMode: true,
});

// Tool order validation
const orderScorer = createToolCallAccuracyScorerCode({
  expectedTool: "search-tool", // ignored when order is specified
  expectedToolOrder: ["search-tool", "weather-tool"],
  strictMode: true, // exact match required
});

LLM-Based Tool Call Accuracy Scorer

The createToolCallAccuracyScorerLLM() function from @mastra/evals/scorers/llm uses an LLM to evaluate whether the tools called by an agent are appropriate for the given user request, providing semantic evaluation rather than exact matching.

Parameters

model:

MastraLanguageModel

The LLM model to use for evaluating tool appropriateness

availableTools:

Array<{name: string, description: string}>

List of available tools with their descriptions for context

Features

The LLM-based scorer provides:

Semantic Evaluation: Understands context and user intent
Appropriateness Assessment: Distinguishes between "helpful" and "appropriate" tools
Clarification Handling: Recognizes when agents appropriately ask for clarification
Missing Tool Detection: Identifies tools that should have been called
Reasoning Generation: Provides explanations for scoring decisions

Evaluation Process

Extract Tool Calls: Identifies tools mentioned in agent output
Analyze Appropriateness: Evaluates each tool against user request
Generate Score: Calculates score based on appropriate vs total tool calls
Generate Reasoning: Provides human-readable explanation

Examples

import { createToolCallAccuracyScorerLLM } from "@mastra/evals/scorers/llm";
import { openai } from "@ai-sdk/openai";

const llmScorer = createToolCallAccuracyScorerLLM({
  model: openai("gpt-4o-mini"),
  availableTools: [
    {
      name: "weather-tool",
      description: "Get current weather information for any location",
    },
    {
      name: "search-tool",
      description: "Search the web for information",
    },
    {
      name: "calendar-tool",
      description: "Check calendar events and scheduling",
    },
  ],
});

const result = await llmScorer.run(agentRun);
console.log(result.score); // 0.0 to 1.0
console.log(result.reason); // Explanation of the score

Choosing Between Scorers

Use the Code-Based Scorer When:

You need deterministic, reproducible results
You want to test exact tool matching
You need to validate specific tool sequences
Speed and cost are priorities (no LLM calls)
You're running automated tests

Use the LLM-Based Scorer When:

You need semantic understanding of appropriateness
Tool selection depends on context and intent
You want to handle edge cases like clarification requests
You need explanations for scoring decisions
You're evaluating production agent behavior

Tool Call Accuracy Scorers

Code-Based Tool Call Accuracy Scorer

Parameters

expectedTool:

strictMode:

expectedToolOrder:

Evaluation Modes

Single Tool Mode

Order Checking Mode

Examples

LLM-Based Tool Call Accuracy Scorer

Parameters

model:

availableTools:

Features

Evaluation Process

Examples

Choosing Between Scorers

Use the Code-Based Scorer When:

Use the LLM-Based Scorer When:

Scoring Details

Code-Based Scoring

LLM-Based Scoring

Use Cases

Code-Based Scorer Use Cases

LLM-Based Scorer Use Cases

Code-Based Tool Call Accuracy Scorer​

Parameters​

expectedTool:

strictMode:

expectedToolOrder:

Evaluation Modes​

Single Tool Mode​

Order Checking Mode​

Examples​

LLM-Based Tool Call Accuracy Scorer​

Parameters​

model:

availableTools:

Features​

Evaluation Process​

Examples​

Choosing Between Scorers​

Use the Code-Based Scorer When:​

Use the LLM-Based Scorer When:​

Scoring Details​

Code-Based Scoring​

LLM-Based Scoring​

Use Cases​

Code-Based Scorer Use Cases​

LLM-Based Scorer Use Cases​

Related​

Code-Based Tool Call Accuracy Scorer

Parameters

Evaluation Modes

Single Tool Mode

Order Checking Mode

Examples

LLM-Based Tool Call Accuracy Scorer

Parameters

Features

Evaluation Process

Examples

Choosing Between Scorers

Use the Code-Based Scorer When:

Use the LLM-Based Scorer When:

Scoring Details

Code-Based Scoring

LLM-Based Scoring

Use Cases

Code-Based Scorer Use Cases

LLM-Based Scorer Use Cases

Related