# Tool Call Accuracy Scorers

Mastra provides two tool call accuracy scorers for evaluating whether an LLM selects the correct tools from available options:

1. **Code-based scorer** - Deterministic evaluation using exact tool matching
2. **LLM-based scorer** - Semantic evaluation using AI to assess appropriateness

## Choosing Between Scorers

### Use the Code-Based Scorer When:

- You need **deterministic, reproducible** results
- You want to test **exact tool matching**
- You need to validate **specific tool sequences**
- Speed and cost are priorities (no LLM calls)
- You're running automated tests

### Use the LLM-Based Scorer When:

- You need **semantic understanding** of appropriateness
- Tool selection depends on **context and intent**
- You want to handle **edge cases** like clarification requests
- You need **explanations** for scoring decisions
- You're evaluating **production agent behavior**

## Code-Based Tool Call Accuracy Scorer

The `createToolCallAccuracyScorerCode()` function from `@mastra/evals/scorers/prebuilt` provides deterministic binary scoring based on exact tool matching and supports both strict and lenient evaluation modes, as well as tool calling order validation.

### Parameters

**expectedTool:** (`string`): The name of the tool that should be called for the given task. Ignored when expectedToolOrder is provided.

**strictMode:** (`boolean`): Controls evaluation strictness. For single tool mode: only exact single tool calls accepted. For order checking mode: tools must match exactly with no extra tools allowed.

**expectedToolOrder:** (`string[]`): Array of tool names in the expected calling order. When provided, enables order checking mode and ignores expectedTool parameter.

This function returns an instance of the MastraScorer class. See the [MastraScorer reference](https://mastra.ai/reference/evals/mastra-scorer) for details on the `.run()` method and its input/output.

### Evaluation Modes

The code-based scorer operates in two distinct modes:

#### Single Tool Mode

When `expectedToolOrder` is not provided, the scorer evaluates single tool selection:

- **Standard Mode (strictMode: false)**: Returns `1` if the expected tool is called, regardless of other tools
- **Strict Mode (strictMode: true)**: Returns `1` only if exactly one tool is called and it matches the expected tool

#### Order Checking Mode

When `expectedToolOrder` is provided, the scorer validates tool calling sequence:

- **Strict Order (strictMode: true)**: Tools must be called in exactly the specified order with no extra tools
- **Flexible Order (strictMode: false)**: Expected tools must appear in correct relative order (extra tools allowed)

## Code-Based Scoring Details

- **Binary scores**: Always returns 0 or 1
- **Deterministic**: Same input always produces same output
- **Fast**: No external API calls

### Code-Based Scorer Options

```typescript
// Standard mode - passes if expected tool is called
const lenientScorer = createCodeScorer({
  expectedTool: "search-tool",
  strictMode: false,
});

// Strict mode - only passes if exactly one tool is called
const strictScorer = createCodeScorer({
  expectedTool: "search-tool",
  strictMode: true,
});

// Order checking with strict mode
const strictOrderScorer = createCodeScorer({
  expectedTool: "step1-tool",
  expectedToolOrder: ["step1-tool", "step2-tool", "step3-tool"],
  strictMode: true, // no extra tools allowed
});
```

### Code-Based Scorer Results

```typescript
{
  runId: string,
  preprocessStepResult: {
    expectedTool: string,
    actualTools: string[],
    strictMode: boolean,
    expectedToolOrder?: string[],
    hasToolCalls: boolean,
    correctToolCalled: boolean,
    correctOrderCalled: boolean | null,
    toolCallInfos: ToolCallInfo[]
  },
  score: number // Always 0 or 1
}
```

## Code-Based Scorer Examples

The code-based scorer provides deterministic, binary scoring (0 or 1) based on exact tool matching.

### Correct tool selection

```typescript
const scorer = createToolCallAccuracyScorerCode({
  expectedTool: "weather-tool",
});

// Simulate LLM input and output with tool call
const inputMessages = [
  createTestMessage({
    content: "What is the weather like in New York today?",
    role: "user",
    id: "input-1",
  }),
];

const output = [
  createTestMessage({
    content: "Let me check the weather for you.",
    role: "assistant",
    id: "output-1",
    toolInvocations: [
      createToolInvocation({
        toolCallId: "call-123",
        toolName: "weather-tool",
        args: { location: "New York" },
        result: { temperature: "72°F", condition: "sunny" },
        state: "result",
      }),
    ],
  }),
];

const run = createAgentTestRun({ inputMessages, output });
const result = await scorer.run(run);

console.log(result.score); // 1
console.log(result.preprocessStepResult?.correctToolCalled); // true
```

### Strict mode evaluation

Only passes if exactly one tool is called:

```typescript
const strictScorer = createToolCallAccuracyScorerCode({
  expectedTool: "weather-tool",
  strictMode: true,
});

// Multiple tools called - fails in strict mode
const output = [
  createTestMessage({
    content: "Let me help you with that.",
    role: "assistant",
    id: "output-1",
    toolInvocations: [
      createToolInvocation({
        toolCallId: "call-1",
        toolName: "search-tool",
        args: {},
        result: {},
        state: "result",
      }),
      createToolInvocation({
        toolCallId: "call-2",
        toolName: "weather-tool",
        args: { location: "New York" },
        result: { temperature: "20°C" },
        state: "result",
      }),
    ],
  }),
];

const result = await strictScorer.run(run);
console.log(result.score); // 0 - fails because multiple tools were called
```

### Tool order validation

Validates that tools are called in a specific sequence:

```typescript
const orderScorer = createToolCallAccuracyScorerCode({
  expectedTool: "auth-tool", // ignored when order is specified
  expectedToolOrder: ["auth-tool", "fetch-tool"],
  strictMode: true, // no extra tools allowed
});

const output = [
  createTestMessage({
    content: "I will authenticate and fetch the data.",
    role: "assistant",
    id: "output-1",
    toolInvocations: [
      createToolInvocation({
        toolCallId: "call-1",
        toolName: "auth-tool",
        args: { token: "abc123" },
        result: { authenticated: true },
        state: "result",
      }),
      createToolInvocation({
        toolCallId: "call-2",
        toolName: "fetch-tool",
        args: { endpoint: "/data" },
        result: { data: ["item1"] },
        state: "result",
      }),
    ],
  }),
];

const result = await orderScorer.run(run);
console.log(result.score); // 1 - correct order
```

### Flexible order mode

Allows extra tools as long as expected tools maintain relative order:

```typescript
const flexibleOrderScorer = createToolCallAccuracyScorerCode({
  expectedTool: "auth-tool",
  expectedToolOrder: ["auth-tool", "fetch-tool"],
  strictMode: false, // allows extra tools
});

const output = [
  createTestMessage({
    content: "Performing comprehensive operation.",
    role: "assistant",
    id: "output-1",
    toolInvocations: [
      createToolInvocation({
        toolCallId: "call-1",
        toolName: "auth-tool",
        args: { token: "abc123" },
        result: { authenticated: true },
        state: "result",
      }),
      createToolInvocation({
        toolCallId: "call-2",
        toolName: "log-tool", // Extra tool - OK in flexible mode
        args: { message: "Starting fetch" },
        result: { logged: true },
        state: "result",
      }),
      createToolInvocation({
        toolCallId: "call-3",
        toolName: "fetch-tool",
        args: { endpoint: "/data" },
        result: { data: ["item1"] },
        state: "result",
      }),
    ],
  }),
];

const result = await flexibleOrderScorer.run(run);
console.log(result.score); // 1 - auth-tool comes before fetch-tool
```

## LLM-Based Tool Call Accuracy Scorer

The `createToolCallAccuracyScorerLLM()` function from `@mastra/evals/scorers/prebuilt` uses an LLM to evaluate whether the tools called by an agent are appropriate for the given user request, providing semantic evaluation rather than exact matching.

### Parameters

**model:** (`MastraModelConfig`): The LLM model to use for evaluating tool appropriateness

**availableTools:** (`Array<{name: string, description: string}>`): List of available tools with their descriptions for context

### Features

The LLM-based scorer provides:

- **Semantic Evaluation**: Understands context and user intent
- **Appropriateness Assessment**: Distinguishes between "helpful" and "appropriate" tools
- **Clarification Handling**: Recognizes when agents appropriately ask for clarification
- **Missing Tool Detection**: Identifies tools that should have been called
- **Reasoning Generation**: Provides explanations for scoring decisions

### Evaluation Process

1. **Extract Tool Calls**: Identifies tools mentioned in agent output
2. **Analyze Appropriateness**: Evaluates each tool against user request
3. **Generate Score**: Calculates score based on appropriate vs total tool calls
4. **Generate Reasoning**: Provides human-readable explanation

## LLM-Based Scoring Details

- **Fractional scores**: Returns values between 0.0 and 1.0
- **Context-aware**: Considers user intent and appropriateness
- **Explanatory**: Provides reasoning for scores

### LLM-Based Scorer Options

```typescript
// Basic configuration
const basicLLMScorer = createLLMScorer({
  model: 'openai/gpt-5.1',
  availableTools: [
    { name: 'tool1', description: 'Description 1' },
    { name: 'tool2', description: 'Description 2' }
  ]
});

// With different model
const customModelScorer = createLLMScorer({
  model: 'openai/gpt-5', // More powerful model for complex evaluations
  availableTools: [...]
});
```

### LLM-Based Scorer Results

```typescript
{
  runId: string,
  score: number,  // 0.0 to 1.0
  reason: string, // Human-readable explanation
  analyzeStepResult: {
    evaluations: Array<{
      toolCalled: string,
      wasAppropriate: boolean,
      reasoning: string
    }>,
    missingTools?: string[]
  }
}
```

## LLM-Based Scorer Examples

The LLM-based scorer uses AI to evaluate whether tool selections are appropriate for the user's request.

### Basic LLM evaluation

```typescript
const llmScorer = createToolCallAccuracyScorerLLM({
  model: "openai/gpt-5.1",
  availableTools: [
    {
      name: "weather-tool",
      description: "Get current weather information for any location",
    },
    {
      name: "calendar-tool",
      description: "Check calendar events and scheduling",
    },
    {
      name: "search-tool",
      description: "Search the web for general information",
    },
  ],
});

const inputMessages = [
  createTestMessage({
    content: "What is the weather like in San Francisco today?",
    role: "user",
    id: "input-1",
  }),
];

const output = [
  createTestMessage({
    content: "Let me check the current weather for you.",
    role: "assistant",
    id: "output-1",
    toolInvocations: [
      createToolInvocation({
        toolCallId: "call-123",
        toolName: "weather-tool",
        args: { location: "San Francisco", date: "today" },
        result: { temperature: "68°F", condition: "foggy" },
        state: "result",
      }),
    ],
  }),
];

const run = createAgentTestRun({ inputMessages, output });
const result = await llmScorer.run(run);

console.log(result.score); // 1.0 - appropriate tool usage
console.log(result.reason); // "The agent correctly used the weather-tool to address the user's request for weather information."
```

### Handling inappropriate tool usage

```typescript
const inputMessages = [
  createTestMessage({
    content: "What is the weather in Tokyo?",
    role: "user",
    id: "input-1",
  }),
];

const inappropriateOutput = [
  createTestMessage({
    content: "Let me search for that information.",
    role: "assistant",
    id: "output-1",
    toolInvocations: [
      createToolInvocation({
        toolCallId: "call-456",
        toolName: "search-tool", // Less appropriate than weather-tool
        args: { query: "Tokyo weather" },
        result: { results: ["Tokyo weather data..."] },
        state: "result",
      }),
    ],
  }),
];

const run = createAgentTestRun({ inputMessages, output: inappropriateOutput });
const result = await llmScorer.run(run);

console.log(result.score); // 0.5 - partially appropriate
console.log(result.reason); // "The agent used search-tool when weather-tool would have been more appropriate for a direct weather query."
```

### Evaluating clarification requests

The LLM scorer recognizes when agents appropriately ask for clarification:

```typescript
const vagueInput = [
  createTestMessage({
    content: 'I need help with something',
    role: 'user',
    id: 'input-1'
  })
];

const clarificationOutput = [
  createTestMessage({
    content: 'I'd be happy to help! Could you please provide more details about what you need assistance with?',
    role: 'assistant',
    id: 'output-1',
    // No tools called - asking for clarification instead
  })
];

const run = createAgentTestRun({
  inputMessages: vagueInput,
  output: clarificationOutput
});
const result = await llmScorer.run(run);

console.log(result.score); // 1.0 - appropriate to ask for clarification
console.log(result.reason); // "The agent appropriately asked for clarification rather than calling tools with insufficient information."
```

## Comparing Both Scorers

Here's an example using both scorers on the same data:

```typescript
import {
  createToolCallAccuracyScorerCode as createCodeScorer,
  createToolCallAccuracyScorerLLM as createLLMScorer
} from "@mastra/evals/scorers/prebuilt";

// Setup both scorers
const codeScorer = createCodeScorer({
  expectedTool: "weather-tool",
  strictMode: false,
});

const llmScorer = createLLMScorer({
  model: "openai/gpt-5.1",
  availableTools: [
    { name: "weather-tool", description: "Get weather information" },
    { name: "search-tool", description: "Search the web" },
  ],
});

// Test data
const run = createAgentTestRun({
  inputMessages: [
    createTestMessage({
      content: "What is the weather?",
      role: "user",
      id: "input-1",
    }),
  ],
  output: [
    createTestMessage({
      content: "Let me find that information.",
      role: "assistant",
      id: "output-1",
      toolInvocations: [
        createToolInvocation({
          toolCallId: "call-1",
          toolName: "search-tool",
          args: { query: "weather" },
          result: { results: ["weather data"] },
          state: "result",
        }),
      ],
    }),
  ],
});

// Run both scorers
const codeResult = await codeScorer.run(run);
const llmResult = await llmScorer.run(run);

console.log("Code Scorer:", codeResult.score); // 0 - wrong tool
console.log("LLM Scorer:", llmResult.score); // 0.3 - partially appropriate
console.log("LLM Reason:", llmResult.reason); // Explains why search-tool is less appropriate
```

## Related

- [Answer Relevancy Scorer](https://mastra.ai/reference/evals/answer-relevancy)
- [Completeness Scorer](https://mastra.ai/reference/evals/completeness)
- [Faithfulness Scorer](https://mastra.ai/reference/evals/faithfulness)
- [Custom Scorers](https://mastra.ai/docs/evals/custom-scorers)