Gates and verdicts

Gates and verdicts add severity semantics to runEvals. Gates are scorers that must score 1.0 — hard requirements that block a run. Thresholds are minimum acceptable scores on tracked metrics. The verdict summarizes the outcome as passed, scored, or failed.

When to use gates and verdicts
Direct link to When to use gates and verdicts

Enforce hard requirements in CI (e.g., "agent must call the right tool")
Track quality metrics with minimum thresholds (e.g., "faithfulness above 0.7")
Get a single verdict signal (passed, scored, or failed) from an eval run without writing custom assertion logic
Separate "must pass" gates from "nice to have" tracked metrics

Quickstart
Direct link to Quickstart

src/evals/weather-eval.ts
import { runEvals } from '@mastra/core/evals'
import { checks } from '@mastra/evals/checks'
import { weatherAgent } from '../agents'
import { faithfulnessScorer } from '../scorers'

const result = await runEvals({
  data: [{ input: 'What is the weather in Brooklyn?' }],
  target: weatherAgent,

  // Gates: must all score 1.0 or the run fails
  gates: [checks.calledTool('get_weather'), checks.noToolErrors()],

  // Scorers: tracked with optional thresholds
  scorers: [
    { scorer: faithfulnessScorer, threshold: 0.7 },
    checks.includes('Brooklyn'), // no threshold = tracked only
  ],
})

console.log(result.verdict) // 'passed' | 'scored' | 'failed'

How verdicts work
Direct link to How verdicts work

The verdict is computed from gates and thresholds after all data items are processed:

failed: At least one gate averaged below 1.0 across data items
scored: All gates passed, but at least one threshold scorer missed its threshold
passed: All gates scored 1.0 and all thresholds were met

When no gates or threshold-bearing scorers are provided, the verdict field is omitted and runEvals behaves exactly as before.

Gates
Direct link to Gates

Gates are scorers passed via the gates field. They run before regular scorers on each data item. A gate must average a score of 1.0 across all data items to pass.

src/evals/tool-gate.ts
import { runEvals } from '@mastra/core/evals'
import { checks } from '@mastra/evals/checks'

const result = await runEvals({
  data: [{ input: 'What is the weather?' }],
  target: weatherAgent,
  gates: [checks.calledTool('get_weather')],
  scorers: [qualityScorer],
})

// result.gateResults: [{ id: 'check-called-tool', passed: true, score: 1 }]

Any scorer works as a gate. Quick Checks are a natural fit because they return binary 1/0 scores. Visit runEvals() reference for the full parameter and return type documentation.

Gate-only runs
Direct link to Gate-only runs

scorers is optional when at least one gate is provided. This is useful for deterministic CI checks where you only care about pass/fail gates and don't need to track any quality metrics.

src/evals/gate-only.ts
import { runEvals } from '@mastra/core/evals'
import { checks } from '@mastra/evals/checks'

const result = await runEvals({
  data: [{ input: 'What is the weather in Brooklyn?' }],
  target: weatherAgent,
  gates: [checks.calledTool('get_weather'), checks.noToolErrors()],
})

You must provide at least one scorer or gate — a run with neither throws an error.

Thresholds
Direct link to Thresholds

Wrap a scorer in { scorer, threshold } to set pass/fail bounds. The threshold is compared against the scorer's average score across all data items.

A threshold can be:

A number — implies minimum (score at or above passes): { scorer, threshold: 0.7 }
An object with min and/or max — for range-based checks: { scorer, threshold: { max: 0.3 } }

Use max for scorers where a high score is bad (e.g., hallucination, toxicity). Use { min, max } when the score should fall within a specific band.

src/evals/threshold-example.ts
import { runEvals } from '@mastra/core/evals'

const result = await runEvals({
  data: [{ input: 'Explain quantum computing' }],
  target: myAgent,
  scorers: [
    { scorer: faithfulnessScorer, threshold: 0.7 }, // min threshold (number shorthand)
    { scorer: hallucinationScorer, threshold: { max: 0.3 } }, // max threshold — high score = bad
    { scorer: verbosityScorer, threshold: { min: 0.3, max: 0.8 } }, // range threshold
    toneScorer, // bare scorer, no threshold — tracked only
  ],
})

// result.thresholdResults:
// [
//   { id: 'faithfulness', passed: true, averageScore: 0.85, threshold: 0.7 },
//   { id: 'hallucination', passed: true, averageScore: 0.1, threshold: { max: 0.3 } },
//   { id: 'verbosity', passed: false, averageScore: 0.9, threshold: { min: 0.3, max: 0.8 } },
// ]

A bare scorer (no threshold) still shows up in result.scores but doesn't affect the verdict.

Using verdicts in CI
Direct link to Using verdicts in CI

The verdict gives a single signal for CI pipelines:

src/evals/ci-check.ts
import { runEvals } from '@mastra/core/evals'
import { checks } from '@mastra/evals/checks'

const result = await runEvals({
  data: testDataset,
  target: myAgent,
  gates: [checks.calledTool('search'), checks.noToolErrors()],
  scorers: [{ scorer: faithfulnessScorer, threshold: 0.7 }],
})

if (result.verdict === 'failed') {
  console.error(
    'Gate failures:',
    result.gateResults?.filter(g => !g.passed),
  )
  process.exit(1)
}

if (result.verdict === 'scored') {
  console.warn(
    'Threshold misses:',
    result.thresholdResults?.filter(t => !t.passed),
  )
}

Quick Checks: Zero-LLM micro-scorers that work well as gates
runEvals() reference: Full API documentation
Built-in scorers: LLM-based and code-based scorers
Running evals in CI: CI integration patterns

When to use gates and verdictsDirect link to When to use gates and verdicts

QuickstartDirect link to Quickstart

How verdicts workDirect link to How verdicts work

GatesDirect link to Gates

Gate-only runsDirect link to Gate-only runs

ThresholdsDirect link to Thresholds

Using verdicts in CIDirect link to Using verdicts in CI

RelatedDirect link to Related