Trajectory accuracy scorers

Mastra provides two trajectory accuracy scorers for evaluating whether an agent or workflow follows an expected sequence of actions:

Code-based scorer - Deterministic evaluation using exact step matching and ordering
LLM-based scorer - Semantic evaluation using AI to assess trajectory quality and appropriateness

Both scorers work with agents and workflows. The runEvals pipeline automatically extracts trajectories, so scorers receive a Trajectory object directly.

Trajectory extraction
Direct link to Trajectory extraction

The runEvals pipeline uses two extraction strategies, depending on whether observability storage is configured:

Trace-based extraction (preferred)
Direct link to Trace-based extraction (preferred)

When the target's Mastra instance has storage configured, the pipeline fetches the full execution trace from the observability store and calls extractTrajectoryFromTrace(). This produces a hierarchical trajectory with nested children, capturing the complete execution tree — including nested agent runs, tool calls within workflow steps, and model generations.

For example, a workflow that calls an agent, which in turn calls tools, produces:

workflow_run
  └─ workflow_step (validate-input)
  └─ workflow_step (process-data)
       └─ agent_run (my-agent)
            └─ model_generation
            └─ tool_call (search)
            └─ model_generation
            └─ tool_call (summarize)
  └─ workflow_step (save-result)

Fallback extraction
Direct link to Fallback extraction

When storage isn't available, the pipeline falls back to:

Agents: extractTrajectory() — Extracts ToolCallStep entries from toolInvocations in the agent's message output. Produces a flat list of tool calls.
Workflows: extractWorkflowTrajectory() — Extracts WorkflowStepStep entries from stepResults. Produces a flat list of workflow steps.

These fallbacks don't capture nested execution or non-tool-call spans.

Trajectory types
Direct link to Trajectory types

Trajectory steps use a discriminated union on stepType. Each step type has specific properties:

`ToolCallStep`
Direct link to toolcallstep

Represents an agent tool call.

stepType:

'tool_call'

Discriminant.

name:

string

Tool name.

toolArgs?:

Record<string, unknown>

Arguments passed to the tool.

toolResult?:

Record<string, unknown>

Result returned by the tool.

success?:

boolean

Whether the call succeeded.

durationMs?:

number

Execution time in milliseconds.

metadata?:

Record<string, unknown>

Arbitrary metadata.

children?:

TrajectoryStep[]

Nested sub-steps.

`WorkflowStepStep`
Direct link to workflowstepstep

Represents a workflow step execution.

stepType:

'workflow_step'

Discriminant.

name:

string

Step identifier.

stepId?:

string

Step ID in the workflow.

status?:

string

Step result status (success, failed, suspended, etc.).

output?:

Record<string, unknown>

Step output data.

durationMs?:

number

Execution time in milliseconds.

metadata?:

Record<string, unknown>

Arbitrary metadata.

children?:

TrajectoryStep[]

Nested sub-steps (e.g. tool calls inside the step).

Step type	Key properties
`mcp_tool_call`	`toolArgs`, `toolResult`, `mcpServer`, `success`
`model_generation`	`modelId`, `promptTokens`, `completionTokens`, `finishReason`
`agent_run`	`agentId`
`workflow_run`	`workflowId`, `status`
`workflow_conditional`	`conditionCount`, `selectedSteps`
`workflow_parallel`	`branchCount`, `parallelSteps`
`workflow_loop`	`loopType`, `totalIterations`
`workflow_sleep`	`durationMs`, `sleepType`
`workflow_wait_event`	`eventName`, `eventReceived`
`processor_run`	`processorId`

Expected steps
Direct link to Expected steps

When defining expected trajectories, use ExpectedStep instead of the full TrajectoryStep discriminated union. ExpectedStep is a discriminated union that mirrors TrajectoryStep — when you specify a stepType, you get autocomplete for that variant's fields (e.g., toolArgs for tool_call, modelId for model_generation). All variant-specific fields are optional, so you only assert against what you care about.

Omit stepType entirely to match any step by name only.

name:

string

Step name to match (tool name, agent ID, workflow step name, etc.).

stepType?:

TrajectoryStepType

Step type discriminant. When set, enables autocomplete for that variant's fields. If omitted, matches any step type with the given name.

(variant fields)?:

varies

Type-specific fields from the corresponding TrajectoryStep variant. For example, toolArgs and toolResult for tool_call, modelId for model_generation, output for workflow_step. All optional — only specified fields are compared.

children?:

TrajectoryExpectation

Nested expectation config for this step's children. Overrides the parent config for evaluating children of this step.

Simple expected steps
Direct link to Simple expected steps

const steps: ExpectedStep[] = [
  // Match by name only (any step type)
  { name: 'search' },

  // Match by name and step type (autocomplete for tool_call fields)
  { name: 'search', stepType: 'tool_call' },

  // Match with specific toolArgs (auto-compared when present)
  { name: 'search', stepType: 'tool_call', toolArgs: { query: 'weather' } },

  // Match a model generation step by model ID
  { name: 'gpt-4o', stepType: 'model_generation', modelId: 'gpt-4o' },
]

Nested expectations
Direct link to Nested expectations

Each expected step can include a children config with its own evaluation rules. This lets you set different ordering or comparison rules at each level of the hierarchy.

const scorer = createTrajectoryScorerCode({
  defaults: {
    ordering: 'strict',
    steps: [
      { name: 'validate-input', stepType: 'workflow_step' },
      {
        name: 'research-agent',
        stepType: 'agent_run',
        children: {
          // Sub-agent can call tools in any order
          ordering: 'unordered',
          steps: [
            { name: 'search', stepType: 'tool_call' },
            { name: 'summarize', stepType: 'tool_call' },
          ],
        },
      },
      { name: 'save-result', stepType: 'workflow_step' },
    ],
  },
})

In this example, the parent workflow requires strict ordering of its steps, but the nested research-agent allows its tool calls in any order.

Choosing between scorers
Direct link to Choosing between scorers

Use the code-based scorer when:
Direct link to Use the code-based scorer when:

You need deterministic, reproducible results
You have a known expected trajectory to compare against
You want to validate exact step sequences
Speed and cost are priorities (no LLM calls)
You are running automated tests in CI/CD

Use the LLM-based scorer when:
Direct link to Use the LLM-based scorer when:

You need semantic understanding of whether steps were appropriate
The optimal trajectory isn't predetermined (evaluate based on task requirements)
You want to detect unnecessary, redundant, or missing steps
You need explanations for scoring decisions
You are evaluating production agent behavior

Code-based trajectory accuracy scorer
Direct link to Code-based trajectory accuracy scorer

The createTrajectoryAccuracyScorerCode() function from @mastra/evals/scorers/prebuilt provides deterministic scoring based on step matching and ordering against an expected trajectory.

Parameters
Direct link to Parameters

expectedTrajectory?:

Trajectory | ExpectedStep[]

Static expected trajectory to compare against. Accepts a full Trajectory or an array of ExpectedStep matchers. When omitted, the scorer reads expectedTrajectory from each dataset item at runtime.

comparisonOptions?:

TrajectoryComparisonOptions

Controls how the comparison is performed.

boolean

This function returns an instance of the MastraScorer class. See the MastraScorer reference for details on the .run() method and its input/output.

Expected trajectory sources
Direct link to Expected trajectory sources

The code-based scorer resolves expectedTrajectory from two sources, in order of priority:

Constructor option — A static trajectory passed when creating the scorer. Used for all dataset items.
Dataset item — An expectedTrajectory field on the dataset item, passed through the runEvals pipeline. Allows different expected trajectories per item.

src/static-expected.ts
// Static: same expected trajectory for all items
const scorer = createTrajectoryAccuracyScorerCode({
  expectedTrajectory: {
    steps: [
      { stepType: 'tool_call', name: 'search' },
      { stepType: 'tool_call', name: 'summarize' },
    ],
  },
})

src/per-item-expected.ts
// Per-item: each dataset item has its own expectedTrajectory
const scorer = createTrajectoryAccuracyScorerCode()

await runEvals({
  target: myAgent,
  scorers: { trajectory: [scorer] },
  data: [
    {
      input: 'Search and summarize weather',
      expectedTrajectory: {
        steps: [
          { stepType: 'tool_call', name: 'search' },
          { stepType: 'tool_call', name: 'summarize' },
        ],
      },
    },
    {
      input: 'Just search for weather',
      expectedTrajectory: {
        steps: [{ stepType: 'tool_call', name: 'search' }],
      },
    },
  ],
})

Evaluation modes
Direct link to Evaluation modes

The code-based scorer operates in two modes based on strictOrder:

Strict mode (`strictOrder: true`)
Direct link to strict-mode-strictorder-true

Requires an exact match. The actual steps must match the expected steps in the same order with no extra or missing steps. Returns 1.0 for an exact match and 0.0 otherwise.

Relaxed mode (`strictOrder: false`, default)
Direct link to relaxed-mode-strictorder-false-default

Allows extra steps. Expected steps must appear in the correct relative order. The score is calculated based on how many expected steps were matched, with optional penalties for extra or repeated steps.

Code-based scoring details
Direct link to Code-based scoring details

Continuous scores: Returns values between 0.0 and 1.0 in relaxed mode; binary (0 or 1) in strict mode
Deterministic: Same input always produces the same output
Fast: No external API calls

Code-based scorer results
Direct link to Code-based scorer results

{
  runId: string,
  preprocessStepResult: {
    actualTrajectory: Trajectory,
    expectedTrajectory: Trajectory,
    comparison: {
      score: number,
      matchedSteps: number,
      totalExpectedSteps: number,
      totalActualSteps: number,
      missingSteps: string[],
      extraSteps: string[],
      outOfOrderSteps: string[],
      repeatedSteps: string[]
    },
    actualStepNames: string[],
    expectedStepNames: string[]
  },
  score: number
}

Code-based scorer examples
Direct link to Code-based scorer examples

Agent trajectory with strict ordering
Direct link to Agent trajectory with strict ordering

Validates that an agent follows an exact sequence of tool calls:

src/example-strict-trajectory.ts
import { createTrajectoryAccuracyScorerCode } from '@mastra/evals/scorers/prebuilt'
import { runEvals } from '@mastra/core/evals'

const scorer = createTrajectoryAccuracyScorerCode({
  expectedTrajectory: {
    steps: [
      { stepType: 'tool_call', name: 'auth-tool' },
      { stepType: 'tool_call', name: 'fetch-tool' },
    ],
  },
  comparisonOptions: { strictOrder: true },
})

const result = await runEvals({
  target: myAgent,
  scorers: { trajectory: [scorer] },
  data: [{ input: 'Get my data' }],
})

console.log(result.scores.trajectory['trajectory-accuracy']) // 1.0

Agent trajectory with relaxed ordering
Direct link to Agent trajectory with relaxed ordering

Allows extra steps as long as expected steps appear in the correct relative order:

src/example-relaxed-trajectory.ts
const scorer = createTrajectoryAccuracyScorerCode({
  expectedTrajectory: {
    steps: [
      { stepType: 'tool_call', name: 'search-tool' },
      { stepType: 'tool_call', name: 'summarize-tool' },
    ],
  },
  comparisonOptions: { strictOrder: false },
})

// Agent called search-tool → log-tool → summarize-tool
// The extra log-tool is allowed in relaxed mode
// score: 0.75 — all expected steps matched, small penalty for extra step

Workflow trajectory
Direct link to Workflow trajectory

Evaluates a workflow's execution path:

src/example-workflow-trajectory.ts
import { createTrajectoryAccuracyScorerCode } from '@mastra/evals/scorers/prebuilt'
import { runEvals } from '@mastra/core/evals'

const scorer = createTrajectoryAccuracyScorerCode({
  expectedTrajectory: {
    steps: [
      { stepType: 'workflow_step', name: 'validate-input' },
      { stepType: 'workflow_step', name: 'process-data' },
      { stepType: 'workflow_step', name: 'save-result' },
    ],
  },
})

const result = await runEvals({
  target: myWorkflow,
  scorers: { trajectory: [scorer] },
  data: [{ input: { data: 'test' } }],
})

console.log(result.scores.trajectory['trajectory-accuracy'])

Comparing step data
Direct link to Comparing step data

Validates the step names and step-specific data. For tool calls, this compares toolArgs and toolResult. For workflow steps, this compares output.

src/example-trajectory-with-data.ts
const scorer = createTrajectoryAccuracyScorerCode({
  expectedTrajectory: {
    steps: [
      {
        stepType: 'tool_call',
        name: 'search-tool',
        toolArgs: { query: 'weather in NYC' },
      },
    ],
  },
})
// Data fields like toolArgs are auto-compared when present on expected steps

LLM-based trajectory accuracy scorer
Direct link to LLM-based trajectory accuracy scorer

The createTrajectoryAccuracyScorerLLM() function from @mastra/evals/scorers/prebuilt uses an LLM to evaluate whether an agent's or workflow's trajectory was appropriate, efficient, and complete.

Parameters
Direct link to Parameters

model:

MastraModelConfig

The LLM model to use for evaluating trajectory quality.

expectedTrajectory?:

Trajectory | ExpectedStep[]

Optional static expected trajectory to compare against. Accepts a full Trajectory or an array of ExpectedStep matchers. When omitted, the LLM evaluates the trajectory based on the task requirements alone. Can also come from dataset items at runtime.

Features
Direct link to Features

The LLM-based scorer provides:

Task-aware evaluation: Assesses whether each step was necessary given the user's request
Ordering assessment: Evaluates whether steps were taken in a logical order
Missing step detection: Identifies steps that should have been taken
Redundancy detection: Flags unnecessary or repeated steps
Reasoning generation: Provides human-readable explanations for scoring decisions

Evaluation process
Direct link to Evaluation process

Receive trajectory: Gets a pre-extracted Trajectory object from the pipeline
Analyze steps: Evaluates each step for necessity and ordering using the LLM
Generate score: Calculates score weighted as 60% necessity, 30% ordering, minus 10% missing penalty
Generate reasoning: Provides a human-readable explanation

LLM-based scoring details
Direct link to LLM-based scoring details

Fractional scores: Returns values between 0.0 and 1.0
Context-aware: Considers user intent and task requirements
Explanatory: Provides reasoning for scores
Flexible: Works with or without an expected trajectory

LLM-based scorer options
Direct link to LLM-based scorer options

// Evaluate based on task requirements (no expected trajectory)
const openScorer = createTrajectoryAccuracyScorerLLM({
  model: { provider: 'openai', name: 'gpt-5.4' },
})

// Evaluate against a static expected trajectory
const guidedScorer = createTrajectoryAccuracyScorerLLM({
  model: { provider: 'openai', name: 'gpt-5.4' },
  expectedTrajectory: {
    steps: [
      { stepType: 'tool_call', name: 'search-tool' },
      { stepType: 'tool_call', name: 'summarize-tool' },
    ],
  },
})

LLM-based scorer results
Direct link to LLM-based scorer results

{
  runId: string,
  preprocessStepResult: {
    actualTrajectory: Trajectory,
    actualTrajectoryFormatted: string,
    expectedTrajectoryFormatted?: string,
    hasSteps: boolean
  },
  analyzeStepResult: {
    stepEvaluations: Array<{
      stepName: string,
      wasNecessary: boolean,
      wasInOrder: boolean,
      reasoning: string
    }>,
    missingSteps?: string[],
    extraSteps?: string[],
    overallAssessment: string
  },
  score: number,
  reason: string
}

Unified trajectory scorer
Direct link to Unified trajectory scorer

The createTrajectoryScorerCode() function from @mastra/evals/scorers/prebuilt provides a multi-dimensional trajectory evaluation that checks accuracy, efficiency, blacklisted tools, and tool failure patterns in a single pass.

Parameters
Direct link to Parameters

defaults?:

TrajectoryExpectation

Default expectations applied to all dataset items. Per-item expectedTrajectory values override these defaults.

ExpectedStep[]

'strict' | 'relaxed' | 'unordered'

boolean

number

boolean

string[]

string[][]

number

weights?:

TrajectoryScoreWeights

Custom weights for combining dimension scores. Weights are normalized to sum to 1.0.

number

Scoring behavior
Direct link to Scoring behavior

The unified scorer evaluates four dimensions:

Accuracy — Matches actual steps against expected steps (if steps is configured). Uses the ordering mode.
Efficiency — Checks step budgets (maxSteps, maxTotalTokens, maxTotalDurationMs) and redundant calls (noRedundantCalls).
Blacklist — Checks for forbidden tools or sequences. Any violation immediately results in a score of 0.0 regardless of other dimensions.
Tool failures — Detects retry patterns, fallback patterns, and argument correction patterns.

The final score is a weighted combination of active dimensions, normalized by which dimensions are active. Default weights are accuracy 0.4, efficiency 0.3, tool failures 0.2, blacklist 0.1, but you can customize them via the weights option. Blacklist violations override everything to 0. When nested evaluations are present, the score is 70% top-level and 30% nested average.

Unified scorer results
Direct link to Unified scorer results

{
  runId: string,
  preprocessStepResult: {
    accuracy?: TrajectoryComparisonResult,
    efficiency?: TrajectoryEfficiencyResult,
    blacklist?: TrajectoryBlacklistResult,
    toolFailures?: ToolFailureAnalysisResult,
    nested?: NestedEvaluationResult[],
  },
  score: number,
  reason: string
}

Per-item expectations
Direct link to Per-item expectations

Each dataset item can override the defaults with its own expectedTrajectory. This lets you vary expectations per prompt:

src/unified-per-item.ts
import { createTrajectoryScorerCode } from '@mastra/evals/scorers/prebuilt'
import { runEvals } from '@mastra/core/evals'

// Default blacklist applies to all items
const scorer = createTrajectoryScorerCode({
  defaults: {
    blacklistedTools: ['deleteAll'],
    maxSteps: 5,
  },
})

const result = await runEvals({
  target: myAgent,
  scorers: { trajectory: [scorer] },
  data: [
    {
      input: 'Search for weather',
      expectedTrajectory: {
        steps: [{ stepType: 'tool_call', name: 'search' }],
        maxSteps: 2,
      },
    },
    {
      input: 'Search and summarize',
      expectedTrajectory: {
        steps: [
          { stepType: 'tool_call', name: 'search' },
          { stepType: 'tool_call', name: 'summarize' },
        ],
      },
    },
  ],
})

Example: efficiency and blacklist
Direct link to Example: efficiency and blacklist

src/unified-scorer.ts
import { createTrajectoryScorerCode } from '@mastra/evals/scorers/prebuilt'

const scorer = createTrajectoryScorerCode({
  defaults: {
    blacklistedTools: ['escalate', 'admin-override'],
    blacklistedSequences: [['escalate', 'admin-override']],
    maxSteps: 10,
    noRedundantCalls: true,
    maxRetriesPerTool: 2,
  },
  // Customize how dimensions contribute to the final score
  weights: {
    accuracy: 0.5, // prioritize step accuracy
    efficiency: 0.3,
    toolFailures: 0.1,
    blacklist: 0.1,
  },
})

Using trajectory scorers with `runEvals`
Direct link to using-trajectory-scorers-with-runevals

Trajectory scorers are configured under the trajectory key in the scorer config. The runEvals pipeline handles trajectory extraction automatically.

Agent trajectory evaluation
Direct link to Agent trajectory evaluation

src/agent-trajectory-eval.ts
import { runEvals } from '@mastra/core/evals'
import { createTrajectoryAccuracyScorerCode } from '@mastra/evals/scorers/prebuilt'

const trajectoryScorer = createTrajectoryAccuracyScorerCode({
  expectedTrajectory: {
    steps: [
      { stepType: 'tool_call', name: 'search' },
      { stepType: 'tool_call', name: 'format' },
    ],
  },
})

const result = await runEvals({
  target: myAgent,
  scorers: {
    agent: [qualityScorer], // receives raw MastraDBMessage[] output
    trajectory: [trajectoryScorer], // receives pre-extracted Trajectory
  },
  data: [{ input: 'Find and format the data' }],
})

// result.scores.agent['quality'] — agent-level score
// result.scores.trajectory['trajectory-accuracy'] — trajectory score

Workflow trajectory evaluation
Direct link to Workflow trajectory evaluation

src/workflow-trajectory-eval.ts
import { runEvals } from '@mastra/core/evals'
import { createTrajectoryAccuracyScorerCode } from '@mastra/evals/scorers/prebuilt'

const workflowTrajectoryScorer = createTrajectoryAccuracyScorerCode({
  expectedTrajectory: {
    steps: [
      { stepType: 'workflow_step', name: 'validate' },
      { stepType: 'workflow_step', name: 'process' },
      { stepType: 'workflow_step', name: 'notify' },
    ],
  },
})

const result = await runEvals({
  target: myWorkflow,
  scorers: {
    workflow: [outputScorer], // receives workflow output
    trajectory: [workflowTrajectoryScorer], // receives pre-extracted Trajectory from step results
  },
  data: [{ input: { userId: '123' } }],
})

// result.scores.workflow['output-quality'] — workflow-level score
// result.scores.trajectory['trajectory-accuracy'] — trajectory score

runEvals reference: Pipeline that extracts trajectories and passes them to scorers
MastraScorer reference: Base scorer interface
Scorer utils: Utility functions including extractTrajectory and compareTrajectories

Trajectory extractionDirect link to Trajectory extraction

Trace-based extraction (preferred)Direct link to Trace-based extraction (preferred)

Fallback extractionDirect link to Fallback extraction

Trajectory typesDirect link to Trajectory types

ToolCallStepDirect link to toolcallstep

stepType:

name:

toolArgs?:

toolResult?:

success?:

durationMs?:

metadata?:

children?:

WorkflowStepStepDirect link to workflowstepstep

stepType:

name:

stepId?:

status?:

output?:

durationMs?:

metadata?:

children?:

Other step typesDirect link to Other step types

Expected stepsDirect link to Expected steps

name:

stepType?:

(variant fields)?:

children?:

Simple expected stepsDirect link to Simple expected steps

Nested expectationsDirect link to Nested expectations

Choosing between scorersDirect link to Choosing between scorers

Use the code-based scorer when:Direct link to Use the code-based scorer when:

Use the LLM-based scorer when:Direct link to Use the LLM-based scorer when:

Code-based trajectory accuracy scorerDirect link to Code-based trajectory accuracy scorer

ParametersDirect link to Parameters

expectedTrajectory?:

comparisonOptions?:

Expected trajectory sourcesDirect link to Expected trajectory sources

Evaluation modesDirect link to Evaluation modes

Strict mode (strictOrder: true)Direct link to strict-mode-strictorder-true

Relaxed mode (strictOrder: false, default)Direct link to relaxed-mode-strictorder-false-default

Code-based scoring detailsDirect link to Code-based scoring details

Code-based scorer resultsDirect link to Code-based scorer results

Code-based scorer examplesDirect link to Code-based scorer examples

Agent trajectory with strict orderingDirect link to Agent trajectory with strict ordering

Agent trajectory with relaxed orderingDirect link to Agent trajectory with relaxed ordering

Workflow trajectoryDirect link to Workflow trajectory

Comparing step dataDirect link to Comparing step data

LLM-based trajectory accuracy scorerDirect link to LLM-based trajectory accuracy scorer

ParametersDirect link to Parameters

model:

expectedTrajectory?:

FeaturesDirect link to Features

Evaluation processDirect link to Evaluation process

LLM-based scoring detailsDirect link to LLM-based scoring details

LLM-based scorer optionsDirect link to LLM-based scorer options

LLM-based scorer resultsDirect link to LLM-based scorer results

Unified trajectory scorerDirect link to Unified trajectory scorer

ParametersDirect link to Parameters

defaults?:

weights?:

Scoring behaviorDirect link to Scoring behavior

Unified scorer resultsDirect link to Unified scorer results

Per-item expectationsDirect link to Per-item expectations

Example: efficiency and blacklistDirect link to Example: efficiency and blacklist

Using trajectory scorers with runEvalsDirect link to using-trajectory-scorers-with-runevals

Agent trajectory evaluationDirect link to Agent trajectory evaluation

Workflow trajectory evaluationDirect link to Workflow trajectory evaluation

RelatedDirect link to Related

Trajectory extraction
Direct link to Trajectory extraction

Trace-based extraction (preferred)
Direct link to Trace-based extraction (preferred)

Fallback extraction
Direct link to Fallback extraction

Trajectory types
Direct link to Trajectory types

`ToolCallStep`
Direct link to toolcallstep

`WorkflowStepStep`
Direct link to workflowstepstep

Other step types
Direct link to Other step types

Expected steps
Direct link to Expected steps

Simple expected steps
Direct link to Simple expected steps

Nested expectations
Direct link to Nested expectations

Choosing between scorers
Direct link to Choosing between scorers

Use the code-based scorer when:
Direct link to Use the code-based scorer when:

Use the LLM-based scorer when:
Direct link to Use the LLM-based scorer when:

Code-based trajectory accuracy scorer
Direct link to Code-based trajectory accuracy scorer

Parameters
Direct link to Parameters

Expected trajectory sources
Direct link to Expected trajectory sources

Evaluation modes
Direct link to Evaluation modes

Strict mode (`strictOrder: true`)
Direct link to strict-mode-strictorder-true

Relaxed mode (`strictOrder: false`, default)
Direct link to relaxed-mode-strictorder-false-default

Code-based scoring details
Direct link to Code-based scoring details

Code-based scorer results
Direct link to Code-based scorer results

Code-based scorer examples
Direct link to Code-based scorer examples

Agent trajectory with strict ordering
Direct link to Agent trajectory with strict ordering

Agent trajectory with relaxed ordering
Direct link to Agent trajectory with relaxed ordering

Workflow trajectory
Direct link to Workflow trajectory

Comparing step data
Direct link to Comparing step data

LLM-based trajectory accuracy scorer
Direct link to LLM-based trajectory accuracy scorer

Parameters
Direct link to Parameters

Features
Direct link to Features

Evaluation process
Direct link to Evaluation process

LLM-based scoring details
Direct link to LLM-based scoring details

LLM-based scorer options
Direct link to LLM-based scorer options

LLM-based scorer results
Direct link to LLM-based scorer results

Unified trajectory scorer
Direct link to Unified trajectory scorer

Parameters
Direct link to Parameters

Scoring behavior
Direct link to Scoring behavior

Unified scorer results
Direct link to Unified scorer results

Per-item expectations
Direct link to Per-item expectations

Example: efficiency and blacklist
Direct link to Example: efficiency and blacklist

Using trajectory scorers with `runEvals`
Direct link to using-trajectory-scorers-with-runevals

Agent trajectory evaluation
Direct link to Agent trajectory evaluation

Workflow trajectory evaluation
Direct link to Workflow trajectory evaluation

Related
Direct link to Related