Built-in Scorers

Mastra provides a comprehensive set of built-in scorers for evaluating AI outputs. These scorers are optimized for common evaluation scenarios and are ready to use in your agents and workflows.

To create your own scorers, please see the Custom Scorers documentation.

Available scorersDirect link to Available scorers

Accuracy and reliabilityDirect link to Accuracy and reliability

These scorers evaluate how correct, truthful, and complete your agent's answers are:

answer-relevancy: Evaluates how well responses address the input query (0-1, higher is better)
answer-similarity: Compares agent outputs against ground truth answers for CI/CD testing using semantic analysis (0-1, higher is better)
faithfulness: Measures how accurately responses represent provided context (0-1, higher is better)
hallucination: Detects factual contradictions and unsupported claims (0-1, lower is better)
completeness: Checks if responses include all necessary information (0-1, higher is better)
content-similarity: Measures textual similarity using character-level matching (0-1, higher is better)
textual-difference: Measures textual differences between strings (0-1, higher means more similar)
tool-call-accuracy: Evaluates whether the LLM selects the correct tool from available options (0-1, higher is better)
prompt-alignment: Measures how well agent responses align with user prompt intent, requirements, completeness, and format (0-1, higher is better)

Context qualityDirect link to Context quality

These scorers evaluate the quality and relevance of context used in generating responses:

context-precision: Evaluates context relevance and ranking using Mean Average Precision, rewarding early placement of relevant context (0-1, higher is better)
context-relevance: Measures context utility with nuanced relevance levels, usage tracking, and missing context detection (0-1, higher is better)

tip Context Scorer Selection

Use Context Precision when context ordering matters and you need standard IR metrics (ideal for RAG ranking evaluation)
Use Context Relevance when you need detailed relevance assessment and want to track context usage and identify gaps

Both context scorers support:

Static context: Pre-defined context arrays
Dynamic context extraction: Extract context from runs using custom functions (ideal for RAG systems, vector databases, etc.)

Output qualityDirect link to Output quality

These scorers evaluate adherence to format, style, and safety requirements:

tone-consistency: Measures consistency in formality, complexity, and style (0-1, higher is better)
toxicity: Detects harmful or inappropriate content (0-1, lower is better)
bias: Detects potential biases in the output (0-1, lower is better)
keyword-coverage: Assesses technical terminology usage (0-1, higher is better)