Textual Evals
Textual evals use an LLM-as-judge methodology to evaluate agent outputs. This approach leverages language models to assess various aspects of text quality, similar to how a teaching assistant might grade assignments using a rubric.
Each eval focuses on specific quality aspects and returns a score between 0 and 1, providing quantifiable metrics for non-deterministic AI outputs.
Mastra provides several eval metrics for assessing Agent outputs. Mastra is not limited to these metrics, and you can also define your own evals.
Why Use Textual Evals?
Textual evals help ensure your agent:
- Produces accurate and reliable responses
- Uses context effectively
- Follows output requirements
- Maintains consistent quality over time
Available Metrics
Accuracy and Reliability
These metrics evaluate how correct, truthful, and complete your agent’s answers are:
hallucination
: Detects facts or claims not present in provided contextfaithfulness
: Measures how accurately responses represent provided contextcontent-similarity
: Evaluates consistency of information across different phrasingscompleteness
: Checks if responses include all necessary informationanswer-relevancy
: Assesses how well responses address the original querytextual-difference
: Measures textual differences between strings
Understanding Context
These metrics evaluate how well your agent uses provided context:
context-position
: Analyzes where context appears in responsescontext-precision
: Evaluates whether context chunks are grouped logicallycontext-relevancy
: Measures use of appropriate context piecescontextual-recall
: Assesses completeness of context usage
Output Quality
These metrics evaluate adherence to format and style requirements:
tone
: Measures consistency in formality, complexity, and styletoxicity
: Detects harmful or inappropriate contentbias
: Detects potential biases in the outputprompt-alignment
: Checks adherence to explicit instructions like length restrictions, formatting requirements, or other constraintssummarization
: Evaluates information retention and concisenesskeyword-coverage
: Assesses technical terminology usage