Introducing Evals in Mastra

Using Evaluation Metrics in Mastra

Evaluation metrics help measure and validate AI model outputs across different dimensions. Here's a comprehensive guide to Mastra's evaluation suite. All of these metrics output a score between 0 and 1.

We are starting with some NLP-based metrics, and will add additional LLM-as-judgemetrics soon.

Core Evaluation Metrics

Model Configuration

First, set up the model configuration needed for several metrics:

 1import { ModelConfig } from "@mastra/core";
 2
 3const model: ModelConfig = {
 4  provider: "OPEN_AI",
 5  model: "gpt-4",
 6  apiKey: process.env.OPENAI_API_KEY,
 7};

Answer Relevancy

Answer relevancy evaluates if responses address queries appropriately:

 1import { AnswerRelevancyMetric } from "@mastra/evals/llm";
 2
 3const metric = new AnswerRelevancyMetric(model, {
 4  uncertaintyWeight: 0.3,
 5  scale: 10,
 6});
 7
 8const result = await metric.measure({
 9  input: "What is the capital of France?",
10  output: "Paris is the capital of France.",
11});

This metric uses an LLM to judge how well responses address queries, scoring yes/no/unsure verdicts with uncertainty weighting. Score is calculated as (relevancyCount / totalVerdicts) * scale, where "unsure" counts as 0.3.

Completeness

Completeness measures how thoroughly a response covers the key elements from the input:

 1import { CompletenessMetric } from "@mastra/evals/nlp";
 2
 3const metric = new CompletenessMetric();
 4
 5const result = await metric.measure({
 6  input: "Explain the water cycle: evaporation, condensation, precipitation",
 7  output:
 8    "Water evaporates from surfaces, forms clouds through condensation, and returns as precipitation",
 9});

Specifically, this metric extracts and compares key elements (nouns, verbs, topics) between input and output using NLP. Score represents the ratio of matched elements to total elements, with intelligent partial matching for longer words.

Content Similarity

This metric performs direct string comparison using a similarity library, with configurable case and whitespace sensitivity. Perfect matches score 1.0, with decreasing scores based on string differences.

 1import { ContentSimilarityMetric } from "@mastra/evals/nlp";
 2
 3const metric = new ContentSimilarityMetric({
 4  ignoreCase: true,
 5  ignoreWhitespace: true,
 6});
 7
 8const result = await metric.measure({
 9  input: "The quick brown fox",
10  output: "the Quick Brown fox",
11});

Context Position

Context position evaluates how well the model uses ordered context. Earlier positions are weighted more heavily (weight = 1/position).

The final score is the ratio of weighted relevant items to maximum possible weighted sum.

 1import { ContextPositionMetric } from "@mastra/evals/llm";
 2
 3const metric = new ContextPositionMetric(model, {
 4  scale: 10,
 5});
 6
 7const result = await metric.measure({
 8  input: "Summarize the events",
 9  output: "First came A, then B, finally C",
10  context: [
11    "A occurred in the morning",
12    "B happened at noon",
13    "C took place in the evening",
14  ],
15});

Context Precision

Context precision measures accurate use of provided context by calculating precision at each relevant position in the response. Score is normalized sum of precision values at relevant positions divided by number of relevant items.

 1import { ContextPrecisionMetric } from "@mastra/evals/llm";
 2
 3const metric = new ContextPrecisionMetric(model, {
 4  scale: 10,
 5});
 6
 7const result = await metric.measure({
 8  input: "What did the research find?",
 9  output: "The study found significant improvements",
10  context: [
11    "Research showed 45% improvement",
12    "Results were statistically significant",
13  ],
14});

Difference

Difference metric calculates detailed text differences:

 1import { TextualDifferenceMetric } from "@mastra/evals/nlp";
 2
 3const metric = new TextualDifferenceMetric();
 4
 5const result = await metric.measure({
 6  input: "Original text version",
 7  output: "Modified text version",
 8});
 9// Provides ratio, number of changes, and length differences

Keyword Coverage

Extracts and compares keywords between input and output using a keyword extraction library. Score is simply matched keywords divided by total keywords from input.

 1import { KeywordCoverageMetric } from "@mastra/evals/nlp";
 2
 3const metric = new KeywordCoverageMetric();
 4
 5const result = await metric.measure({
 6  input: "Explain photosynthesis: chlorophyll, sunlight, glucose",
 7  output: "Plants use chlorophyll to convert sunlight into glucose",
 8});

Prompt Alignment

Uses an LLM to check adherence to specific instructions with binary yes/no scoring. Final score is the ratio of followed instructions to total instructions, scaled to range.

 1import { PromptAlignmentMetric } from "@mastra/evals/llm";
 2
 3const metric = new PromptAlignmentMetric(model, {
 4  instructions: [
 5    "Use formal language",
 6    "Include specific examples",
 7    "Stay under 100 words",
 8  ],
 9  scale: 10,
10});
11
12const result = await metric.measure({
13  input: "Describe quantum computing",
14  output: "Quantum computing uses quantum bits...",
15});

Tone Consistency

Analyzes sentiment consistency either between input/output or within sentences of a single text. Score is based on sentiment difference or variance, with 1.0 indicating perfect consistency.

 1import { ToneConsistencyMetric } from "@mastra/evals/nlp";
 2
 3const metric = new ToneConsistencyMetric();
 4
 5const result = await metric.measure({
 6  input: "Write a positive product review",
 7  output: "This product exceeded my expectations!",
 8});
 9// Measures sentiment stability and alignment

Combining Metrics

Combine metrics for more assessment:

 1import {
 2  AnswerRelevancyMetric,
 3  ContextPrecisionMetric,
 4  PromptAlignmentMetric,
 5  ToneConsistencyMetric,
 6} from "@mastra/evals/llm";
 7
 8import {
 9  CompletenessMetric,
10  ContentSimilarityMetric,
11  TextualDifferenceMetric,
12  KeywordCoverageMetric,
13  ToneConsistencyMetric,
14} from "@mastra/evals/nlp";
15
16async function evaluateResponse({
17  input,
18  output,
19  context,
20  instructions,
21}: {
22  input: string;
23  output: string;
24  context?: string[];
25  instructions?: string[];
26}) {
27  const model: ModelConfig = {
28    provider: "openai",
29    model: "gpt-4",
30    apiKey: process.env.OPENAI_API_KEY,
31  };
32
33  const metrics = [
34    new AnswerRelevancyMetric(model),
35    new CompletenessMetric(),
36    new ContentSimilarityMetric(),
37    new ContextPrecisionMetric(model),
38    new TextualDifferenceMetric(),
39    new KeywordCoverageMetric(),
40    new PromptAlignmentMetric(model, { instructions: instructions || [] }),
41    new ToneConsistencyMetric(),
42  ];
43
44  const results = await Promise.all(
45    metrics.map((metric) =>
46      metric.measure({
47        input,
48        output,
49        context,
50      }),
51    ),
52  );
53
54  return results;
55}

Introducing Evals in Mastra

Stay up to date