Contextual Recall Evaluation

note

The Scorers API is currently in beta. We're actively working on improvements and appreciate your feedback. For questions or feature requests, please reach out on Discord.

Use ContextualRecallMetric to evaluate how well the response incorporates relevant information from the provided context. The metric accepts a query and a response, and returns a score and an info object containing a reason.

InstallationDirect link to Installation

npm install @mastra/evals

High recall exampleDirect link to High recall example

In this example, the response includes all the information from the context. Every element is accurately recalled and expressed in the output, resulting in a perfect recall score.

src/example-high-recall.ts
import { openai } from "@ai-sdk/openai";
import { ContextualRecallMetric } from "@mastra/evals/llm";

const metric = new ContextualRecallMetric(openai("gpt-4o-mini"), {
  context: [
    "Product features include cloud sync.",
    "Offline mode is available.",
    "Supports multiple devices.",
  ],
});

const query = "What are the key features of the product?";
const response =
  "The product features cloud synchronization, offline mode support, and the ability to work across multiple devices.";

const result = await metric.measure(query, response);

console.log(result);

High recall outputDirect link to High recall output

The output receives a perfect score because all context elements are present in the response. Each feature mentioned in the context is accurately recalled and integrated, with no missing or extraneous information.

{
  score: 1,
  info: {
    reason: 'The score is 1 because all elements of the expected output are fully supported by the corresponding nodes in retrieval context, specifically node(s) that detail cloud synchronization, offline mode support, and multi-device functionality.'
  }
}

Mixed recall exampleDirect link to Mixed recall example

In this example, the response includes some context elements but also introduces unrelated content. The presence of irrelevant information reduces the overall recall score.

src/example-mixed-recall.ts
import { openai } from "@ai-sdk/openai";
import { ContextualRecallMetric } from "@mastra/evals/llm";

const metric = new ContextualRecallMetric(openai("gpt-4o-mini"), {
  context: [
    "Python is a high-level programming language.",
    "Python emphasizes code readability.",
    "Python supports multiple programming paradigms.",
    "Python is widely used in data science.",
  ],
});

const query = "What are Python's key characteristics?";
const response =
  "Python is a high-level programming language. It is also a type of snake.";

const result = await metric.measure(query, response);

console.log(result);

Mixed recall outputDirect link to Mixed recall output

The output receives a mid-range score because it includes one relevant context statement but also introduces unrelated content not supported by the original context.

{
  score: 0.25,
  info: {
    reason: "The score is 0.25 because while the sentence 'Python is a high-level programming language' aligns with node 1 in the retrieval context, the lack of mention of other relevant information from nodes 2, 3, and 4 indicates significant gaps in the overall context."
  }
}

Low recall exampleDirect link to Low recall example

In this example, the response includes very little or none of the relevant context. Most of the information in the response is unsupported, resulting in a low recall score.

src/example-low-recall.ts
import { openai } from "@ai-sdk/openai";
import { ContextualRecallMetric } from "@mastra/evals/llm";

const metric = new ContextualRecallMetric(openai("gpt-4o-mini"), {
  context: [
    "The solar system has eight planets.",
    "Mercury is closest to the Sun.",
    "Venus is the hottest planet.",
    "Mars is called the Red Planet.",
  ],
});

const query = "Tell me about the solar system.";
const response = "Jupiter is the largest planet in the solar system.";

const result = await metric.measure(query, response);

console.log(result);

Low recall outputDirect link to Low recall output

The output receives a low score because the response includes information that is not present in the context and ignores the details that were provided. None of the context items are incorporated into the answer.

{
  score: 0,
  info: {
    reason: "The score is 0 because the output lacks any relevant information from the node(s) in retrieval context, failing to address key aspects such as the number of planets, Mercury's position, Venus's temperature, and Mars's nickname."
  }
}

Metric configurationDirect link to Metric configuration

You can create a ContextualRecallMetric instance by providing a context array representing background information relevant to the response. You can also configure optional parameters such as scale to define the scoring range.

const metric = new ContextualRecallMetric(openai("gpt-4o-mini"), {
  context: [""],
  scale: 1,
});

See ContextualRecallMetric for a full list of configuration options.

Understanding the resultsDirect link to Understanding the results

ContextualRecallMetric returns a result in the following shape:

{
  score: number,
  info: {
    reason: string
  }
}

Recall scoreDirect link to Recall score

A recall score between 0 and 1:

1.0: Perfect recall – all context information used.
0.7–0.9: High recall – most context information used.
0.4–0.6: Mixed recall – some context information used.
0.1–0.3: Low recall – little context information used.
0.0: No recall – no context information used.

Recall infoDirect link to Recall info

An explanation for the score, with details including:

Information incorporation.
Missing context.
Response completeness.
Overall recall quality.

View source on GitHub