Skip to main content

Faithfulness Scorer

The createFaithfulnessScorer() function evaluates how factually accurate an LLM's output is compared to the provided context. It extracts claims from the output and verifies them against the context, making it essential to measure RAG pipeline responses' reliability.

Parameters

The createFaithfulnessScorer() function accepts a single options object with the following properties:

model:

LanguageModel
Configuration for the model used to evaluate faithfulness.

context:

string[]
Array of context chunks against which the output's claims will be verified.

scale:

number
= 1
The maximum score value. The final score will be normalized to this scale.

This function returns an instance of the MastraScorer class. The .run() method accepts the same input as other scorers (see the MastraScorer reference), but the return value includes LLM-specific fields as documented below.

.run() Returns

runId:

string
The id of the run (optional).

preprocessStepResult:

string[]
Array of extracted claims from the output.

preprocessPrompt:

string
The prompt sent to the LLM for the preprocess step (optional).

analyzeStepResult:

object
Object with verdicts: { verdicts: Array<{ verdict: 'yes' | 'no' | 'unsure', reason: string }> }

analyzePrompt:

string
The prompt sent to the LLM for the analyze step (optional).

score:

number
A score between 0 and the configured scale, representing the proportion of claims that are supported by the context.

reason:

string
A detailed explanation of the score, including which claims were supported, contradicted, or marked as unsure.

generateReasonPrompt:

string
The prompt sent to the LLM for the generateReason step (optional).

Scoring Details

The scorer evaluates faithfulness through claim verification against provided context.

Scoring Process

  1. Analyzes claims and context:
    • Extracts all claims (factual and speculative)
    • Verifies each claim against context
    • Assigns one of three verdicts:
      • "yes" - claim supported by context
      • "no" - claim contradicts context
      • "unsure" - claim unverifiable
  2. Calculates faithfulness score:
    • Counts supported claims
    • Divides by total claims
    • Scales to configured range

Final score: (supported_claims / total_claims) * scale

Score interpretation

A faithfulness score between 0 and 1:

  • 1.0: All claims are accurate and directly supported by the context.
  • 0.7–0.9: Most claims are correct, with minor additions or omissions.
  • 0.4–0.6: Some claims are supported, but others are unverifiable.
  • 0.1–0.3: Most of the content is inaccurate or unsupported.
  • 0.0: All claims are false or contradict the context.

Example

Evaluate agent responses for faithfulness to provided context:

src/example-faithfulness.ts
import { runEvals } from "@mastra/core/evals";
import { createFaithfulnessScorer } from "@mastra/evals/scorers/prebuilt";
import { myAgent } from "./agent";

// Context is typically populated from agent tool calls or RAG retrieval
const scorer = createFaithfulnessScorer({
model: "openai/gpt-4o",
});

const result = await runEvals({
data: [
{
input: "Tell me about the Tesla Model 3.",
},
{
input: "What are the key features of this electric vehicle?",
},
],
scorers: [scorer],
target: myAgent,
onItemComplete: ({ scorerResults }) => {
console.log({
score: scorerResults[scorer.id].score,
reason: scorerResults[scorer.id].reason,
});
},
});

console.log(result.scores);

For more details on runEvals, see the runEvals reference.

To add this scorer to an agent, see the Scorers overview guide.