Faithfulness Scorer

The createFaithfulnessScorer() function evaluates how factually accurate an LLM's output is compared to the provided context. It extracts claims from the output and verifies them against the context, making it essential to measure RAG pipeline responses' reliability.

Parameters

The createFaithfulnessScorer() function accepts a single options object with the following properties:

model:

LanguageModel

Configuration for the model used to evaluate faithfulness.

context:

string[]

Array of context chunks against which the output's claims will be verified.

scale:

number

= 1

The maximum score value. The final score will be normalized to this scale.

This function returns an instance of the MastraScorer class. The .run() method accepts the same input as other scorers (see the MastraScorer reference), but the return value includes LLM-specific fields as documented below.

.run() Returns

runId:

string

The id of the run (optional).

preprocessStepResult:

string[]

Array of extracted claims from the output.

preprocessPrompt:

string

The prompt sent to the LLM for the preprocess step (optional).

analyzeStepResult:

object

Object with verdicts: { verdicts: Array<{ verdict: 'yes' | 'no' | 'unsure', reason: string }> }

analyzePrompt:

string

The prompt sent to the LLM for the analyze step (optional).

score:

number

A score between 0 and the configured scale, representing the proportion of claims that are supported by the context.

reason:

string

A detailed explanation of the score, including which claims were supported, contradicted, or marked as unsure.

generateReasonPrompt:

string

The prompt sent to the LLM for the generateReason step (optional).

Scoring Details

The scorer evaluates faithfulness through claim verification against provided context.

Scoring Process

Analyzes claims and context:
- Extracts all claims (factual and speculative)
- Verifies each claim against context
- Assigns one of three verdicts:
  - "yes" - claim supported by context
  - "no" - claim contradicts context
  - "unsure" - claim unverifiable
Calculates faithfulness score:
- Counts supported claims
- Divides by total claims
- Scales to configured range

Final score: (supported_claims / total_claims) * scale

Score interpretation

A faithfulness score between 0 and 1:

1.0: All claims are accurate and directly supported by the context.
0.7–0.9: Most claims are correct, with minor additions or omissions.
0.4–0.6: Some claims are supported, but others are unverifiable.
0.1–0.3: Most of the content is inaccurate or unsupported.
0.0: All claims are false or contradict the context.

Example

Evaluate agent responses for faithfulness to provided context:

src/example-faithfulness.ts
import { runEvals } from "@mastra/core/evals";
import { createFaithfulnessScorer } from "@mastra/evals/scorers/prebuilt";
import { myAgent } from "./agent";

// Context is typically populated from agent tool calls or RAG retrieval
const scorer = createFaithfulnessScorer({
  model: "openai/gpt-4o",
});

const result = await runEvals({
  data: [
    {
      input: "Tell me about the Tesla Model 3.",
    },
    {
      input: "What are the key features of this electric vehicle?",
    },
  ],
  scorers: [scorer],
  target: myAgent,
  onItemComplete: ({ scorerResults }) => {
    console.log({
      score: scorerResults[scorer.id].score,
      reason: scorerResults[scorer.id].reason,
    });
  },
});

console.log(result.scores);

For more details on runEvals, see the runEvals reference.

To add this scorer to an agent, see the Scorers overview guide.

Parameters​

model:

context:

scale:

.run() Returns​

runId:

preprocessStepResult:

preprocessPrompt:

analyzeStepResult:

analyzePrompt:

score:

reason:

generateReasonPrompt:

Scoring Details​

Scoring Process​

Score interpretation​

Example​

Related​

Parameters

.run() Returns

Scoring Details

Scoring Process

Score interpretation

Example

Related