Skip to Content
ExamplesEvalsFaithfulness

Faithfulness

This example demonstrates how to use Mastra’s Faithfulness metric to evaluate how factually accurate responses are compared to the provided context.

Overview

The example shows how to:

  1. Configure the Faithfulness metric
  2. Evaluate factual accuracy
  3. Analyze faithfulness scores
  4. Handle different accuracy levels

Setup

Environment Setup

Make sure to set up your environment variables:

.env
OPENAI_API_KEY=your_api_key_here

Dependencies

Import the necessary dependencies:

src/index.ts
import { openai } from '@ai-sdk/openai'; import { FaithfulnessMetric } from '@mastra/evals/llm';

Example Usage

High Faithfulness Example

Evaluate a response where all claims are supported by context:

src/index.ts
const context1 = [ 'The Tesla Model 3 was launched in 2017.', 'It has a range of up to 358 miles.', 'The base model accelerates 0-60 mph in 5.8 seconds.', ]; const metric1 = new FaithfulnessMetric(openai('gpt-4o-mini'), { context: context1, }); const query1 = 'Tell me about the Tesla Model 3.'; const response1 = 'The Tesla Model 3 was introduced in 2017. It can travel up to 358 miles on a single charge and the base version goes from 0 to 60 mph in 5.8 seconds.'; console.log('Example 1 - High Faithfulness:'); console.log('Context:', context1); console.log('Query:', query1); console.log('Response:', response1); const result1 = await metric1.measure(query1, response1); console.log('Metric Result:', { score: result1.score, reason: result1.info.reason, }); // Example Output: // Metric Result: { score: 1, reason: 'All claims are supported by the context.' }

Mixed Faithfulness Example

Evaluate a response with some unsupported claims:

src/index.ts
const context2 = [ 'Python was created by Guido van Rossum.', 'The first version was released in 1991.', 'Python emphasizes code readability.', ]; const metric2 = new FaithfulnessMetric(openai('gpt-4o-mini'), { context: context2, }); const query2 = 'What can you tell me about Python?'; const response2 = 'Python was created by Guido van Rossum and released in 1991. It is the most popular programming language today and is used by millions of developers worldwide.'; console.log('Example 2 - Mixed Faithfulness:'); console.log('Context:', context2); console.log('Query:', query2); console.log('Response:', response2); const result2 = await metric2.measure(query2, response2); console.log('Metric Result:', { score: result2.score, reason: result2.info.reason, }); // Example Output: // Metric Result: { score: 0.5, reason: 'Only half of the claims are supported by the context.' }

Low Faithfulness Example

Evaluate a response that contradicts context:

src/index.ts
const context3 = [ 'Mars is the fourth planet from the Sun.', 'It has a thin atmosphere of mostly carbon dioxide.', 'Two small moons orbit Mars: Phobos and Deimos.', ]; const metric3 = new FaithfulnessMetric(openai('gpt-4o-mini'), { context: context3, }); const query3 = 'What do we know about Mars?'; const response3 = 'Mars is the third planet from the Sun. It has a thick atmosphere rich in oxygen and nitrogen, and is orbited by three large moons.'; console.log('Example 3 - Low Faithfulness:'); console.log('Context:', context3); console.log('Query:', query3); console.log('Response:', response3); const result3 = await metric3.measure(query3, response3); console.log('Metric Result:', { score: result3.score, reason: result3.info.reason, }); // Example Output: // Metric Result: { score: 0, reason: 'The response contradicts the context.' }

Understanding the Results

The metric provides:

  1. A faithfulness score between 0 and 1:

    • 1.0: Perfect faithfulness - all claims supported by context
    • 0.7-0.9: High faithfulness - most claims supported
    • 0.4-0.6: Mixed faithfulness - some claims unsupported
    • 0.1-0.3: Low faithfulness - most claims unsupported
    • 0.0: No faithfulness - claims contradict context
  2. Detailed reason for the score, including analysis of:

    • Claim verification
    • Factual accuracy
    • Contradictions
    • Overall faithfulness





View Example on GitHub