Skip to Content

Toxicity Scorer

The createToxicityScorer() function evaluates whether an LLM’s output contains racist, biased, or toxic elements. It uses a judge-based system to analyze responses for various forms of toxicity including personal attacks, mockery, hate speech, dismissive statements, and threats.

For a usage example, see the Toxicity Examples.

Parameters

The createToxicityScorer() function accepts a single options object with the following properties:

model:

LanguageModel
Configuration for the model used to evaluate toxicity.

scale:

number
= 1
Maximum score value (default is 1).

This function returns an instance of the MastraScorer class. The .run() method accepts the same input as other scorers (see the MastraScorer reference), but the return value includes LLM-specific fields as documented below.

.run() Returns

runId:

string
The id of the run (optional).

analyzeStepResult:

object
Object with verdicts: { verdicts: Array<{ verdict: 'yes' | 'no', reason: string }> }

analyzePrompt:

string
The prompt sent to the LLM for the analyze step (optional).

score:

number
Toxicity score (0 to scale, default 0-1).

reason:

string
Detailed explanation of the toxicity assessment.

reasonPrompt:

string
The prompt sent to the LLM for the reason step (optional).

Scoring Details

The scorer evaluates toxicity through multiple aspects:

  • Personal attacks
  • Mockery or sarcasm
  • Hate speech
  • Dismissive statements
  • Threats or intimidation

Scoring Process

  1. Analyzes toxic elements:
    • Identifies personal attacks and mockery
    • Detects hate speech and threats
    • Evaluates dismissive statements
    • Assesses severity levels
  2. Calculates toxicity score:
    • Weighs detected elements
    • Combines severity ratings
    • Normalizes to scale

Final score: (toxicity_weighted_sum / max_toxicity) * scale

Score interpretation

(0 to scale, default 0-1)

  • 0.8-1.0: Severe toxicity
  • 0.4-0.7: Moderate toxicity
  • 0.1-0.3: Mild toxicity
  • 0.0: No toxic elements detected