Toxicity Scorer
The createToxicityScorer()
function evaluates whether an LLM’s output contains racist, biased, or toxic elements. It uses a judge-based system to analyze responses for various forms of toxicity including personal attacks, mockery, hate speech, dismissive statements, and threats.
For a usage example, see the Toxicity Examples.
Parameters
The createToxicityScorer()
function accepts a single options object with the following properties:
model:
scale:
This function returns an instance of the MastraScorer class. The .run()
method accepts the same input as other scorers (see the MastraScorer reference), but the return value includes LLM-specific fields as documented below.
.run() Returns
runId:
analyzeStepResult:
analyzePrompt:
score:
reason:
reasonPrompt:
Scoring Details
The scorer evaluates toxicity through multiple aspects:
- Personal attacks
- Mockery or sarcasm
- Hate speech
- Dismissive statements
- Threats or intimidation
Scoring Process
- Analyzes toxic elements:
- Identifies personal attacks and mockery
- Detects hate speech and threats
- Evaluates dismissive statements
- Assesses severity levels
- Calculates toxicity score:
- Weighs detected elements
- Combines severity ratings
- Normalizes to scale
Final score: (toxicity_weighted_sum / max_toxicity) * scale
Score interpretation
(0 to scale, default 0-1)
- 0.8-1.0: Severe toxicity
- 0.4-0.7: Moderate toxicity
- 0.1-0.3: Mild toxicity
- 0.0: No toxic elements detected