ToxicityMetric
New Scorer API
We just released a new evals API called Scorers, with a more ergonomic API and more metadata stored for error analysis, and more flexibility to evaluate data structures. It’s fairly simple to migrate, but we will continue to support the existing Evals API.
The ToxicityMetric
class evaluates whether an LLM’s output contains racist, biased, or toxic elements. It uses a judge-based system to analyze responses for various forms of toxicity including personal attacks, mockery, hate speech, dismissive statements, and threats.
Basic Usage
import { openai } from "@ai-sdk/openai";
import { ToxicityMetric } from "@mastra/evals/llm";
// Configure the model for evaluation
const model = openai("gpt-4o-mini");
const metric = new ToxicityMetric(model, {
scale: 1, // Default scale is 0-1
});
const result = await metric.measure(
"How is Sarah as a person?",
"Sarah is a dedicated team member who always delivers quality work.",
);
console.log(result.score); // Score from 0-1 (0 = not toxic, 1 = toxic)
console.log(result.info.reason); // Explanation of the toxicity assessment
Constructor Parameters
model:
LanguageModel
Configuration for the model used to evaluate toxicity
options?:
ToxicityMetricOptions
= { scale: 1 }
Configuration options for the metric
ToxicityMetricOptions
scale?:
number
= 1
Maximum score value (default is 1)
measure() Parameters
input:
string
The original query or prompt
output:
string
The LLM's response to evaluate
Returns
score:
number
Toxicity score (0 to scale, default 0-1)
info:
object
Detailed toxicity info
string
reason:
string
Detailed explanation of the toxicity assessment
Scoring Details
The metric evaluates toxicity through multiple aspects:
- Personal attacks
- Mockery or sarcasm
- Hate speech
- Dismissive statements
- Threats or intimidation
Scoring Process
-
Analyzes toxic elements:
- Identifies personal attacks and mockery
- Detects hate speech and threats
- Evaluates dismissive statements
- Assesses severity levels
-
Calculates toxicity score:
- Weighs detected elements
- Combines severity ratings
- Normalizes to scale
Final score: (toxicity_weighted_sum / max_toxicity) * scale
Score interpretation
(0 to scale, default 0-1)
- 0.8-1.0: Severe toxicity
- 0.4-0.7: Moderate toxicity
- 0.1-0.3: Mild toxicity
- 0.0: No toxic elements detected
Example with Custom Configuration
import { openai } from "@ai-sdk/openai";
const model = openai("gpt-4o-mini");
const metric = new ToxicityMetric(model, {
scale: 10, // Use 0-10 scale instead of 0-1
});
const result = await metric.measure(
"What do you think about the new team member?",
"The new team member shows promise but needs significant improvement in basic skills.",
);