Context Relevance Scorer
The createContextRelevanceScorerLLM()
function creates a scorer that evaluates how relevant and useful provided context was for generating agent responses. It uses weighted relevance levels and applies penalties for unused high-relevance context and missing information.
Parameters
model:
options:
:::note
Either context
or contextExtractor
must be provided. If both are provided, contextExtractor
takes precedence.
:::
.run() Returns
score:
reason:
Scoring Details
Weighted Relevance Scoring
Context Relevance uses a sophisticated scoring algorithm that considers:
-
Relevance Levels: Each context piece is classified with weighted values:
high
= 1.0 (directly addresses the query)medium
= 0.7 (supporting information)low
= 0.3 (tangentially related)none
= 0.0 (completely irrelevant)
-
Usage Detection: Tracks whether relevant context was actually used in the response
-
Penalties Applied (configurable via
penalties
options):- Unused High-Relevance:
unusedHighRelevanceContext
penalty per unused high-relevance context (default: 0.1) - Missing Context: Up to
maxMissingContextPenalty
for identified missing information (default: 0.5)
- Unused High-Relevance:
Scoring Formula
Base Score = Σ(relevance_weights) / (num_contexts × 1.0)
Usage Penalty = count(unused_high_relevance) × unusedHighRelevanceContext
Missing Penalty = min(count(missing_context) × missingContextPerItem, maxMissingContextPenalty)
Final Score = max(0, Base Score - Usage Penalty - Missing Penalty) × scale
Default Values:
unusedHighRelevanceContext
= 0.1 (10% penalty per unused high-relevance context)missingContextPerItem
= 0.15 (15% penalty per missing context item)maxMissingContextPenalty
= 0.5 (maximum 50% penalty for missing context)scale
= 1
Score Interpretation
- 0.9-1.0 = Excellent relevance with minimal gaps
- 0.7-0.8 = Good relevance with some unused or missing context
- 0.4-0.6 = Mixed relevance with significant gaps
- 0.0-0.3 = Poor relevance or mostly irrelevant context
Difference from Context Precision
Aspect | Context Relevance | Context Precision |
---|---|---|
Algorithm | Weighted levels with penalties | Mean Average Precision (MAP) |
Relevance | Multiple levels (high/medium/low/none) | Binary (yes/no) |
Position | Not considered | Critical (rewards early placement) |
Usage | Tracks and penalizes unused context | Not considered |
Missing | Identifies and penalizes gaps | Not evaluated |
Usage Examples
Basic Configuration
const scorer = createContextRelevanceScorerLLM({
model: openai('gpt-4o'),
options: {
context: ['Einstein won the Nobel Prize for his work on the photoelectric effect'],
scale: 1,
},
});
Custom Penalty Configuration
const scorer = createContextRelevanceScorerLLM({
model: openai('gpt-4o'),
options: {
context: ['Context information...'],
penalties: {
unusedHighRelevanceContext: 0.05, // Lower penalty for unused context
missingContextPerItem: 0.2, // Higher penalty per missing item
maxMissingContextPenalty: 0.4, // Lower maximum penalty cap
},
scale: 2, // Double the final score
},
});
Dynamic Context Extraction
const scorer = createContextRelevanceScorerLLM({
model: openai('gpt-4o'),
options: {
contextExtractor: (input, output) => {
// Extract context based on the query
const userQuery = input?.inputMessages?.[0]?.content || '';
if (userQuery.includes('Einstein')) {
return [
'Einstein won the Nobel Prize for the photoelectric effect',
'He developed the theory of relativity'
];
}
return ['General physics information'];
},
penalties: {
unusedHighRelevanceContext: 0.15,
},
},
});
Usage Patterns
Content Generation Evaluation
Best for evaluating context quality in:
- Chat systems where context usage matters
- RAG pipelines needing nuanced relevance assessment
- Systems where missing context affects quality
Context Selection Optimization
Use when optimizing for:
- Comprehensive context coverage
- Effective context utilization
- Identifying context gaps
Related
- Context Precision Scorer - Evaluates context ranking using MAP
- Faithfulness Scorer - Measures answer groundedness in context
- Custom Scorers - Creating your own evaluation metrics