# Context Precision Scorer The `createContextPrecisionScorer()` function creates a scorer that evaluates how relevant and well-positioned retrieved context pieces are for generating expected outputs. It uses **Mean Average Precision (MAP)** to reward systems that place relevant context earlier in the sequence. It is especially useful for these use cases: **RAG System Evaluation** Ideal for evaluating retrieved context in RAG pipelines where: - Context ordering matters for model performance - You need to measure retrieval quality beyond simple relevance - Early relevant context is more valuable than later relevant context **Context Window Optimization** Use when optimizing context selection for: - Limited context windows - Token budget constraints - Multi-step reasoning tasks ## Parameters **model:** (`MastraModelConfig`): The language model to use for evaluating context relevance **options:** (`ContextPrecisionMetricOptions`): Configuration options for the scorer **Note**: Either `context` or `contextExtractor` must be provided. If both are provided, `contextExtractor` takes precedence. ## .run() Returns **score:** (`number`): Mean Average Precision score between 0 and scale (default 0-1) **reason:** (`string`): Human-readable explanation of the context precision evaluation ## Scoring Details ### Mean Average Precision (MAP) Context Precision uses **Mean Average Precision** to evaluate both relevance and positioning: 1. **Context Evaluation**: Each context piece is classified as relevant or irrelevant for generating the expected output 2. **Precision Calculation**: For each relevant context at position `i`, precision = `relevant_items_so_far / (i + 1)` 3. **Average Precision**: Sum all precision values and divide by total relevant items 4. **Final Score**: Multiply by scale factor and round to 2 decimals ### Scoring Formula ```text MAP = (Σ Precision@k) / R Where: - Precision@k = (relevant items in positions 1...k) / k - R = total number of relevant items - Only calculated at positions where relevant items appear ``` ### Score Interpretation - **0.9-1.0**: Excellent precision - all relevant context early in sequence - **0.7-0.8**: Good precision - most relevant context well-positioned - **0.4-0.6**: Moderate precision - relevant context mixed with irrelevant - **0.1-0.3**: Poor precision - little relevant context or poorly positioned - **0.0**: No relevant context found ### Reason analysis The reason field explains: - Which context pieces were deemed relevant/irrelevant - How positioning affected the MAP calculation - Specific relevance criteria used in evaluation ### Optimization insights Use results to: - **Improve retrieval**: Filter out irrelevant context before ranking - **Optimize ranking**: Ensure relevant context appears early - **Tune chunk size**: Balance context detail vs. relevance precision - **Evaluate embeddings**: Test different embedding models for better retrieval ### Example Calculation Given context: `[relevant, irrelevant, relevant, irrelevant]` - Position 0: Relevant → Precision = 1/1 = 1.0 - Position 1: Skip (irrelevant) - Position 2: Relevant → Precision = 2/3 = 0.67 - Position 3: Skip (irrelevant) MAP = (1.0 + 0.67) / 2 = 0.835 ≈ **0.83** ## Scorer configuration ### Dynamic context extraction ```typescript const scorer = createContextPrecisionScorer({ model: "openai/gpt-5.1", options: { contextExtractor: (input, output) => { // Extract context dynamically based on the query const query = input?.inputMessages?.[0]?.content || ""; // Example: Retrieve from a vector database const searchResults = vectorDB.search(query, { limit: 10 }); return searchResults.map((result) => result.content); }, scale: 1, }, }); ``` ### Large context evaluation ```typescript const scorer = createContextPrecisionScorer({ model: "openai/gpt-5.1", options: { context: [ // Simulate retrieved documents from vector database "Document 1: Highly relevant content...", "Document 2: Somewhat related content...", "Document 3: Tangentially related...", "Document 4: Not relevant...", "Document 5: Highly relevant content...", // ... up to dozens of context pieces ], }, }); ``` ## Example Evaluate RAG system context retrieval precision for different queries: ```typescript import { runEvals } from "@mastra/core/evals"; import { createContextPrecisionScorer } from "@mastra/evals/scorers/prebuilt"; import { myAgent } from "./agent"; const scorer = createContextPrecisionScorer({ model: "openai/gpt-4o", options: { contextExtractor: (input, output) => { // Extract context from agent's retrieved documents return output.metadata?.retrievedContext || []; }, }, }); const result = await runEvals({ data: [ { input: "How does photosynthesis work in plants?", }, { input: "What are the mental and physical benefits of exercise?", }, ], scorers: [scorer], target: myAgent, onItemComplete: ({ scorerResults }) => { console.log({ score: scorerResults[scorer.id].score, reason: scorerResults[scorer.id].reason, }); }, }); console.log(result.scores); ``` For more details on `runEvals`, see the [runEvals reference](https://mastra.ai/reference/evals/run-evals). To add this scorer to an agent, see the [Scorers overview](https://mastra.ai/docs/evals/overview) guide. ## Comparison with Context Relevance Choose the right scorer for your needs: | Use Case | Context Relevance | Context Precision | | ------------------------ | -------------------- | ------------------------- | | **RAG evaluation** | When usage matters | When ranking matters | | **Context quality** | Nuanced levels | Binary relevance | | **Missing detection** | ✓ Identifies gaps | ✗ Not evaluated | | **Usage tracking** | ✓ Tracks utilization | ✗ Not considered | | **Position sensitivity** | ✗ Position agnostic | ✓ Rewards early placement | ## Related - [Answer Relevancy Scorer](https://mastra.ai/reference/evals/answer-relevancy) - Evaluates if answers address the question - [Faithfulness Scorer](https://mastra.ai/reference/evals/faithfulness) - Measures answer groundedness in context - [Custom Scorers](https://mastra.ai/docs/evals/custom-scorers) - Creating your own evaluation metrics