# Response Caching Response caching skips the LLM call and replays a previously cached response when an agent receives an identical request. Use it to reduce latency and avoid paying for repeated calls. Caching is implemented as the [`ResponseCache`](https://mastra.ai/reference/processors/response-cache) input processor. Mastra doesn't provide an agent-level option. To enable caching, register the processor explicitly. This keeps the API surface small while Mastra collects feedback; per-call overrides flow through `RequestContext`. ## When to use response caching Reach for it when the same request shape repeats across users or sessions, for example prompt templates, suggested-prompt buttons, agentic search re-asks, or guardrail LLMs that classify the same input over and over. Skip it when calls trigger external side effects through tools, since cache hits replay tool calls without re-executing them. ## Quickstart Add a `ResponseCache` to the agent's `inputProcessors` and pass any `MastraServerCache` as the backend. For development, `InMemoryServerCache` works out of the box: ```typescript import { Agent } from '@mastra/core/agent' import { InMemoryServerCache } from '@mastra/core/cache' import { ResponseCache } from '@mastra/core/processors' const cache = new InMemoryServerCache() export const searchAgent = new Agent({ name: 'Search Agent', instructions: 'You answer questions concisely.', model: 'openai/gpt-5', inputProcessors: [new ResponseCache({ cache, ttl: 600 })], // 10 minutes }) ``` The first call runs the LLM normally and writes the response to the cache. Subsequent calls with an identical resolved prompt return the cached response without invoking the LLM. ## Per-call overrides via RequestContext Per-call config flows through `RequestContext`. Use `ResponseCache.context()` to build a fresh context, or `ResponseCache.applyContext()` to merge into one you already have: ```typescript import { ResponseCache } from '@mastra/core/processors' import { RequestContext } from '@mastra/core/request-context' // Fresh context with the override await agent.stream('hello', { requestContext: ResponseCache.context({ key: 'custom-key', bust: true }), }) // Or merge into an existing context const ctx = new RequestContext() ctx.set('caller-meta', { userId: 'u-123' }) ResponseCache.applyContext(ctx, { bust: true }) await agent.stream('hello', { requestContext: ctx }) ``` Three fields are overridable per call: - `key`: String or function. Overrides the auto-derived cache key for this request only. - `scope`: String or `null`. Overrides the tenant/user scope for this request only. `null` opts out of scoping. - `bust`: Boolean. Skips the cache read but still writes on completion (useful for "force refresh" buttons). `cache`, `ttl`, and `agentId` stay on the constructor. They're instance-level concerns and not safe to vary per call. ## Tenant scoping By default, `ResponseCache` looks up `MASTRA_RESOURCE_ID_KEY` on the request context and uses it as the cache scope. This means an agent that already populates the resource id (e.g. via memory) gets per-user isolation automatically. Two users never see each other's cached responses. Override explicitly when you need a different scope: ```typescript new Agent({ // ... inputProcessors: [ new ResponseCache({ cache, scope: 'org-123', // explicit tenant scope }), ], }) ``` Pass `scope: null` to deliberately share entries across all callers. Only use this for known-public, non-personalized content. ## Custom cache backend `ResponseCache` accepts any `MastraServerCache`. For production, use `RedisCache` from `@mastra/redis`: ```typescript import { Agent } from '@mastra/core/agent' import { ResponseCache } from '@mastra/core/processors' import { RedisCache } from '@mastra/redis' const cache = new RedisCache({ url: process.env.REDIS_URL }) export const agent = new Agent({ name: 'Cached Agent', instructions: '...', model: 'openai/gpt-5', inputProcessors: [new ResponseCache({ cache })], }) ``` For a custom backend, extend `MastraServerCache` and implement its abstract methods (the processor only calls `get` and `set`). ## How caching is implemented `ResponseCache` hooks into `processLLMRequest` (cache lookup, short-circuits on hit) and `processLLMResponse` (cache write on completion). Both run inside the agentic loop _after_ memory has loaded and earlier input processors have transformed the prompt. This means the cache key is derived from the resolved `LanguageModelV2Prompt` Mastra is about to send to the model. The key is created _after_ memory has loaded and earlier input processors have run, and each step in an agentic tool loop is independently cached. ## What's in the cache key When you don't supply `key`, the processor derives one deterministically from the inputs that change the LLM's response at this step: `agentId`, `stepNumber` (so each step in a tool loop has its own cache entry), `scope`, model identity (`provider`, `modelId`, spec version), and the resolved `prompt` (post-memory + post-processors). Any change to these inputs automatically invalidates the cache. ### Customize the cache key Pass `key` as a function on the constructor or per-call to derive your own cache key from any subset of those inputs. The function receives the same inputs the deterministic hash would have consumed and returns a string (or a `Promise`): ```typescript import { ResponseCache, buildResponseCacheKey } from '@mastra/core/processors' await agent.stream(input, { requestContext: ResponseCache.context({ // Cache only on the model id and the resolved prompt tail — ignore // step number, scope, etc. key: ({ model, prompt }) => `qa:${model.modelId}:${JSON.stringify(prompt).slice(-200)}`, }), }) // Or reuse the deterministic helper while overriding individual fields: await agent.stream(input, { requestContext: ResponseCache.context({ key: inputs => buildResponseCacheKey({ ...inputs, scope: 'global' }), }), }) ``` If the function throws, the processor falls back to the default key derivation so the call still benefits from caching. ## How cache hits work When the processor finds a cache hit, it short-circuits the LLM call by returning the cached chunks from `processLLMRequest`. The agentic loop synthesizes a stream from those chunks instead of calling the model. `agent.generate()` collects them into a `FullOutput`; `agent.stream()` returns a `MastraModelOutput` whose chunks come from the cached buffer, so consumers iterating `fullStream` or awaiting `text`, `usage`, and `finishReason` see the cached values. Cache writes happen after the response completes. Failed runs (errors, tripwire activations) aren't cached, so the next call retries cleanly. ## Related - [`ResponseCache` reference](https://mastra.ai/reference/processors/response-cache) - [Processors](https://mastra.ai/docs/agents/processors) - [Guardrails](https://mastra.ai/docs/agents/guardrails) - [Agent.stream()](https://mastra.ai/reference/streaming/agents/stream) - [Agent.generate()](https://mastra.ai/reference/agents/generate)