Running Evals
Evals are automated tests that evaluate LLM outputs using model-graded, rule-based, and statistical methods. Each eval returns a normalized score between 0-1 that can be logged and compared. Evals can be customized with your own prompts and scoring functions.
Evals suites run in the cloud, but as tests, it’s logical to store them in your codebase. Because LLMs are non-deterministic, you might not get a 100% pass rate every time.
Mastra recommends using Braintrust’s eval framework, autoevals, to run evals. They have a free tier that should be enough for most use cases.
Other open-source eval frameworks: