Master AI Evaluation: Build and Run Evals with Mastra
2025 is seeing explosive growth in AI applications, but how do you know if they're actually performing well? This hands-on workshop will teach you how to build and run comprehensive evaluation frameworks for your AI systems.
Evaluating AI systems is crucial for ensuring reliability, safety, and performance at scale. Join Mastra.ai to learn practical strategies for implementing evals that give you confidence in your AI deployments. You'll learn how to use to assess your AI application and AI Agent capabilities, detect potential issues, and maintain high standards of quality.
Get hands-on experience with essential eval strategies including:
Implementing LLM-as-judge evaluation frameworks
Setting up automated evaluation pipelines
Creating targeted test cases for your specific use cases
Monitoring and analyzing eval results through AI ops
This workshop is perfect for anyone building an AI Agent or AI application. Basic familiarity with JavaScript is recommended, and participants should have a code editor ready. You'll walk away with working eval implementations and practical knowledge you can immediately apply to your AI projects.
Don't just deploy AI—deploy it with confidence. Join us for this practical, hands-on session where you'll build real evaluation frameworks that you can start using right away.
Workshop Transcript
e e e e e e e e e e e e all right everybody thank you for joining I'm sure we'll have a lot more coming in over the next minute or so we will get started here in probably about 30 seconds we'll let a few more people funnel in and then we will start talking about eval all right let me share my screen and we will kick this off hey everyone welcome to the
workshop all right hopefully everyone can see my screen the last Workshop I shared the wrong screen first and so it took a while for someone to tell me so hopefully uh you'll uh drop in the chat if you can't see it and let's yeah let's get started so let's talk about what the goal of this Workshop is so thanks everyone for registering our goal is to build and run
evals and you should learn a lot about evals along the way and then we'll also show you how you can use this within MRA itself so even if you're not using MRA you should get a lot out of this but you'll also hopefully leave with the knowledge that it's really easy to do if you are using monra so as far as what we're going to
learn today we're going to go over an overview of evals and AI applications we'll learn how to leverage off-the-shelf evals with your AI agents we'll learn about using an llm as a judge in your evals how to write some custom evals and how to tie those to business results and then how to build some test Suites as well so let's go through who we are uh
so I'll go first so I'm Shane Thomas I'm Chief product officer and founder at mastra I was originally you know in product and Engineering at Gatsby and at atfi I built a product called audio feed and I've been doing open source stuff for over 15 years when I first started kind of doing some contributions to Drupal of all things so you can you know follow me on Twitter or you know connect
with me on LinkedIn and let's Obby go ahead hey everyone I'm Obby um I was a principal engineer at Gatsby uh built Gatsby Cloud uh worked at NFI as well and then I'm CTO here at MRA I've never seen snowfall which is a running joke in this MRA Universe we're creating here up in San Francisco that's CU I was born in Los Angeles we don't see snow over there
and what do I want is 13 stars on kidub 13k Stars we already got 13 thanks and a little backstory on this snow we always do this example in all these workshops and when we're giving demos our base example in masterer is a weather agent and so somehow it came up one day when it was very cold where I'm from which is
sou Fall South Dakota that Abby has never seen snowfall which I have seen a lot of okay so let's talk about if you need evals Greg Brockman seems to say that evals are surprisingly often all you need and yeah we're going to talk about why and I have a really image here this is from a hdev if you haven't been following his stuff and you're interested in evals he
has a lot of really good thoughts I agree with most of them you not all of them but most of them he's very good at uh breaking down kind of the eval space so you know there's some links at the end to some of his blog posts on evals and I would highly recommend you know if you want some further reading go read some of his stuff but in this case you
can see in this diagram that essentially you're building this cycle around evals where you integrate some evals into your product you go through these steps of running the EV vals improving running the evals improving and you kind of want to decrease that feedback loop between improving evals and how that kind of funnels into improving your overall
results of AI because you know our goal is to improve those results over time right we want to make sure that as new models come out as we change our system prompts as we pull in more data or give uh our agents or applications more tools or more autonomy that it's actually making things better and not going
backwards you know I constantly uh if if you're using if any of you're using like cursor or wind surf you might realize that sometimes it feels like it got worse sometimes it feels like it got better and that's because all this AI llm stuff is non-deterministic right it's very hard to test every possible use case but what we should be doing is
testing at least some of them making sure that we are getting better over time so it's less likely that we have those regressions for some of our users where it seems like things are working great one day now it's not as working good for the same thing you were doing last week and so yeah this this is done through quick feedback cycles and uh
kind of increasing how quickly you can iterate so we'll talk a little bit about types of evals so there's often like unit tests sometimes there these are kind of tied into something called Behavior tests but you have an input and output and you want it to kind of match an expected results so you want to verify that the llm responds correctly
across a wide range of different prompts you have some Metric based evals we'll talk about some of those today but they're more quantitative measurements sometimes these are you know I like to call them offthe shelf evals you can kind a pull in you can use them for specific use cases probably don't want
to use too many of them but you do want to try to tie them to uh making sure that over time things are not regressing and then we have this idea of Judges or you know someone kind of actually judging the output and this could be through human evaluation which is typically the most expensive but typically the best or llm as judge where
you can actually use an llm to judge the response and you do have to do some calibration maybe and provided examples but over time you can uh kind of reduce some of the human evaluation you need doesn't mean you can completely get rid of it but maybe you don't need to rely on it quite as much if we want to talk a little bit
about unit testing it's kind of similar to traditional software unit tests a lot of the stuff that you've done with traditional software engineering or software development hasn't really changed it's just a little different very similar to if you you know were building software and then you were introduced to devops or you know
something like that where now you got to learn some new principles but it's still software development at the end of the day AI is very similar to that during unit tests you'll often use an llm to generate test cases synthetic often called synthetic data and unlike traditional unit testing your pass rate might vary a bit because the llm is not deterministic so you'll
set the bar of what you want your pass rate to be and over time you want to make sure you don't go below that but know that there'll be some variance because you're not always going to be able to have something that's with with 100% predictability and you'll typically want to run these in Ci or part of a process that allows you to validate frequently that this is uh not going
backwards so here's some just kind of examples really simple of what this might look like let's say I wanted to build an agent that gets the weather you know we just talked about how ob's never seen snow but you might have a bunch of different ways in your sample data or of how someone might ask for the weather what's the weather in the city can you
tell me the weather weather in the city and you might have a whole bunch of different uh sample ways that someone might ask for that and then you'll have a test that would say did the response contain the city named was the correct weather tool actually called was the tool status 200 and did it have actual results so you can kind of think of how
you would script something like that to make sure that when you run this test Suite all those different variations are run and all the data is uh kind of validated that it came back and was run correctly another task example could be go out and research a sales lead so you might have you know an array of pairs of
this is the input and this is the expected output you know where did John Smith work before Apple maybe he worked at Amazon so you expect that Amazon is going to be either in that response or the exact response depending on how that you structure your prompts and your whether you're using structured output or something like that and when did Jane
Smith start at Google you're looking for this specific date needing to be there so you might want to run tests that say the response contains the correct factual data the correct tool was called whether that's looking it up in your database whether it's doing a web search using a LinkedIn API whatever your tool would be in this case you'd want to make sure that it actually called the right tool for these types of prompts so you
can see how you can piece together and run these kind of tests on different parts of your application another important thing when you're thinking about evals is it's often called llm ops or AI Ops it's not just running tests it's also making sure you can debug things as you go so you want to have something that allows you to do tracing which essentially says anytime there's a request to one of your
llm calls or your agents you can look through and see these are all the steps that happened you can see the inputs and outputs of every step and this helps you debug and makes it easier for you to iterate and make improvements on your application we have a whole bunch of different metrics in MRA but there's a lot of different types that you could find kind of out in the wild these are
typically quantitative metrics they might be NLP or natural language processing based or llm based meaning you use an llm as a judge for some of these things you should be selective about metrics you choose I think I Obby and I kind of had a had a video online where you know we were talking about whether off-the-shelf metrics are useful
or not and so I think there's some uh it depends on your use case they can be useful especially over long periods of time where you're seeing how things are either trending up or trending down but in uh individual any individual you know code push or something maybe those metrics don't mean that much but over
time they in aggregate they definitely can but so you do really want to monitor those results over time look for sign significant changes and figure out why uh results might be uh going one way or the other when we talk about Judges we have kind of a human judge or an llm as the judge human judges could be users of
your product you know sometimes you'll see if you use chat gbt or a lot of uh similar types of those products there's a thumbs up or thumbs down like was this a good response or not they're actually using that information to feedback as evals into uh the the training of their models right and so if you're thinking a
smaller scale you might have something like that as well if you provide something to your user and they say this was good well then you know that that was a good enough response because typically users are only going to respond one way or the if they're only going to give you a thumbs up if it was really good or a thumbs down if it was
probably not very good at all most people won't do anything if it's in the middle so you'd kind of use that as kind of training data or EV data for your evals to make sure that as you improve things those types of responses are still uh are still not going backwards they still as good uh you can have experts as well so there's you know we're we're currently in YC combinator and so there's a company that's doing
more financial data and it needs to use very specific tax laws when it's answering questions and so they have to have someone with that background that can validate that whatever their application is returning is right and so they spend a lot of time a lot of money making sure that it's providing factual
information so in this case they're using an expert and then of course that feedback is used in those test cases an llm as a judge is you just use an llm to grade this uh because it's a little bit more uh automatable you you don't just need like a pass or fail typically you can use some kind of scale and you'll also
want to probably use the best llm for this so whatever the newest model is whatever the most expensive model is that's usually your judge because it'll have the most comprehensive ability to re reason through this stuff but you do need to calibrate the LM with human feedback with examples and making sure that over time your llm judge is
responding very similar to what your expert or what your user would actually do so let's talk a little bit about mastra mastra is an open source AI agent framework for typescript it has things like tools memory and tracing for agents it has a state machine based workflows so you can build really like agentic workflows that
can pause and resume when there's if you need humans to kind of interact with things so you can kind of have that human in the loop it has evals it has storage for rag retrieval augmented generation pipelines so you can pull data in from anywhere we also have a nice local development playground which
you'll get to see a little bit today and yeah I don't know how we can turn off this uh annotation Abby can you figure out how to turn that off I don't know I guess you're not a you're not a uh disable annotations okay and I don't know how to remove it though I'll figure that out in a second yeah okay we just keep
going all right so the idea is masterer is supposed to be opinionated but flexible that's really our goals when we built MRA it's an open source framework so we want to have opinions so you don't have to make a lot of decisions right away but we don't want to lock you into those decisions so you can of course just swap things out and replace things as you go
and we really want to empower you to get further faster so you can hit the ground running you can get something out the door really quickly and then you can uh get really far and then swap in different things as you need and as you get more sophisticated with your application but we're going to go ahead
and create some evals and I'm we're going to talk about what tone consistency answer relevancy and hallucination evals are we'll talk ob's going to apparently talk about why offthe shelf evals are trash as I kind of alluded to a little bit we're going to talk about building a judge talk about bringing it all together and then
yeah we'll get excited about some things so I'm going to stop sharing and hand it over to Obby hello again and we'll go from there take it away to go through not much time but we will do it okay so this is a monster project I'm not going to go through the setup process we have a workshop for the basics I would uh attend that one or review it before I go any deeper um I
want to say that doing evals is not it's kind of like an experimental territory that we live in but it is all very much scientific right it's just the the method of just iteration and you know trying hypotheses out and and making them work so I will say that the things that we're going into we're probably going to do a evals part two because
right now this what what I want to show you guys is how do you get into the iteration Loop where you can just execute the scientific method right test things get try something out see if it changes have some heris right so I'll go through from simple to everything more complex but in reality there's so much more to learn even we're learning every day so okay disclaimer is aside let's
get into it first I want to talk about um NLP metrics so natural language metrics are been used forever um and there's so many of them off the shelf that if your product does care about um how things are structured with language those are really nice evals to do like maybe your your llm responses are really
mean or mean spirited maybe that requires you to change your prompt a little bit um maybe you're trying to have a new control tone right so this tone consistency metric as an example is something that will just measure the tone based on a like you know a semantic uh like a search uh sorry a NLP Library which a sentiment analysis um so so how you set up
off-the-shelf metrics in MRA are very easy we have a MRA eval uh package so you would just install MRA eval and then within the the NLP uh kind of section of that Library we have so many other ones here you can essentially see completeness content similarity keyword coverage textual difference now this
only matters if you're dealing with language and like that's important right but if you're expecting your agent to do things maybe that's not as important rather than the actions it takes but let's uh keep going so how you set this up you create a new tone consistency metric and then I here I have just a simple agent that you know I have this fake company called Quantum Tech and
it's like kind of like a customer support assistant bought and then the cool thing about evals or just anything in MRA is it's all JavaScript so if we wanted to write a test on this we could and in this test I have a bunch of test cases we're using Vest by the way it's pretty fast but you could use anything
you want like just or mocha or whatever the new flavor is cuz it's just JavaScript so in this test I have a bunch of queries and this is usually how an eval kind of goes you have inputs and outputs U very much like a unit type of test um what Quantum Computing Services do you offer can you explain quantum entanglement and then you know it keeps
crashing these are all kind of support questions and we should maintain a consistent tone so what you can do first is we take our agent that we set up in our tone example here and we gave it that system prompt and then what we can do and I'll make this bigger no never mind a little too big um we'll generate a response
from the agent based on our query and we'll get the response and then we'll need to measure it's based on our metric so we have our con tone consistency metric metrics in MRA are very simple they they have this function called measure and in this case we're going to take the input which was the original thing that we asked and then the text
that was uh returned measure it and we'll get the essentially the sentiment scores from the NLP library and then as then it's up to you to kind of figure out like what that means to you right as the as the developer so let's run this thing I think I have it skipped and and Obby we're just seeing your editor I don't know if you're to share
that right now sorry my bad uh this is what I wanted okay and let's test I run T oh because it's running another one I'll skip this we're going to come to that soon all right so it's going to do the tone test and so the problem like the first problem you'll see is like this test is not really deterministic because the the generate is going to generate a different response every time right but and so
look something failed that's really cool let's actually inspect all this and I'll give you like the the kind of takeaway so let's go to like the first one so the query was like what Quantum Computing Services do you offer and obviously all this stuff is fake right because there is no Quantum Tech so it just made up a response but that's not what we're
testing here we're testing tone so it gave a you know it's cool like pretty lengthy um response and then the tone info is the like response sentiment is zero which means it's like neutral and then the reference sentiment is a is like a difference and then when you when you essentially have that difference you can
then convert that into a score there's this algorithm in the library you could take a look at that um populates the score um I'm not really expert of sentiment analysis but for this example here we are um so yeah it's scored like 0.92 I guess that is good right because as The Interpreter of the score it's
it's kind of up to you to determine if that's good or not cuz like maybe you want tone always to be like like always to be zero or maybe you want it to be positive always right then you would need to then tweak The Prompt but okay let's keep going so most of these are the same but the one that failed had a negative tone right and
let's see like it doesn't seem like it but hey that's kind of what um that's kind of how to how it went and the trend here with these types of is like if you want to run test inci for eval I would instead trying to test if things are correct you should test if things are wrong and in the emergency case wrong because like if the the if
the scores are going to change every run you can't really be sure maybe you you can't really think of like it as a score you have think of it as a range or like maybe it passed maybe it didn't versus like you could just say like it didn't fail I know this is not a good thing to say but like it's kind of the reality of
what what this is let's go to another one so this was NLP based I think NLP based is not that great um because even if you get got a score as The Interpreter I don't even know what the reasoning behind it is because I have to be some sentiment analysis expert to understand this or go read about it right that's why we're going to get into
the next one which is like llm as judge so for llm as judge and once again llm is judg means you are passing a model to the metric you're essentially making the metric into an agent under the hood because it can do a lot of things but then you're basing it still on a um like there's like logic within the metric that determines how to score
the response so answer relevancy is based on um you have the inputs and then based on the response it's going to see if the answers are relevant it and it's going to just judge that and let's see and the relevancy is based on the question right is is based on the question is the answer relevant now once again is that even valuable to
test I don't know it could be doesn't seem like it but this is just a good intro example so I'm just trying to poke this this thought throughout this whole thing um so same kind of deal as last time I put a lot of comments here because I want to show how you could essentially improve this so this is what this is like the future um but you also
see that we're using you know O3 mini for our model um because you know I didn't want to do too expensive a model for for the workshop here um answer relevancy has like this uncertainty waight you can like kind of wait the responses you're going to see that it's going to kind of be all over the place um but yeah same type of thing we're Tech Gadget Inc
we're going to ask some questions to the to to it um and let's just run these tests same kind of deal is all kind of like same JavaScript testing type of situation these three questions we'll do the generate and measure same type of measurement so even if you're using an llm as judge the metric cell is like a
measure function and it's still inputs and outputs but this time we get a reason and I'm really curious to know and the reason may not be something we like to hear and we'll have to like understand how to fix it but it's kind of interesting so let's run this one so I'll unskip this one let's go back and skip this one here okay and back here just keep this
code open let's run test again this one might take a little bit longer we'll see because it has to do to generate calls right one is the agent call that you're making and then the metric itself has to go do it so we'll see I probably should have put a progress bar or something and because you're using o 03 you know there's some Advanced reasoning
going on so it does uh it's not the fastest model yeah we can try it with a different model too after this run but it is very common for you to want to use typically the best models when you're doing your evals but you're probably thinking as you're seeing this won't this get expensive you know there's a lot of uh a lot of llm calls
just to test that things are working and to be quite honest yes depending on your situation you have to weigh how important is quity and accuracy versus the the cost of having to run these tests and so that determines like how often do you actually run them do you run them every time you commit code or do you run them once a day once a week uh you'll have to really be thoughtful
about that and they can time out as we just saw um but we'll go through some of the ones that didn't I I can add a timeout too but here's the thing like right like let's just say like this example is so trivial but it's a show right this just a show show stuff off but if you were to have to test a real
agent that does a lot of stuff like how much are you willing to pay for testing because you know every test is like five bucks let's just say we have an internal agent that we run evals on and this one day it went off the rails and we paid $5 every 30 minutes and we didn't know that where it came from or something and that was we were like where who's spending all this money oh it was the evals
that's what was happening you know I just say this to keep you all informed you know poking the thought as as I like to say let's get back to this so let's go through these questions and the answers so how do I reset my factory settings obviously it made it up again but that's cool and the answer is relevant because you know based on the question the answer is relevant I would agree
with this one so that's cool is it valuable I don't know when will the new Tech Gadget be released above 50% is is not bad right I don't think you know the scale here is is is up to you but if it's above 50% in this trivial example it's probably okay um anyway we can keep going on this not really valuable but let's push one more let's do one more example that is even more interesting about off-the-shelf
evals which is hallucination um um and this one yeah this one's going to be a treat okay mechanics are the same so hallucination it's a metric you get it from the llm LI in our evals Library super chill you use the same type of model in this case for hallucination the type uh the inputs are like the context
of what your agent should know so in this case I just put a bunch of text here like as different key facts so the thing though is like agents are known to hallucinate a lot and that's why context is super important the more context you have the less it'll hallucinate so you have to kind of like give guard rails to your agents like uh
what it can do but like is it really valuable to test hallucinations potentially but it's like how are they scored and so my point up until this point is off the shelf evals are trash because like as I've said Each one is like it depends it depends it depends and so let's test this hallucination one and we'll see if it actually does uh well at
hallucinations and just spoiler alert I found a bug in this off-the-shelf eval and we got the logic from like a paper so it's like maybe we just didn't implement it properly but I'll show it to you let's go into it so everything's set up the same way hallucination metric send it a model give it the the inputs that it needs the test is the same thing
as other ones that's how you would test it let's run it same kind of deal you generate the response from your your agent you pass it into measure inputs outputs scores and reasons so as much as I'm hating on evals The evals that we have like the the mechanics are super easy right and so once you start doing your own evals at
least you you have the good mechanics to like to push forward so let's run this it's going to take once again a time so I'm Shane if you want to like add any color here while we wait I probably will not wait for all of them uh because it will take a while but uh yeah yeah I think uh this kind of goes back to these off-the-shelf evals can be
important again if you know what you're actually measuring which is why at the end of the day what you should think about is what are the business results you actually want to accomplish and how do you measure that and some of these might fit into that and they might o especially over time give you information that says okay we we do understand what this is doing and we want to make sure that as we improve
models or we change our system prompts for our agents or we add more data that you know through rag or something that it doesn't go go backwards but again you should be I think there's this tendency for people to see all these off-the-shelf evals and add all of them right it'd be pretty easy to just grab
them and add all of them and then you almost have too much data where too much data is almost worse than having no data at all because it's hard to make decisions if you're flooded with too much information so just make sure you're being selective about the evals and make sure you have a real reason for wanting to use it and what it's trying
to measure and then in that case it's uh much more meaningful to whatever the business results you're actually trying to drive because if my guess is if you're here you're trying to build something and you want to measure if it's working and in order to do that you know you want to make sure that you're
not spending too much time on information that's not actually going to help you accomplish that did did I do did I do good enough J job stallin Obby yeah I mean I also put the model to GPT 40 because I was too lazy to wait um point across um as you can see it's running we'll just go through one of these so the question was like how has Quantum tech technology
evolved since their first quantum computer now once again this is a question that like outside the context it was given should not be like that's we gave it some really Jank context right if we look here you know we gave it some basic stuff so it shouldn't answer outside of that and that's kind of where there's like a dis maybe there's like a disagreement like would you
consider if I gave you this context this list of context and it didn't say anything wrong about that list of context is that considered a hallucination but what if it adds some stuff that wasn't in this context is that hallucinating often times we see this in our own lives with using Claude or even Winsor for cursor or any of the tools like you kind of can fact check the things that you know but then
sometimes you don't know the thing that is telling you might may or may not be wrong so Hall hallucination tests are are good but then you have to have a lot of context as we can see this one was good this one was good because it said that it didn't say anything out of the context that like contradicted that context but it also added some stuff about scalability in road map that I
didn't even say it doesn't have the context for that so this could be a false positive and that's the bug I found in this hallucination stuff you have to make sure that it only looks into the context but even with that bug resolved is this a valuable eval I don't know it depends and that's like the worst answer in software engineering okay let's move on to something maybe more valuable which is
doing it yourself like making your own metrics so llm is Judge what does that mean in uh practice at least in the MRA framework we have a primitive called the MRA agent judge kind of wordy but kind of uh cool sounding if you ask me and what it is is it's just a MRA agent that's job is to evaluate anything that
you want and usually when they say l llm as judge they don't really kind of consider that you could you know like you could kind of just manipulate this thing as much as you want just like any other you know agent that you're building and what we realized and this is kind of like the negative and positive of all this you
probably will spend a lot of time building your judge um it maybe won't be as expensive as a human judge but you'll probably put a lot of time and effort into building your judge um because that's truly the thing that will evaluate if you're doing well and another point about Judges before we continue a judge's purpose is to evaluate things and give you reasons for
why certain um responses were good and why certain responses were bad the score is not for some trophy right the score is to take the ones that were really good and put them back into your system prompt that way you are adding that context we're talking about every good example goes back into the system prompt and even bad examples go back into the system prompt or any other prompting
that you're going to do because as we've learned is the more context and knowledge you give this thing the better it performs but the problem is if you're starting at Time Zero you have no context or Knowledge from anybody and you can't make it up I mean we always try right that's what testing is all about synthetic data and stuff but if
you can get it from your users and you have like kind of an iteration cycle where you're judging stuff or maybe you know you have a human expert judging them they all go back into it's like a you know a circle it goes right back into the agent so that's just you know okay I'll get off my high horse and let's keep
going so in the same way as agents judges have system prompts as well or we call them instructions but it's it's a system prompt and for my example I made a customer quality agent right so what it's supposed to do is it's going to judge the responses a customer support bot gives a uh a person needing help and
once again not the craziest example but it's a test harness right that allows you to get into an iteration Loop because you could like follow what I've done and I published the code it's on GitHub you could just follow what I've done and like kind of put your own stuff in there and get into an iteration loop with your own code and see where it
takes you so my judge here I'm judging the responses based on correctness and here's a tip do not use numerical scores as the way you're going to judge something because numbers it's funny because numbers one llms don't really do well with numbers like numerical kind of assignments which is weird and also like and that's kind of from my experience and like from what I'm reading like
don't take it like as absolute truth but it also does better when you kind of have the these like textual representations so like for example in my thing I made a I made a rubric called poor good excellent to me those mean numbers because I'm a human and I like numbers but from a to a llm text is better so poor good and excellent in theory will be a better
assignment and I only would know that from doing this and then fail like it actually like judging things wrong and then you have to kind of think about it once again you're working on your judge a lot too so correctness poor good excellent I gave it examples of what a poor thing looks like a good one excellent one this could evolve over time as I start
learning other poor examples and stuff and that'll go that'll come back into my judge and then it'll make it better completeness you know this is all like Cookie Cutter stuff but you know it's cool uh Clarity same kind of deals all right that's kind of the The Prompt how do you set up a judge well we have
you can extend the agent judge because the agent judge is what runs all the off-the-shelf evals but if you want to like do your own thing you just extend it and you pass your model in we have two methods you need to implement evaluate which is the input for like the customer query and then what the agent response is and this is kind of where it's up to you
like how do you evaluate well you're going to use your agent and make like a quality evaluation prompt so it's like please evaluate these things and then based on that system prompt I wrote and like established my grading system I want it to return this in a specific structure right and then using our um using our agent generate function with through MRA you can then assemble
this into an object and then can assemble it I don't have some typescript thing on so please ignore this but it'll it'll run um and then there's some other methods just like this is just JavaScript again like you can make I made these methods because I need helper methods just like the way I like to write code the only re only required things I had to do were here were to do
evaluate and get reason um because I was mentioning earlier scores and reasons are the thing that make you like not go insane so you you need to also have like a reasoning prompt as well so like explain the reasoning explain structure like this is how I want you to do it what's the overall score um give me some recommendations on
how to improve it these all came from the previous evaluation and then now it needs to give given all the evaluation data make this into something that I can reason about and once again you have to iterate a lot here too right and depends on what kind of team you're on the wording like it's everyone this is very much like a a personal thing that you
would want to do we can like kind of recommend what we like to do but everyone has preferences and then lastly we do the same kind of structuring and I added some more methods because I like that uh so we'll come back for questions here but let's look at look at this in the wild in my actual test so I had to make a so the judge is one part
but judges can do multiple metrics if you wanted it to so like I made a metric like a oneto one metric to judge but like you could technically make multiple metrics from a specific Judge I we have a metric the cool thing about masterer is like everything's just very extendable and like JavaScript like friendly so if you're like really good at JavaScript or like understand the language very well
you could do whatever you want um cuz you can just extend metric here and then and you know we have some requirement required methods to implement like measure we were looking at that before and that's pretty much all you have to implement is measure so couldn't it's up to you so in our case right we evaluate
we are a judge we get the scores we want like in our case I like helper methods that help me do stuff and then I create get the reason from our get reason thing and then I return it here and then I could I could do any number of metrics that I want from a singular judge it's up to you um I think it's pretty cool if you want to explore that just some caveats
like don't do a lot of it like don't do a lot of different types of evaluations in one judge because uh you won't your mileage may vary okay let me wrap this all up real quick so we have uh our quality metric test same as all the other things I was showing you test cases for each one we're going to measure them sorry I'm just going fast because we're kind of running out of time but uh yeah
just metric. measure I'll run it and I got to what was the last one we ran this one let's run this one and what's cool when I wrote this what's cool is like I created the the the rubric right so it's cool that like seeing it like your rubric actually get like implemented I don't know makes you feel some type of way and the other thing that you know if you're looking at
that and you're thinking wow that was kind of a lot of code just to run this judge right just to run this someone to to or some llm to judge the outputs of these tests and it kind of is but you have to also think of it this way if over time you can have less Reliance on the human expert providing the feedback which you know
takes a lot of time is can be expensive it's really uh challenging to get that feedback sometimes and you can rely more on the llm to do mo more of it again I don't think it'll ever completely replace it because you want that human feedback in the loop but uh it it is worth spending a little time improving
your judge because you you can of course uh automate a lot of this stuff that you wouldn't be able to if you didn't spend extra time as you can see from the results here like it broke down the response and then gave me the reasons that I was looking for I was trying to judge on correctness and then the examples how to
improve it completeness Etc so it's cool it's like what I wanted now the question is do I agree as the human expert I'm the one who created this monster do I agree with its reasoning I do personally but you know it's up to you and if you don't then you got to go tweak it right this is like Frankenstein's monster about to go judge all your stuff you
probably should get it tuned up a little bit okay so then finally we have all these metrics and stuff I'm going to put them all together into a customer support agent and also let's just for sake of not being slow once again this is all the stuff we went through and I put it on um you okay okay so this is kind of the next part here when we put
it all together you can run uh metrics and evals in test files that's cool it's probably for emergency scenarios like you're testing for failures not successes but what we think is you should just be eving in production like as things are happening and how you do that is you just pass your evals to your agent like this and any generate call that happens in the background we will
just like you will be collecting evals and however you do it like however whatever metrics you want to do I'll show you a demo of this this starts monster playground which is dope by the way um you can see my Quantum Tech customer support agent is right here let's chat with it what if I did it like when when Quantum Tech
four cool I mean obviously I didn't know what to expect there but let's just see how it EV valed across all the evals we did so in Monster Dev and in monster Cloud spoiler alert you can do tracing so one second I think I have to refresh this uh no here sorry you can do tracing this is like our tracing UI I think okay here it is so like that was the stream I
just did there's a little lag on my on my side so that was a stream we just did we can see like everything that happened also my stuff has like memory enabled I added monster memory we're going to have a workshop on that so this might all just be uh unclear to you now but through this you can see like just
everything that gets called and sometimes and this kind of us in stuff into the database what's cool about tracing is sometimes you're evaluating an agent if it did something or not and it it could say it did it like it could say oh yeah I looked up uh I looked up the menu at the restaurant and I they're having chicken picata today and then you look
at the trace and that fool didn't even look at the restaurant API it just made it up right so that's one line of a defense as a human you can like look at traces but you can also look at your evals in like after the run has happened so like as you can see like oh nice this tone this one's hallucination this that customer quality right you can see the input the output that came back and the outputs here and
the score that it got this is cool for many reasons I mean it's valuable because you can see it all Hamill makes a joke that people spend too much time making like dashboard for their eval so we were like yeah so we just gave you one so you don't have to make it yourself but this allows you to track it
over time I could do more I can do more um prompts and then we can start seeing it another cool and last kind of thing which which I'll wrap up is in our agent we have this thing called the instructions uh section and here you can do like prompt enhancement which would take your eval results and then go to uh well we have like written
a prompt judge that will like improve your prompt okay so I'm just going to take blindly take whatever evals we had I'm going to enhance this prompt it's going to put all the evals in there and then yeah we'll just see what happens and then the Dem our Workshop will we'll open for questions as you can see now it changed the prompt whether you think that's a good prompt
or not it's up to you I do personally um so I'm going to use it but it doesn't mean that it will score it might score better on the evals but we don't necessarily know in our version history here I can see this is like the newest version I will make it active and so you can see that the analysis here is like we also have like we have like I was saying we have reasons too right it's a
prompt judge it gives you a reason of why it made that choice if you agree you agree Okay cool so now let's use it refresh the page you can see that our instructions have changed when Quant I mean I'm not making the question any better but we'll see okay and then we'll have evals coming shortly should be like two of them the tone consistency came right
away cuz it's fast right doesn't talk to an agent but if I keep like spamming refresh okay now we have more and then answer relevancy will come in last I didn't look at the scores before but I know my metric the one that I wrote has gotten better so it must be because of me but uh hallucination still zero which
is good uh tone consistency I don't know if it went up but if you're watching and paying closely you would probably know I think this one actually I could actually look duh so these are average scores we actually went worse and then you know we got worse somehow that's the frustrating thing about all this but in some ways we got better so in the thing that we actually cared
about we got a lot better so that's why off the shelf evals are trash in some ways um and that's it that's all I will share we'll open up for questions yeah yeah we'll go through just a couple quick in slides and then we will have maybe time for a few questions so feel free to drop some in the chat and we'll
try to get to them if we we do have time let me share my screen quick okay so where do we go from here and just to reiterate all the slides will be shared the uh the GitHub repo that with ob code will be shared and uh we'll make sure we we share the recording as well so you'll get an email in a couple hours whatever email you signed up with so you will
actually be able to then of course see all all this stuff all right so you can see the slides Obby we're good all right so we have the evals docs this walks through again gives you an overview on evals how you could use them if you're using MRA here's some further reading these would consider some kind of like foundational posts on evals and so if
you're just really trying to figure out if you went through this and it felt like we were moving really fast or it was you know way over your head just everyone's at a different point in learning all this stuff I would refer to some of the these posts and even to be honest even some of these were over my head when I was first starting to read it so just know that it takes time to
figure this stuff out at at some level evals are just tests but they are different enough in a way where it seems very confusing so again evals are tests but they're you know they do require some time and some effort to learn and so if hopefully you all uh enjoyed this I'm asking for if you all to help me out right now help us out if you can go to our GitHub and start if
you haven't already that would be fantastic we're trying to get to you know 13,000 stars as Abby said which would make mastra the more stars than I think any other related JavaScript AI thing that's out there right now and uh we also have a weight list for mastercloud it's master. aweight list or SL Cloud beta I think either way works but you can join it we'll give you early
access to our Cloud product which we're going to be releasing very soon and we also uh which I didn't know was going to happen today we we have a launch on why combinators launches page so if you want to go upload that that would be also be very appreciated maybe I'll Obby if you want to drop that in in the chat if we have that handy
somewhere but with that here's some links there's the GitHub here's our website join our Discord if you are getting into this stuff and you just have questions we're we have a lot of people in our Discord that are just helping each other out we're there almost every day a lot of other people from the team are there so if you just have questions on this stuff it's a good
place to go good place to get some support get some help if you get stuck along the way we have more events coming up we'll be talking about rag how to build your first agent eventually we'll have stuff on memory and tools and and all these things that are kind of AI Concepts and then connect with us on or on Twitter find us on LinkedIn as well
connect with us there and yeah any questions that we have I see Abby you mentioned a few yeah I'll just uh one that wasn't answered what is AI Ops and are evals considered a part of it AI Ops is a madeup word for devops with AI on the side so think devops plus AI stuff so in a traditional engineering team who would be responsible for paging and all that stuff based on eval is going wrong or
what if your agent starts going off the rails and all evals are starting to go Trend in in a way that you don't want right so evals are just another eval responses and scores are just another data point a devops person would be monitoring so it's just devops good question great question and then there's a few others how to estimate the LM
costs so that is a good question so what any advice on uh Vim that's gonna be really hard because it does depend one depends on your application a lot of your cost is going to be driven by user usage if you have you know an application that's more Enterprise and it's only for a few users and you don't have that much usage you're not going to pay that much for llm costs I'm assuming you're talking
about the actual llm costs itself not costs of uh using mastra but if you have something with very high usage if you're building you know chat GPT or or perplexity or something like that you're going to have a ton of costs right which is a you know could be a good thing as long as the economics work that you're
making more from your users than it's costing you but there is definitely a lot of math that goes I would look at look at the llm models they have like costs per token you can start to do this math but it is a calculation and it's not easy to understand today it will get easier over time there are more and more tools that are starting to come out to
help you calculate costs but just know that you're going to have to spend some time in a spreadsheet and probably think through some of those things and model some of these uh prob possibilities out um if you're talking about actual uh masterer costs the open source product is free you can use it you can deploy it
anywhere our Cloud product will you know it's going to be pretty cheap so you're not going to have a lot of costs there most most your costs are going to be towards the llm monster Basics Workshop we did we send that out we we can send it out I guess if you Jo we'll put it there yeah and if you go so here's the calendar this is you can subscribe to that calend
to get all the events so next Thursday we're going to run a build and deploy your first agent so you can learn how to build a very simple masterer agent and you actually get it deployed all right everybody I appreciate you all uh coming today and you know find us if you have questions you know we're around and we hope to see you at a future event have a great day
peace

