Back to all workshops

Master AI Evaluation: Build and Run Evals with Mastra

June 5, 2025•4:00 PM UTC•60 minutes

2025 is seeing explosive growth in AI applications, but how do you know if they're actually performing well? This hands-on workshop will teach you how to build and run comprehensive evaluation frameworks for your AI systems.

Evaluating AI systems is crucial for ensuring reliability, safety, and performance at scale. Join Mastra.ai to learn practical strategies for implementing evals that give you confidence in your AI deployments. You'll learn how to use to assess your AI application and AI Agent capabilities, detect potential issues, and maintain high standards of quality.

Get hands-on experience with essential eval strategies including:

Implementing LLM-as-judge evaluation frameworks
Setting up automated evaluation pipelines
Creating targeted test cases for your specific use cases
Monitoring and analyzing eval results through AI ops

This workshop is perfect for anyone building an AI Agent or AI application. Basic familiarity with JavaScript is recommended, and participants should have a code editor ready. You'll walk away with working eval implementations and practical knowledge you can immediately apply to your AI projects.

Don't just deploy AI—deploy it with confidence. Join us for this practical, hands-on session where you'll build real evaluation frameworks that you can start using right away.

Workshop Transcript

0:51

Hey everyone, thanks for joining us today. Going to get started real soon. Just got to figure out how to live stream this um to where it needs to be. So

1:06

um yeah, just give me a sec. I'll figure it out. While we're waiting for a bit, why don't you uh enter in the chat where you're from, where you're calling from or you're joining from. I mean, um I see some people already saying what up.

1:44

Hello. I am calling from or doing this stream from Europe actually. Usually I do this in San Francisco, but I'm actually in Belgium right now. Uh visiting Ward if you all know him. And tomorrow, and I

2:01

don't know if y'all came to the last workshop, that was the workflows one. Uh, tomorrow headed over to Tony in Greece because he's getting married in case anyone wanted to know. Oh, we got people from Finland. I'm going there

2:16

soon. Got people from Colombia. Wow. Gal, hi from Munich. Sick. Hey, another

2:23

Belgium guy. Donavvel. Um, Phoenix, Texas. Washington DC.

2:29

Oakland. Congrats to Tony. Thank you. I guess I'll let him know what she said.

2:35

Chicago, San Diego. Hell yeah. Oo, Canada. Central New York. Toronto,

2:42

India. Man, we're going to have a great worldwide workshop today. I might have to give up on the live stream because I don't have permission to live stream on Zoom. I don't know if

2:52

Shane did that on purpose, but uh maybe he did. Let's see. Um, we'll start in two minutes if not. So, I apologize for the delay here.

3:46

I say we get this show on the road. Maybe I can cancel the live stream that was happening and unschedule it or something. I don't know. Whatever. We live stream every day. So if one thing's

3:57

messed up, what can you do? Let's unschedule it maybe. Well, I say we get the show on the road. Okay,

4:21

forget about it. Let me just unschedu this so there's no confusion. just gonna go with it. All right,

4:40

cool. So, yeah, welcome to the workshop. Today, we're going to talk about evals.

4:46

Um, I have some slides, I have some code, I have some We're going to write some We're going to draw some diagrams and stuff. I want to demystify evals for everyone. Um, I know it's a honestly it's a very hot topic, but it's funny enough not a lot of people actually do them. So maybe the people

5:05

here will learn something and then you know they'll start doing it. So I'll share my screen and we shall get to it. All right, let me do this.

5:26

And here we go. Oops. Should start at the beginning. My bad. Okay. So, yeah, welcome. If this is

5:34

your first MRA workshop, welcome. I'll give a slight introduction. Evas is actually a very advanced topic. So, if this is your first workshop, maybe you should stick around, of course. But

5:47

there's also intro workshops. Like I said earlier, we do live streams every day. They're really fun and maybe you can come and ask ask questions and maybe that's another avenue for y'all to learn. I am solo today um because if

6:04

anyone didn't know there's the AI.engineer engineer um conference in San Francisco which has been epic. Um and Sam Sam is over there uh doing his thing and then Shane is taking some welldeserved PTO hanging out with his family but the show must go on as always so I'm here to teach everyone. Um so yeah we're MRA

6:27

uh we are a Typescript we are the Typescript uh framework for building AI agents. Um, and our workshop goal today is build and run evals with MRA. Pretty simple, I would say. Um, and I'm going to check on the chat and stuff while it's going because because I'm solo, I

6:47

have to I have to see like maybe what's going on. I already see a question here. Where are the live streams? Uh, they are on they're on Twitter and stuff if you

6:58

follow us. Um, we'll get to that in a second. Um, so yeah, I'm kind of balancing both, but it's all good. So,

7:05

what are we going to learn today? We're going to learn overview of evals and AI applications. I am very opinionated, so you're going to hear some of my opinions, and hopefully you agree with them. Uh, we're going to talk about

7:17

offtheshelf evals, um, and how you can use them, are they worth using? Uh we're going to talk about how to use LLM as a judge, which is kind of like a strategy in your evals. We're going to talk about how to write custom evals that are connected to business results. And then you can like build and run these uh live live test

7:41

suites and stuff like that. Um and we'll talk about like whether to do this in CI or not. So that's another big question that people ask. Um, who are we or who

7:53

are me or who am I? Um, I'm Abby. I'm the founder. I'm a founder, one of the

7:58

founders and CTO of MRA. Uh, previously I worked at a company called Gatsby, Gatsby.js, uh, where I was a principal engineer. Um, I also we got acquired by Nellifi. So, I also worked at Nellifi.

8:11

Uh, made moves over there as well. Uh, fun fact about me, I've never seen snow fall from the sky. I was born and raised in Southern California. And as Southern Californians do, we go

8:24

to the mountains where there's fake snow, so I still haven't seen that Um, and I originally wanted 13K stars, but we're actually on our way to 14K. So, if you want to connect with me, and you can connect with me at Abby, and we do the live streams, like I said, and they'll be uh streamed from both the Monsterra account, my account, Shane's

8:46

account. We just do like a simo stream. So there's always a way to interact with us every day and then we do the workshops as well. So you can also reach out on LinkedIn if you want. I'm not a LinkedIn but a lot of people are so for

9:00

sure and a lot of people came because I was inviting people on LinkedIn. So that's a really cool uh place to also connect. All right. So the question is do you

9:12

need evals? Um, and you know from Greg Brockman, evals are surprisingly often all you need. I don't know if about that 100% but you do need some type of eval. And let's like actually walk through this

9:29

little diagram here. So in a in engineering in all the software engineering that we do I think everyone is trying to get to a uh a positive feedback loop or adjust when the feedback loop is negative and also I don't sorry I just the chat uh distracted me um I'll have to like periodically do it instead of just get distracted by it. So we are trying to always have a positive

10:01

reinforcement loop right that's at least how we can move forward and feel good about engineering for example when we started learning all of us started learning software development I'm sure it was hard at some cases when there's a bug or there's you don't really know where to go because like you don't have a a quick loop that can like get you in

10:23

the right path especially when front end stuff in the beginning was really hard, but it had the quickest eval loop because you could refresh the browser and then the CSS changed and then you felt like, oh snap, I'm doing something, you know, or you know when you change the color to blue, you can visually see that. In the backend world, unit tests

10:46

are a good eval loop because you can write code, you can run the test, it fails, you have this negative action that you can then write code and then improve it, right? Like that is the whole loop. With agents though in this whole AI stuff in general because things are not deterministic, you don't necessarily know if you're doing

11:08

something right or not because it may work for you but it doesn't work for your friend or it works on my machine but doesn't work in prod because the prompts are different or inputs are different etc. This you can often see this already when you're using windsurf or cursor or copilot or any number of them or even lovable versus bolt new

11:30

versus whatever like people are having different experiences using the same tools and that's just an interesting point right and so if everybody's having a different experience and you can't really test uh for 100% certainty what are you supposed to do you get into a I mean technically If you don't do anything about this, you can release something that will probably just not

11:57

work, you know, or maybe it works for a bit and the model changes and then you get screwed again, right? So, let's walk through this. You have these LLM invocations and every every invocation of the LLM emits traces. If you're using MRA, for sure, can't really speak to I

12:15

mean, most frameworks do, right? But if you're doing this solo dooo, you're going to have to figure out how to make sure you have traces in the whole system. Now, traces help you understand what happened in an execution. For an LLM, the same prompt may or may not

12:33

yield the same execution timeline. That's why tracing is very important. Logging and tracing are logging is more for I think for for me, right? Like I like to see the things that are happening. traces can then decorate

12:47

what's happening with uh we use open telemetry but you have this telemetry data unit tests are not the same world of unit test that we may all know unit tests kind of test behaviors in this world but even then you're not necessarily test and this is a thing we're going to get back to you're not testing for success you're testing for absolute failure right if

13:13

you're and for example if you do a unit test and let's just say theoretically you say the score should be 70%. And let's say 10 runs the score is 70%, but one run the score is 69.5.

13:28

Technically that is a failure given our old unit testing world of how we think about things. But in this world that's not a failure because success is a range of values. So maybe a unit test for success is not the right thing. So essentially, you want to automate these things over time. Um, and you're

13:48

going to create essentially a way to evaluate your your agent and what it does. Um, and then over time, even past that, once you have enough data, aka hundreds of thousands of data points, you might be able to make specialized models, um, called SLMs, small language models, if that's what you need to do. or you can do fine-tuning to make you know this whole thing better. Um,

14:12

there's just so many things that you can do post having eval data, but most of us aren't there yet because how many people have 100k data points already and when you're just starting your project, what the are you supposed to do, right? So, this is something that is important, but it's hard to see the importance if you don't have like a good loop to do

14:31

it. So, what is the goal of eval? It's to improve your AI's uh application's results over time. And over time is the key word here because you're not getting it on day one. But then when should you

14:46

do it if not on day one, right? So there's like this whole question, should I be doing eval? I heard they're not really good until you have enough data, etc., etc., etc. So I'll walk through

14:58

why you should do them. Um, at least that's my opinion. You should do them and you how to get to a quick feedback cycle. Um, and then you can start iterating. So there's different types of evals.

15:10

um unit tests which are you're trying to do this input output and verifying responses over a range of scores or testing for absolute failure in the sense that you know if you're if you I'll get into that more but like uh you're just testing for like the the danger case um metrics based eval um typically we like to call them uh code evals which is does not take any reasoning from an LLM it is purely a

15:41

quantitative measurement or a qualitative measurement right you're writing a function given the inputs and outputs and you are creating the heristic yourself and that is totally fine we've been doing that for many years now right lastly is the interesting one judges or some people call these evaluators these provide feedback and then they actually give you a grade um and given that you can do a

16:09

lot of things. Right now, there are two types of judges. You can have a human judge, which is usually a user or an expert. Um, and we'll talk about why that's important, or you can have an LLM as a judge, and they are trying to use

16:24

the reasoning models to evaluate the inputs and outputs of the response. So, pretty cool stuff. hard to grasp in the beginning but definitely uh very much a pillar of this AI application stuff that we do. All right, let's keep going.

16:43

So unit testing the end like they end up being very similar to software tests. You could even generate test cases right using like what we call synthetic data aka just making up. um really um and you can test different cases very much like how we test our unit tests today. We will make like different cases. Sometimes a bug happens in

17:09

production and you take that uh failure case into your test suite so you know that that one won't happen again. It's a little harder in this world, but it's still the same concept. So that that type of stuff like the principles of that do not change. Your pass rate is

17:27

super variable because LLMs are non-deterministic. So, you have to set the bar on what is okay for your application. Some applications are okay and think passing is 60%. And that's totally fine. Um, you know, that's totally chill. Some some

17:45

are even more stringent. So, their evals are way rougher, but their system prompts are a lot more um structured, etc. They want a passing rate of 70%, 80%, etc. I just don't know if you can

17:58

get to 100% truly. Um, but I guess we'll find out over the next couple months and years. Or maybe a AGI will just solve all of our problems. So, who knows? You

18:09

can typically run this in CI just like any other thing. Um, it is expensive because you're, you know, especially if you're using like an LLM to do things to generate test cases, it can get expensive. And if you're doing this on every pull request, think about the token use. Like this is different in

18:28

some ways. It's not necessarily free. You have to pay for compute and the tokens if that's what you're doing. Um

18:34

but you can also do this differently where uh you can essentially because an agent is full of for us workflows, tools, um networks, uh memory, etc. You could also take these pieces of the puzzle and unit test them in a traditional software sense because the LLM is to is going to execute that tool or workflow or

19:00

whatever, but you can just uh test those in isolation much like we used to do all the time. Cool. Keep going here. Um so here's some

19:11

examples. So you have a task like get the weather and your sample inputs, right? Your inputs are what is the weather in X city? Can you tell me the weather?

19:24

Weather in and sometimes you spell weather wrong like wh e t h e r in city. These are all supposed to be different inputs that you could possibly imagine from your user. And your test is, well, I'm expecting the LLM to format the response to contain the city name. Uh oh, my screen sharing was paused. I

19:49

don't know why. What happened? Uh, what happened there? Sorry. Give me a second here.

20:06

I don't know why it was paused. You can still see it. Okay, that's great. Sorry if anyone in the chat is my screen share paused cuz Zoom just like

20:17

zoomed out of it. So my bad if it is. I don't know why.

20:24

Looks fine. Okay, that's weird. The cursor is not moving. Okay, let me

20:30

stop sharing real quick and share again. Okay, and am I and are we back? Um, if you can just give me a signal in the chat. Do you see my cursor moving? All good. Okay. Sorry about that. I don't

20:43

know what happened, but um cool. Let me get back into this. So, I was saying that like the response may need to be, hey, I want the city name in the response and I want to know that my weather tool was called, right? So, I need to know if the LLM executed the weather tool. I want to make sure the

21:04

tool the API had a status of 200 and I want to know that it's going to be in Fahrenheit versus Celsius, right? Um, cool. So, that's like an example.

21:15

Another example, more complicated is like research a sales lead. Now, types of like types of inputs for this. Where did John Smith work before Apple? Um, if

21:27

you just typed in Amazon or a company name like you need like these are all realistic things that users do. When did Jane Smith start at Google? That's cool. Um if you just put like a

21:41

year that's interesting right like these are like you cannot imagine I mean we all can imagine because we've been been software engineers but we know that users do whatever they want right so it's kind of hard to come up with these inputs uh which is why synthetic data is interesting u but if you can then and

21:59

sometimes you can just take all the inputs that are coming into your production uh application and then turn them into tests as well much like we do with errors and stuff. And then the same type of result examples are you want to make sure that if John Smith worked at this place, the answer has factual data

22:19

or correctness. You know, you want to make sure if you're going to search someone or search a company that like I just said, you want to make sure the tool calls are good. They actually get executed when you want to. And sometimes

22:31

it's really hard to get a LLM to call a tool when you desperately need it. So like these types of tests ahead of time are good and I'm going to keep going. I also need to speed it up because uh we've had a lot of technical difficulties and I I'll try to go a little bit faster. So tracing, why is

22:49

tracing important? Much like in other software applications, traces allow you to see the input and output of everything that's happening. This makes it easier to debug when things go wrong. Also, it's kind of cool to see if you're

23:01

doing a very complex task. It's pretty cool to see all the things that the LLM does and it kind of seems like magic and that's pretty cool metrics. So these are the different types of um like things you can test on.

23:20

So we uh support in MSRA we support a lot of metrics that uh you can just kind of import and use. It could be quantitative. Um, if you're trying to do some type of scoring on the response, if you have like a structured object and you want to like take the keys and then do some type of function with that, you

23:40

can totally. It's just code. Um, NLP, natural language processing, which kind of like the predecessor to this. if you want to do tone and like you know um

23:51

other types of like uh you know positive connotations etc. You can totally do that. Um but you got to be selective about what you're doing. There is no like silver bullet that if I

24:07

import this it's just going to always work. You know it's really based on what you're doing. But maybe in the C category of applications, let's say a lot of if you're building a support bot, maybe there are some shared evals there that uh or metrics, I mean that are relevant that across those application types. Um and the most important thing is to monitor. You need to know where

24:32

you're scoring today and then as users come, as volume comes, you got to see how the score changes over time. So this is very much like a a scientific experiment, you know. Cool. Let's keep

24:46

going. So there are two types of judges. A human judge and an LLM as a judge.

24:53

Human judges are expensive because you if you usually a human judge is an expert in the field that you're working in. For example, like before MSRA, Shane, Sam and I were trying to build a CRM, yet none of us have worked in a sales role or ever used Salesforce in our lives. So, if we were to continue

25:12

building an AI CRM, we would have either had to create LLM as judges, but probably in the beginning hire some consultants to make sure that the, you know, the sales agent is working the way that the sales playbook goes. So, it's actually pretty easy though once you have that person because they can say something's good or not and they can

25:33

like give proper feedback that's relevant and contextual that feeds back into the system prompt and maybe exposes new tools you need to make etc. LLM as a judge allows an LLM to judge this which means you have the same non-deterministic characteristic as your agent. So, you're gonna have to pump that thing up and probably spend enough as enough as much time as on the judge

26:00

as your agent itself, but it's cheaper in some ways because you don't have an upfront cost of human labor. And if you can really figure it out, you can get pretty good. But you have to calibrate it. And even judges may require their own judge,

26:17

right? So, this gets kind of meta, but it's important. Um, so let's talk about MSER real quick.

26:23

We are open source framework. Uh, we help you build AI agents and everything in between or everything around that space. Uh, what's included? We have

26:34

agents, we have tools, memory, tracing. We have workflows that have suspend and resume and all those different types of things. We have eval support as well. We

26:46

have storage rag, a local dev playground. pretty much all the things you need to kind of like make moves um in this space. And so now I I want to talk about the eval loop and how quickly can you get into the eval loop. So with that I'm going to stop sharing this and I'm going

27:08

to share my excaladra which is not very exciting but we will draw some stuff. So, let me see here. Where is my screen? Here it is. Okay.

27:28

So, first off, the whole like this whole eval loop is super important like we said because you want to make sure you get better over time. So, first I'm just going to draw a box here. That's just the agent. Let me make this extra large. This is an agent.

27:48

And you deploy this to production whatever it exists however you want to deploy it maybe you put a UI in front of it whatever but what's happening is this agent is going to be getting inputs which is in the form of these user prompts right input or you know user prompts and within this execution it could a lot of things could happen, right? You may

28:18

have a bunch of tool calls that happen, you know, maybe it's doing like web research and stuff. So these tool calls then are calling some other functions in your application, right? So let's just call these like functions that are executing and they have a chance of failing or not. Of course, you could also call

28:44

workflows if you're using MSRA especially. This is very helpful. And these are like deterministic code execution that you can do, but the LLM still has to choose to call it. And some

28:57

things maybe have already been um synced into memory. And so if that's the case, then maybe there's no tool call at all. It's just based on context. Um, and if this is

29:08

your application like this, then what's going to happen is you're going to yield a result. So the output is, you know, output could be many things. It could be text and especially if you're doing some multi I don't even want to go into the multimodal thing today because that makes things even crazier. But you're not always the input is not always a text and the output is

29:37

not always text. It could be objects, it could be videos, it could be images, it could be voice or sound, right? Everything is possible these days.

29:50

And the input could also be any number of these things, right? So that's the game we're playing. So now you're in day zero. You just shipped this architecture. You're collecting your

30:03

inputs and outputs and maybe on day one given the volume things are looking pretty good. You have zero evals at this point because who cares? You're just trying to ship stuff and um you know iterate but and like you know and what you're doing at this point right when you don't write evals and you're looking at these inputs well for many people you're not even saving the inputs and outputs. So

30:26

you wouldn't even know how it's performing. Let's say you did let's say you use MRA 2. So you're like storing all this stuff. You can then at this point you are the first evaluator, right? Cuz you're

30:38

looking at the data and you're like, "Oh man, this sucks or it's it's going well." And if you're the human expert of your own product, that's even better because maybe you can push off evals for a bit and you're the one who can iterate. Um, and usually this is how development works with evals too where

30:56

you can start like kind of thinking about what the eval will be. You can test what you know, like you can test inputs and outputs that maybe you're not an expert on, but you know what they should be and then you can start doing some human eval stuff. Actually, most people are doing human evals when they

31:14

test uh their projects or if you're in a playground or you're in chat and anytime you see something you it's unexpected, you're doing an eval cuz you know, oh well, if I ask a weather agent, hey, what's the weather in Brussels and it's freaking raining and it says it's sunny. I'm a human eval. Like, it's definitely

31:33

raining right now. It's not sunny. Boom. I can already do that. But here's the problem. At scale, that's not going to

31:39

work. So what you want to do and this is also costs money by the way. What you want to do is you want to add evals to your agent and what we call these is live evals. um as opposed to running them in CI, why don't you run them?

32:01

It's almost like uh sorry, let me just do this better. But after the output is sent because you don't want to block on uh returning results to the user, you do an asynchronous thing where after this is run, you run a set of evals. And maybe in the future like maybe in day zero maybe you just run an eval for every input and output you get for the

32:26

first day and you are collecting data at this point but then maybe you want to sample it in the future just like honeycomb or all the other tools we use with some sampling rate and that way it's happening off the execution path but you're still collecting uh live data. Now all this live data goes into a database. So let me zoom out of here.

32:50

All these evals get written to the database. So let's say we're using Postgress or something. And if you're using MRA, this happens automatically. But hey, if you're not, that's cool. You're going to have to put the data somewhere. So this is like your eval

33:03

results. Okay. So now you have your eval results and they will contain the input, the output, and whatever score or how you wanted to score this. And then what

33:16

you're going to have to do is another thing which is called annotation. So this is the next kind of phase of this. You have a now let's say you have a day's worth of data. You have a bunch of data in here. Now you can run

33:29

something called annotation. Annotation is essentially you saying and let me just make draw better kind of look a view of this. Let's say this is like a table of all the the data you have like a spreadsheet let's say and some people put their eval data in Google sheets. So I I wouldn't put it past anyone to do

33:49

that. Um far far have done more. Um so now you're in this thing and you want to pick good and bad examples. So let's say this

34:01

first one I said good and this next one is bad, right? And when you're annotating you want to say good why, right? Good because bad because. So now what you've done is created another table of things which are your

34:20

annotations. So let's say we have another DB table. Let me just put this in the same thing. So this is like our Postgress. Now we have our eval

34:31

annotations. So now you have let's say like you know the some extra commentary on what happened because these do have input output and let's say score on each row. The best the good the good outcomes will the you want the good outcomes and like the really good responses and the really shitty ones. You want to annotate those because what this does is

35:04

completes the loop. Right from here, you have these things, these annotations that can then be fed back. And sorry, it doesn't look like a cool loop, but it is. Uh, let me put this an arrow. You

35:17

want to put this back and maybe you're changing your system instructions. Maybe you're encoding these into your system instructions. or maybe you learn some stuff that needs to tweak these tools and workflows and other things to to make things better. So, as you can see, it doesn't look like a great loop, but

35:35

the input comes in, it does its thing, output goes out asynchronously. I'm running my evals, I'm storing them, and then I'm annotating them, and then I'm feeding that uh the learnings back into the LLM. Cool. So, that's the eval loop as at

35:54

least as as we see it. And there's also other things you could do. And there's definitely products out there that help you run an eval loop. Laying smith, laying fuse, laying watch, laying trace,

36:05

laying everything these days. Um, and monster helps you do this, too. Okay, let's go back to the slideshow.

36:15

Almost to the code part and I'm running out of time, so I really need to speed up here. Um, oh, we're at the code part. Great. So, first I'll just show we're going to create some eval. I've already

36:27

pre-anded like Emerald Legosi, right? Like I'm here with the with the code. I'll just show you different ones and I'll show you a custom judge and then we can do some questions. I think that's

36:38

probably where we're at. Um, so yeah, let's get on with it. So, let me actually share my code. Here it is. We'll start with tone.

36:55

Um, and let me zoom this in a little bit more. And so for each one of these, I've created an agent and I've created like a test. We'll run the test and I'll kind of walk through everything here. Um, so

37:08

here it's a very simple agent. So maybe look at the chat to see. Um, we have a question here. So our eval added into the agent's prompt. We'll get

37:21

into that. So be right there. So excuse me. So I have a simple agent here. You

37:28

are an assistant that provides information about quantum tech quantum computing company btailed and informative in your responses. Technically this agent doesn't really do much, right? Um but and it's using GPT40. This is all using MRA. We have these things called offthe-shelf evals

37:47

uh which are either NLP metrics or LLM as judge that you can use. Offtheshelf is pretty good to get familiar with evals and tone consistency. It may be important to some um especially if you're doing customer service or whatever. And so you can set up this

38:05

tone consistency metric. It's literally calling an NLP function from like node NLP which can do uh that type of sentiment analysis on the response and that's it. All you got to do next um if you wanted to do this live you could pass evals in here and then it would do exactly what I said that asynchronous loop after that. But in this case I'm not going to do that because we have a test that we're going

38:34

to run. So let's go to the test. So this is my tone consistency tests. Um

38:41

I have three different test cases which I want to know um like what would the tone would be on those? So first one, what quantum computing services do you offer? Can you explain quantum entanglement in simple terms? And then I'm having some trouble. It keeps crashing. So I don't want to skip this.

39:02

So for each test um I'm going to generate a like use my agent. I imported it right from here. So my tone consistency agent and I'm going to call generate on our agents. We have two methods generate and stream. I just chose to generate because I'm not streaming anything right now. And I pass

39:22

these queries in and they're going to give me a response in text because that's what I'm doing right now. And I can take my metric which we defined here and I'm going to call measure right measure is going to give me a score for it. So I'll take the query what the response is and I'm going to measure

39:41

tone consistency on it. So that's what I'm doing. Uh there will be a recording sent after this if you were wondering. So thanks

39:53

for the question. Um it's hard to do these things by yourself. I didn't realize uh because it's it's hard to keep track of things. Um but we press on. So let me get into my terminal here.

40:04

I will run also if you don't use warp you should. It's a great terminal and they should sponsor me but they don't so it's okay. Um, I'm going to run pmpp test or mpm sorry mpm run test on source agents source master agent.

40:32

Yep, let me just go into that folder. Actually, this zoom thing is pissing me off. I got to move this. Okay. So, CD into Maense

40:50

and we're going to mpm run test on the tone test.ts. So, it's running. Uh I don't have any debug logs or anything. have to wait and

41:03

see what's going on because it has to then it has to generate for each test a response and then it has to call the tone consistency thing. It log stuff. I'll I'll go through this in a second. Um, this repo Oh, the terminal is not

41:22

showing up. My bad. Or is it? Okay. Okay. Okay. Uh, let me stop

41:29

sharing. I don't know why. Can you guys see my terminal now?

41:35

Let me double check the chat here. Yep. Okay. Sorry. Back

41:43

back in it. And yeah, this repo will be available after as well. So yeah, sorry.

41:49

Sorry about the the I'm spazzing I guess. Okay, let's go through uh what happened also. So I guess you you saw the IDE. I

41:59

just ran mpm run test so you didn't really miss out on much except for me fumbling but um here you can see the query what quantum computing services do you offer? Quantum tech offers a range of blah blah blah blah blah and then it gives me all this stuff. Honestly, whether that's true or not, we don't,

42:18

you know, it's not true obviously cuz how does the LLM know what Quantum Tech is? I just made that up. But we're not testing for that. We're testing for tone, right? So, this is all the stuff that it said. And then we have a tone

42:31

info. Response sentiment is zero. It's fine. Um, reference sentiment 04. If you

42:38

really have to know these NLP values for it to be to make a difference. The tone score is 95%. I don't know how it actually really determine that because these NL NLP things are a little black boxy, but if you trust it, that's pretty good. And you might I might be hinting at why

42:56

I don't really like offthe-shelf evals, but that's just me. Uh the next one, entanglement, you know, same kind of thing. 98%. Wow. How tone the tone was super consistent in this one. Um okay,

43:11

sure. And then the last one is, you know, they're saying some stuff, but the tone is like 79%. Um, because maybe they're too robotic, right? So, is this a good metric to use? Maybe. Maybe if you just want to play

43:29

around with EV valves, this is good. I don't think this is actually good, personally. But, um, we press on. So,

43:37

that was tone. Let's do another one that is a little bit more interesting which is answer relevancy. So let's look at that. So I have two different versions of this that we'll go through. So first

43:51

is I create same type of thing. I'm creating an agent down here. It doesn't do anything yet but it is an answer relevancy agent. It's using 40. I could use a better reasoning model too and you

44:03

might have to uh and which we we are in a different one but um the the model's capabilities for reasoning um about the inputs and out inputs and outputs are important. That's why these things get expensive because reasoning models cost more. So just saying um but in this case you're a helpful customer support agent for Tech Gadget Inc. You should answer questions about our products and

44:26

services. If you don't know the answer, admit it and offer it to escalate to a human. Sure. So we have answer

44:32

relevancy. This is actually a LLM as a judge metric. I can pass in a model. So I'm using 03 mini here. And you can give it

44:43

certain um properties. Scale is like I want to I want the score that it outputs to between zero and one. And then uncertainty weight. You can add uncertainty like a little like uh give

44:56

right. So um because it is not all this is very random. Um and let me check the chat here real quick. Uh the qu one quick question on the last

45:08

one is comparing the tone between the three queries relative to each other. Um no no it was doing each one individually just as an iteration like this one and this one and this one. um it's just purely taking the text, passing it through the NLP tone function and getting a value. So you really have to

45:27

interpret that like it's like that's why it's not a good metric cuz how do you interpret tone really from a text? Or maybe I'm the only one who struggles with that. Um but anyway, so this answer relevancy is an LLM judge uh type metric pass in a model and under the hood it's supposed it's going to take um it's

45:48

going to take the input and output and it's you're going to give it some context and then it's going to say hey is this answer even relevant? You know um so in this case there's nothing going on here. It's just using its the the baked in data. Let's go look at the tests real quick. same thing. I got

46:06

three different questions here and um it's gonna say like hey is this is the answer relevant to the question right cuz what if I said like hey you know when will the tech gadget thing be released and it tells me that x200 was released like I asked for x300 bro so what what are you doing right so answer relevancy a lot of people use this

46:29

metric by the way and I think it's pretty good um but you have to tweak that uncertainty weight. And once again, it's kind of like very custom to what you're doing. But this is another good one if you want to like explore what LLM as a judge means and how it works and stuff. So, same kind of deal though.

46:47

Um, we are going to generate and we're going to measure and we're going to print it out. Cool. So, let's do that. Let me not skip this because that would be dumb. And let's run the answer

47:03

answer relevancy test test.ts. This one will take a little bit longer. So if you got some questions, we can ask them. It's using 03 mini, so it should

47:14

take some uh it should take some time here, but I guess we're one out one done, two to go. Or maybe no, we still have three to go. So, if you have any questions, just drop them in the chat. And uh yeah, we'll have to just sit here

47:31

idly together while it does its thing. Oh, we have a failure. Great. That's

47:38

good. I wish I had some knockknock jokes or something or elevator music or something. Oh, we got second one. Now, let's go for the third. Where are

47:59

we going to be in the third one? Come on. Yo, speed it up.

48:08

Oh, and it timed out. I guess it would have took longer than 30 seconds. I don't really care about that last one because we have enough here to to work on. So, let's go through the first one.

48:20

How do I reset my tech gadget X200 to factory settings? The response is it made up some stuff. We're not really testing hallucination here, but just saying that. And the answer, uh,

48:35

I logged right here the reason. I probably should have put a space between these, but with LLM as a judge, you get the input, the output, a score. So, it judges as a score of one, which is good because the scale is 0 to one. So I

48:50

guess it's 100%. We're killing it. But and then it gives the reason, right? So the score is one because every part of

48:56

the answer directly addresses the user's inquiry by explaining each step even though they're not it may not be real, right? But there isn't any extraneous information. Okay, cool.

49:09

Second one. When will the new tech gadget X300 be released? I'm sorry, but I don't have the exact release date for the Tech Gadget X300. It doesn't. However, I can escalate your, you know, this is like that thing in the system

49:22

prompt is if you don't know, then you escalate. The question though, which is funny, is like, how did it know about the X200? If the X200 didn't exist, how did it not know? But maybe it didn't feel comfortable giving me an answer,

49:34

right? So then the score is 33%. Interesting. Or 33. It says, while the

49:40

response acknowledges the query, right? So it it's kind of somewhat relevant because we acknowledge the query. It fails to provide release information. So it's not the answer is not truly relevant. And then this one, do you

49:55

offer international shipping? Yes, it does. Blah blah blah blah blah. The score is 67 because the first part of

50:02

the response clearly answers that international shipping is offered, which is directly relevant. However, the additional information about shipping costs and all this other is not. So, answer relevancy is very strict, right? And some people want their response to be brief. You don't want to

50:19

be adding paragraphs of un unnecessary stuff. Also, if you're in cursor, right, and when you ask the uh the cursor agent to do something and it does 17 other files and stuff, I don't know if that thing is very much answer relevant, right? It's it it what it did and what you wanted it to do or there's a there's a gap. And as you can see though,

50:42

um let me go back to the test cases here. I was just looking for everything that was greater than zero, right? So not too big a deal. Um cool. There's some questions. Let me

50:55

go here. I have a question from Colin Matthews. Any opinions on 0ero to one scoring versus abstract number based on adding for positive behavior, subtracting for negative behavior or other patterns? Um I'm going to talk about that in a bit

51:13

with the custom stuff that I'm actually building to. Um and then so I'll get to that in a sec. And then Morett, uh, in a case like this where it didn't know about a certain gadget, would it make sense to write an eval which checks if the agent called a certain tool or would the strategy be to test if the answer

51:32

mentions the gadget regardless if it came from a tool or not? Yes, that's what you would want. So if I go back to answer relevancy, let me give it a tool that gives gives it information.

51:43

Oops. What am I doing? So let me do this. Do this. I have a tool here called

51:49

get releases. So, it should make something better. And I'm going to change the system prompt to be actually good um or better, you know. And I'm going to put this here. I'm going put

52:02

the tools in. And we're going to run the test again. But let me also cuz I don't want that thing to time out. Let's just put like a this much. And we're going to run this again.

52:17

I might not have time for hallucination, but it's all the same kind of business, right? Hallucination tests if like the context and all this is like actually I'll go through it real quick and then we'll get into custom and then we might have to go a little bit over given all the stuff. So, I apologize for that. Um, this will be recorded though. I'm going

52:37

to keep going until the content's over. But, if you have to leave, that's totally fine. Um, while this is running, I'll start the next one so we can save some time. Um,

52:48

we'll go to hallucination. Same kind of business, right? I create an agent. It doesn't do much. I create a

52:54

hallucination metric, which is an LLM as a judge. You pass in a model. I use GPT40 cuz I didn't want to wait so long cuz the last one takes forever. Um, and then for this is interesting.

53:05

Hallucination is a really good. This is one of the off-the-shelf evals that I love because we're always testing for hallucination and don't we all get pissed when this thing hallucinates on us. I would recommend you at least do this and kind of tweak it. Um, if you all don't know, we run an MC we have an MCP doc server that you can install into

53:25

cursor and windsurf and then use it to write code. But it half the time in the beginning it just hallucinated on what information was necessary to write that code. So we actually use this hallucination metric and how you how it really works you give it a scale but you want to give it context and this context is super important for the measurement right

53:48

given the inputs and outputs and given the context does it do the right thing um and is what it's responding not like is it true like is it factually true given what context you gave the measurement um so that's cool we'll run that in a So, it looks like they all passed this time. Ain't that interesting? Well,

54:11

yeah, because I gave it the right information. It could make tool calls, you know. Um, even though, for example, I didn't add any information about how to do this tech gadget thing, but it knows about the release dates. And in

54:26

this time, it did better at answer relevancy on the international shipping question. That's why dude, okay, was it 33%. Right now it's 0 75. I didn't change anything

54:38

about that. That's why unit tests like this are whack because you shouldn't be testing for success. You should be testing for absolute failure. If it could jump like that in that range, right? Obviously, this this is a

54:50

hypothetical situation. I'm putting this uh problem in a box right now. But dude, like the fact that it could change in a swing like that means you should only be testing for less than X because that's means something is gravely gone wrong. Hope that makes sense. Um, and let's now let's run the

55:09

hallucination test here. So, same kind of deal as all the other tests. I have some test cases. I have a lot of test

55:15

cases here. Let me get rid of some so this goes faster. And then same kind of deal. Hallucination measurement. Um, so

55:24

I'll do that. Okay. While this is running, I'll move to the next one. So, we're more efficient here.

55:39

So, I was leading this whole workshop up to this because I my opinion is offtheshelf is whack. Um, I don't think people should be doing them because in some cases they're good. Maybe hallucination, but as we can see like does it really like is it truly testing what you want to test? I personally think people should create their own

56:01

custom metrics that are based on what your business you're the you're the one writing the application, right? So, you're probably the best one to know and to write this metric. How do you do this? The first thing you need is a judge. So, I have one here. Um, and my

56:16

judge is kind of hefty. Um, but you can So, how the how you do this, um, and we have docs and stuff on this, but I'm going to show you now. You can import the MRA agent judge from the eval package. And you have to write instructions just like an agent. As you can see, I wrote um I wrote different

56:36

type of instructions. I don't like LLM using numbers. Maybe because I studied mathematics in college. I don't know.

56:42

But I just don't feel right it calculating numbers for me. I do feel right about it giving something a qualitative label for example. And so in this case I have different things that I care about. I care about correctness, completeness, clarity, empathy which is

57:01

hard to gauge honestly. Um and a lot of models don't even pass emotional intelligence but let's not even go there. Actionability. Um, and within those I want to grade them on a poor,

57:12

good, and excellent. And here I have some examples that would have come from annotation, by the way, in in a real scenario or could come from annotation. Um, a poor one is like your warranty is valid for 5 years, but it's actually 2 years. Your device comes with a 2-year warranty. That's like good. I

57:33

guess it's it's factually correct. your device is covered by a 2-year manufacturer warranty which includes parts, labor, it gives more information, right? So, that's an excellent response. And then the same for all these other

57:45

ones, right? Like I just have examples what poor, good, and excellent would mean for each of those categories. And then we have an evaluate function here which you can create a eval prompt, right? Based on the customer query and

58:05

the response from my agent, please evaluate the dimensions only using these ratings poor, good, excellent. So then I will want a structured object back of the ratings, the examples and the recommendations. Then I pass these kind of things into my um my judge agent. Right? This is still

58:29

an agent. It's a LLM as a judge. agent is like a primitive masher. That's why we called it an agent here. Um uh yeah,

58:37

then you pass this eval prompt which has the inputs and outputs in and your kind of structure. And this is going to give you um a ratings object in this case. And so this is where and you'll if you read a lot of white papers too when they're doing these type of evals there's a normalization period um after you collect this information where you

59:00

take the ratings and you convert them into numbers because now you can put your quantitative uh uh opinion on it. So for me I was like hey you know 2.5 is good poor 7.5 is good and excellent is 10. So then I can actually put some

59:17

numbers behind it and I'm the one who made up the numbers. So that's why it matters to me. And then if you want to get a reason, so you have to have the input, the output and the scores here.

59:29

You can say, hey, these are all the scores for all the things and this my overall score. Please give me like uh the reason why it was scored this way. And that's how you finally generate a reason. You can format it. And sorry,

59:46

this is my converting to rating thing. If it's less than five, it's poor. Less than eight, good, etc. So, you can convert back and forth between these. Um, I know we have 1 minute left, so I'm

59:59

going to run this one, but I got to go check on that hallucination one. So, all of them fa or one of them failed probably because of a timeout. Great. That's okay. So, let's just go through one here. How has quantum tech uh

1:00:15

evolved since their quantum computer? Quantum tech has made significant strides in the development of blah blah blah blah blah. Let's look at the reason um where is that here. So this score to

1:00:29

one because score is one because all statements in the output are either unsupported um and one is not good. I think one means hallucination um or contradict the context which is probably true. I just don't have time to go back and show it to you. Um let me answer some questions and I'll run the

1:00:48

last one. Um okay. How did you structure your docs for rag with MCP? Any specific approach that worked better? Sorry, not really

1:00:58

relevant to this, but that's okay. Um, we did we like structured it by path and then like topic and you can actually look at the docs MCP server for inspiration if that u you know is good for you. Next one is is there a significant difference between using the MRA agent judge class and a regular MRA agent as an evaluator? Um, no. It's a

1:01:21

convenience and honestly this is Eval's V1. I'm cooking Eval's V2 right here which will be way cooler than this. Um, but this is what we have right now. So, there isn't a big difference. Maybe

1:01:32

syntactic sugar in a way and TypeScript and but this will get better and more cohesive. Okay, I'm going to do we're going to do the final drop the mic uh demo and I have a agent here or sorry, I showed you the judge. This is my metric.

1:01:51

It's like I just made like a customer quality metric that takes the judge and this orchestrates everything. So Ray Rayanch you would ask like could you do this yourself? Totally dude you could totally do this yourself with a regular agent. Uh we just made these classes for some convenience to give structure to this but like for example I created this

1:02:10

measure method that exists on the class calculating a score. Um so this is convenience stuff to help give like a framework around it. Um and then I have my customer. This is a test. I'm not

1:02:23

going to run these for per se, but I do have an agent in here that I want to run. I think it's this one. Um, agents, customer support. Here it

1:02:36

is. So, here's my customer support agent. It has some BS system prompt. It has

1:02:45

company context because I'm doing hallucination. And it also has my customer quality metric. And in this case, I put them on my agent itself. So when I test it, it's going to do this

1:02:59

eval run asynchronously. And so I'm going to run go back. Go back.

1:03:06

Go back. mpm rundev. This will start. Oh, I already have it

1:03:12

running. That's my bad. Here, I'll just pretend pretend like you didn't see that. And it's starting up the dev

1:03:18

server. We have a dev server. Um, when you run MRA dev, it turns that MRA class, for example, here, it turns this into a hono server, which then we serve a playground from it, which pretty cool. Um, and if I go to it, you'll see that I

1:03:35

have my agent here. And I'm just going to ask it some BS questions, and we'll just see how it goes. like um when is quantum tech going to um actually let me think about this. Why

1:03:55

does quantum tech suck? Uh your agent always hallucinates and leads me astray. Okay, so it's doing its thing. You know, honestly, if this was a support agent, I

1:04:15

would stop reading right now, you know, but for the sake of quantum tech and our customer support metric, we will wait for it. And so this is the response right now in Ma Playground, you have this eval tab. You're going to see the tone consistency showed up right away cuz that shit's for free. Actually,

1:04:33

you're writing a function and you can already see that the score was 67. Now, remember I talked about that spreadsheet of data. Well, we technically gave you a spreadsheet here in the playground that stores what happened. So, you have the time stamp, the instructions. So, like

1:04:49

this was the system prompt at the time that you ran this inputs, outputs, and the score. As we know, NLP or these uh NLP metrics don't have reasoning. Makes sense. There's no LLM involved. Now, in

1:05:03

the background now, there's probably some more metrics. Yep, I called it. So, let's talk about look at answer relevancy first. And here's another good point about

1:05:16

this. If I change my system prompt and I learn things, I still want to see the evolution of my system prompt, right? I have the data for it. Also, what if I regress? I need to know what the system prompt was at a certain time. This is

1:05:28

where like a prompt CMS will come in handy. I'm not I'm not saying we have one or will have one, but maybe we will. Um, and then you can see the score here. Um, it doesn't really matter about the score because it's all made up, but it's like

1:05:42

whose line is it anyway. Um, hallucination will look very, it's obviously took things that are not in the context. Um, that's fine. But let's actually look at my customer quality

1:05:54

metric. Um so correctness rating excellent quantum tech is a company focused on developing quantum Yep that's true that's true to improve you know completeness so I can see right it's going through each one that I that I created I'm the one who cares about this okay so then you can see like why was it a complete answer and the rating was good so here are some examples from the

1:06:19

response to improve it the cool thing is I wanted to know how to improve it even though it might be like wrong at least it's like a good first step to improve something um and say oh okay the response could more directly address the customer specific complaint I can go back to the system prompt I can go do

1:06:37

some things to tweak this clarity cool it got excellent and empathy good whatever but I converted those I normalized what I wanted in this qualitative label to a quantitative score 88% now if we didn't know this was all BS, right? 88% pretty good. And if I had a real agent that I really cared about and a lot of you do too, you can then feel

1:07:03

good about the metric. And that's the whole point because these um actionable uh improving reasons can then you can then iterate now and I can go back to my system prompt. I can go change some stuff because look how shitty this one is, right? I can change it. I can make it better. I run the same input again in

1:07:22

the playground or if I'm in production, I can even do this too and I'm starting to collect all this data and then go from there. Okay, so that is all I have time for. I'll answer some questions for those who want to stick around and then we're good. Thank you so much for um attending. So

1:07:42

uh yeah, questions in the chat, please. I'm here. Way harder to do this by yourself. So, thanks for having the

1:07:48

patience and, you know, listening to me talk for an hour. Any questions? Thank you. Oh, Craig said, "Thank you. Great stuff. I now have a working eval." Sick, dude. Wow. I did something

1:08:11

important today. Yes. Any other questions or I'll just let everyone go.

1:08:17

Um, going once, going twice, going three times. Okay. Well, that was it then. Thank you so much for attending. I will see you next time. Um,

1:08:31

I will share I think in the event actually I'll just share it now. Why not? Um, I'll put it in the chat here, but I think, yeah, I mean, it's it's public, so it's chilling.

1:08:50

Evals. There it is. And we'll also assume we send the recording as well. Um, wait, I need to send that to everyone. Sorry. There you go. So that's the code that we walked

1:09:09

through. Um, oh, we have a question. So if you want to stick around for this from Colin Matthews, any tips for tracing issues if eval drop and more than one underlying element changed. Oh boy, that's a tough one. Um, if dep. So in this case, you

1:09:27

need to make sure that you're the the elements that changed have their own tests because you don't want to get biseect this right? you want to just figure out it's got to be these one points. But then if the model changed and now you're receiving and this has happened to a bunch one some of our customers like when like the model

1:09:45

changed they changed the model and all their emails went out of whack because the first couple days after a model release things are not so peachy right so they needed to know is it the tools is it my code is it things that I wrote is it the model so you have to have like tests around everything which is a big

1:10:03

ask right now because you know a lot of us are just figuring this thing out um but it's a good tip to start getting into the mindset of what could go wrong in the trace and do I have coverage on all of it. Thanks for the question, Colin. That's a really good question.

1:10:21

All right, everyone. I think that will be unless you have another one. Um I could be here all day. I'm just kidding.

1:10:27

Um I'm gonna go now. Thank you so much. Have a great day. Catch you on live streams. We do it every day. Catch you

1:10:33

at the next workshop. Uh really appreciate it. Bye.

More workshops

Workshop Hosts

Abhi Aiyer

Abhi Aiyer Mastra

Watch Recording