Back to all episodes

WTF is the "Eval Loop" - Evals Deep Dive, AI News, and Mastra Docs with Paul

July 7, 2025

Today we talk AI News, we get some Mastra Docs updates from Paul, and we do a deep dive on Evals. In our deep dive we talk about common questions when starting to think about Evals and discuss the upcoming Mastra Evals roadmap.

Guests in this episode

Paul Scanlon

Paul Scanlon

Watch on

YouTube Spotify

Listen on

Episode Transcript

3:32

What up? What is up? How's it going, dude?

4:13

I'm doing well. How you doing? Pretty good. Pretty good. Got my Celsius here. Getting ready for the day.

4:20

Yeah. I kind of got a little echo coming from your end. I see you. I see you have a a stack of books, you know. That's

4:28

That's good. I do as well. I'm using this microphone.

4:35

Yeah. I mean, maybe check your input. I don't know. Can you hear me pretty well? Hello.

4:41

Oh, that's a little better. Hello. Hello. Hello.

4:47

Is that better? I think so. We'll go with it for now. Roll.

4:54

If you're in the chat, let us know. Let us know how the audio sounds if it's from Aby's microphone. We can always uh work on fixing it, I guess. But welcome all you admirals of AI. This is AI

5:04

Agents Hour. I'm Shane. I'm here with Obby. What up? Yeah. and we're going to chat some AI

5:10

today like we do every Monday. If you this is the first time you've tuned in, you know, we do this pretty much every week. Uh sometimes more, sometimes, you know, almost never less. And we have a pretty good episode today.

5:23

We're going to be talking AI news like we always do. We're going to bring Paul in from the MRA team to talk about some docs updates and then we're going to go deep on evals because eval is a hot topic right now. Has been for a while. Yeah. Well, before we do that, how was

5:40

your Fourth of July? Uh, well, it was great, but I did dislocate my shoulder. Oh, How'd that happen? So, I was I have a a pretty my left shoulder is

5:52

pretty weak and I was tubing and I hit a massive I mean my friend sent me skyhigh on the boat and I landed tried to hold on popped my shoulder out stayed on the stayed on the tube, shook it out and it popped back in and it was fine. But but I've I've dislocated the shoulder like six times in my life. So like I've gotten good at like it'll it didn't I

6:15

don't think it fully dislocated because that'd be kind of hard to pop back in. It was like a partial dislocation. I could just like shake it back in and then it would pop back into place. So, it's a little sore. But besides that, it was an amazing weekend. Got to

6:27

spend some time with family and friends. And did you see fireworks? Uh, yeah. I didn't go to any specific

6:33

show, but where we're at, there was just like a ton of people around us, so we could just like observe everyone else's fireworks. We We did uh shoot off a few. What about you? How was your fourth, dude? It was chill, but SF fourth is way

6:46

weaker than LA. for sure. Um because shots fired.

6:51

Shots fired, dude. Because there not that many people just doing fireworks illegally everywhere. And that's what makes the Fourth of July fun. But there was a bunch of I went up the hill here and I saw a

7:04

bunch uh near Bloom Saloon and stuff. So, you had a good vantage point, but then they just ended and in LA they go off all night. So, uh it's just a different different vibe.

7:16

Yeah. Yeah, some someday you'll have to come check out some South Dakota fireworks. We we go pretty hard here in South Dakota. I can only imagine if it's legal how crazy the fireworks would be. I mean, it's not legal in cities. Well,

7:29

I mean, it's still people do it all over, but it's not technically legal in like city limits, but where I was was outside of like any big city limits, so it people are just shooting them off all over. Hell yeah. Well, should we get into it? Yeah, let's do it. So, this is a live

7:47

show. That means if you're watching this on LinkedIn or YouTube or X or wherever you may be watching, you can leave us a comment. We try to respond to most of them. So, let us know what you think as we go along uh through some of this

8:00

stuff. The first thing we're going to talk about is we're going to do some AI news. So, why don't we just get into that? And yeah, let me share my screen and we'll

8:12

talk about the first uh first bit. So I thought this was kind of interesting. You know, XAI raised five billion in debt and then an additional five billion in equity. So 10 million or 10 billion with a B billion total.

8:29

So that's a lot of money. Um I obviously it seems to be all around building out data centers and they you know they want to build the largest data center right and they the idea is that if you have the most compute you will have the best AI. I mean I don't think Grock has the best AI today obviously I

8:48

don't think anyone does but it is kind of interesting to see if the bet's going to pay off because if it if that holds true they're going to be well positioned to potentially have the have some of the best at some point. Yeah. Plus, you know, raising more money, you can allow you to get more talent and, you know, it's like an arms

9:07

race now. So, yeah. If if anything, last week we saw the the arms race for talent is getting insane. Yeah. So, dude, if anyone who was doing ML and stuff like years ago, and then

9:20

they were like all sad like, oh man, I don't know about my life. Like, you bet you'd be happy today. Yeah. Yeah. ML is new is the new like sporting event.

9:32

You can you can go you can be a professional athlete or just like a professional machine learning engineer. Yeah, dude. You make the same amount of money. You give your kids you give your kids a choice, you know, like, hey, do you want

9:43

to like be a basketball player or just be like a freaking nerd and uh you can make big bucks. Our time has finally come. Unfortunately, I'm not ML background, but still the nerds, the nerds among us, which I identify as. Just never gonna get professional at anything.

10:01

All right, so we got a question here from Spacework Dev. Quick question. Any chance that with using master networks, let's say after the network decided which agent is relevant, can we make it stick with it the next message and not use the routing again?

10:18

I mean, the answer is no. today. Um, but if you wanted to open up a GitHub issue, we can take a look at that future request. Uh, just a reminder, agent networks have two modes. One mode

10:30

is just a generator stream call. So, it goes to the routing agent once. It picks its decision on what it's going to route to and then returns the response to the user or you or whatever. And then you go

10:44

to the next one. The second is loop. loop is just it goes in a loop until it figures out the problem, but in that loop it goes back to the routing agent each time to decide where to go next. So

10:56

I mean we could potentially do this but you should open up a GitHub issue. Yeah, absolutely. Yeah, I think that's a good idea. That's a good place to start and yeah, we can have a discussion and

11:06

figure out how we could best do something like that. It'd be cool also to know, you know, a little bit more on on the use case of what you're trying to build. I I I think it's I kind of get it, but it it's always good to have some background context and what you're building. So that helps us decide what what features and what other you know

11:24

similar types of users would want similar things. All right. So that was the first news article and the first question. So that's that's good. Let's go to the next one. And this one actually probably was around last week,

11:38

but we didn't really talk about it much. So I figured we'd spend a little time talking about it today. the Gemini released their CLI. So, you know, if you you're familiar with Cloud

11:52

Code, that has a CLI. Gemini released their CLI, which they're touting, you know, your open source AI agent. So, you can basically bring Gemini right into your terminal. I think the whole idea is it to be similar to what you can get out

12:06

with cloud code. Pretty cool. Yeah. Have you I haven't

12:14

had a chance to try it yet. I have been using Cloud Code a lot, but have you had a chance to try it out yet? Yeah, I tried it out. It's pretty sick.

12:20

Pretty fast, too. Um, and it has a lot of stars. It blew up.

12:26

So, people like it. It's gotten good reception from maybe people who are Gemini fans, but even non-Gemini fans. Yeah. Do they have the I don't see the the GitHub link. Where's that at? I got

12:38

to find that. Oh, right here. Contribute to this project. I'm just pulling it up. Yeah.

12:45

Wow. 55,000 stars in two weeks. Yeah. Pretty crazy.

12:51

Dang. That's pretty cool. Yeah. And it's very it's very useful utility. So, it's pretty cool.

12:57

Yeah. So, if you are listening, let us know if you tried it out. How does it compare to claude code? I do think that having the large context could help, you know, that's nice. in Gemini's one other

13:08

benefit of Gemini is it's typically cheaper too than than Claude, but also you know I I do wonder with those large context if it starts to add up over time but it is nice to have the larger context windows as an option. Yeah, I don't know how I didn't do any coding tasks with it though. I should have maybe I'll try after but um I don't

13:32

necessarily use Gemini for coding tasks. I usually use it for things that need like a really big ass context window. Um, yeah, I I have known I've been hearing some people starting to use Gemini more, but again, I think it's just a a lot of this stuff is like it kind of depends on your preferences.

13:51

It's like Coke or Pepsi, right? It's like becoming like that, you know? I do think there obviously models, some models are better at other things. Like that is true. I think in general sonnet still outperforms most

14:03

other models and coding tasks but I think it somewhat depends on what you're doing because I've seen people have good luck with all the major models you know depending on what they're trying to build. Yeah, I think what was it Carpachi was like these are all like utility companies so you know makes sense. We're

14:22

all just, you know, maybe like before it was like based on where you live, right? Like what utility company you use, but now it's like based on your choice. Yeah. And I think there's always going to be like some that shine at, you know,

14:35

certain tasks better than others. But we do have another uh just this is more of a comment. Be great to have something like an agent simulate simulation testing like one from Langwatch.

14:49

That would be great, wouldn't it? Yes, it's great idea. I can confirm that would also be great. That's a pretty cool feature. You should open a GitHub issue. Uh

15:00

yes, the things that you'd like to see in Monster. Yeah, that's a good call out. Yeah, if you just pop open a GitHub issue, you know, obviously we can't add everything, but we do we do read all the all that and that all gets factored into when we're planning out our road map and what features we want to add. So, and that the other thing that's kind of nice

15:17

about having a GitHub issue is if more people chime in that they want that, it's it's a good indication, right, of of a feature that many people would like to see. So, yeah. All right, moving right along.

15:34

Next up, this one. I don't know why it's Give me one second to open it up. Apparently, I lost the tab, but we will uh we'll pull it up here.

15:56

Maybe we can even uh Oh, I might need not be able to share the audio, but that's okay. We'll just we'll just read it. So, Cursor released some updates. This

16:08

came out a couple days ago, right before the 4th. Cursor 1.2.

16:14

So, there's agent to-dos, which is kind of interesting because you can kind of build a to-do list by the looks of it if you're watching it this little video. And then you it'll essentially kind of allows you the ability to structure the plan. That one of the great things about cloud code is it actually builds its own to-do list, right? and then you can kind of see it as it goes through it. But

16:38

sometimes you want to control the to-do list. You don't want to let the agent control the to-do list. So maybe I I know I personally often when I would be building features for myself, I would make a very detailed to-do list and then I would just like churn through all those things, right? Yep. Now I can just take that to-do list, pump it into cursor, and let it do its

16:56

thing. Um, looks like you can queue up messages, uh, memories. I'm curious to know more about that. PR indexing and search, improved embedding, faster tab

17:08

completions, agent merge conflicts, and background agent improvements. So, a whole bunch of things that came. I can uh I can attest that the update is really good because I'm a cursor guy now. And you're a convert. You were a cursor, then you converted, then you converted

17:26

back. Yeah. Um, agent to-dos is really cool if you are a control freak, you know. So,

17:33

like that's good. The background agents have improved. I've been playing around with that a lot. Tab completion, you can

17:40

like tab to victory so quickly now. Uh way quicker than last week. And I'm pretty new here, too. I can see the

17:46

sharp difference given I just converted, right? So, I can see how much they've already improved. Code search is a lot faster. I haven't tried memories though yet. I don't really care about that right now. So, um yeah, dude. It's sick.

17:59

I should have been on it sooner, but maybe this was Destiny. I had to come through this way. You had to go through the path. All the

18:06

roads lead back. Um I can see how people use quad code versus this, but I could also just see how it's just so easy to use cursor on web and everything like it's just like a dream. So I think this is now story time because you because we were hanging out.

18:23

It was last week I guess. Yeah. Time time is time is uh with the holiday kind of is kind of weird, but yeah, it was last week and you were just firing off tasks background tasks to cursor on your Safari browser in on your iPhone as we were having a beer at the bar.

18:42

Yeah, because I was using I was using background agents in my in cursor just to play with it. And then I thought then I saw this like tweet last week that you can uh you can they have a PWA for cursor and I was like oh you can use this in the browser. So then on our way to where we were going I think we were going to happy hour. I just started

19:00

launching things off and then I've been still following over with those tasks. I didn't implement any of the code they generated, but it got me to think about where I wanted it to go like without even having to do the work because I was hanging out with you and then like later on I was like, "Oh, here are all the ideas I had." And then it went and like

19:19

explored them for me. So, it was so sick. I love it. I still use it. I'm like firing off some stuff right now

19:24

before the before this uh live stream. Which is the the funny thing is I've seen obviously I've seen I've worked with you for a long time. So you used to, you know, you would often come up with an idea and kind of pave the rough path and then hand it off to someone to kind of explore the path and report back and

19:43

then together you'd often like decide how to actually move forward. But now it's like you can just send it off when as soon as you think about it. You don't have to wait for someone to say like, "Okay, I got some time to like explore this." can let an AI agent

19:56

explore the path, decide if it's the right approach or at least learn a little bit from their approach and say, "Okay, that's either cool or not." And yeah, it's like vibe coding. It's like literally vibe coding feature or like product features that you know you're going to throw away, but you're like

20:13

talking to the agent and you're not really necessarily giving it context that all the context it needs by talking it very conceptually and then it does a pretty good job of finding context within the the repo and then kind of making some stuff happen. Even if the code is not going to be accepted, it probably saved me like 20 hours and I

20:32

just did it while drinking with you which is amazing. It's It's like the the new way to do like throwaway prototypes. It's like Yeah. Like like we know we're going to throw this prototype away. I just want to see what it would look like and see how it

20:43

feels. Yeah. And maybe it doesn't even fully work, but it gets you close and then Yeah. We're just going to throw it away and

20:50

pay, you know, 20 bucks or 50 bucks in compute for the or whatever it cost, right? Yep. And Yeah. That's pretty cool. Yeah. All right. One one more news story here

21:03

and then we're going to bring in Paul for the last news story because he might have some opinions. Cool. But this one is, you know, we like to always just talk about these. I don't know if I have any insights into it or

21:15

if you do, but it's nice to talk about uh any new open weights or open source models. So, BYU has released Ernie 4.5 and you for those of you you know most of you probably all know this so I'll be just reiterating things you know but if you see these model names deepseeks the same way the first number is the number

21:39

of parameters so 300bs 300 billion parameters the second number is how many if it says an A47B that's how many active parameters are you know are active at one time when you're when it's doing inference But it talks a little bit about its pre-training. It used a mixture of experts, a multimodal mixture of experts pre-training approach.

22:03

Talks about you know sca how they scaled the infrastructure and then talked about how they did post training and then of course it has some benchmarks. The interesting thing is that the 300B model surpassed Deepseek's 671B model on 22 of the 28 benchmarks. Now there's 47 47 billion active parameters in this model compared to deepsek's 37 active no

22:29

37 billion active parameters but it is uh you know fairly impressive right that is as far as like open source models are concerned yeah totally it is uh it compares pretty favorably by AI studio is always has like uh they're Chinese by the way of course they are and their logo is a pop print which also is like supports my theory that all Chinese uh AI studios will be like

22:58

animals and all American ones will be buttholes. So every every if you haven't seen the meme about every AI logo looks like a butthole. Once you see it, you can't unsee it. So can't unsee it. You know, if there's one thing that you

23:17

want to tune in to us for, it's our conversation about AI logos, what and what they represent or look like. So, I like these benchmarks, though. Um, well, you know, benchmarks.

23:29

Yeah. Yeah. I always I almost like to be honest I'm like there's like benchmark overload because there's like new benchmarks coming out that support because ultimately if you're developing a model you also usually develop certain benchmarks that make you your model look good. It's not always the case but and

23:48

obviously a lot of these are some just some standard benchmarks that have been run before. So that's good. you want to see them, you know, farewell on different benchmarks.

23:59

But ultimately, you know, I think it comes down to you'd want to test this and see how it actually compares for the types of tasks that you have your, you know, applications or agents actually doing. There's so many models coming out. I want to build one maybe like a little like hello world one.

24:17

Yeah. You want you want to train your own like small small model? Yeah, maybe on MRA or something. That'd

24:23

be actually a good live stream, too. I don't know how to do it. We'd have to figure it out. Yeah, we need we need one of those like

24:29

Andy can come help. Yeah, maybe Andy or Yes, some other some machine learning person come help us do some fine at least. That'd be cool, too.

24:41

We could fine-tune a Quen model or something. I actually saw I saw Professor Andy on Friday or last week. Yeah, someday. We need to get him. We bring him Andy uh back in for what what do we

24:54

call the segment? Uh builders learn ML. Yeah. He's actually he's been doing some cool He could teach us some stuff.

25:00

Yeah. Yeah. Last time we talked about just like reinforcement learning and uh things like that. So we should get we'll get Andy back on. He'll teach us nonML peeps some ML.

25:14

Yeah. All right. And I'm gonna bring Paul on for this next one here. And he doesn't

25:20

know what he's getting into, but welcome. What up, Paul? Hello. How's it going?

25:28

It's going. It's going. Happy Monday. Good to see you. Monday. Yeah, good to see you. I'm glad you're

25:33

bringing me in for um a news item to have an opinion on rather than the previous conversation about American AI company logos because I don't have an opinion on that. Dang it, we missed out a funny opportunity there. But uh yeah, so this one came across I think it was about last week and was interesting in my opinion

25:58

because I have I think some I have a point of view on it, but I'm curious you know what your point of view is Paul and Obby. So Cloudflare released payer crawl. So it enables content owners to charge AI crawlers for access.

26:18

So if you read this um the idea is you know all these uh model companies are taking content from the internet using it for their training and the authors of that content are not getting any kind of royalty or any kind of payment for basically providing the training data for all these models. So the idea is what if you

26:40

could what if there was an option for you when one of these model company crawlers comes to your site you could say you can have this data for this price if you want to crawl it and then through you know some magic behind the scenes action you get paid a small you know whatever your price is for that content they can decide if they want to pay that price or not or they can move

27:02

on. I'll leave it at that. I have strong opinions on this, but I'm curious what you all think. And if you're in the chat, let us know. What do you think of this? Is this a good idea?

27:14

Go for it, Paul. Oh, I'm not sure which which side to come down on until I've heard your your point of view. And I don't want to I don't want to bias you on on what I think about it. Um

27:27

well I guess first and foremost so long as it doesn't change the readers experience and perhaps there is an opportunity here to improve readers experience because it might mean that we would reduce the amount of ads that need to be on a site which might help fund um content publications. And I guess you

27:46

also need to consider like the different types of content publications. Some aren't commercial. My blog for instance I don't earn any money from that. I just do it because I like to write and so it really doesn't bother me. My the whole

27:58

idea of me writing to put my stuff out there is so that it can be shared. But that's not always the case and there are obviously like complete commercial um aspects to content publication. And if you know they are effectively well none of these models involve giving content away for free. They monetize it in

28:18

various various ways. One being ads. And if LLMs are scraping content, they're not going to be susceptible to ads. Then I think being able to monetize it in a different way is a good thing. And I

28:31

think if we don't um how are these content publications able to survive, which ultimately means we're potentially shutting down the freedom of sharing information. So I think it's a small thing to pay for. No one's probably going to be impacted by it. And it means that things that are in place and live

28:49

can maintain being in place and live. It's very poetic in terms of the response. Um, for me, like if you think about utopian world of agents and stuff, you know, every agent will have a ramp card with a budget if it, you know, if it's crawling the web and hits some payw wall, whether that's the New York Times or whatever, or any paywall, if it needs to buy something to accomplish its task,

29:15

I can see a world where that is true, right? I'm pretty sure we all can. The problem with something like this and I think Cloudflare did that with like the password or like the bot protection and like all these different types of services that they have, they like take over the content experience, right? Have

29:32

you ever seen like the the you're not a robot Cloudflare check on every website, you know, if they try to verify you? Um, so I think that's a if you're down with that, that's that's cool. You can have that. Um, but then yeah, what if you're

29:45

not making any money from your content? Are you now gonna get money from it because you wrote something for free that is then, you know, sold? Yeah. It's just complicated, dude. But

29:57

the mechanism is tight and I believe in the mechanism. Like you should be able to buy stuff that is for sale or that payw walls you. You have the choice to go through the payw wall or not. I don't have any other political opinions about

30:12

the content laws or anything. Yeah. So here's my opinion. So, I am typically an optimist. So, I'm gonna say

30:18

that like it's easy to be a critic, but I don't believe this shit's gonna work. And here's why. I think most if you look at just the the content web of what people posting, most people do not pay for content online. Some do, and there

30:34

are some publications that have made it work, but in general, most people, I mean, do not typically pay for content. When you go to a search engine, you expect that content to be there and if it's not there, you don't ever read it. You probably don't ever find it. I think

30:47

the same is going to be true if you go into chat GPT and someone wants to charge, you know, a crawler $4 to crawl their content, they're just going to say, "No, I'll just get the content from the next person that didn't charge that has similar types of content and train it on that." And then when you ask chat GBD a question, you're going to

31:06

get that other person's content, not the one who wanted to charge the $4 for the crawler to access their stuff. That's my opinion. I kind of hope I I wouldn't mind being wrong because I do like I want content creators to be monetized for the creation of their content. I think we all do. You know, it's it is a lot of work. If anyone's ever written anything, it's not easy to do, but

31:28

especially if you want it to be good. It's like that's a lot of work to actually put together high quality content. And one of the challenges with AI is it just takes your content and summarizes it often. Doesn't actually like deliver the direct content. But I

31:41

would say a a lot of these models, you know, do provide sources now. They're getting better with providing information of where they got that information. So it does drive back people to your website just like search engines often did. You know, we see it on the, you know, of course, every business is going to allow their content

31:58

to be crawled freely, but we see quite a bit of traffic coming from chat GBT on the Maestro website for instance, right? So, you know, that chat GBT is recommending Maestro for some question and they're citing the source and then people are clicking through that source getting back to uh Maestro. And I think that if you turn that off, even if you

32:15

want to charge a a nominal amount, which you think is a small amount, I think the inconvenience of it and the amount will that inconvenience will most likely lead to you just getting ignored. So that's my take. I don't think it's going to work. I would love to be wrong. Isn't that the lens of like a small content creator? Like the people who

32:35

charge pays are like big publications or if you're on like medium or some or you have a pay. So yes, great. That's actually a good point. But here's the thing. They're not going to use Cloudflare's thing to do

32:47

that if you're if you're a big content creator, you're going to do a deal directly with the model company, right? Like once you hit that, so maybe there's this middle ground of like people that are big enough that don't can't pay the lawyers to do the deal with the big model company. Maybe there's some middle ground where some people do it, but you also it's a huge

33:04

risk if you do that as a kind of mid-tier publisher and then nobody that uses these AI tools ever gets your stuff. Yeah. I'm pretty sure like these models were trained on a pro version of the New York Times, right? Like they weren't just crawling stealing information that was behind pay walls to train their

33:23

stuff. Maybe they were. Or maybe it's like the no-name creators that are the ones getting jacked.

33:28

Yeah. They put it on for free. A lot of these in there's been some legal cases where they've, you know, you know, like Reddit made a bunch of money from their content with some model providers. I mean, so the bigger companies are already getting paid

33:40

because they're they're bringing suits, you know, lawsuits into action and they're yeah, going through the process. But anyways, I do think that this specific thing is I don't think it's going to work. Maybe something like it emerges. I would like love to see it. I don't believe that it's going to work. I don't think it's going to help the

33:59

little guys out. Maybe the mid-tier players can get some value out of it. I don't think the large people, the large publishers are going to care because they're going to do their own thing. That's my take.

34:11

Yeah, they should have like HTTP uh header for agents, not like and then like then anybody could do whatever they want when an agent crawls your site. That should just be a standard and then anyone could then everyone can figure their own standard. They could figure out what to do.

34:29

All right. Well, that was the AI news. We do have uh we do have a couple comments here. So from Vamp

34:37

if anyone used to I used to do a bunch of Drupal videos and stuff. I was I started my open source career doing Drupal stuff. So but vamp if you are looking into crew AI you should just look into MRA. If you're you were doing

34:48

Drupal it means you you know some PHP you maybe work did some JavaScript and some jQuery on the front end. Typescript is closer to uh JavaScript than uh Python is. So, you should just use Mastra, but cool to see you and yeah, uh I think a lot of Yeah, it's definitely a throwback. It was like 14 years ago when I started that, I think. Uh but

35:11

for those of you I think a lot of people can probably relate to this, though. They you may be a web developer coming from a different area of expertise than you know, we kind of joke that we're not machine learning experts because we're not. We're we were web devs, right? I think that's our background. I think everyone on here's background. So, I

35:29

think a lot of people can relate to that. All right. Will you guys be doing any events or talks in the Bay Area anytime soon? Yep. That's all we could say, I guess.

35:41

Yeah. What What do we got on the What do we got on the docket? I mean, we we're kind of all over, but I don't know if what we have actually planned in the ne in the near future. I think the biggest thing is maybe MCP

35:52

night when we get approved or not. Yeah, we might be. Yeah, we might be at we will be at MCP night. We may be

35:59

speaking at MCP night. So that that'll be uh one event from that from that perspective. We're at like every event you could think of, you know. We're there in some capacity. Yeah. If there's a big if there's a big

36:11

AI event, you will very likely see someone from MRA with some books, you know. Yeah, we bring these everywhere. These are so if you want a book, find the big AI events and you will find us. If you see this book at a meetup that you're at, we were there.

36:31

We just missed you or something. Yeah. Although I have heard people say they uh just randomly find books, you know, at some of these like leftovers that didn't get picked up at some of these events.

36:43

So, we may or may not have planted those. Most likely we did. Yeah, but no one will know. All right, and with that,

36:55

we are gonna even Paul Paul's got a copy. Everyone, I traveled with it. I'm going to start leaving it in various Airbnbs that are saying, "Hey, if you're watching this, if you just leave us a comment here in the chat and I will find a way to get you a book if you want a book." So

37:11

yeah, if you want to book, if you're one of the people watching this right now, if you leave a comment live, I will find a way to get you a book. Now, the caveats being we can't ship to a lot. There's quite a few countries we can't ship to. So we can ship to quite a

37:24

few countries, but there's many that we can't. So yeah, if I can send you a book. Otherwise, you can of course always go to this link here and get the digital copy. So if you just want to read it, see what it's about, it's not all about MRA. So it is actually just general

37:45

building AI agents and AI applications. Uh does show a little MRA code, but it's mostly just illustrative for example purposes of like how to think about things. All right, Paul, we wanted to bring you on to talk about docs. So what do you have for us today?

38:02

Um a couple of things actually. um mainly around playground and deployment. Um now it's not to say that we've seen anyone having difficulties with this, but when I was looking over the docs, it was something that I wanted to make sure that was absolutely crystal clear and I think I've mentioned this once or twice before is that master is almost too

38:22

flexible. Like you can deploy it anywhere. Um and it's sort of coming out in the docs the more I look into this.

38:29

I'm like all right, yeah, you could do it that way as well. Um, so I wanted to just show I guess what I would consider to be the happy path for deploying a m deploying a master application, but we will also cover some other bits and pieces in the docs if you don't deploy to master cloud. Um, so let me get some screens going here. Uh,

38:51

let's see. How's my zoom? shine up a little bit. Yeah. Whatever you think. One more click

39:05

is what I always recommend. That's all right. Yeah, that's all right. Cool. So, in the docs we have

39:12

or we've always had actually, but it's just been revamped is our master cloud section. Um, currently in public beta, but you know, you can sign up and get going and it works and it's cool. and I've been through and sort of documented and illustrated the different features that we offer.

39:30

Um, it is, I think, the easiest way to get up and running with mastery. Just hook up your GitHub repository, click deploy, enter a couple environment variables, and you've now got exposed endpoints. Um, we've got a full step by step now for how to sign up, whether it's going to be with GitHub or Google or sign in rather.

39:54

Connect your master app to the repo, create a new project, import your Git repository, configure it, and deploy and away you go. You've got master uh up and running with the playground. So that leads me on to the next thing which is a revamped local dev playground.

40:14

So I'm sure if you've built a master application, you run npm rundev, you'd be looking at a playground. So this gives you the opportunity to add inputs and various things and you can see how um agents, tools, and workflows work in your application. Um we've also covered that here. Oh, there's you, Shane. Um, with some little

40:39

videos just to show, you know, this is what happens in master playground which allows you to test locally. Um, it's an example of an agent. You can see the response. Um, this is what

40:52

happens with workflows. This one is showing uh a workflow with branching, which I think looks really cool, but it allows you to visualize the steps or the decisions that are made in your workflow. Um, and it allows you to inspect each one and check the input and output. Um,

41:09

can I can I interrupt you for one second, Paul, because we got a question and I think what you're illustrating here is part of the answer to that question if that makes sense. Okay. So, the question is, can you talk briefly about the difference between Crew AI and MRA? And so, I want to take a detour just for a second. I think one crew AAI is

41:29

Python, MRA is TypeScript. probably you know by and large is going to be a big difference right you're choosing a different language of how you want to build agents crew AI started with this idea of you know multiple agents working together and then they eventually added flows because they realize that you do need workflows which you kind of just

41:47

demonstrated a little bit there uh Paul but you need more definitive workflows with uh when you're dealing with LLMs and agents today uh so MRA has a lot of the same things we have workflows we have the concept of agents we have the concept of agent networks or groups of agents working together. One thing that MRA has that

42:05

crew doesn't have, at least as far as I'm aware, like I haven't used it in the last few months, but at least as far as I know, we have this visual playground. And to be honest, one like I I will one shameless plug of why why we came up with this playground is because I was a Drupal developer back in the day and I liked

42:26

the admin that Drupal gave you where you can install modules and piece things together. And I know a lot of you probably have never used Drupal, but you've probably used WordPress, which I think Drupal is technically superior to WordPress, but WordPress is way better at marketing. And so, uh, but a lot of the ideas are you want this visual

42:44

testing ability, this this way to kind of like see things and verify that everything is working. And that's one thing or I think a lot of people say is the playground is what kind of connects it all together. So, you write your code, but you at least can test and verify things in like a visual playground. And that's what Paul's

43:00

showing here. So, momentary detour, Paul. I'm going to hand it back over to you. But I think that was like what

43:06

you're showing here is one of the biggest differentiators between us and a lot of the other, I guess, frameworks that are out there. Yeah, no worries. And and I agree, you know, like the the local dev playground allows you to fully build agents, tools, and workflows without ever touching like your front-end application or, you know, you don't need to introduce any UI elements to be able to have inputs that

43:30

work and you can sort of then inspect the way that your agents work. Now, the reason I wanted to kind of show this and master cloud is that on master cloud, you have the same thing. um you can then test your workflows in the uh I guess what would be a production environment but on master cloud. So like what you have locally is what you also have when the agents

43:56

deployed. Um now the last thing I wanted to cover was you know in other areas of deployment is we do have our own like deployer packages which would allow you to deploy your master application to let's say Versel. So it's very straightforward now we've removed all of the config in your master instance. You literally just do new versel deployer

44:19

install the package and that will allow the same CI get push and whatever and you can deploy your master application to versel. But the thing you don't get when you deploy to other platforms and this you know this is the same if you went with any of these other cloud providers is you don't get the playground. So that is one huge

44:36

advantage for me is that yeah one mastercloud is super simple to set up and deploy but you're effectively getting the exact same thing in the cloud that you had in local dev which allows you to better test how your application would work in a real world. You know you're actually talking about proper network requests and all the rest

44:53

of it. Um, so that was kind of where we were at today with the docs. Um, there are some other things that people have asked about that we've also introduced. Um, one thing that was maybe slightly

45:05

confusing for folks is when you integrate um, master into an application. Um, so we do have this framework section where if you wanted to add master to a next application, you wouldn't need to do anything in addition to deploy it. um this would just deploy alongside your nextjs application and versail know how

45:26

to handle it. Um, so lots of options whether it's a standalone integr standalone master application or an integrated master application. You can deploy it in a number of ways, but I think yeah, I'm I'm going to champion master cloud just because you get the beautiful playground experience everywhere,

45:46

dude. Sick. Good work on those docs, too. Thanks, man.

45:51

Yeah, anytime we add good docs, we get a reduction in questions about those topics. And I haven't seen a lot of questions about them. Um, which is good. And then, dude, Mashroo Cloud's looking

46:04

pretty pretty nice. That look pretty good. Yeah, looking pretty good over there.

46:09

The team's been cooking, you know. Yeah, it used to be our unwanted child around here, but now we love it just like everybody else. Yeah. Yeah. We kind of left focused so much on just building the framework that

46:21

cloud had fallen a little bit behind, but now it's looking good. We're cooking. We're cooking. All right. Anything else you wanted to chat about today, Paul?

46:32

Um, no, I do want I do I will be staying tuned though for your emails deep dive though because that's something that's uh interesting. So yeah, you can you can hurry me along hurry me out of the chat. I've done my docs thing now to more interesting things. Well, for those of you wanting to see

46:50

what Paul's up to, you can follow Paul on X or Twitter right there on the screen. And yeah, well, we we will uh bring you on again soon, I'm sure. Yeah. Yeah. I'd love to come back. Thanks so much. All right. Thanks, Paul.

47:06

Later. Bye. All right, dude. It's game time.

47:13

124 people in here. Thanks for being here, guys. Yeah. Thanks for tuning in.

47:18

We have a few people that want some books. You know, Ben, I'll see what I can do. I I'll I'll find a book. You get a book. You get a book. So,

47:30

Ansuman, I can't send you a book. I'm sorry. Unfortunately. Unfortunately, I cannot send you a book.

47:37

Move to London. But you have the digital copy. I wouldn't Someday we will figure it out and when we do, we will let everyone know. But we India is one of the

47:48

countries that is not quite as easy to ship to and we have had a lot of people ask so we you know love to figure it out but all right so I guess we should talk evals maybe we set the stage and then I do have uh kind of have an FAQ blog post that we can kind of read through together because I think it's probably good information and then we can dive in. I have a lot of

48:14

spicy takes on this Um, so I'm ready to go. Um, okay. So, here's the thing. Um,

48:27

eval are like tests anymore. That's one thing. Um, I truly think it is more it really is an observability thing. It's like a metric like you're you're you're collecting metrics and then based on

48:43

metrics you're like you are then going to change something in your application. So this is becomes like more of a reliability and observability problem than some agent And if you look at like the market now of evals, there are a bunch of people providing eval tools, frameworks, products. And when you get down to the and I don't

49:07

want to discredit any of those because you're going to need tools to really do it at scale, but you don't need big evals to do eval. And that's my new statement. Big eval doesn't want you to know that you could just do eval yourself in the beginning, right? You don't need to use anything actually to

49:25

do eval. You could do it yourself if you really want to. And and I'm sure you know Hamill that we're going to see his blog post, he also has a lot of that kind of vibe as well. It's like learning the fundamentals of how you score

49:36

something and then what to do with the data after. I think right now I think all these eval products are becoming like sock too. You just have to have it and you check them bark off. You know,

49:48

do you have evals? Yeah, we use this product, we use Langmith or we use whatever. And then maybe you don't even ever write actual emails that are good. You just have it so everybody, your investors, your customers are like, "Oh,

50:01

do you have emails for your product?" and you say yes. That's my take right now. Sorry.

50:08

Big E, dude. I'm anti- Big E now until we become Big E and then I'm good with him again. Yeah. Well, we will be Yeah, we will be the the open version of Big O Open Big

50:21

Evel. We'll be open Evo. I think that's already a thing. You'll have to find a different name.

50:27

That's true. That's like That's Matt PCO's thing, I think. Yeah. Yeah, I don't know. There there's

50:33

something called open evals. We'll have to come up with a different I'm pretty sure that's his thing. He probably had the same vibe as us. Like, yo, what's all these big evals trying to

50:39

do stuff? Open eval. What is that?

50:44

Yeah, maybe it's lane chain. One of these got it. Yeah, someone someone coined it first.

50:50

Someone got it. But yeah, should we maybe just browse through this post a little bit? I think it's it could be useful and I can talk to it too. Sure.

51:02

We kind of set the stage here around, you know, we'll go through it's kind of like an reads like an FAQ, I guess. I mean, it's it's kind of long, so we will, but it was just published yesterday, actually. So, it's probably something good to if we were to read a blog post on a live stream, this is probably a good one to read. We won't read it all, but yeah, that guy is cool, too.

51:26

Yeah. Yeah. Haml's Haml's cool. All right, so let's go through the first

51:32

question. Is rag dead? Let's click it up or I mean increase Oh, yeah. Yeah.

51:43

All right. All right. Is rag dead? Should I avoid using rag?

51:52

Oh, man. It's funny. This is where all the stuff started, right? where people are starting to do evos.

51:58

Um I don't think he thinks rag is dead. Yeah, I mean this is the kind of the important part. Uh I think rag has been a buzz word and everyone thinks rag means vector search. You have to use

52:11

vector but rag is much more than that. It just means retrieval. Augmented generation means you're going to retrieve data and feed it into the context. that even even if you're using giant context large context windows like Gemini,

52:24

there's still times when you're going to want to control what context goes into that and that is going to be some kind of rag system. So rag isn't dead. You don't always need it, but I think it's become a little more nuanced. So I think the the followup this is I agree with this. Yeah. So rather than avoid it or embrace

52:41

it, maybe you just need to experiment with how to get how to retrieve stuff. Whether that's just from a database doing like a keyword search or whether that's using a vector DB. Ultimately, you'll have to do some experimenting, do the work, do the leg work, and figure out when rag might be useful. And then

53:00

yeah, you don't need to maybe not everything is rag. Like I think before like everything needs to be rag. Now it's it's a little more nuanced, but it's definitely not dead. The problem is like now that rag is a

53:12

buzzword, like it it harms like other things that you do because if you're just doing vector vector search, then maybe you shouldn't call it rag. It's just like you have a vector search tool to to do that. But if you think about LLM as a judge, it itself is rag because it needs to retrieve the input and

53:30

output from the agent call. It needs to augment it by passing it to the context window that it has. And then it needs to generate scores and prompts. So everything if you if it wasn't a a buzz word, we could say everything is red

53:44

in a lot of ways. Yeah, in a lot of the ways, but because it's a buzz word, we can't. So maybe it's eagraction of you extract data, you augment it, and then you generate it. Um, but anyway, that's like another Yeah, we'll we'll talk about that. I

54:02

think in some ways it's like in the new term. First it was like prompt engineering and then it was like okay rag and then now everyone's starting to talk about and then like eval was important agents right it got all these buzzwords and then event the new one that's taken off is context engineering and it's like okay well rag's a component of context engineering prompt engineering is a component of it

54:26

you know you have all these things that kind of feed into it and like context engineering is now the broad term that'll eventually become overused and everyone will hate it And but that's just the way that these cycles end up going. Maybe it's working all the way back up to AI engineering. Yeah, exactly. Yeah. All everything is

54:45

uh just or or or even broader. Everything's engineering. Everything is just Yeah, David would like that one for sure. Yeah, everything is just engineering. It's all just engine. All right, so let's move on to the next one. Can I use the same model for both

54:57

the main task and evaluation? So this is for LLM as judge. Obby, for people who don't know, what's LLM as judge? This might be new to some people. So LLM as judge is when you take the

55:10

output of your agentic process. It's not just like an agent call. It could be anything that uses a like LLMs. You take the input and output and all the options

55:23

and all that stuff. And then you take uh another LLM, it could be the same model, and you give it a criteria to score that output based on the input given good examples, bad examples, or exa another way to put it, you could score it however the you want, right? That's how it should be whether and that's up to you as a user. But there are these things called offtheshelf LLM as judges

55:50

where they do specific things like they test for hallucination. That's like a topic area in an auto eval. There is answer relevancy, factual correctness, etc. Tool call correctness. All this stuff exists. Whether they matter to

56:05

you, I don't know. Probably not or probably because like every every agent is different. And if you read, you know, their response here, I think for the most part, I agree. It basically says you can use the same model. It's typically fine because the judge is doing a different

56:24

task. You have it do one thing in your first whatever that your your application's doing. It's doing one thing and then you're having it evaluate which is a completely different task. So it's not like it's going to

56:35

necessarily be too biased on you know like thinking its own response is better for some reason. You can you can use a different model. Of course I think this is the statement that I would 100% recommend. Start with

56:47

the most capable models. like just pay the money, use the best model for evals at the beginning and then over time once you you know have eval of course like maybe try a lesser model to be the judge. But ultimately you you got to think about if you're having someone judge output you want the smartest person or the smartest model deciding if it was a good response. One thing they

57:10

mention here that's important, the judges they recommend are scoped binary classifications tasks, which means I think Hamill's general advice is it should be a one or zero. Either it passed or it didn't. And the reason is you can then easily see. Now some people don't like that. Some people want like a range. And so you know at MRA we give

57:32

you all the the ability to do either. But like Hamilton stake is it should be a zero or one. You either pass pass or fail. And the idea is then you can uh decide if you know you have a higher

57:42

rate of passes or higher rate of failures rather than doing like some kind of average score. Zero one is good or zero scale to one because the thing is and I'll say it again like you should be able to score things however you want, right? Like and that's on the author of an eval. So

58:00

yeah, like yes or no, cool or you could do whatever you want. Yeah. Which I know is not a great answer but it's the truth. Yeah, I I I do think if you were to start having it be yes or no are the

58:12

easiest evals because you could you as a human could also verify that you know you agree with the LLM as a judge or not, right? And I think that's so if you are starting this is probably the best place to start is either did it is it a pass or a fail. Um, the other thing that Hamill recommends, which I don't 100%

58:30

agree with this, but recommends that, you know, don't using the same model for open-ended preferences or response because they don't recommend using uh basically having a model provide like context around whether it was a good response or not. I think we I think there's some value in seeing some reasoning around it. So, we you know, we like to provide that as an option, but again, you don't have to use it. It's

58:52

kind of up to you on on how you want to build your emails out. And reasoning doesn't come in the same LLM call. It's like a separate one. So I could see how it could be biased if you

59:04

put everything in the same pro. But anyway, how much time should I spend on model selection? I don't really agree with this because everyone would be changing models every day and then they'd be complaining to us like why their stuff's broken or whatever and they're not measuring they're not measuring things after they

59:27

change or before they change or when they upgrade. And so if you were writing evals and collecting data, you could not necessarily spend time on model selection, but you could start having data for it if you know you start changing and then you can decide for yourself. Yeah. Well, I think you actually maybe you agree because I think that's what what they're saying is does your error

59:47

analysis meaning like your data suggest that your model's a problem then maybe change otherwise don't mess with it. Yeah. But oftentimes I don't know if it's the model or maybe the model just uh hides imperfections that you have running in your own system. Maybe I don't know. Yeah. I mean I think you need I think that the the short answer is you need

1:00:05

data. So yeah, if you're just changing model and trying to like see if you is it actually better or not, I think before you you fixate as he you know as said here on selecting a new model, you should probably just make sure you're doing eval and you have the data before you worry about like trying different models like focus on that first. What's that thing we heard during YC

1:00:29

where like and I don't know if anyone if we knew that did it but like startups are waiting for new models to start doing feature work or something because like they believe in you know remember that stuff. Yeah. Yeah. It's like the next model change is going to unlock new capabilities and so they're just like

1:00:46

they're waiting on the models to get better before their you know technology really works and if you have the luxury more power to you. Yeah. I I I think you shouldn't wait sh you can build a lot of cool things with the models that we have today. And honestly, another hot take is I

1:01:03

think the it to me seems like the model progress is starting to slow down. I think people like I think we're still making improvements. Some of those improvements are great because every new improvement unlocks new capabilities.

1:01:17

But I do think that in general like the gaps of of the jumps so far have started to slow a little bit. Now, maybe there's a new model right around the corner that's going to blow people away. And I think every time there's a new model, there's something new that is pretty exciting, you know, like it is exciting, but I don't think we're having these

1:01:36

quantum leaps anymore. But every new small increment does open up like new things that could be automated or improved. So, I think even small changes are good. I just don't think we're having as big of progress as we were potentially two years, a year and a half

1:01:49

ago. We need quantum leaps. be some quantum leaps going on, man.

1:01:56

So, there's a question. Isn't LLM as a judge a bit counterintuitive if the domain is not aligned to the LLM and we give that control to the LLM to determine a score? Indeed, that's why offtheshelf evos may or may not be good for your agent because of some general thing.

1:02:15

Um, yeah. And I think you know the idea is like these large the that's why you pay for the more expensive LLM right you use the best model because those models typically let's say it's like a legal review right those models will typically be trained on you know a larger amount of data right it's a large has more training parameters it'll likely do a better job than you know a smaller model

1:02:42

but it might not be specifically trained or fine-tuned on whatever the domain is. So, if it's like legal or like law or accounting or something, you you eventually could get to the point where you could fine-tune a Quen model or some smaller model on that corpus of data and have that be your judge and you might get better results or more accurate

1:03:03

results. So, essentially, you're kind of fine-tuning or training your own judge. So, yes, I think there there are ways to probably improve it, but once you get into that part, like that is still very expensive. you probably should make sure

1:03:16

you just have basic eval first that actually you know a very like a more expensive LLM can give you reasonable results of understanding the context but yeah that is yeah definitely something to think about yeah big eval wants you to use the most capable models I'm sorry just all the all the model well the model companies want us to use the most

1:03:40

expensive model for sure yeah big companies want big evals to tell you to use all the big models. All right. Should I build a custom annotation tool or use something off the shelf? I disagree with this one as well, but

1:03:55

yeah, I I I don't think you I think there's a balance. I think you should So So my take is you don't need to write you shouldn't need to write everything from scratch because you're just reinventing the wheel. But you shouldn't assume that you can just take all these off-the-shelf evals and everything just works perfectly.

1:04:14

Exactly. My take, and it's a biased take obviously cuz you know we're the master guys for us. But my take is you should use a lightweight framework that provides you some structure so you don't have to think about the structure. Maybe you can learn from some of the off-the-shelf

1:04:31

evals because you may want to see how they work and see how the data is structured, but then you should probably write custom evals that are specific to whatever you're trying to accomplish. So in the case of whatever it is your application is trying to do write evals that correspond to that specific task

1:04:50

and then you know so custom evals with a a lightweight framework that kind of helps you like not have to worry about the structure or data storage or all that. That's that would be my take. His question though is about the annotation part.

1:05:07

Yeah. Yeah. Yeah, cuz I Well, I mean, I think the the suggestion is his suggestion is you build your own UI annotation tool or use a spreadsheet or whatever. Yeah. I mean, I I just think there's

1:05:18

again you're you're kind of reinventing the wheel. It's the same same take. It's just you ultimately if you have if you use something that's lightweight that provides some of the some of the mechanisms to do the annotation, you don't have to worry about it. You should still focus on building your custom

1:05:34

evals, but then you don't need to invent you don't need to reinvent the wheel. Yeah. I mean, big eval definitely wants people to use off-the-shelf tools because that's how they get paid, right? They're putting a UI around a database that stored all these results for you.

1:05:49

But what if it was open source and you just use it? That's our take, you know, because who wants to build another annotation tool? It's all the same You're just adding comments to rows. you know, you're doing some stuff asynchronously that the framework would

1:06:02

have to be linked into anyway, you know, so might as well just use something that is designed for open source. And ideally, that annotation tool would allow you to like export the data if you needed to or give you access to the database. So, you could do other things if the tool doesn't provide you all the capabilities to do it yourself. So,

1:06:19

yeah, uh it's it's like, you know, yeah, you can make Excel do a lot of things, but not everything needs to be in Excel. So I would or and you can obviously spin up different tools you you know vibe code different tools to kind of but again you are now now you have to maintain that I think people see think of how easy it is to

1:06:37

build something but then now there's the maintenance cost of okay some product manager is going to want a new feature to this annotation tool and now someone has to build that new feature. Do you want to be building features for annotation tools or do you want to be like focused on building whatever your business actually does? Do you think people are not thinking about the

1:06:54

maintenance cost of stuff? I don't think Yeah, I think they're like, it's easy to build, but yeah, unfortunately, it doesn't always mean it's easy to maintain because maybe the AI, you know, the the tools can help you maintain it a little easier. I would agree, but you still have to have someone think about the maintenance of that.

1:07:10

Yeah. Also, who has time to add features when you're doing other too? So like I mean you're trying to build a business not try to you know you're not trying to become big eal yourself you know. Yeah. Yeah. I mean obviously there's always the build verse buy right everyone. You always have to think about that. Sometimes it is better to build

1:07:28

things. In this case I don't typically think it is but all right. So why do you recommend binary pass fail evaluations instead of one to five ratings?

1:07:45

Yeah, makes sense. It is subjective. Yeah, I mean, yeah. No, go ahead. That's why maybe you shouldn't assign

1:07:56

scores and like maybe you should assign like qualitative labels, right? So like good, fair, or whatever, you know, great. Then you can like turn those into numbers if you want or you can use those as labels or something like that. It is more complicated than yes no for sure.

1:08:17

Yeah, I think yes, no is easier, but ultimately, let's say you do like a one to 10 scale, you end up with essentially you end up with it being like yes, maybe or no, where it's like kind of like your if you're familiar with like net promoter score, right? Like I don't I don't even remember what is like eight, nine, and 10 are like your promoters and

1:08:35

then anything you know less than a certain number is like your detractors and then the middle are just the middle. Like so you kind of still build like you would then kind of decide for yourself what those ranges are of what you consider a pass or a fail. But so I think you you'd eventually get there anyways. But

1:08:54

when in doubt if you're just starting out just yes or no is probably simple. Yeah. If you can't determine between like a score of three or four on a like a five scale. Um maybe that's okay if

1:09:06

you're getting a lot of fives, you know, like then but you don't want a lot of twos and threes more so than or maybe you get a bunch of threes and fours. Like you're going to have to dive into those because it's not once you're done the eval. It's not like the work is done. You have to maintain it and like keep looking at how the numbers change because now it's like an ops problem.

1:09:30

Th this one is I think a little bit tricky. How do you eval or debug multi-turn conversation traces? So assuming you're having a conversation with an agent, how do you make sure that you know eval are supposed to tell you if something worked or it didn't, if it was good or if it was bad. How what

1:09:50

about if it's multiple interactions? How do you judge that? Yeah, this one's really tricky and there needs to be more tooling around doing this. If you control memory, you can make some nice tools around this for

1:10:02

sure. You can like process the you can like score the chat. It's chat history itself. But dude, there could be a

1:10:09

ton of messages in there, too. Yeah, like a lot. Yeah. I mean, it's why you kind of need to be able to evaluate individual

1:10:16

messages, you know? Did it did this was this answer to this question good or not? you know, you don't without maybe all the context, but then you also want to be able to check the entire conversation. Like, did at the end of it, did it meet

1:10:29

the goal of whatever the user was trying to do? If you call into a voice agent and you want to book an appointment, did the user actually book the appointment at the end of talking to the voice agent or they get distracted and they didn't? You know, that's a pass or a fail. Yeah, it becomes a lot easier when it's a finite time on the conversation. Like

1:10:48

if this was like your therapist buddy or whatever that has just like memories for years and you can't really eval the conversation. You'd have to figure out a time range that you'd want to eval, right? And so then that's like your finite uh points in time. Then it gets a little easier. Yeah. Yeah. So that's very tricky.

1:11:10

Should I build automated evaluators for every failure mode? I'm reading it along with everyone else. So to see if I to see if I agree.

1:11:34

Yeah. So I mean I think I think it's basically you know should you for every every single type of failure should you build some kind of evaluation for it and should you use an LLM as a judge for every every one of those you know if you think about it in if you have a bunch of different try catches should you be evaling every try catch I don't know um

1:11:55

I think I think it's it's just like test coverage you know that you need there you don't need 100% test coverage you need the right amount for you Yeah, you need like all happy paths and then absolute failures pretty much covered. Yeah, but I'm sure he hates getting questions like this because we we do as well because the answer is it depends,

1:12:15

right? Like that's the worst. Yeah. Yeah. There's not there's Yeah, there's not easy answers to some of this

1:12:21

stuff. How many people should annotate my LLM outputs? So if you want human annotators to come in and annotate your evals or your outputs, it's going to be really funny to have annotation teams. A lot of people outsource this stuff too, like annotate

1:12:43

annotation.com or some some outsourced version of this, too. Yeah. Just like there's like entire

1:12:49

companies that just do like data labeling. There's going to be entire companies that just do data an or like eval annotation or LLM output annotation. Yeah, there's a couple in YC I believe that but I think it's more like in at runtime you can like ask like a tool call that actually talks to an expert you know to get an answer.

1:13:14

Yeah, I'm sure isn't scale AI just a bunch of people annotating stuff basically. Yeah, I mean they're annotating they're labeling data or for you know that to then consumed for for training, but ultimately it's just you're you're always annotating. I mean technically yes, you're just annotating data that will probably then be reused for more training down the line anyways. So yes.

1:13:37

Yeah. Most companies aren't at the sophistication of having human annotators. Um, but I'm I'm sure like it's going to become a thing as big companies get get like on board. Yeah. Well, especially if you're trying to automate

1:13:55

uh, you know, some kind of professional, whether it's accounting, law, you know, doctors, you know, things like that where you want to provide certain types of advice that are very specialized, specialized knowledge, then having someone judge. So, you know, for us, like I I couldn't judge medical, right? Like I I would not have the background, but I could probably judge, you know, JavaScript or PHP or

1:14:20

something, you know, like if that's specialization. So, or did this, you know, does this chunk of code do what needs to be done like I could judge that. So, I think that anything that has like specialized knowledge may maybe you and I will eventually just be like annotating code agents. Did this work? Yeah, we'll be we'll be hasbins and that'll be the only

1:14:40

job we can annotation data data annotation careers. Well, if you think about it, right? If you think about it like this, like like all coding agents, all users of them are the experts. So, it kind of like makes sense why all

1:14:59

these like lovables and etc like proliferated without necessarily maybe writing emails at all, right? because the users themselves are the experts and they're they're just hitting like and dislike on those chats. So, it's actually like a like a great way to get feedback from live production.

1:15:19

Yeah, exactly. Uh so, back to multi-turn and Ansuman says you could also you can also check if tool calling is aligned or not. This is actually important because if you have an an LLM that can call tools, that throws external data into the mix, which makes it even harder to eval.

1:15:37

Yeah. Because you have to now eval. Did they call the right tool? Then based on the data they return, did

1:15:44

they get the right response? So you can actually eval now at multiple points. Yeah. But still, that's why, you know,

1:15:50

checking the the call and the response as a whole is often easier, at least more straightforward to get started. It's like, did it roughly do what we needed it to do? Yes or no? Now, look at the actual like steps that it decisions

1:16:03

it made along the way. Yeah, there's some off-the-shelf ones for this. Like, tool correctness is like you expect, you tell it what tool calls you expected in what order, and then what the shape of the arguments is because there's no way you can actually really know. And then you have a then there's JSON correctness which is the

1:16:21

output if it's JSON of a tool like is the shape correct is it a pass the ZOD schema or whatever. So those are like some interesting ones too. What gaps in Eval tooling should I be prepared to fill myself?

1:16:38

Okay. This one's kind of long so uh we will all So first one's air analysis and pattern discovery. Yep.

1:16:52

But people want other people to do it for them. It's human nature. Yeah. Yeah. I mean, you definitely need

1:17:00

Yeah. This is kind of like the vibe check. Does it does it pass the right vibe checks? You need someone that can be the the arbiter of vibes of like did

1:17:07

it is this the right tone? Is this the right response? You need someone that can own that. Uh, so AI powered

1:17:15

assistance throughout the workflow. You want you want the LLM to be able to basically help through this. Okay. Yeah, sure. A lot of companies are building that too. So

1:17:37

custom evaluators. Yeah, you should be prepared to build your own custom evaluators. Agree. definitely agree on this

1:17:47

and I I kind of mostly agree with this too that I I think that generic metrics are useful in that they help you understand what eval are but I don't I question their usefulness for most applications like ultimately I think if you need to if you haven't done use evals you should probably just turn a few on so you can see get the idea of like how it works

1:18:11

and Then you should then spend time like writing your own custom emails. Yeah, Big Eel does not like that statement. Yeah, we're not we're not here to please Big Eel. No, Evell. Uh

1:18:29

APIs that support custom annotation apps. So again, this is their, you know, their opinion that you should build your own custom annotation app. I don't think you need a custom annotation app because I think it leaves too much in the eye of the beholder. You're spending time building part of an application that

1:18:50

is could, you know, maybe is a spreadsheet or whatever. Like I don't know, it's not I I understand that they're not the most sophisticated tools to build, but it's then you have to worry about the maintenance. I do think that I would say if you are using a non-custom annotation app, you should then make sure that you have access to the data. So yes, you could, you know,

1:19:14

ideally you have an API. So if you needed to at some point build your own, you could. Yeah. But you need to make sure you can get

1:19:20

that your data out or you can spelunk that data because ultimately your the tool may not provide the right observability for what you need to dig into. So if you have access to the database or you have access to export it something um yeah just make sure you have yeah that should be a requirement for

1:19:39

using these other platforms and I guess they you know they mentioned that here too like true bulk export or ideally yeah ideally just access to the database if you you know all right what is the best approach for generating synthetic data how far are we dang this thing's There's a lot. Yeah, this thing's long. We're about half deep. Yeah. Yeah, we are going deep. We're going deep.

1:20:04

Maybe we can skip this question because like no one is even here yet in life. Synthetic data. Yeah. All right. Synthetic data. This is a good one.

1:20:19

But it's kind of like, you know, just basic answer. So I was talking about Yeah. Air analysis is what you need promoting this course.

1:20:30

We don't need to build over rags. So, you should watch Nick's workshop. He can teach you rag.

1:20:37

Yeah. How do I choose the right chunk size? Yep. We have a we have a rag workshop. We can learn.

1:20:43

But this one's very uh I mean there's some good guidelines here you know I I would look at but I do think yeah if you need to learn the basics of rag to understand chunk size and how it actually impacts because that you know again you have to eval your rag systems as well. every time you you know introducing you know quote unquote context engineering where you

1:21:07

are manipulating the context you need to eval based on how you expected. So here we're talking about evaluating the rag system and again I guess I just said this but you need to eval retrieval and you need to eval generation. So that's kind of how you should think about it. You need to eval both sides.

1:21:32

Uh, bunch of good information here. There's a framework from Jason Lou. Shout out to him.

1:21:41

All right. What makes a good custom interface for reviewing LLM outputs? Who cares?

1:21:46

Yep. Don't do it. You can uh if you want to build your own, there you go. Have at it.

1:21:53

GLHF. If you want to build your own. Yeah. I mean, yeah. do do your thing, you know, like

1:21:58

do your thing. All right. How much am I This is kind of interesting because I think this will be contentious like Yeah. How much should you allocate to eval?

1:22:13

Is this like in terms of talent development budget or is it like tools? So I think it's more in ter terms of like how much time you should expect to spend. I think this is actually that this is the highlight. This is probably why it's bold. So in their case and

1:22:27

again you know they're probably yeah you decide decide if it's relevant to your case but 60 80% of the development time was just air analysis and evaluation. So that means if you think about that 20 to 40% of the time was actually building the thing. Yep. The rest of the time was like making sure does this thing actually work? And that's why most people

1:22:52

uh that do a little bit with AI, they like this is the first you're you probably went through it, right? Like I'm I'm speaking I think for 80% of the people listening to this. The first time you use it, you're like, "Holy cow, this is amazing. I can't believe all the things that this can do." And then you go a little deeper and you're like,

1:23:10

"This sucks." It doesn't It doesn't answer correctly all the time. It was like who can like 50% of the time it's right 50% of the time it leads you in some loop. Yeah. And it's it's terrible. And that is

1:23:23

probably why here it says 60 to 80% of the development time is just on making sure the quality is good enough. That's what eval right is like making sure the quality of your application is good enough because everyone's gone through that moment of like wow this can do so many amazing things. Wow, this is terrible to now

1:23:44

like okay if you want to make it good it's still just like if you want to build a you if you're writing code you want to build a good application you got to spend a lot of time on the details you can get like 80% done really quick and then like getting the polish takes forever I think that's what this is the new polish

1:24:02

when we built our docs agent the docs agent itself is like leveraging all the stuff we've already built so building it was super easy but then it just did not listen to people the way we wanted to to. And we did spend a ton of time. We wrote evals, too. But even then, we vibe checked that mostly, and it still took most of the time.

1:24:23

Yeah. I mean, you should just spend on expect to spend it. Even though you can build things really fast with with these AI tools, you can't turn the AI tools and just hope that they're going to evalid and improve it. And not yet. Maybe someday. So expect to spend a lot

1:24:40

of time and this is where I think most teams this is why this is I'm gonna like keep harping on this point. I don't think people understand that it actually that's where you're going to be spending a lot of your time. It's not just building the application. It's making

1:24:51

sure that the application works reliably. Yeah. Yeah. AGI is not going to save you

1:24:56

right now. And I think this is a good point too that I would agree with kind of makes sense. If you're passing 100% of the time, you probably didn't write your eval, you know, well enough. You it's probably

1:25:10

too too easy, you know, way too easy. So maybe you should be having like a lower pass rate. But you do want to make sure the reason it's nice to have that is you can make sure that as you improve, you know, quoteunquote improve things, you don't decrease your score going forward. Yeah. Uh we got a comment here from Pedro.

1:25:28

Building a product for the healthcare market here in Brazil. Most was a game changer. That's cool, Pedro. That's good

1:25:33

to hear. Nice. Original plans were to use lang chain, but being able to have everything on the same mono repo is awesome.

1:25:41

Cool. Thanks, Pedro. Glad you're enjoying it. Yeah. And thanks for tuning in. Big things popping.

1:25:47

All right. Why is air analysis so important in LLM evals and how is it performed? So, we'll just go over the high level here. encourage you to read up, you know, if if you're interested in how to

1:25:59

do this, you need some kind of data set. You need some kind of human annotator and you know a big word axial coding you know basically figure out failure categories right like categorize categorize the failures and then refine it over time. So it's kind of just an iterative loop, right? Create a data set, do some annotation, categorize it,

1:26:25

refine it, repeat. This is nothing new either. Like if you've done DevOps, this is DevOps. We

1:26:31

do this all the time. Same So yeah. Yeah. Thanks for inventing new words,

1:26:37

everybody. Thank you, man. You got it's the collision of builders and machine learning experts. Like we got to have this new taxonomy of terms.

1:26:49

Like we have to learn some ML terms and they have to learn some general like Yeah, I'm gonna go do some axial coding right now. The You mean you're just going to like categorize things labels? Okay. Why don't you just say that, bro?

1:27:07

You make make others feel uh inferior with terminology. All right. What's the difference between guardrails and evaluators? Okay, this is a good one. Guardrails are inline safety check. Yes.

1:27:20

So, guardrails should be able to say like should be able to stop something from happening, right? It should be able like short circuit the response and say this thing is off the gone off the rails, you know, quote unquote guardrails, right? It's gone off the rails. We need to stop it. You could think about this as like uh, you know,

1:27:38

copyright checks and like image generation, right? like there if you use open AI and you ask for some copyright stuff sometimes it'll start generating the image and then it'll stop if you use chat GBT um or it could be you know like hate speech or something like that right you want to make sure that the the agent does you can't they can't get the agent

1:27:56

to say something you don't want it to say so typically guardrails need to be faster but they're less extensive than an evaluator and I think that's what I haven't read this but yeah fast and deterministic simple and explainable clear-cut high impact failures. Yeah. Like leaks, profanity. Exactly. So it's

1:28:17

like these are the quick checks and then you have the more detailed and more expensive maybe evals that run after off the path off the main path. Yeah. But the mechanism can be the same.

1:28:29

You could run an eval inside a guardrail to then determine if you want to say yes or no. Should I do that? But then the problem is latency is very high. So you really have to use really fast models

1:28:41

and then you have to have a really scoped prompt that can return very quickly because you don't want the user to wait for like your latency. Yeah. I mean yeah the mechanism you're right is is like the same thing. It's

1:28:52

like take this data, this maybe input output response or whatever and then judge it. And often guardrails are going to be pass or fail, right? It's like either one or zero. Similar what we talked about earlier.

1:29:06

And it's like, does this have profanity? Okay, stop it. Otherwise, let it go.

1:29:11

That that's a very simple yes or no, right? Yes. And also, eval is judge. I could

1:29:18

write a function that looks at the text and looks for all the bad words, right? So, it is still like a scoring and eval. Yes, no. It's all the same if you think about it. Like, it really is. Yeah. You You don't have to use an You

1:29:31

don't have to use an LLM. Yeah. You can just write a write a function, right? And there's there's some like different libraries to do some of these checks

1:29:37

where and ideally, if you can make it not have to use an LLM, it's going to be faster, right? So, but there are some things you might need to use LLM to check for. So, yeah.

1:29:54

All right. Next up, what's a minimal viable evaluation setup? Spend 30 minutes.

1:30:06

I, you know, I'm not a fan of using don't use notebooks. Yeah. Unless you I mean, if this is like this is how I know you come from the Python world. If you're like use notebooks, I don't know. Yeah, like notebooks are so good.

1:30:20

Yeah, it's whatever. Like use Google Sheets. Yeah, use your use your tools. You know,

1:30:26

I guess if you know notebooks, Yeah. use notebooks. But if you don't if you're coming from the non-Python world and you don't know what notebooks are, don't spend time learning how notebooks work. There's way other better ways to do that.

1:30:37

Yeah, for sure. Um, this one's relevant. Yeah, relevant to a a comment we just got too, but how do I evaluate agentic workflows? So, if you're actually building

1:30:50

building a long a multi-step workflow. Yep. Uh we're going to add we're going to be adding our evaluations to workflows. So, this will be built in too

1:31:01

um which will help people. But yeah, like like he said and it's very much follows the primitive path of the workflow. workflows have an execution like a run and then they have steps within that execution and then you can do control loops and all that stuff. So there's so much to evaluate if you're doing an agentic version of this because

1:31:21

sometimes based on a LLM call a branch could turn on or off, right? So you need both the entire input to output then you also need step level execution. Um and it doesn't have to be LLM as a judge, right? can just have functions and whatever you need to make sure everything's good. And then you can also

1:31:40

have guardrails on steps too. So like that type of stuff can happen and you can stop execution of the workflow based on the input and output things like that. Yeah, exactly. Yeah, you you want to

1:31:52

test you want to be able to test the entire thing. Did it work or not? And then test individual steps in the thing.

1:31:58

So, and a couple comments. Love you guys. Thanks, Nordine. Appreciate you. Thank you. Love you, too.

1:32:12

I appreciate you want to give a shout out has all the a that should say agentic. Yeah. And somehow he corrected that. Uh I've seen has all these things weaved into one. Zero config tool testing agent testing workflows of

1:32:26

observability. Awesome. Glad you like it. Let us know. Yeah, let us know how we can keep

1:32:31

improving. All right, so just some more uh kind of context. I'm not going to spend time digging into this. Definitely read this if you would like to learn more.

1:32:47

Oh, shots fired on this. Eval big eval vendor. Yeah, here's this is a big eval question. Yeah. Who who paid Who paid him the most?

1:33:00

Yeah. How much did they pay to have him say? How much did you all pay to get on this list? Haml was at the Arise conference. So

1:33:07

maybe Yeah. Maybe going on. Yeah. It's it's like, you know, it's got

1:33:13

to be a lovehate relationship, right? Yeah. He must have got a bunch of these types of questions at that conference, which I'm sure. So, who can offer?

1:33:25

He's a consultant, right? That's his job. Yeah. Yeah. He gets paid to come in and help you build evals.

1:33:32

Yeah, that's cool. It's a good hustle. Yeah. I mean, everyone's trying to

1:33:37

figure out eval, so the market is hot. Good place to make money. Yeah. Uh, so yeah, mentioned some on

1:33:45

like eval vendors. There's a whole bunch more than this. A whole bunch.

1:33:50

Yeah. Like every day there's a new one. Yeah. And they're all doing the same thing. Yeah, there are I would say I bet you if

1:33:57

you spent 10 minutes you could find a hundred maybe 50 50 different people doing this same types of things all all advertising the same features. Yeah. Um my my general advice is to find something that's not going to completely lock you in. That's just general advice. Yep. So,

1:34:21

oh this is a really good question because we talk about this. Yeah, we talk about this a lot. We hear it a lot.

1:34:26

How are evaluation used differently in CI/CD, which typically means you're going to be doing it on a data set versus production monitoring? So, test data sets for CI are small in many cases 100 examples and purpose-built. So you want to test for specific things and make sure that as you change your agent or your workflows over time those eval seriously degrade

1:34:51

right you want to see them improve. Uh so core features regression tests u and CI tests are run frequently so the cost of each test has to be carefully considered. Uh, you know, a lot of times this means you, you know, you have something that runs on C, C CI/CD, but only when your prompt changes or only

1:35:11

when your agent changes. So, you don't run it on every PR, but maybe you run it on PRs that affect this part of the code, right? Yeah. You could spend so much money so quickly. Yeah. Yeah. Don't do it on

1:35:24

honestly either don't do it at all. That's my opinion. Don't do it at all. Do error analysis outside of the fact.

1:35:31

But then maybe you just want to have like the if the world if like your whole like the most drastic error case should be tested. Yep. Uh and then evaluating production traffic. You can sample live traces. So you don't you could run it on every

1:35:49

single you know trace that comes in or just a sample you know so you don't want to run it on everything. And uh you might rely more on expensive LLM as judge evaluators. Yes. Typically with a more expensive model.

1:36:08

Um one of the things that I recommend and I think he alludes to it here is that oftentimes you'll look at failures that happen in production or things that are answered really well and those become test cases in your data set for your CI. So you use you you look at the production data and the eval production and then you decide which ones you want

1:36:30

to maybe pull into a data set and make sure that like either this answered it really well. So you want to make sure that as you improve your agent you don't lose an you don't go backwards on that correct answer or it didn't answer it well and now you want to try to solve it so it will answer that question better. Yeah,

1:36:49

just devops now. Same kind of same kind of principle. similarity metrics. Yeah. I mean, yeah. Okay. I I think for the most part,

1:37:03

I don't I don't think they're that useful, but yeah, here's that other one that you already answered, so that's good. Should I use ready to use? So, offtheshelf eval metrics.

1:37:17

Hey, no, I mean, this is kind of what I said, though. Yeah, maybe you should hire me to do Maybe you should hire me to do write your emails. I'll be a consult we have a new business, Shane Consulting. Yeah. Like I like Hey, I I the funny

1:37:29

thing is I didn't even read this. So, it's like it's funny that if you like if you talk about this stuff enough with people, you must come to these same kind of conclusions. Yeah. But this is just Yes. Gen generic evaluations I think are pretty, you

1:37:41

know, they they're they feel good metrics. They feel good, but I do think they are like they can be useful in the beginning. So, you have something you need something. You need to understand

1:37:52

how emails work. Just turn them on, see how they work. If you do decide to use them, just be selective about which ones you use. Don't use don't turn on 500 metrics because you're just going to be overloaded with data. So, I would just

1:38:05

pick a few that you do think provide you measurable impact, you know? So maybe there's a few that you could use, but I I think for the most part I I agree with this. Yeah. And this is kind of the point. Good

1:38:22

scores on some of these metrics don't mean your system works. That's why a custom eval specifically targeting what you're trying to accomplish is usually better. Yeah.

1:38:36

How can I efficiently sample? Yeah, just sample stuff, dude. Yeah, you know, we did it. Ratios. We did it. We went through this whole thing. Dang, that took way longer than I expected. I

1:38:47

thought that was going to take like 10 minutes. Yeah, but it was full of really good information. Yeah. So So hopefully you all uh we

1:38:54

definitely went deep on on this. Of course, you can read this article. You know, we we didn't write it, but I would agree with, you know, I would say 80% of it I agree with, which is which is probably the right amount. Yeah.

1:39:09

And the good thing is most of the stuff in here will be supported in Ma. So yeah, if not all the all the things we agree with will definitely be in MRA for sure. Yeah, maybe that maybe that's a good uh segue. We should talk at least a little bit about what what are some of the

1:39:26

things we are doing in Maestra now? Because if if you've used master for a while, you know we do have an eval framework built in. But we've spent a lot of time just as Hamill has like talking to a lot of people that are thinking about eval working on eval struggling with eval. And we we've learned that it was pretty

1:39:45

apparent we needed to kind of rework and re improve what kind of what what I would consider we kind of built the initial version pretty quickly and now we're trying to build the right version. Yeah. So when we first built evals, we built it in our second week of YC I believe when all the homies like the

1:40:03

founding engineers came to SF and we did a little like offsite kind of thing. And usually during offsites you like focus on certain problems and you come out with you know at that time we didn't care because we weren't we're just trying to produce anything right. So that eval from back then January which

1:40:21

is not that long ago. Um that's kind of what we stuck with and we had a lot of inspiration at that time and it was things were a little bit more naive than they are today. So we are refactoring all of our evals and we're actually not going to call them eval. Evals is like a category right like this is just a a

1:40:39

category of things that you do evaluations whatever. I don't know if we landed on a perfect name, but you know what you're really trying to do is score the score different parts of your application. So, we're trying to call them scorers right now. That's a maybe name pending. But what you'll be able to

1:40:59

do is you'll be able to create code scores which are just functions that you can then put in the eval pipeline or you could put in a guardrail pipeline. U there will be LLM as a judge, but I don't think we're going to call it LLM as a judge. Maybe it's just an agent um a specific type of agent. Um that has access to certain tools that it may need

1:41:24

to do certain things. But what we're kind of like coming down on is like there's a pattern here for running these pipelines. And I've and we're going to kind of explore that. But like there's three things you need to do to do an

1:41:36

evaluation. You need to extract data from your data points and the data points here for an agent call or LLM call um are the inputs, outputs, options like temperature etc. So you need to have all the data of what happened. So

1:41:53

yeah then you extract data. So for example if I was doing something like answer relevancy which is a offtheshelf eval. Extraction point of answer relevancy is extracting the facts or statements from the output because you know LLM responses may have a bunch of paragraphs and right? There's a bunch there could be a whole narrative in there and

1:42:17

you want to know and some some sentences are more structural than actually pro providing information. It's just how English works. So you extract the facts out of there. So that's extract. Then

1:42:30

you score it, right? So based on the input, based on these statements that I've collected and extracted, I need to then assess a scoring process. That could be a function that is just doing some NLP or any library or that could be another LLM call to score it based on what is extracted. And then lastly, this is optional just like Haml was saying, he doesn't care about

1:42:53

reasons. If you do care about reasoning, then you can provide a reason step where it's like, hey, I've extracted this data. I've scored it. Why did it score this way? And at the end, now you have

1:43:04

like three prongs of information that you can store. And then if you did yes, no, or a score, you can track it over time. It should be super easy to write these things. Like I think a lot of people are scared to do custom evals. Like Hamill desperately talks about it like it's

1:43:21

like the best thing to do. But if you don't know what to do, then like and it feels like a lot of people don't know what to do, which means there's probably some bad UX and DX out there that's preventing people from just easily understanding how to do this right? And then there's a bunch of questions of how do I add it to aentic workflows and how do I add it here? How

1:43:40

do I add it there? It should just be obvious, right? So that's what we're going to do. We want to make eval cool.

1:43:46

So you can just take these different scorers that you create and just put them everywhere whether that's you should be able to do whatever you want right and you should put them anywhere everywhere collect the data and then if you like our UX and our design habits and stuff maybe you'll like our annotation tool or whatever we create we

1:44:03

haven't really thought that that far ahead but it really does come down to like it's like a pipeline guardrails are a pipeline too if you think about it but they just run you know during and before the requests. So that's what we're kind of shaking and baking on. I could show architecture diagram.

1:44:21

Yeah, let's answer a question while you get that pulled up. Show Pedro asks how do you handle that observability? You just dump everything on Postto with hotel save logs to a database. So one of

1:44:34

the nice things about MRA is you get to specify where you want this to you know this data to be stored. you can kind of bring your own storage so to speak and so you get to pick what you want it to be in a Postgress database you want to use superbase you know like ultimately you you can choose and we do emit hotel

1:44:53

so you can pump it into any observability provider that you want yeah with this like v2 of our evaluations like hotel traces may not be the right thing but it's the right transport layer that all these applications speak it's a standard so whether you use the raw hotel trace or not, it has all the data you need. What we're planning on doing is emitting hotel like everyone can use hotel and

1:45:20

that's cool. We might transform hotel into a different format or there are new libraries coming out that's like or making new standards for open um essentially LLMs and traces, right? Maybe like all the old of a of an old trace from back in the day, not all fields are relevant. And we see this all the time, right, Shane? Like a lot of trace fields are not relevant. We had to

1:45:43

trim out a bunch of stuff. It's just very verbose, you know. So yeah, we're we store it in your database, whatever database you bring. Um yeah, we don't do

1:45:54

post hog unless you want to. I guess you could. That's on you though. Um but

1:45:59

yeah, there are two kind of data types you need to store to do evaluations. You need to store logs. You don't necessarily need to store them but they you should just have them traces which traces kind of tell you what has happened during the request the tool calls the LLM calls etc and within a workflow as well like all the steps etc and with those things you can then eval

1:46:23

stuff on demand or you could do it live yeah that's the answer to that let me share my screen is there any other questions or no That's it for now. If you if you are watching this live that you know let us know if you leave a comment on LinkedIn on YouTube on X and we will uh try to answer some questions if you have them. Sorry I'm getting some un unknown number

1:46:51

hitting me up once again. All right, cool. Probably someone that saw you live and wants to get get in 818 number. Dude, they're trying to like

1:46:58

mimic area codes that I know. Those Um, okay. So, if you're not familiar with MRA, some of this stuff might be unknown to you, but I'm going to go through it as if you know something. Um, but if you don't, that's chill.

1:47:16

Mashra at its core can be run as a library or a framework. So, for when I say library, you can set up MRA and then use it within Nex.js, Express, whatever where you own the server. And that's cool. You could also expose Mashra as

1:47:32

its own server if it uses Hono and you know for example if you're using NexJS maybe you have API chat as an endpoint and you know maybe if you're using MRA we have like our own generate function and whatever um any of those things whether you use NexJS or not or MRA server or whatever you use MRA internally will output hotel traces for all the internals that it

1:47:57

does. So we have like are all the the primitives and methods that it does. We have this thing called autoracer that we put on our primitive classes like memory and agents etc. And then we um

1:48:09

essentially instrument the parent and child spans. So we have like for the traces we have like internal methods being traced and then excluded from the user. We have tool calls, memory and all its methods, the agent call itself, AISDK, which you know that was funny today. I

1:48:28

won't go into that. And then like anything that's happening post-processing. So this is what happens. These are

1:48:34

emitted as hotel traces. But I'm starting to realize that maybe hotel is whack. Like hotel is a good transport format, but maybe it's not a good consumption format because a trace could have a parent span and a ton of spans in between. And then if you

1:48:52

wanted to then assemble data, you have to essentially assemble it, right? You have to like filter and combine all this these spans together. Also, fetching them in a database is annoying too. But

1:49:05

you know, Mashra stores all these traces in the database that you bring in Mashra storage. So you have access to the data. What we're thinking about doing though is transforming the hotel traces into like a a different format or maybe use some open source library that's trying to create a new standard there where we still emit hotel for people but then for

1:49:26

internal purposes we have something called a run or XYZ I don't know. Um, but once you assemble runs, you can then add them to data sets. So you have data sets here or you can run scores on them. And we were talking about all these concepts for the last like hour right

1:49:43

now. So I'm not going to go into them again. But what scores output are scores. Scores output scores which may relate to a certain trace, a certain run, a certain data set, a certain

1:49:56

entity. Right? Because in MRA maybe you want to do memory evals. Maybe you want to do tool call eval, workflow eval,

1:50:03

agent network evals. There are a lot of different entity types. So you need to know what the scores are for that.

1:50:09

And then the types of scores, you know, we're thinking they're either code scores or LLM, but it all comes down to this pipeline, right? Extract, score, and reason. You could probably do a lot of with just those three steps. And

1:50:22

maybe there's some other custom steps. Um but anyway, so now you have like this structure here and then it's really easy to build your own because we have like a framework for it. Lastly, all with all these scores and stuff and now we're tracking all the input options and everything. You can start playing around

1:50:41

in what we're name pending monster CMS or something which is like a whole playground for creating different configurations for your agents, workflows, memory, etc. Most e big eval has prompts like prompting is the way as the entry point but what if it's actually like different models score better different tool calls or MCP servers score better different prompts

1:51:07

and contexts score better so why just version maybe you can version one thing or you can version the whole configuration and because Monsterra already has dynamic agents dynamic context you could asend potentially just start AB testing or versioning agents or workflows or any primitive because now you're in the eval loop. So, as you can see, requests all

1:51:33

the way here. You can do some crazy You go into the CMS, you you you start versioning things, deploying that. Now, you're like looking at things as time series data and then doing an error analysis and everything Hamill just said, you're in the eval loop. Now, if I see some blog posts about the eval loop, I'm going to know that they stole it

1:51:52

from me. But, uh, yeah, so that's this is the architecture of what we're trying to do. and big eval wouldn't show it to you, but we are. That's it. Sick, dude. Yeah, that's a you got a

1:52:06

glimpse into the road map of what we're building. So, that's one of the nice things about tuning in is we we're pretty open about most of the stuff we're working on. So, you can see what's coming. Some of it already exists

1:52:17

today. Some of it exists in a PR form. Some of it ex is going to exist tomorrow. Uh

1:52:24

so some qu couple thing couple more comments from Pedro. Uh new to AI do you have any recommendations? So a couple things that I would recommend you know ultimately the here you read this book if if you are one that likes to consume books you can get a digital copy for free here. It's not a it's not all about MRA. It's general principles of building

1:52:50

AI agents. So it has like highlevel stuff you should be thinking about. Talks about eval, talks about tracing, talks about, you know, how to write good prompts. I mean it has a kind of sections in there for all this kind of

1:53:00

stuff. So that's a good starting spot. If you are wanting to dive into MRA and see like how do I learn that? You can go here. We have a course that's actually you can use within cursor or uh windsurf

1:53:15

or any other MCP supported editor and you can actually take the course directly inside your IDE. So you're like you say start master course. It guides you through it helps you write the code. It tells you why it wrote the code that it writes. Um it's pretty cool and

1:53:31

that's a good way to if you just want to learn okay I'm convinced I want to try MRA. How do I actually get started? That's a good way to get started. So that That's high level how I'd recommend and

1:53:42

of course there's a lot of there's so many resources out there that a lot of them are good some of them are are not as good but I would always recommend you start there. Uh suggestion you should crop this Excal section of the video and post it as a short. Yes, we might. We just might actually. Pedra it's a good idea and we've thought

1:54:01

about that. Honestly, some of it is I just uh don't have the time to always do it, but we every week we have like good sections and we So, we'll try to pull out a short of that and maybe we could actually probably just crop that whole one and post as a YouTube video as well. Yeah, maybe not today because then, you know,

1:54:18

we don't want big eval lurking on us today. Yeah, you get this is like the intimate they're not not everyone's watching this. This is like, you know, an intimate group of 185 people right now just so um but one thing about this, like we're working on it, uh our target date, and I don't like talking target dates, but I kind of want to stress the team out a little bit and put us on like a path,

1:54:42

but we were planning at the end of July to have like our first version of everything I just said. Um we'll see if we get there, but if you guys are amped, I am, too. So yeah, yeah, it's a lot of work. But the

1:54:55

someone from the team is watching this and just says, "What?" Yes. Eug, John, you know what I said.

1:55:03

Eugen's over there like, "Uh, okay." He he he's down. I know his his reaction. He's down for sure. But yeah, we we'll figure it out. We're

1:55:15

We're working. We're trying to work fast on this to get it. Ultimately, we want to get some version, you know. Ultimately, we get a earlier version in

1:55:22

your hands so you can start testing it and let us know and then then we'll iterate and improve. All right, dude. That was that was a show. Yeah, dude. It was a long show.

1:55:33

Yeah, I did not I I was actually thinking like this is going to be a one hour we can get in and out. We did not make it. We We doubled that time. Uh some would say maybe we need to improve our I need to improve my estimation skills. No, that Hamill thing

1:55:48

was so rich and deep of stuff, so it was totally worth it. Yeah, I mean, we definitely did a deep dive. So, hopefully if you did tune in, you get you felt like we went pretty deep. If you have questions, reach out, let us know. Come into our Discord.

1:56:01

Let's chat. If you have feature requests, bug, you know, bugs, post GitHub issues, that's the best place to go for that. And yeah. Yeah. Anything else you want to say before we

1:56:13

cut this? No. We'll see you around. get the book.

1:56:18

We're the admirals of AI, by the way, if you didn't know. Yeah. All right.

1:56:23

And yeah, just because follow Obby OnX, Obby Ayer, follow me, SM Thomas 3, we can chat. And with that, see y'all. Peace.