Back to all episodes

Evals + Observability, Agent deployments, and AI News

December 8, 2025

Today we talk with some guests (Alex from Mastra & Laurie from Arize) about evals and observability. We chat with Lio and Kevin from Defang about deploying agents to AWS/GCP/Azure. And of course we cover all the latest AI news

Guests in this episode

Alex Booker

Alex Booker

Mastra
Laurie Voss

Laurie Voss

Arize
Lio Lunesu

Lio Lunesu

Defang
Kevin Vo

Kevin Vo

Defang

Episode Transcript

6:42

Hello. And as you can see, we had some technical difficulties. We're in a new place. We're in OB's apartment. I'm in

6:48

SF this week, so we're in person, which is rare. And the good thing about this is I get to redo the entire intro. And the first one was a practice run because we had some technical difficulties and now we're up and running.

7:01

Uh, but please leave us some a review on Spotify. Five stars only, please. Otherwise, find something else to do.

7:07

Uh, leave us a review on Apple Podcast. We appreciate that. That helps more people find the show. And if you are

7:13

watching this live, as many of you have already, you know, realized, you can leave comments and we hear them like, "Hey, I can't hear the sound." Yeah. Like, Maravic. Yes. So, thank you.

7:25

Thank you for uh for fixing or for letting us know. We now hopefully have it fixed. And if like this what I'm about to say is for the person listening to this weeks and months from now and you're just listening to the audio, there'll be a nice little Easter egg for you of us mimming stuff on video. So you'll have fun then.

7:44

Yeah, we'll maybe cut that out before we post that to the audio only version. But we appreciate you all being here. We got a jam-packed show today. We're going to talk about some evals. We're going to talk about AI news like we do every

7:57

week. We're going to talk with Lori from Arise. We got Leo and Kevin from Defang, so you're gonna learn about them and what they're doing. And yeah, please leave leave comments and questions along the way. And this is going to be a fun

8:09

show. Yeah. But before we jump into it, we had a nice little surprise. So, I guess if

8:15

you're listening, let me know if you're a Spotify user, but every year you get this endofear Spotify wrapped and it tells you all your favorite songs, your favorite artists. It's kind of cool. A lot of people look forward to it. Uh, some of my friends share, you know, the

8:28

surprises or, you know, the expected results sometimes. What was your listening age? 39. Oh I'm old, dude. Mine's 23. You listen to the new stuff. I listen to

8:39

all the old stuff. So, uh, yeah. But anyways, one of the things that I didn't know is you actually, if you're a creator, you get a Spotify wrapped for creators. And so

8:54

you want to take a look at our Spotify rap? Let's do it. So you're going to get a little treat. You're going to figure out. And while

8:59

you're if you like what you see, please go leave us that review if you haven't already. Uh all right. So we we have let's hit it and let's do uh you can see our reaction to this along the way.

9:19

So 2025 was the year of agents. The year of agents this year, one episode rose above the rest. I think I overshot it. So, if you're looking for

9:31

an episode, go, you know, figure out if Lovable's dying. Super agent. Dude, that was a that was the condom episode. Yeah. If you don't know what that means, you might want to go rewatch that episode. Still getting quoted,

9:43

apparently. Who knew? How do they measure that? I wonder. That's cool.

9:49

Yeah. And then it says that your show was in the top 20% of videos on Spotify. And so listen, if we're in the top 20%, you saw you saw the quality of the production that we are putting out.

10:02

I'm questioning others ability. Must be a lot of trash on Spotify, dude. Honestly, but hey, I'm happy to be there, you know, happy to be in the top 20%. It wouldn't be possible without uh without you listening after the fact.

10:16

And obviously, we appreciate you listening live, too. We had 23 different countries. That's cool. Our top countries, you know, Australia, Poland, Canada,

10:28

United States, and Belgium. Dang. You know, Ward on our team's in Belgium. I think he's been telling all his friends.

10:33

That's crazy. It's all Ward's listens, right? Yeah. Ward just leaves it on when he goes to sleep. Puts him to sleep at

10:39

night. Um, our fans, you want to know the our fans like Future Bad Bunny? Bailey. Okay. Our fans like some audio books. Death by Astonishment.

10:53

I've not listened to that. So, all right. Bunch of sociopaths listen to the show. Yeah. And then your your favorite uh

10:59

podcast that you listen to if you listen to our show. So, no comedy podcast on there. There you go.

11:06

Might want to frame this. We were had a marathon show. Fans listened to you longer than 79% of other shows. Okay.

11:14

There must be terrible shows out there, dude. An instant hit show. It was more your debut season was more popular than 85% of other shows. How many gooboos are having podcasts? And then it was a most shared show. So

11:26

you received more shares than 70% 76% of other shows. So I don't know, dude. I think they're bullshitting. I think they make these numbers up. But hey, cool. You know, we we are here and yeah, we're

11:38

gonna we're gonna talk some evals here in a bit. But before we do that, one other thing. So, last week you monologued about Opus 4.5. If you were watching this, I'm curious.

11:50

Have you played around with Opus 4.5? Because I finally had the chance to dive in. And I got to say, like you were, you

11:56

did not overhype it. It is living up to, you know, set your expectations appropriately, right? It's not going to do everything for you, but I was I felt like it was a state change. It is significantly better in my opinion which

12:08

is kind of wild because I felt like going from when they went from three to 3.5 that was like a state change and now four to 4.5. So the.5 releases are apparently more important than the full releases. I I don't know

12:21

but try out Opus if you haven't. It is it's pretty good. Yeah. I think some providers give it the

12:27

same price as sonnet. So if you haven't tried it, you might as well for the same price. Um but yeah, I'm Opus Pilled and I think all of us are now. Yeah, now I'm basically I surprised I'm not doing it right now, but I basically have a git

12:39

work tree running uh doing something with Opus at any point in time like between user customer calls or meetings or obviously when I'm writing code I'm doing multiple but usually have just one running all the time on something. So I would recommend you uh give it a shot and see if you can do at least you know a large percentage of of things that you typically would have to do yourself. Yeah, parallel agents.

13:06

And one other thing that I just thought, you know, some of you might be able to relate to and m maybe not, but we're in SF this week. So, we got invited to an event, like a launch party. So, we were previously in YC last year in the winter batch. We got invited to this launch party event. And, you know, we know the hosts and so we thought that'd be cool.

13:25

And I sent it over to Abby and he pointed out it's a black tie event. I don't own a black tie. And so I immediately went to Chad GBT and I said, you know, what do I what's the dress expectation here? And it said black bow

13:39

tie, which is even worse in my opinion. Oh it's bow ties. That's what Chad GBT told me. If you

13:44

show up in a normal tie to an event like this, you're going to be a goop. Oh my god. I only have a I only have a black tie. I I have a a black normal tie, which I brought, but I don't know. So you tell

13:56

me if you're in the chat. black. If you're going to a black tie event, does it have to be a bow tie? Chad GBT was

14:02

pretty convinced it it needed to be, but you know, can't trust it. Well, on the invitation, it says gowns and suits. That is interesting. Yeah. So, anyways, that's what we'll be

14:13

doing this week. Let's jump into it. We're going to pull in uh we have Alex from the MRA team coming in here and he's got a talk coming up. So, Alex, if

14:24

you're there, we're going to pull you in here in just a second. But we want to talk a little bit about eval. And so before we pull him in, there was a post that I think you know I don't know if maybe I don't know who shared it. Maybe you shared it to me. Um you don't understand eval. Everybody in

14:46

Eval should read this. So Maxim posted this and of course I I dug in. got quite a bit of traction and it is an Eval guide book V2. So,

14:59

let's take a look at this. So, if you're looking for a place to start learning about eval, this is a good place to go. So, first of all, it kind of talks about the life cycle.

15:19

It talks about the problem that I was trying to solve. And of course, we could spend an hour and going through this. You can see the scroll bar here. It is

15:26

comprehensive. Yep. So, we're going to talk a little bit about our opinions on evals. Uh, Alex is

15:34

going to talk about a talk that he's doing on eval. I wouldn't claim that we're the experts, but we know a thing or two. But if you want to be an expert, I would recommend you start here. It's a good uh a good place to get started.

15:49

And I think Alex is having some issues here with his mic. He he sent me a message and said, "I'm having mic issues, but not as bad as you." So, you know, so we'll see if he joins. But let let's talk a little bit about let's set the stage. What are evals? When should you

16:10

use them? Because there we've talked in the past around like should you actually even use evals. A lot of the coding agent companies barely have eval or just started introducing evals.

16:22

So I think what okay so what are evals? Uh and we've done workshops about this too but the quick thing is a it's a a loop that you put yourself on that is the goal is to improve your application. So whatever you're building could be let's say a veterary a veterary app that you know listens to vet vets or something like a vertical agent like that well how do you know that that

16:48

agent is doing well it may be doing well in your trial and tests of it locally or in your prod and that usually happens to most AI apps is like when you're building like things are going really well but the minute you deploy to production real users with their real prompts and their real behaviors start

17:08

using your application and then the things that you tested in a silo are not actually happening in reality and you don't necessarily know. So, but then let's say you measure it and you know that it's screwing up all this all this time. How do you fix it? So, that's like the part of eval which is you know uh

17:25

tracking data, figuring out what why things are going wrong, making adjustments tracking and then the infinity loop of that. Yeah. And I think one of the things that I've I guess I've seen as we've talked with more customers who are starting to do eval is you pretty much most people prototype without any kind of evals in place, right? That's just the way we think about it, right? You

17:54

know, everyone wants to talk about test-driven development if you're an engineer, but most people don't do it, right? They want to see something working and then they write the tests later. eval typically follows the same thing at least in practice from what I've seen. But what we often see is if you're doing something that requires

18:13

it's harder to determine if it's right or wrong. Like you know a lot of times when you're writing code there's not one right way to write code. As long as you get the result you could go about it in a lot of different ways and you could we could all argue what's good or what's bad but you might be able to get the same Obby and I could write this a

18:29

different or get the same result by doing two completely different things. Yeah. And so in those cases, oftentimes we don't see people grab eval or use evals until later, till much later. But

18:40

if you have something that has more regulatory burden, if you're building a finance agent, if you're building something with that's legal, something that deals with shipping or like e-commerce. Yeah. Like f things that are touching things that if if the cost of failure is really bad, if it messes up,

18:58

that's when eval become a lot more important. And you and you shouldn't wait till you're trying to get to production to start writing eval. You probably need to think about it much sooner in the process. And it typically

19:09

starts with that data set. Obby kind of talked about it. And then you're building that loop of so you can continue to improve things and make sure that over time you're you're not getting worse, you're actually getting better. I think we we bias too much on what

19:23

coding agents are doing. And because we are the users of coding agents, we can evaluate it every day, right? Like sometimes when Claude starts messing up, don't we all like tell each other Claude's having a bad day today because the code it's generating is not as good as you thought yesterday, right? And you

19:41

can visibly feel it, but dude, honestly, a for each versus a for loop never killed anyone in real life. So like that's not the type of stakes that people are talking about. If you want AI to be in our like community like outside of tech, then you're going to have to like have use cases that are like that. And then those require more stakes. You

20:02

know, everything's higher stakes when it comes to people and the law and getting sued and all that Yes. And Alex is here. Let's bring Alex on. Alex, what's up?

20:15

Hey, how's it going, folks? Good to be here. Yeah. Well, uh, you've been on the show

20:20

before, but maybe you should remind people who you are for those that haven't met you. And you said you were having some mic issues, but clearly not as bad as what we were having. So, you know, I'm glad glad you got I just want to say I was panicking because I thought my AirPods weren't working, and I was so confused, but it turns out it was your microphone not working after.

20:37

Yeah, we were scrambling to get the new uh the new room set up. And yeah, I like it. Clearly miss some things. It's a

20:45

start. Well, we got more to do to start off. Yeah. Well, thanks for having me. Uh, my

20:50

name is Alex. Maybe if you are here on the Master YouTube channel, you might recognize me from some of the videos I've been making. My vibe is all about developer experience at Mastra. I want to make sure you have the best

21:02

experience building an AI agent. And that usually means in the form of education, right? Whether that's videos, workshops, improving documentation, demos. And do you know what guys, I'm

21:13

sort of I have some experience on YouTube, but I'm foraying into giving my first conference talk in a few days. And when someone reached out to me a few months ago with the opportunity, I was like, that's loads of time to prepare for a talk. But as the days have got closer, I've realized there's a lot of work to do. Um, I was just working late

21:33

on it now. I thought maybe I could show you some notes and we can talk a little bit about Eve Alice and scores. Your first one ever. I gave a meetup talk uh many years ago

21:43

before the pandemic, but I've never been to like an actual conference. And this one is called API Days, by the way, which is quite a big deal in Paris. Um looking forward to meet some folks in person and talk agents, dude. That's dope. Congrats. Well, don't don't congratulate me yet. Let's see how the talk goes.

22:01

Be fine, man. All right. Give it to us. Yeah. Yeah. I mean, so it's I have I've

22:06

definitely been in a situation where you you think you have plenty of time to prepare a talk and then all of a sudden you realize you have not uh you have wasted all that time or you procrastinated. So what's that thing you say Shane or something like fits to the date or something. Yeah. Like Yeah. Every every project compresses to the timeline or something like that. It's like no matter what

22:25

you're gonna like Yeah. work expands to fill all available time, right? Exactly. So the other thing is even if

22:31

you started it early, you would you would be in this mad rush to think if it's good enough. So you might you know you gave yourself now you now you have a deadline and you'll get it done and and we're here to help. Yep. Yeah. Well, I think you gave a really good intro to evals. I'm gonna share my

22:48

screen here real quick and maybe Shane you can help me bring it up. Yeah. And while you're pulling that up u be you know let's go ahead. Yeah, let's take this down real quick. Yeah, I see I see your full uh full

22:59

screen. Uh so question from Sebastian. Are Evals more comparable to regression testing or to test driven development style testing? I had a feeling it's more about preventing what you saw go wrong during development.

23:13

And I think there's, you know, I'll answer and then feel free to for either of you to jump in. I think it's a little bit of both in some ways because you typically have some kind of data set that you will run anytime you're making major changes. So you are basically kind of doing some like test driven development. You you

23:31

want to make sure that the tests are passing before you you merge, right? So that is one part of it. That's kind of like what often people will say they call them like your CI evals or your offline offline evals is what some providers use them as. But it's essentially your it's an eval

23:50

that is not running on live user data that's coming in. It's running on some data set that you've prepared that you want to make sure that doesn't go backwards. And but on the other hand, you also have this idea of live evals or uh things as users are using the agent live, you're going to either run evals on a sample or all those depending on what your traffic

24:14

is and costs and all that stuff. And you're going to then pick out interesting failure modes that you may then want to incorporate into the data set. Or maybe some are bad enough where you're like, okay, we gota we actually got to fix it. So it's almost a

24:26

combination of these kind of live and then offline evals is I think the common phrases that people use. Uh the nice thing is we have Lori from Arise coming later. So any questions that we if you want to know specifics about terminology, Lori will be the expert. Uh

24:40

and we'll we'll learn a lot more then. But Alex, do you want to pull it up? Yeah, maybe I can stick around and learn from Lori because I've got some questions. That's for sure.

24:51

Yeah, this talk's an interesting one because when I joined Mastra, I had never really used Zoom a little. Yeah, for sure. It's quite interesting because when I joined Mastra, I'd never built an agent before. And I looked at this list of documentation thinking, "Wow, I've got so much to learn." And

25:09

gradually as I learn about these features, I've been making YouTube videos about them and sort of teaching what I learn in public. The one area that I hadn't really gone in depth on that I'm super stoked to learn more about is eval and scores. I think you touched on it when you mentioned that sometimes we only think about these

25:26

towards the end of our projects or at least when we're in production. While I'm learning, I've not been doing that so much. So, I thought this conference talk would be a good forcing function to learn more about evals. Honestly, it's turned out to be a bit harder than I

25:40

anticipated because there's just so many opinions about evals and definitions and the sentiment in the community seems to vary. I want to talk a little bit about that as we go on. But let me just start with a quick definition and a quick understanding of the challenges. Just building a little bit on what Abby said.

25:59

I think we all feel this, right? That our models work all the time, some of the time. And that's that non-determinism at play. that's so powerful but can also reduce our confidence when we go into production

26:10

and so you end up in this sort of scenario where your agent can work 95% of the time 100% of the time the reason I phrase it this way is because you might be sort of playing with the agent locally doing some testing and things kind of work but it's when you put it into production and you start getting hundreds maybe thousands of different inputs that you get these kind of reports of errors and if we're talking

26:30

about deterministic software that's bad enough when we're talking about nondeterministic agent That can be very stressful, very difficult to track down and improve. And this could be huge for your application if it's in a regulated environment. Sure. But I would argue anything that is customerf facing these type of tasks essentially are very

26:50

important. Going back to the kind of idea of the sentiment, it it's sort of uh I mean you might have seen tweets like this where I think I had it pulled up but I but I closed it. I think it was Swix who tweeted about how uh Claude Code aren't using evals and other big companies aren't using evals. But I think with

27:09

those products oftentimes we're like the domain expert in the driver seat and so we're not as sensitive to when they go wrong. Um and they will be looking at proxy metrics like if we're accepting changes and things like that, but say you're building some kind of customer support agent, you really want to feel good that everything works and will continue to work. If you want to change

27:27

the agent, you want to do that with confidence. You might want to consider how the agent will behave with real user inputs you haven't seen yet. Guardrails can be essential, right? But only if they work. And you also want to get this

27:39

idea of like how the system performs in production. Like if there is some degradation due to a regression or maybe it could be a change of your model provider or maybe your tool coming from an MCP server changes, you want some kind of overview to make sure this is going in the right direction. And so the challenge is that we need to systematically measure the quality of

27:57

our AI systems and diagnose those failures. And the more specific we can be about why they're failing, the better chance we have of addressing that, right? And so the solution to these challenges is application evaluations.

28:10

And I took a little mental note here to not be confused because I feel like evals can mean different things depending on what you're talking about. When the providers release a new model, they come with evals, right, based on these benchmarks. Um but these are quite different from application or task based evals right?

28:28

Yeah 100%. The types of things that the models are being evaluated on are completely different than your application with like a a business use case. Um it's also benchmarks are very academic in some ways too like how good is a model at doing mathematics or history etc. But

28:47

you're as an application developer, you're testing how good your agent is at solving the business need that you built, you know, so it's a little different scope and and your specific business need, right? These models are very general, but when we're building an agent, it's to solve a specific uh problem and

29:04

that's pro potentially something that's never been seen before, certainly with your unique sets of circumstances and context. And so an eval I think you can make this I've kind of learned that you can make this as simple or as complex as you want. And I think that might be and I'm very curious to hear your take about

29:20

the sentiment towards evals. But I think that might be where some of the perspective comes in because an eval fundamentally is a single metric that scores a specific aspect of performance. For example, does it work? Is it good quality? Does it follow the rules in the system prompt? Does it follow the tone?

29:37

This is an interesting example because that could be quite subjective um which will require a clear definition and it's also something that you probably can't strictly evaluate using deterministic code. This is where we get into that interesting territory of using a model to evaluate a model's output. We'll get into that. But I highlight scores here to make it clear that if you ever see

29:56

scorers used in the context of evals, I would say scorers is like a tactic that falls within evals to assess a specific aspect of performance. And in the simplest form, you know, you can run your agent and you can give it a score. You can say this is good or this is bad. And you can do that as a manual type of thing. And I'm sure you do during

30:14

development. But as you can appreciate with a huge variety of inputs and a more complex agent that isn't going to scale very well. So we start to look for a more systematic approach. Broadly, there are two ways of scoring an LLM's output.

30:30

The first is deterministic, which is basically where you write code to test the output of the model. This is fine for things like length, formatting, checking for string occurrence. You might as well do this because it doesn't require a model. Therefore, it's fast and even cheaper. Doesn't have to be cheap to any expensive to valves, by the

30:46

way. Uh, but if it's not broke, don't fix it. You can use deterministic code in some cases. And these are basically

30:52

unit tests. You'll probably use like a different test harness and you'll be thinking about it in a different way. Um, but these are a good mental model is to compare these with unit tests. As I mentioned, some things like tone, you're going to really I don't know how to

31:04

write code to check the tone, but I do know how to ask an LLM. And so, you can use an LLM to create a score of it will test things like quality based on a criteria, empathy, style, tone, whether it follows the rules, if it does the correct thing, all those type of things. And so, the other interesting thing about evals is that they can be

31:22

operationalized in different ways. You can run scores on some or all live users. This is a tactical sampling where you can determine like whether you want to run 100% of users, 50% of users, and you can connect this up to some dashboards and interesting ways using observability platforms to get that kind of overview of things in production. I

31:41

only learned this recently, but you can run scores on like the entire conversation. You can run it on a trace. You can run it on a span. You can run on a tool call output effectively, right? And it's just a way of having more

31:51

granularity to to see what works. Um, but again with a lot of granularity options comes a bit of this kind of analysis paralysis thing. And as I work on this presentation, something I'm trying to figure out by the way guys is like what is the right sort of um what is the 8020 rule for evals, right? I

32:09

feel like you can take this so far. I feel like you can write evals for every aspect of your application and you can get very specific or you can go very broad and just test everything. You can run it on your conversations. You can

32:20

run it on your traces. You can also, by the way, run scorers on created data sets, like Abby was saying, with each config change, this is a bit like what you're used to seeing in CI pipelines with tests, right? But if you change the prompt, you might want to run your evals against the golden data set to make sure that the happy path and the typical edge

32:39

cases are working. Um, but yeah, let me I really want to get some more clarity on this because um this is these are the tactical aspects, right? These are quite easy to understand, I think. Uh, but the question is how to um find the right

32:52

balance when running evals. Like what's a good mental model so that you don't just waste a bunch of time writing emails that aren't helpful, but just focus on the 20% that get you 80% of the way there. Yeah, this is a good question. So, we also have a question in the chat that is not the same, but I think it's somewhat

33:10

related. So, Jefferson says, "Would be nice to be able to check if tools were called with specific values." You can. Yeah, that's probably one of

33:16

the most common emailals, right? If you give agent an agent a bunch of tools, you want to know if this prompt came in, it should call this tool with these values, right? That is one of the most common failure modes is that the right tool didn't get called in the right situation. And so that is one of the most common set of evals that you want.

33:36

Uh but the reason that's important is it ties to what is your agent trying to do? What's the business value your agent's trying to solve? And you should focus on what are the one to two to three things that you need this agent to really do well and that's your 20% right and depending on the complexity of the agent. It might be more than two or three things but in most cases most of

33:57

us are starting build when you build agents you kind of pick a few set a set of small set of tasks that you want it to do really well. you should look at the end to end result of did it accomplish that and you should when you score it there's a lot of contradictory uh I guess evidence of how you should set this stuff up and maybe you should do like a gradient scale of like or a linear scale of zero to one

34:22

and it could be a 0.5 is it like a okay score but that's kind of confusing like what is is it should you if you're just getting started I'd recommend doing a pass or fail it either is good enough or it's not good enough And that is it. Because a human, if you're thinking about it, if I give you a bunch of code, is it good enough to ship or is it not

34:40

good enough? That that's it. That's what matters. It's not like, oh, it's pretty good, but there's no in that case it

34:46

wasn't good enough. So, uh those are I guess that'd be my 8020 is like focus on just like pass or fails and focus on evals that are towards a very specific uh whatever you want the business result to be. That that definitely tracks. I've seen in a few different articles and conference talks at AI engineer for

35:04

example, people uh realizing they over complicated it with those scales and just sticking to like a true fail or pass. That seems to get you pretty far. What categorical labels like good, fair, I mean those are all can be transformed into numbers on a human scale, right? An

35:25

LLM can assign labels to things and then you can then give labels meaning. So for example, if I had like excellent, good, poor, terrible, let's just say, right? Um, but for me, excellent means five points and good is only three points and you know, there you go. And you can make your own numbers, right? Yeah. That's comes back to you can skin this

35:51

however you'd like. Um, but I'd like to answer your question in a like a an actual example, please. Example, like let's say I have a an I have I have an agent that helps me order my essentially take my symptoms like if I'm sick and then recommend me medication that I can go by at the pharmacy or I have to get a

36:12

prescription, right? Let's get all we can get all into that, too. So if I built that application, I put it in production, I have to cover my ass because I'm not trying to get sued or anything for doing the wrong thing. If I

36:24

say I have a headache and it gives me like, you know, Pepto-Bismol, that's not good, right? Like it has to be relevant. And that agent doesn't know what's on my inventory, right, in my pharmacy. So it must have tools to then go access that

36:37

information. I'm already setting up the stage of how you would cover your ass in the situation. One, tool call accuracy.

36:45

when the agent is finding things in inventory based on what the user is prompted did it call it with the right values and did I get the right result and if you have many tools that are imperative to the success of the application you go do tool call accuracy on all of them right second relevancy of the in of the user input if I'm asking about a headache and I'm getting diarrhea medication that's not good

37:10

right so we need to write emails for making sure relevant responses And then lastly, you don't want to damage or hurt the user. So if they're asking about giving me oxycottton or cocaine or something, you need to have guardrails to be like, "No, we don't sell that here." Other other than actually what if it's like goes on the Silk Road and actually

37:28

tries to buy it for the customer, right? Like we don't want that to happen. So you need to put guard rails in place. Hopefully that makes sense.

37:35

Absolutely. I think the I think some of those examples are no-brainers, right? Like we do not want to be ordering cocaine off Silk Road if uh somebody or Um I think where I get I think it gets interesting is like um agents you like they often behave like close enough like it's very I think it would be unlikely you ask about a headache and get diarrhea medicine just because the model

38:01

is probably going to go on roughly the right track but it's it's more about those kind of edge cases that it's more likely to fail if for example someone you know they they phrase it in a very specific way uh for example and I think I struggled to like um identify when is a good time to worry about those edge

38:19

cases and when to just focus on the big picture obvious problems relating to the silk road etc. Yeah. So that's why we always recommend tracing everything from the beginning. Uh so even if you're not doing evals,

38:33

you're collecting traces for all the things coming into your system and then you yourself manually over time should be able to discover edge cases in that data because are users are non-deterministic too. Um so you have a bunch of inputs that would probably lead you to believe that there's edge cases or something you know and then you can then form stuff around that. This is one of those things, dude,

38:58

Alex, it's always tough with this subject because of non-determinism. Yeah, you're really trying to do the scientific method and how was science created? There was no like, you know, rules, right? You got to keep testing, experimentation, and then eventually you

39:11

have like hypothesis analysis and then you're, you know, you can back up what you say. Same thing with emails. I like that. I think this is such a a good point, right, that tracing is essential. And

39:26

while you might not be uh completely sure about how to run evals in your project, it's a very good idea to turn on tracing earlier than later because you're building up a data set essentially uh that you can then analyze to identify uh failure modes and then perhaps refine your evals over time. Um yeah, so we're not going to go too much into the details. We're running a little bit um short on time, but it's it's kind

39:53

of interesting to me because and by the way, this is just something I need to learn more about and I'm excited to get into. I'm really curious to look more into platforms like Arise, by the way, because I think they can help with this. I feel very good about this idea of running scores on a created data set because I can liken it to CI and TD and tests and things like that. And they're

40:13

different from tests in the sense that usually tests are about um like deterministic type stuff and there's no such concept as LLM as a judge. It it makes sense that we have different nomenclature for this very different category. But then there's this idea of like having the sort of eval loop where you are always using traces to uh like

40:33

running scores against traces continuously in production and using that to gain insights which you then go on to improve and you sort of repeat this cycle um until the metrics are where you want to be or if they drop where they where they are. And and I think this gets and I just want to call out that I think this is really hard for someone new to agents to wrap their head

40:52

around unless they're truly thinking about a production use case because it's a lot easier to learn this I think when you have a hook to hang on if you can think about um what you're building and and how to do it. And so this is something I want to learn a bit more about. And we get into all kinds. I don't think we have time to get into all

41:10

of it today, but we get into these really interesting ideas like LM as a judge. And then when you're using a model to test a model, well, you could kind of say, well, that model needs its own tests, right? And that's sort of where this idea of ground truth comes into play. We also think about ideas like having a golden data set. You know,

41:26

we talk about the 8020 rule. This is a a good way of starting, I think. And and I have seen these like videos obviously in talks where people talk a lot about data sets and they talk about error analysis like error analysis sounds a bit complicated but it's basically just looking at how and when the model fails

41:43

and ideally forming those into some kind of categories because if you can create a category you solve that category and you have this downstream effect of solving a bunch of issues like it makes total sense right and you might also want to look at a tactic called uh synthetic data which is where you use an LLM to generate data instead of

42:00

collecting it from real users. Suppose you're building a very specialized agent, something to do with like contracts for example, you have very few but very high value customers, you might not gather enough significant data to to run evals of scale in a meaningful way, but then you could generate some uh synthetic data. I again I I do want to learn more about these because um they

42:21

they are like it kind of there's this kind of distinction, right? We have this idea of running uh squares like in development and as part of like a CI pipeline, but this continuous monitoring and evaluation. That's something I'm still trying to wrap my head around a little bit. I wondered if you guys have a good way of

42:39

thinking about it. Yeah. So, so what I always say is if you've run a production application before, it's actually not as different as you would think. There are definitely

42:49

different Yeah. different nomenclature, different way to think about it because of the non-determinism. But if you think about your data, your golden data set that runs against CI, you you've probably had software tests that run on CI. So that's not that different. And you kind of mentioned, but the other aspect is if you've ever run a production application, you

43:08

probably have some kind of monitoring, right? Like is is the latency on my API behind like what it should be? Do I need to investigate that? It's so you almost

43:21

look at it from like a DevOps or um you know a monitoring perspective and that's what those live evals really are. It's another input and yes it is a little bit harder but sometimes if you're in DevOps you know like sometimes tracking down production issues is actually it's not as deterministic as you'd want it to be. It's actually pretty hard. So whack too. So I mean it is kind of

43:44

like that in a way that and of course it's even maybe harder in this case because of the level of non-determinism but it is very much a monitoring exercise and it maybe a different persona it's not the DevOps person doing the monitoring maybe it's the human expert if it's you know a if it's a in Aby's case earlier like a medical

44:02

pharmacist well you maybe need some doctors that are actually validating that it's not the DevOps person but it is kind of that monitoring approach of like we are monitoring the application and then we are going to do things to try to make it better if we see failure points. And so I think that's the way to think about it doesn't make it any

44:19

easier to do. But I think it's like that's that's how I think about it and it makes sense to me that way. That sounds good. I like that. Just one last question for you then and I'm so stoked to see the rest of the show uh

44:34

with Lori from Arise and the folks from Defang as well. Um, I'm glad we got a chance to set the stage a little bit of some basic definitions around evals and squares and things. Um, we there seems to be a bit of like mixed sentiment in the industry. Uh, some people like sing

44:51

eval praises, other people are a bit more cynical. Um, I I feel like maybe there's a trend of people looking for the s the minimum simple way to use evals and get value from them like we were talking about, right? keeping it simple with a pass or a fail, maybe running a limited number of evals, like what what what's your sort of take? Do you think that evals are like um

45:15

overhyped and not as helpful as maybe some people would want you to believe or do you think that they have like a a very useful place for for most agents? And how how do you kind of reckon between those two sides of the um conversation? I have controversial take to this as always. All right, do it. I think that there's a

45:35

big conspiracy out here. Okay. Um, now like social media and the whole AI like industry has been up and down on evals. Uh, when we first started Mantra,

45:48

evals were such a big thing that it was one of the first things we built outside of like what we had, right? So, we we added the old version of evals back then when we didn't know anything. Then we come to find out that users are not even close to production to even care about evals. And so there was a

46:07

whole moment where no one cared about evals. Then some big companies started publishing their evals. So now people not big companies just startups publishing their evals. So now people care about evals again. Then a lot of

46:19

companies exist like you know the the big eval companies exist. They're pushing for people to do evals because they need to make money. like that's their that's their product. And then the coding agent companies

46:31

start saying that they're anti-EVAL and they don't do them at all. So then that confuses people uh because you know we're all engineers and we're trying to like do the right thing. We're trying to like look for guidance. I bet you if

46:44

Airbnb published an eval set taught people how they did it and how it impacted the conversion of Airbnb customers, everyone would be doing evals tomorrow. So, this is all like a thing of like who are the people listening to and thought leadering about it. Um, but yeah, that's just my take on it. What's yours? Yeah, I I would say it depends. If

47:04

you're in a highly regulated, if you're in a highly regulated industry, we've got we've got a hot take and then we've got the you know, if you're building a coding agent, just ship the thing. All right. Just make sure that it passes the vibes and get it out there. And then if a lot

47:20

of people are using it and you do what you know cursor and wind surf notoriously did where you feel like one day it's working great and the other day it sucks for you if you were using it at that point you realize like maybe you wish they would have had some evals maybe would have caught some of the regressions. Uh and so I think it does depend. I

47:37

think it the higher the regulatory burden the higher the chance you're going to need evals earlier in the process. Yeah. Every co customer that we've talked to that has to do compliance is for like eval forward. Let's say pro

47:49

eval. Anyone who's a startup who is vibe based, they're anti-eval. It slows them down. Quote unquote slows them down. So it's just interesting. Yeah. It's kind of interesting looking at it

48:00

as a it's about the risk level is what you're saying. Yeah. I'd say get and if you can get it out and get your thing in front of users quicker and then you're going to get your data set and then you can build your ebounds later. If you have if if that scares you then you probably need

48:16

eval because you probably have a reason where if your agent messes up you're on the hook for something. So yeah. Well, well, that feels like a neat little bow to put on this thing because I started by saying about how it works all the time, some of the time, right? And I think for a startup where what you care about is putting something in front

48:34

of users for them to react to, you're inching towards product market fit, then you know, it's okay to have some rough edges, right? That's not going to be deal breaking. But for some applications, especially if it's to do with like a formal process or compliance or something, if it doesn't work, it will not be successful. There's no doubt about it. So it sounds to me then it's

48:53

like a balance between risk and speed and this kind and and that's not a new consideration for most startups anyway. Um I think loable just deferred an enormous uh VAT bill because they wanted to move fast you know it's just a thing right in startups like it's how we how we roll. But yeah thank you so much for

49:09

having me on. I hope I could set the stage for you all. Dude, your talk is going to be great. The content is amazing. It's going to be

49:15

very thoughtprovoking. Good job. Yeah. Thanks man. Hopefully it's recorded. Yeah, hopefully

49:21

it's recorded. I want to watch it. All right, see you. Alex

49:26

Sebastian said, "Eval is a loaded term in the JavaScript world." Indeed, it is. But as we know, AI was not created by JavaScript developers.

49:39

We have learned. We have learned. All right, we are behind schedule today. So, we're going to go through the news.

49:44

We might go through the news a little faster than normal. And if we don't get through it all, we'll come back and finish whatever we didn't get to because we have uh Lori who should be coming on here in a bit. So, let's talk some news.

49:58

Anthropic's been in the news a lot. So, first up is that Anthropic Simon Wilson had a blog that said Opus has a soul document, which when I read that I was like, what what is a soul document? What does that even mean? Um,

50:18

but so apparently it the the thought was it was actually rather than just being added to the system prompt, it was part of the training data to try to give Opus a soul, you know, like more of like to to be more basically guide how it makes its decisions. And so there are some things in this that if you scroll down it kind of talks through uh like

50:44

you're you're claude you're trained by anthropic our mission you know here's our mission we're in a peculiar position we believe it might be building one of the most transformative and potentially dangerous technology in history yet we press forward anyways so this isn't cognitive dissonance but rather a calculated bet and so it's basically

51:04

trying to instill a sense of like obligation to do the right thing I think through this document that was part of the training data um and talks about you know trying to get prevent prompt injection and it's just an interesting idea like of course we've probably thought that that kind of system prompt but now they're trying to

51:24

like bake in certain preferences and like how they want the sold or the the LLM to think into the training itself not just in the system prompt which I thought was pretty interesting. super legit. Uh, so there's also some rumors. I don't know if it's been officially confirmed. I

51:44

haven't really kept up to date over the weekend, but rumors are that Enthropic is going to go public. So, that's potentially coming. And they're doing some behavior that would maybe indicate they're planning to do that. So for

51:57

those of you in the JavaScript world, you probably saw that Enthropic in a surprise move purchased or acquired Bun huge. So yeah, that that was uh definitely caught my attention as someone who's obviously used Bun a little bit, but uh I'll share it here. And yeah, so Enthropic acquired Bun. You know, they were partners with Jared on the bun team. And you know, I think as

52:25

much as it they probably wanted the technology, I think they maybe wanted the team even more because it seems like they they Anthropic wants to own the the coding market, right? And being uh close with developer people, great engineers building developer tools is one way, yeah, to appeal to that market. What do

52:46

you think? The Bun team is super talented. Um I don't know what happens to bun per se.

52:54

um because they are trying to focus on cloud code as they like they had a bun had an announcement blog post as well you can read um on their website um but yeah they're take really good talent also if you think about it anthropic has a runtime now and that's interesting um so I would bet that they're going to keep doubling down on making cloud code the best for engineers and then maybe

53:20

some other new stuff that allows s them to pack and ship JavaScript or write code better for these cases which ties into this you know report that is from the information and again like don't know how to validate if this is true or not but Enthropic has apparently told investment banks most likely as part of this IPO plans that it

53:42

plans more acquisitions to strengthen its coding capabilities. Coding related tasks now make up the majority of its revenue. Also, look at that article right below. Open router showing anthropic dominance.

53:56

That's tight. And there's actually I'll pull this up. Um, this is a another post. This is a tweet from Swix

54:09

that says this is this one chart explains everything about why OpenAI, XAI and DeepMind dropped everything to go ex go after the chase in grand prize and coding use cases. So you can see that um basically if you look at programming it looks as it's like the highest value you have like amount of tokens the cost per

54:35

token and it's kind of a sweet spot of and this is from open router right um which we'll talk about a little bit later as well but it's kind of that sweet spot of it just makes sense that and they've of course been on the sidelines seeing Enthropic winning and pulling in so much revenue that I think it it has made all the other uh model labs really take

54:59

notice. Dude, coding agents are a pyramid scheme if you think about it, right? Like because we are the ones using it the most, paying the most and uh you know and that's where all the model companies want to go. Um like even

55:12

with OpenAI's code red, I wonder are they going to try to improve coding performance or try to focus on Chad GPT the thing that they could win? Yeah, I don't know. That's a good question. Yeah, open AI, you know, kind of announced they have a code red where

55:25

they're trying to focus on the product, but their product they're winning in consumer land. So maybe they should focus on consumer. Maybe they're going to give up. Maybe maybe they regret the workflow builder now because they're like, "Oh

55:36

that was a distraction." That was a distraction. I mean, but I do think that they, you know, their codeex model some people do like, you know, so it's not obviously what Anthropic has, but it's interesting. It'll be interesting

55:50

to see what OpenAI's next moves are. Do they d keep doubling down on coding? I would expect they will. I think they

55:55

want to own I think they're going to go after two things. Consumer, they're going to try to continue to win in coding. And I think they're going to try to limit some of the other, you know, quote unquote distractions that are not specific to those missions.

56:06

The biggest enterprise use case right now is coding. So that's where the money is. you know, not 20. I mean, $20 a

56:14

month OpenAI subscription is pretty good at volume, but then Anthropic is not doing that. Cloud is not the best chat thing, but the API is, right? So, I wonder how many Cloud Max plans have like increased. I have a Cloud Max plan now, too. Yeah, I've just been paying API use tokens, but now I'm starting to think I

56:32

should see this a pyramid scheme. I should get that max plan. Uh, yeah. uh

56:40

anthropic launched interviewer. It's a tool to really try to help understand people's perspectives on AI. So there it's available if you you know for a week-l long pilot. This came out last

56:52

week so you know go check it out. But basically you give anthropic interviewer research goal and it drafts research questions, conducts interviews and analyzes responses in collaboration with the human researcher. That's cool. So they said they tested it

57:05

on,50 professionals about their views on work in AI. The largest sample was from general workforce and it's kind of an interesting uh approach here. Do the American people hate AI? Let's see. Oh, it's not in the thing. Yeah, it's not part of it. Uh but

57:23

what the sentiment was from the interviews. We looked at the intensity of the most common emotions motion radar. Nope. So basically the goal is for Enthropic to be able to conduct more research on a

57:37

variety of topics but interesting thing that was released. Maybe it's interesting for you to check out. So there's another thing that is not anthropic related but it's because of Opus that it even came about. And so I'm

57:53

going to call this as you know kind of an anthropic thing. And I don't even know if it's serious, but I I think it I think it they want it to be serious, but I don't know that it actually is. But we're gonna talk about it anyways because that's what we do here.

58:07

Think I chatted with this guy. Introducing Opus PM. Opus package manager. Instead of npm install load

58:15

dash, we run Opus PM install load dash except it doesn't install anything. It builds the entire package from scratch using your cloud code. Never have supply chain attacks again. And so there you go. It's open source.

58:29

Uh obviously it's a bit of a joke, but it's an interesting idea in that as Opus is better and it can write and manage its own code. Do you need less packages? That's maybe what this is why this is kind of interesting. If they're simple packages, would you just rather build

58:48

them yourself or would you go ahead and uh still want to use the package that maybe has some vulnerability, you as we've been seeing with npm and other things lately? What do you think? I don't think that's worth doing like if it's already written because you're just wasting tokens. But I, you know, I was

59:10

talking to someone who had a different perspective like we shouldn't care about tokens. Like that's like the current problem is caring about tokens and idees and stuff. Maybe in the future when we don't have to care about the code, you don't even care what packages or whatever. Um, so I don't know. I don't

59:27

know how I feel right now. We definitely need to care about the code. Like you can't just be writing load dash every single time.

59:33

Yeah. You know, not efficient. Yeah. I mean, I think it's

59:38

in a future state. Maybe there's a some some of these packages that could go away, right? Because it's but I also think once you write the code, you still have to maintain it. And that's on you.

59:51

You know, even if you're using tools to help you maintain that code and those tools are better than you could do yourself, that's still significantly larger code base. And of course, maybe you don't need as much. There's a lot of stuff you could throw away from some packages that you don't actually use,

1:00:03

right? You might only use one part of the package. But I do think like in general having to maintain that much extra code is probably not a good thing.

1:00:15

Maybe eventually the token cost continues to drop. The speed increases and we don't even have to ever look at the code again. But I also know that LLMs uh write some insecure code as well. So you shouldn't just trust the

1:00:27

code that was written by the LLM to replace the package that might also have a security vulnerability. So uh use your something something relevant to us like uh we have this library that we were using that is now no longer supported on mpm which is a bug now in our code. It's called string similarity. I'm not going

1:00:44

to go look for another string similarity library because I'm opus I'm going to ask opus to write it and that's how I'm going to pick that battle. But if I needed to implement load dash I'm not going to ask opus to make load dash. I'm going to ask it well if I needed like find index from load dash for example I only needed that one function maybe I

1:01:02

would ask opus to do it if I didn't know that already existed and I think that's the power of it yeah and I got to fix that bug by the way I just remembered it while talking about this uh and yeah so we're a little behind schedule so we're going to move on to we're going to bring Lori on in just a second here but we got a question in the chat and it looks like Meet Ryan Evans

1:01:27

says, "This is my first time watching this live stream and it's great to have all this AI new information and casual style of talk." Well, we keep it pretty casual around here, you know. But how often do we all go live? Every Monday.

1:01:38

Usually right around noon Pacific time. Some days like today, we run a little late, but noon Pacific, we're here live almost every week. And if we're not, uh, we usually make it up at some other point in the week. Uh, but it's a good

1:01:51

reminder if you are just tuning in for the first time. I'm Shane. This is Obby. This is AI Agents Hour. Please go give

1:01:57

us a review on Spotify or Apple Podcasts. And this is a live show, so please drop in your comments on YouTube, X, LinkedIn, wherever you are watching us from. And with that, I'm excited. So before I

1:02:13

bring Lori on, I want to say Lori Are we back? Okay, I think we're back. Sorry about that, y'all. I think we had a big power

1:03:28

surge or something here. Okay, what were we saying? All right, are we back?

1:03:34

Uh, we we had some This has been the show of technical difficulties today. I think it's happening to him again. Uhoh.

1:03:40

It's happening to me. Are you still on? No, I was.

1:03:48

All right. Let's see if we're back. Nope.

1:03:56

What's going Well, all right. So, we're back. We are kind of back. This has been the show of technical difficulties, but

1:05:16

sorry y'all for whoever lost us on the stream. We're going to try to do our best to make this thing work. And we uh yeah, we might have lost audio quality a little bit, but we are gonna do what we do every week here and we're going to roll with it. So, uh, and we'll figure out what is in this new, uh, studio

1:05:35

setup that, uh, caused the failure, but it seems like the internet just kind of dropped on us here. But I don't know where we cut off, so I'm going to start back over and say that I'm excited to bring Lori onto the show. They saved the day. Lori saved the day for us at TS AI Comp.

1:05:52

Uh, we had kind of our MC for the event. Got sick, I think, had an ear infection or something, was not able. So, let us know basically one day before that she did not think she was going to be able to do it. And so, Lori was able to come in on short notice and kind of save the day and really help us out. So, Lori,

1:06:11

hopefully you can hear us. Hopefully, uh we won't lose you again. Uh good to see you.

1:06:16

Hello. Thanks for having me on. And are are you there?

1:06:22

Yeah. Can you hear me? Yes, we figured it out. All right. Um,

1:06:27

yeah, maybe for those that don't know you, I think a lot of people maybe do, but for those that don't, you want to give a quick introduction and then tell us a little bit about what you're doing now. Uh, sure. Um, so, uh, my background is as a web developer. I've been a web

1:06:45

developer for 30 years now, which is, you know, like longer than a lot of people on this audience have been alive. Uh and uh along the way I founded some companies uh including npm inc. Uh so I was very interested to see opus PM. Uh

1:07:04

that's really it's a fun tiein. Um and I've worked at Neblify, I worked at Llama Index, and now I'm working at Arise AI where I am uh doing developer relations and teaching everybody about evaluations and observability. Awesome. And can you tell people a

1:07:23

little bit more about Arise in general? I'm sure many people know. You know, we we've been talking about eval earlier, so we probably talk a little bit about that, but what is Arise? How long's it been around? And you know, what's the

1:07:35

what's the common pain point that people end up going to Arise for? Sure. Um, so Arise has been around about uh seven years. Um so it started before the LLM boom uh when it was doing

1:07:49

evaluations for traditional ML. Um it uh pivoted strongly when LMS became a big thing for obvious reasons. Um uh and it has two main products. We have um AX which is uh sort of the enterprise

1:08:09

product fullfeatured full fat uh you know you can do an on-prem install uh you can use it as a SAS uh and then there's uh Phoenix which is our open source product uh which is a a smaller feature set um but you know you can run it on your laptop uh and just get started instantly uh which a lot of people like especially when you're in an

1:08:31

enterprise environment and you're like oh I don't have time to like do a whole purchase order. I just want to do some evals. Uh just being able to like run some evals on your laptop uh is a big selling point for a lot of people. Um

1:08:47

yeah, so in terms of why people would would pick us up, um we what we try to do is close the loop for um the entire software life cycle of an agent. So uh when you start at the development phase you can use this use us to figure out you know what is your weirdo agent doing um and you know just basic observability like what are my traces do what you know

1:09:14

what what prompt is going to the other and what am I what actual response am I getting out um and you can you know do basic optimization then but when we really shine is when you move into production uh because then you can take uh real time traces is uh from your production app uh and you can turn them

1:09:34

into data sets and run evaluations against them both to do offline evaluations where you're like okay against real world data how did we do did you know were we actually satisfying the users requests were we doing what we expected to do uh but you can also do like real-time monitoring so you can have an LLM uh constantly running

1:09:54

evaluations against responses so if your uh agent suddenly goes off the rails and starts you know cursing everybody out or giving everybody 100% discounts. Uh you can be alerted to that in real time. Yeah. And I think that is, you know, we we talked a little bit about eval earlier and that was one of the biggest

1:10:11

uh really things we talked about is you need a good mechanisms for setting up that data set or kind of setting that eval loop, right? It's like you have you have real user data coming in. You want to be able to then pull that back into a data set and having that like live eval loop is is a big part of that. Do you

1:10:29

Yeah. Do you see one of the uh one of the most exciting things that we're doing at the moment is we're getting into um automated prompt optimization. Uh so you get the results of your evals and you feed them to an LLM and you say, "Hey LM, what would be a better prompt uh than the prompt we were using when we got these evals?" Uh and we tried that using uh cloud code.

1:10:53

Um, so we got the LLM to write your CloudMD file for you on the basis of a loop of evaluations of like how well Cloud Code was doing against uh a set of PRs on GitHub. Uh, and we got something like a 25% lift in how well it was able to tackle PRs uh by training it on uh, you know, an optimized cloud MD file for your specific repo.

1:11:20

So it's essentially like automated context engineering. Yes, exactly what that is for the for the system prompt specifically, right? Yeah. Yeah. For the for the cloud MD, which is

1:11:31

essentially, you know, the system prompt for plug code. Yeah. And are you baking this into the AIS product suite to be able to allow people to kind of use that loop where it does some of that it can help improve the the the prompts of their agents. Yes. I mean, you can do it today. It's kind of a it's semmanual but uh using

1:11:50

the SDK like we did you know the the prompt optimization I talked about we were using the SDK to do it in a tight loop uh in code under the hood so you can do it today. Nice. Nice. Yeah. I think there's you know if

1:12:03

last year or this I guess this current year was all about people starting to think about how to build agents and really get getting started. I think the next year is going to be about how do we how do you get these agents to give better quality? And obviously context engineering is a big part of that. You know, we hear a lot about reinforcement learning as well as as like a separate

1:12:20

approach, but I think that's that's like the really heavy-handed approach that you probably don't want to go down that path unless you absolutely have to unless you're rich. Yeah. It's like how do you make the prompt better? And obviously, if the LL

1:12:34

if the LLM can write prompts better than we can, then maybe we should just tie them into the loop and have them have the LLM help out rather than having to just have a human kind of read through all the results and do it themselves. Yeah. I mean there's there's reinforcement learning and then there's reinforcement learning right like if you're taking if you're taking real world signals and you're turning them

1:12:52

into uh modifications of your agent then you are doing reinforcement learning even if that is not what the people who do RL think of as reinforcement learning. This is true. Yeah. You're not changing the model weights but you are changing the agents uh how the agent interacts.

1:13:09

So yes exactly. We were asked this question earlier um which is like to eval or not to eval essentially and like from you from y'all's perspective at the company what are the types of customer profiles that really care about eval versus what do you think are customer profiles that don't care about eval? Um that's a great question. Um the analogy that I draw uh for our product

1:13:35

is there was this famous quote um from uh somebody who was I think it was like the CEO of the Coca-Cola company who said that like we compete for share of mouth with water. Uh and I'm like the the equivalent for us is we compete for share of evals with vibes, right? We're basically not doing it. Um uh a lot of people are just getting away with vibes

1:14:01

right now. And um when the trigger for us uh the trigger we see most often is um we built an we we built an agent using vibes uh that we were sure was going to be absolutely great in the development phase. uh we put it into the real world and people immediately started like you know feeding the script of the be movie into it uh and it and it

1:14:28

went immediately off the rails uh and we need real world like we've got all of these traces we need real world evaluation data right now because our agent does not work anything like we expected it to work against production data um so yeah h I I finally got my agent into production is the biggest trigger for doing formal evaluations Yeah. Yeah. I've I think that resonates

1:14:52

pretty well with what we see, you know, a lot of our users is, you know, and sometimes it makes sense like you can probably get pretty far for certain use cases with Vibes. But, you know, I think one of the the higher the regulation, the more likely you need emails early. That's one thing that I've seen talking to a lot of our users. Like if it's if

1:15:08

you if there's a high cost to your agent failing, well, then yeah, you probably should bring evals in before you get to production. Yeah. uh should block deployment not have like do you think h not having them should block your deployment? It's a good question. Um

1:15:27

the problem is non-determinism, right? The problem is uh are your evals failing because your agent is being like your your LLM as a judge for your evals is itself going off the rails or uh your your eval is not as uh fine grained as you want like what is the error rate on what is the false positive rate on your your live evals really um I think if

1:15:55

you're getting like a 50% failure rate then it's clear that something is horribly wrong. Um, so there is some threshold at which it makes sense uh to block deployment. Um, but it can't be 100% just because of the non-determinism of like how your eval work in the first place.

1:16:12

Interesting. And what would you recommend is the way to get started? So someone who's new to new to Evals, they they they're listening to this and they're they're thinking, "Okay, yeah, maybe maybe you convince me. I'm you know I've been trying to not do it but now I maybe I

1:16:28

should try at least learn a little bit where one where should they go to learn and then like how would you recommend someone get started? Um so uh I spend a lot of time thinking about this as you can imagine uh because it's my whole job is to try and persuade people to get started with evouts. Uh um I think the way the way that we see most people come in the door to to Arise

1:16:58

is they start get started with Phoenix and the thing that they get started with for Phoenix is not eval but tracing. Um so that I think is that is the the front door is uh you before you get into formal evaluation just get a better handle on what your agent is actually doing under the hood that isn't like

1:17:24

printing out log lines. Uh throw some tracing in there and you will im you will be immediately surprised at what's actually going on. There's going to be more loops than you were expecting.

1:17:35

There's going to be more latency than you're expecting. There's going to be more weirdness happening than you thought. And just getting good tracing, good observability going is going to get you uh a lot of the way. Um and then

1:17:50

throw it into production because the best way to do evals is with production data. Um like you can you can use synthetic data, you can imagine what your input is going to be like. Um but nothing is like production in terms of uh confounding your expectations. So throw it into production, get some traces out of it and then start

1:18:12

evaluating that. Um you know assuming you are a startup obviously if you're a bank don't do that. You're a bank for the love of God like generate some synthetic data and come up with a golden data set please before you start uh sending everybody's money to China. Um

1:18:29

but uh if you're you know if you're at the if you're at the early development stage there's nothing like production data to to see what it's really going to work like. Yeah. I mean I think you just can't predict what people are going to try to do with it. You just you can try you can try to generate your own data set and

1:18:48

maybe it's useful to start with with at least a small sample but but yes no one you know no one if you've ever built products for any human you know that people don't typically follow the happy path all the time. Yeah. To say nothing of like adversarial stuff, right? People coming in and like deliberately trying to get your bot to do their homework for

1:19:06

them instead of answer whatever question it was, right? They're like, "Yeah, actually, I don't want to do shopping. I need you to generate, you know, this script in Python, please." And the LM will do that. Uh uh so, you know, testing your

1:19:18

guardrails and things like that. It doesn't really happen until you hit production. Yeah. I mean, funny enough, I built a

1:19:24

Slack agent that lives in our Slack just to play a game and with the with the whole uh master team and immediately someone asked it to write code for them and so yeah, make me a node server. Make me a node server. And so you what you said is exactly true. Of course, the adversarial stuff as well. Yeah, people have been doing that with

1:19:44

what was it whatever Amazon's shopping bot is called. They immediately started asking it to generate stuff for them in Python because they were like free tokens. Yeah, that's that's it's still happening too. Like I think newcomers, they'll get to

1:19:56

production and they look at the traces and then they realize that they need guardrails. It comes to that point where they see a user input that they never thought that they would see and it triggers their like, oh, I got to do something about those. Um, but maybe people should do that earlier.

1:20:15

Yeah. I mean there's a lot to be said for um the quality that you of synthetic data sets that you can get now is pretty good. Um the LMS are surprisingly good at generating even if you ask them like come up with an adversarial data set, come up with something where you're trying to break this bot. Um they're pretty good at that these days. Uh so

1:20:34

it's not it's not wasted time. Um but the quality of of real production data it just just can't be beat. I'd like to ask you something that is not LLM related at all or AI related.

1:20:47

Sure. Can we go through uh like memory lane and talk about like the state of web uh stuff back in the day? Sure. You got two minutes. Two minutes. I just want to

1:20:58

This is my favorite topic. So, is it really? Oh, yeah. I love talking about the web.

1:21:03

Yeah. Okay. So, we talked about all the eval stuff. Now, let's talk about the We're all We're all web developers. A

1:21:09

lot of a lot of our audience has similar backgrounds to us coming from the web world. That's why we're doing JavaScript AI stuff. So, if the audience doesn't know, Laura used to do the state of web reports every year and it would like predict and have some predictions and what happened in the year and it was super valuable because you got to see who are like the

1:21:27

rising players in libraries and everything. Actually, you should explain like the the work that kind of went into that because I don't think people appreciate it enough. I mean it was it was a side effect of working at npm was that I was just sitting on all of this data uh about what packages were going

1:21:44

up and what packages were going down. Um and in particular the thing that I was trying to fix was people who would look at a specific package uh and say oh my downloads went up so everybody's using this package. And I'm like, actually, every single package on the registry is constantly going up because the registry just gets bigger all the time. Uh, and what you need to look at is how much

1:22:08

faster than the registry your package is growing. I called that I called that metric the share of registry. Uh, and it was much more interesting. It's like relative to the registry, this package is actually going down. Even though it

1:22:20

has more users than it's ever had before, it's getting less popular. But that that look at it is how a lot of open source calculates success now which is like the the velocity of downloads within the registry against other libraries of the same type let's say. Yeah. That's cool. Like it's like honestly we use those calculations today

1:22:44

and I'm sure everyone does. So it's awesome. Um do you would you ever do this for the AI industry? the same kind

1:22:51

of report. If I could get anything like the same quality of data, I would do it in a heartbeat. Uh one of one of the problems with AI is that we are not nearly as centralized in terms of a data source, right? Like who would you ask? Uh who would you ask that question? Would you ask OpenAI? Would you ask like open

1:23:10

router? Um, open router have tried to do some stuff like that in terms of like share of share of models which is very interesting stuff but like they have a lot of selection bias in terms of who decides to use open router. Um, uh, but yeah, I would do it in a heartbeat. Um, it's, you know, I love

1:23:29

data. How do we get you this data? Can we get a like a see like a prop call for proposal on getting us the data? Where would I ideally where would you want it from? Um the very best data comes from

1:23:42

humans. The very best data is like you don't try and infer it from downloads. You you actually send you know 30,000 people a survey and are like tell me what the hell you're up to. Tell me why you're doing this because why is a

1:23:57

question that the data can never answer whereas the humans always know why. Um so I would love to see like a community effort of all of the various frameworks coming together and saying we're all going to pull our our communities. We're all going to pull our questions. Yeah. Uh and we're gonna try and come up with to the extent that it is possible like

1:24:15

an impartial look at the entire surve uh the entire landscape. I would love to do something like that. We're in. Yeah. All right. we'll start start to rally rally the the different uh different

1:24:26

crowds that that I mean I do think as a you know it is risky as like a as a company to be part of something like that because you don't want to you're you know what what you're building to be have a negative light but also like people need to want to make educated choices and we're all kind of like we learn from each other so it's

1:24:44

useful to see what are people doing and how are they using these things and not just listening to some random people if you're just tuning in for the first time on a but it was it's the data is so true because it it it telegraphed the fall of Gatsby like that was like honestly it did like let's be honest

1:25:03

like that the data showed the fall of Gatsby you could just see it you're like that's interesting as hell and then the rise of NexJS like that was critical and I bet the same here was yeah it was uh it was not it was not fun to be in the position to be pointing that out like uh especially um the Gatsby community was kind of goodnatured

1:25:26

about it. They were like, "Oh, that's a shame." Uh but like the Angular community were like big mad. The Angular

1:25:33

community did not like being told that they were that their share of registry was falling. It was I got some very angry emails about it. Yeah, they can't they can't blame you. They got to blame the the survey. I

1:25:45

mean, everyone It turns out It turns out they can blame me. It turns out they can and they did. They're they're gonna try.

1:25:53

Well, uh, Lori, any parting words before we let you go? We appreciate you coming on. Thank you.

1:25:59

I appreciate the time and thanks again for the invitation anytime you want me back. Yeah. And what's the best way for people to follow you? Uh, I am selda.com on Blue Sky and that is where I answer everything fastest.

1:26:12

All right. Selda.com on Blue Sky and check out Arise. I'm really excited. I've been playing around with Phoenix a

1:26:17

little bit as well. So if you're looking for like open source observability that plays well with MRA, give that a shot and check out the whole reports of go to memory state of the web. Yeah.

1:26:28

All right. Thanks everybody. Yeah. We'll see you Lori. See you.

1:26:33

All right. Well, it's been a show. We've had our share of technical difficulties which, you know, happens from time to time. You know, we do we do say we're not experts at this. We got day jobs.

1:26:45

You know, this is what we do for fun. We barely do that. So, but I'm excited about bringing on Leo and Kevin. So, I chatted with them a while back. You know, they use MRA a little bit, but they also have some

1:26:57

really cool integrations with uh MRA and other frameworks, and they're really uh have some cool tools for deployment. And so, I want to bring on and learn a little bit about Defang. So, let's bring him on the show. Kevin,

1:27:09

welcome. And Leo, welcome. Hey guys. Hey.

1:27:15

Sounds all good. Yeah. Yeah, we are. Yep, we can hear you. We are back. We've had some, as you

1:27:20

maybe have seen, as you've been kind of seeing the show. Normally, we don't have this many technical difficulties, but today we're in a new place and it is a it's been it's been a battle, but we're still here and we're having a good time. But maybe as as a good way to intro, can you tell me uh Leo, what what is Defang?

1:27:40

And then maybe you guys can both do a quick little introduction and then I I I hear we're gonna see a demo at some point, too. So we always like to see demos. Yeah, we prepared a demo for everybody to see. So uh enough opportunities for

1:27:52

more technical difficulties later. Yeah. So uh I'm co-founder of Defang and it really started uh and and the name kind of hints at it at a way to be able to build applications that are not tied into any particular cloud. That's where the the fang came from.

1:28:15

uh it has you know it has changed a little bit but in in in essence now it's a deployment tool that helps you deploy uh your application to well any cloud but we're not there yet so we we support AWS GCP and digital ocean now um and and in a way that is cloud agnostic right so build your application once but deploy it multiple times what's interesting and

1:28:41

and this is where the the master agent angle comes in what What's interesting is where um people were are using defang to deploy their own applications to the cloud um in particular AWS GCP uh but now we're seeing them deploy their applications into their customers cloud. So that's kind of an angle that we uh

1:29:02

took on um because as you know many people have are hosting their stuff in production you know it's been clicked for the last two years by some DevOps guide uh who understands exactly what's going on and then when a customer comes and says I love your I love your agent love your app but can you deploy one for me in in my account so that I can monitor what goes in what

1:29:30

goes out. Yeah. Then they're scrambling because they never set up all the automation for that. And that's kind of

1:29:36

where where we want to come in. It's super cool because I was on a podcast last week or something and I think it's coming out. Someone asked me like what's a tool that's like non AI related that you really want to see and I was like multicloud deployments. So

1:29:50

that's pretty. Yeah, that's cool. Yeah. Um yeah. So Oh yeah, I was supposed to do my

1:29:55

introduction. Yeah. So I've been in the space for a while. uh defang is my third startup and kind of because of the pain

1:30:01

that I've had in in previous startups you know you you build your application and then you you keep rewriting the whole thing three times over because of the the cloud is changing or the the cost profile is changing and um yeah so Kevin you want to introduce yourself then uh yeah I definitely don't have that much experience like everyone I

1:30:20

graduated it like three years ago so uh but I did do six internships so I yeah you know I have a typical Google internship and then you know a little bit of Tesla here and there Apple and then after I realized I hated working at big company so here I am a small 10 person company so um yeah and I don't know I around here I just I just kind of

1:30:46

solve things I guess if I have a title I'm a developer but if they want me to you know marketing you know the works so uh yeah just kind of taking around things Cool. Awesome. Yeah. Yeah. Well, it's great to have you on

1:31:00

and obviously we talked a while back and you you showed me some stuff and we talked about, you know, some of the support you're adding for MRA and all that, but are we ready to see a demo? Can we see a demo soon? Yeah. Yeah, we got ahead. Seeing is believing. I think everyone wants to see something. And so I And if you are watching this as we're

1:31:20

pulling up this demo, if you have questions, put them in the chat. Drop them in the chat. We will try to answer as many of them as we can.

1:31:27

All right. Can everyone see my screen or Yes. Yeah. Maybe maybe click a zoom.

1:31:33

Oh, is it that bad? Uh well, I think it's just you know old people well like us like me. Is this any better? Did one more click or one more click.

1:31:44

Okay. There you go. Yeah. Okay. It's it's like one or two clicks past

1:31:49

where you're comfortable as as the driver. You're like this is way too big. But that's for the viewers.

1:31:54

Okay. Yeah. So basically we um uh our IA or like a lot of people use Terraform and stuff like we use a compos file. So our compost file kind of uh defines what

1:32:05

your project or definition is. And I think the closed file is actually a really good way because uh most people that we talk to for local development uh they all use compose files because who wants to you know spin up like a database by hand and all that stuff when you can have the post file that just spins up your whole stack and then just

1:32:22

kind of joins everything together. So um we decided on this and uh all you need is a docker file within your application or if you don't have one uh have tried it before without a docker file too we use rail pack and it makes it containerize your application for you as well. So some people that aren't really

1:32:40

familiar with it. Uh docker docker in docker files. Um so yeah today we actually helped me with this uh repository. Uh I think he uh it was the

1:32:52

repo base. Uh I was trying to make it work and make it as a sample for our customers to see how to use uh mastra with defang. And uh special thing is um a lot of people I saw online on mastra was also you know turning on SSL. So we we have SSL uh support and we make

1:33:10

everything very production ready. Um let me go through the compose file. So a lot of times you have to deal with keys to um bedrock or vertex AI. With uh defang

1:33:21

we have this uh special compose uh extension called XD fang lm. And if you toggle this true, what we do is we make a we make a task ro or uh service account uh enabled in your GCP or adabus account to enable vertex AA or bedrock for wherever cloud you're going to. So that way you don't have to pass a key. It's ready for you. Your your application can use it. um only for GCP

1:33:47

we need these two because just because um the versel routing needs these uh two uh kind of information your GCP project and your location uh ads doesn't need it ads can just look at the service role and then be able to invoke it from there uh but yeah and then we have a postgress database down here and usually when we have this XD fang postgress uh this is

1:34:10

our thing and this tells us to use a managed database so RDS in cloud SQL. So you don't actually we're not running like containerized Postgress uh containers. You can have a full managed uh you know database in your cloud. All

1:34:23

right. So I'm just going to hit these two. Um this is how we would uh typically send to the cloud. I'm

1:34:29

actually going to send them both out at the same time. So the left is going to be sending to The right is going to send to GCP. Uh yeah, and I set everything up already just for time wise. If not, uh you would just need to set some environment

1:34:41

variables like um or configurations through our tool uh like LM what LM you want and maybe a GitHub token for rate limiting purposes, but it's optional just for this app. So here hey Kevin, while the deployment is uh running there, you want to ask the MCP server Kevin wrote our MCP server as well. Do you want to ask the MCP server about the comparing costs? Yeah, I I was just going to do that

1:35:07

next. Uh yep for sure we will do that. So uh we made a yeah MCP server that calls our backend and uh calls actually the Amazon or GCP for uh estimations of uh how much these will cost and we have three modes uh available while balance and I think was it production or was it uh high availability but production is a is a alias you can use either. Yeah. So uh how much will this cost me

1:35:36

on it was uh yeah so let me explain a little bit what's going on here. So in essence, think of defang kind of like a so it it compiles this um compose file into infrastructure code and then applies it to your to your platform and when you ask for estimation it actually does the same thing but then it's invoking the cloud uh cost APIs for real-time uh

1:36:06

numbers right so it transpiles this thing it finds out oh I'll need a managed database I need a managed reddis managed language models etc Okay, what do what do they cost? And then it invokes all the SKs, finds out the SKs and and calls all the all those cost APIs. So the non the numbers you're getting here are real numbers. This is not like, you know, if you copy paste a compos file in CHPT or perplexity or

1:36:30

something, it'll spit out some numbers, right? But it's not exactly the same thing. Yeah. How do people iterate on these

1:36:37

compos files? like if I were to change my Postgress and redeploy like what's the what happens? Yeah. So early on we actually had the we were thinking of making infrastructure

1:36:49

code super easy uh and and cloud agnostic and so for a while we had our own files but then people everybody already had a compos file so now we use that as the input and what's good about it is it's an open standard uh docker opened it like a few years ago the LLMs are well trained on it so any changes you want to make uh yeah you can ask the LLM you can make them yourself the in

1:37:12

fact the editor has the the JSON schema very likely for for compos uh and any changes you make you just do the same command again right so defang will also do the drift detection uh will make any changes we'll do builds if they're needed right so if the compost file says uh my my my service comes from

1:37:33

source code then it'll it'll upload the source code to your cloud account starts the build in your cloud account so actually on the on the de developer machine there's not much happening Everything happens in your in your cloud account. Yeah, as you can see, it's being um yeah, it's being built within code build

1:37:50

and cloud build, which is GCP. And you can see the build logs coming back to us right now. So, it's containerizing both the application on your AWS and your GCP account. Uh if I quit this CLI, it will

1:38:01

still run like you don't have to be in the CLI for it to run. Never runs on your computer. Yeah, the CLI is just streaming logs, right? Once you kick it off, it's just streaming logs. You cannot have a interrupted deployment

1:38:13

or something like that. Yep. And how do you deploy like the like if I configure a server in here? Is it like

1:38:20

deployed on like a lambda fargate and all that? Yeah. So we kind of want to uh right now it's very likely going to be on Amazon it's going to be elastic container service very likely and on Google very likely cloud run. I say very likely because there are some limitations to those two. So for example, if if your

1:38:39

service needs a GPU, which was very common two years ago, not not so common anymore because now everybody uses uh uh like the the model APIs kind of. But if you if you add in your compos file that I need a GPU, we'll provision a GPU for you as well. Uh and an ECS cannot do that, right? So we'll have to get an EC2 and on Cloud Run, you end up with Compute Engine, but you don't really

1:39:04

care. So we kind of want to leave that uh uh abstract that away also because there's there's value in in making that dynamic in a way like if you're if the application changes for example you have a service that gets hit once uh once a a week because it's behind the schedule or something that would be good behind in a lambda but if you get a lot of traffic

1:39:29

now then you want to have a 247 service up and running and scale that uh horizontally So we want defang to to have the smarts to do that in the future. So now it's not that dynamic yet. We have some feedback there, but uh that's where we want to go. Yeah.

1:39:46

Yeah. And um yeah, you can see the AWS actually just finished mounting. So it uh activated and the app is running now. Uh probably go hit it in a second. Um but uh uh I also just did um estimation

1:40:01

on GCP and I wanted a bus and you can see that the MCP server was able to tell me that uh co uh hosting on GCP was about $10 cheaper than ask it to comp it compare. Oh, there you was $34 for ads and $26 for GCP. Very cool. Yeah, that's awesome. Yeah. And uh let's see here. Oh, am I

1:40:27

still sharing here or? Yeah, sorry. I thought I lost you for a second. Uh, yeah. God, I try to get this

1:40:33

lake, but my terminal is too big though. There you go. I got Okay. So, you can

1:40:39

open now. Um, uh, yeah. Yeah. So, this is a defang.app uh URL

1:40:46

because if you don't specify a URL in your compost file, then we'll just create one under defang.app. Um, but you can also specify in the compos file it's called domain name. It's it's part of the standard. And then default will do

1:41:00

all the SSL, you know, DNS stuff uh that's needed to to hook that up, right? Yeah. Like this. And then you can put in whatever you want. And that would be your domain name um but on the service

1:41:12

that you're trying to uh link. But yeah, for now this is just the one we made. Uh but yeah, you can see it works. Uh let's

1:41:19

see what the Masters team's been up to. uh see what recent commits you made but yeah you can see uh the LM gets called in uh it works and there's no keys involved uh everything is provisioned through uh the service account so yeah so it's using vertex AI it we hook up vertex AI if you deploy to GCP bedrock if you deploy to AWS so it never never leaves your account basically it's not reaching out to any third third

1:41:45

party that's kind of also where a lot of the value lies yeah and you we can see here uh if it loads uh you're you can go into your own uh account and you can see that it's been you know it's been deployed and you can see here um two minutes ago see and that's legit dude that is legit yeah if you go to too we actually

1:42:08

because we did two clouds right we it's a cloud agnostic so we can actually you can send you can send all three clouds in one if you want but like we only did two for today but yeah we went to ECS here you'll see that we have one here. Mastra just got deployed as well. So, and then this is running on Oh, that's not the one I'm looking for, I

1:42:27

guess. But yeah, this one just got finished. Did it finish? Yeah. And we can go into the uh one. And then we have

1:42:34

two replicas on two clouds. Oh, I think it's still going through some provisioning. Yeah, it takes a while for the whole load balancer and everything to be happy. So, okay. Yeah, but uh the GCP one uh

1:42:45

finished. But yeah, as you can see um we also are well like we're not just writing gibberish. We're partner with Google and AWS. So we went through their wall architect partner program. So the

1:42:57

infrastructure we're writing isn't just going to be you know garbage architecture. It's been, you know, stamped and approved by our partners and, um, it's like, you know, well trusted. We we're we make all the IM roles, load balancing, uh, securities, you know, uh, all the production ready stuff you need.

1:43:13

Yeah. Yeah. You guys must have seen that many times as well, right? A lot of the sample code and toys and, uh, tutorials

1:43:20

that you find there, you cannot run them in production as is. Yeah, quite a lot of, uh, stuff you have to change there. So the samples that we dep that we uh publish in our samples repository as well as the tool itself, we really want it to be production ready. Yeah, you should be comfortable doing a uh deployment with defang and and not think

1:43:40

about it. We hook up um uh observability. Well, it's all in your account. So if you want to go fancy with

1:43:47

maybe uh Prometheus Graphana, you you just have to add it to your compost file. We'll just deploy that. Wow. Very cool. So are the target users here engineers who are not DevOps friendly or

1:44:00

more DevOps friendly like is there here that is being filled? It's so the target user um so up to now I would say it's it's the beginners that that you know they get AWS credit but they have no clue how AWS works. So they're definitely a target user but what's happening more and more often is that they they just skip AWS altogether. they go to for sale and and

1:44:26

don't even think about AWS, right? So, so when they now they, you know, they vip code their app, they're they're on for sale, they're live, and they got a customer that says, I love your thing, but there's no chance that I'm going to send this this uh private data to to this one person SAS, right? So, uh the customer ask for I want a private

1:44:48

deployment in my account and they might be in a different cloud, right? or uh then then that's where defang would would be very valuable even to the to the pipe coders. Yeah. We had a friend of ours who built

1:45:01

on Versel and then they finally got a government contract and they have to be an AWS GovCloud. There you go. Yeah. Now they're changing their whole app to just be AWS. So they don't have to deal with this anymore, you know?

1:45:13

Yeah. Yeah. Exactly. Yeah. That's a perfect use case. And on top of that, Versell like uh is a

1:45:19

reseller, right? So they they like I think it's like five six times more expensive to host on for sale than just straight on ads or something. Yeah. I mean they they definitely they take their margin you know. Yeah. So they they you just post on but

1:45:31

through like them. So I mean this is the the best way to go. That's how they get you right. So it's

1:45:36

cheap to start. Yeah. Yeah. Um well this has been awesome. How do people so say you have some engineer

1:45:45

developers, web web developers who maybe want to deploy to cloud uh GCP, AWS, where do they go? Defang.io. Is that the best place to go? Yeah, it's it's all self-service, right? So, you can download the CLI. The CLI is

1:45:58

open source. You can also fork the repo if you want to make changes to it. Uh yeah, we have an MCP server uh that you can try. You can deploy with that. We're preparing a a major release early next year. So we're actually integrating our

1:46:12

own agent, not just MCP server. So that the CLI will be able to interactively help you go through all the cloud setup and all of that. Kind of the major pain point now is getting you into the actual cloud account. So we're doing some

1:46:25

features around that so you you hopefully never have to go to the AWS dashboard and still be able to deploy there. That's kind of the I mean that sounds like the dream. Maybe in a few months when you get that big launch, come back and and show us the agent you built. Oh, I I would love to. Yeah, sounds good. Yeah.

1:46:43

Also, I'm really interested in talking more about some other stuff with you guys. So, we should just chat after Oh, that sounds good. Yeah. Anytime. Yeah, reach out to us. Uh I'm on X, Twitter,

1:46:55

Blue Sky, Leo Lunessu. There's not many with that name, so find me there. Yeah, I I don't have a match. I just have LinkedIn so you just probably look for my name and story somewhere.

1:47:06

Find Kevin in Discord. You can find Kevin in the Oh, yeah. Yeah. Actually, I haven't. I

1:47:11

have another PR ready for you. I found a little more another bug in the actual brother. Yeah, please send it over. All right. It

1:47:18

was great having you both and thanks guys. Yeah, thanks for having and yeah, we'll definitely have you on again when you have more uh to show after this next big launch. Yeah, we'll see you. See you guys.

1:47:32

And then there was us. All right, we're the show must go on. We've been This is a long show because of some technical difficulties and uh we got started a little late, but we still have a little bit of AI news and then we're going to have to get out of here. We're going to keep it quick. Yeah, we got a lot to do, other stuff to do, you know, the actual the actual job.

1:47:50

Yeah, this is this is just what we do for fun. Uh this is our ARR right here. Yeah, this is all right. So, some Open

1:47:58

AI stuff. because OpenAI is always trying to do some things. You know, we did mention OpenAI has announced their code red which is they're trying to focus more on their their core products, but they've also, you know, had some other things.

1:48:10

They released a new proof of concept study where they trained a GPT5 thinking variant to admit whether the model followed instructions. So they call it the confessions method. So rather than asking you know it's like just a getting the um the original LLM they basically trained a very specific model that has

1:48:30

like a separate path that evaluates if it's if it did something if it confessed to like some kind of failure or some kind of guess some kind of hallucination. Um so it says AI systems becoming more capable. We want to understand them as deeply as possible. So sometimes the model takes a shortcut. I think we've probably all

1:48:48

seen this or optimizes for the wrong objective or just plain make something up. Yeah. O opus is still guilty of this. I

1:48:55

uh I I use conductor and I have like multiple uh git work trees, right? And I thought I had this I copied some code examples uh from a framework or from a different repo and I wanted it to just like evaluate and say like hey like how would we do this? And it made it up because it wasn't actually in that work tree. it was in the other it was in my main uh git uh git branch and so

1:49:22

it but of course it wouldn't tell me that I just made it up and I I thought it was right and I started looking at it and I was like that's not right you just made this up yeah it lied to me said that mistr does not support JSON prompt injection I was like how do you know that and it was like oh I'm sorry yeah I just figured yeah I just guessed I just made it up so

1:49:38

I think you know so the whole idea is confessions this this idea this confessions they don't prevent mistakes but they make them visible They say like here's a mistake that it made. So in some ways it's kind of like running eval right but it's in this case it's very specifically trained on a GPT5 version.

1:49:54

So it's kind of interesting. Look at Lee's comment there too. Watch this.

1:50:00

Yeah. Watch this. All right. Um Google's making making moves as they

1:50:07

sometimes do. Dude, Google's coming for your throat. Yeah, Google's coming.

1:50:14

All right. So, Google has released their workspace studio where anyone can build a custom AI agents in minutes to delegate the daily grind, automate daily tasks and focus on the work that matters instead. And I think Google has like they have the same kind of thing in different places like Google AI studio.

1:50:33

You can do some stuff like Gemini. Yeah. Vert.ex. Now, they have it in work workspace studio, but they're just

1:50:40

trying to get it everywhere. Yeah. Yeah. So, I don't know if it's anything necessarily. It's Python only. It's a shame.

1:50:46

If you want to add like a a studio agent, you have to run use a lane chain or crew AI. Well, maybe that should change. Or you can use A to A, which once again they're pushing that Yeah. Uh, so speaking of Mistl, which

1:51:01

you announced, this came out I think the day after our show last week, so we didn't cover it, but Mistl released some models. So they've mistrol family or mistral 3 family of models. So they say it's the world's best small models. They have their instruct models, mist large

1:51:21

or frontier class and they're kind of comparing it to some other open source models using the the ELO score. Um got some mixed reviews though. Yeah. So let's talk a little bit about what

1:51:35

people are saying. So I there was I think it maybe it was Theo had some comments on some benchmarks that said like they're not very good. Yeah. like maybe was a little bit of a

1:51:46

let down and then also some a little bit of controversy around you know where did they maybe borrow some things on some of they they basically announced it as it's their own model but maybe they uh yeah maybe they uh kind of borrowed some stuff from deepseek so saying and again not proven on any of this stuff but

1:52:12

distilling deepseeek while claiming it was their own model. And you know, then there's some from Sa Sam Page here says, you know, the slot profiles. Yeah. That it basically the analysis of their slot profiles confirms it is basically, you know, very similar to,

1:52:33

you know, Deep Seek V3. So, you know, who knows if it if it is or it isn't. I don't know. There's maybe some evidence

1:52:39

that it could be, but I I think technically it's not against the terms of like what you could use DeepSeek for, but it is uh probably, you know, you should probably admit it if you're doing it. Yeah. Hopefully it's not true, but it probably is. Yeah.

1:52:57

All right. So, that's what's going on with Mistl. Uh, this was if you're in the JavaScript world, this one will definitely be interesting to you. This came out last week and I think it caught a lot of

1:53:09

people by surprise. It shouldn't have but it kind of did that Tanstack AI. So Tanstack has a whole bunch of different packages you know like Tanstack start and Tanax Tanstack router, Tanstack query. Now they have Tanstack AI. So powerful open source AI SDK with a

1:53:26

unified interface across multiple providers. Uh so you can use it with any model different languages they even want to support. Yeah. You different you know vanilla

1:53:37

TypeScript react. So they basically want to be uh essentially just an a new AI SDK. So I think they're going right after kind of uh for sales AICK. Pretty similar in

1:53:51

some of the stuff it wants to do with some of the stuff we do at Mastra, but it's definitely it caught some attention when it was released last week. Are they going to go full on agent framework, you think? Or just this library? I would guess if it goes the way of AIS

1:54:04

SDK, they're going to try to like if it gets enough adoption, you would have to think they're going to add. It's the natural it's the natural thing to add in. So, we'll we'll see, you know, bring your Tanstack agents to Monra in the new year. Yeah, maybe maybe we'll just support Tanstack agents in Monstra. Who knows? Yeah, why not, right?

1:54:22

So, yeah, but that was interesting. So, if you're part if you're a fan of the TAN stack stack, maybe give that a look. And this will be the last thing we talk about and we're not going to cover it other than just showing you that it exists. We talked about it was funny. We talked about with Lori a little bit. So, Open

1:54:43

Router has kind of released a state of AI, I think. you know, of course they have certain types of data that makes it that they can report on and it's pretty huge, right? But they have some nice charts in here that kind of tell you and you've seen some of these get shared across uh X and other places. But you can see like the decline of Deepseek's

1:55:03

dominance open source market and this is of course all people that go through open router but it is very interesting to see the trends and you might be able to say like oh these are the popular models or this is what maybe a model I should try. This is like the uptake of reasoning models. So go to you know openouter.ai/ AI

1:55:22

slashstate-of-i and I think you're going to get a lot of good information. So if you're interested in what the trends are and with that anything else before we close this thing up. This has been quite a wild ride of a show. No, this is really it's fun in the new

1:55:40

studio with all the the trouble. But uh yeah, we'll figure that we'll figure that out for next week. I'm excited for 2026. I'm excited for the show, too. And for everyone listening, we're going to be making

1:55:51

moves next year on a lot of things. Yeah. So, we appreciate you all for tuning in. This has been AI Agents Hour.

1:55:58

We had some technical difficulties today. We'll figure those out. Uh we are trying to do some new things. We're

1:56:04

going to try to up the production value of this a little bit. You know, we are the top 20% of Spotify shows. Apparently, we are the chosen one. So, if you haven't given us a five star on Spotify, if you would like to give us a five

1:56:15

star, please go to Spotify, give us a five star review. And apparently, they told me in our uh in our report that we need to ask more people to follow us on Spotify, which is a thing. I didn't even know. So, please follow the show on

1:56:28

Spotify as well. And if you want to give us less than a fivestar review, please find something else to do. Like Jesus. Yeah, we don't we don't like that. Uh but you can also check us out on YouTube. Check us out on uh X, all all

1:56:43

the different places, right? And you can follow Obby and myself on X as well if you're looking for, you know, questionable follows. We're about as good as you can get. Um, and you know,

1:56:58

meet Ryan Evans. Thanks for tuning in. See you the next one, dude. And Sebastian has some comment from

1:57:04

before. Thank you for tuning in, Sebastian. All the other people that tuned in, we appreciate you. We'll see

1:57:10

you next week. Peace.